apohara-codesearch

Hybrid code search for your coding agent — one offline binary, no model, no database.

Quick Start · Features · Where it fits · How it works

A single Rust binary that runs as a Model Context Protocol server, giving a coding agent fast, fully-offline hybrid search over any local repository — no embedding model to download, no external vector or graph database. It installs in seconds, runs air-gapped in a few megabytes of RAM, and keeps its entire state in one SQLite file.

# Your agent calls the search_code MCP tool:
search_code(path=".", query="where does the runtime block on a future?")

# → top hit — a chunk WITH its structure, not just a line number:
{
  "file": "src/runtime/handle.rs",
  "start_line": 241, "end_line": 341,
  "kind": "method",
  "signature": "block_on<F: Future>(&self, future: F) -> F::Output",
  "snippet": "/// Runs a future to completion on this Handle's associated Runtime...",
  "imports": [{ "source": "crate::runtime::task::JoinHandle", "line": 17 }],
  "exports": []
}

Structure (signatures, imports/exports, and struct/enum/class/type symbols) is extracted for Rust, TypeScript, Python, and Go. Any other language is still searchable — indexed as overlapping text windows.

💡 Concept

Note

A hash, not a model. The dominant code-intelligence tools are heavy: Node plus native bindings, a C/C++ toolchain for some grammars, an embedded graph or vector database, and a learned embedding model that downloads on first run. Their strength is deep understanding; their cost is that they are anything but lightweight.

apohara-codesearch takes the other side of that trade. The embedding is a deterministic blake3 feature-hash, not a learned model — so there is nothing to download, nothing to serve, and the same input always produces the same vector. That makes semantic recall weaker than a model-based tool; we compensate with hybrid retrieval (lexical + vector, fused) rather than pretending the hash is semantic. It is a Claude Code MCP plugin, and works with any MCP client.

✨ Features


🔌 MCP stdio server	Two tools — `search_code` (hybrid) and `reindex` (incremental) — over plain JSON-RPC. Works with Claude Code or any MCP client.
🦀 One static binary	No Node, no native bindings, no toolchain, no service. `cargo install` or `npx`, then run. The only state is one SQLite file.
🧠 Hybrid ranking	BM25 (SQLite FTS5) + a feature-hash vector (sqlite-vec), merged with Reciprocal Rank Fusion, then MMR-diversified.
🌳 Structural extraction	Per-symbol chunks with signatures + file imports/exports for Rust, TS, Python, Go; everything else indexed as text.
📴 Offline & air-gapped	Zero network at runtime AND at build. No model fetch, no telemetry, no API keys.
🪶 Near-zero footprint	~22 MB resident memory indexing a 224k-LOC repo — flat with repo size (memory-bounded pipeline).
⚡ Incremental + watch	`reindex` does blake3 content-hash deltas; the `watch` subcommand keeps the index current as files change (a plain CLI loop, not a plugin hook).
🔁 Deterministic	Same input ⇒ same vector ⇒ byte-stable `recall@k`/`MRR`. Re-indexing is stable.

🚀 Quick Start

Register it with your MCP client. For Claude Code, add to .mcp.json:

{ "mcpServers": { "codesearch": { "command": "npx", "args": ["-y", "@apohara/codesearch-mcp"] } } }

The npx wrapper downloads the matching prebuilt binary for your platform on first run. That is the whole install — no model, no database, no daemon.

Other acquisition paths — build from source, run directly, keep the index live

# Build + install from a checkout (lowest-trust path):
cargo install --path crates/apohara-codesearch

# Run the binary directly as a stdio MCP server:
apohara-codesearch serve

# Keep the index current as files change (plain CLI loop, NOT a Claude Code hook):
apohara-codesearch watch <path>

Prebuilt, per-OS binaries are also published on Releases (built by cargo-dist). It installs as a Claude Code plugin via the apohara marketplace too.

[!WARNING] Downloading a prebuilt binary is itself a supply-chain surface. Verify the checksum from the Release, or prefer cargo install and build from source.

Tools

Tool	What it does
`search_code`	Hybrid BM25 + vector search over a repo path. Lazily indexes on first call. Returns the top-k hits with structural context.
`reindex`	Re-index a repo. Incremental by default (blake3 content-hash deltas); `force: true` rebuilds from scratch.

🧭 Where it fits

Lighter than the graph tools, structure-aware where ripgrep is text-only. It does not match a model-based tool on conceptual recall, and it does not build a call graph — those are deliberately out of scope.

	apohara-codesearch	graph / embedding tools	ripgrep
Runtime dependencies	one static binary	Node + native bindings + toolchain	one binary
Model download	none	hundreds of MB	none
External DB / service	none	embedded graph / vector DB	none
Offline / air-gapped	✓	usually requires a fetch	✓
Structural context	signatures + imports/exports (4 langs)	call graphs, deep	text only
Ranking	hybrid BM25 + vector (RRF)	learned embeddings	exact / regex

🔬 How it works / honesty

Walk + chunk. A .gitignore-aware walk splits each file into per-symbol chunks (with the symbol's signature attached) plus bounded module-remainder and window chunks, so a giant file never collapses into one diluted chunk.
Index. Each chunk gets a BM25 lexical row (SQLite FTS5) and a feature-hash vector row (sqlite-vec), keyed on a shared row id. Both sides share one identifier tokenizer, so parseString and parse_string match each other.
Search. A query runs through both BM25 and vector k-NN; the two ranked lists are merged with Reciprocal Rank Fusion, diversified with MMR, then the survivors are hydrated with their structural context.
Stay current. Re-indexing hashes each file and reprocesses only what changed, in a single transaction that keeps the three tables consistent.

Footprint at scale

Measured with the default feature-hash embedder on a Ryzen 5 3600 / 48 GB box, driven over the stdio MCP tools:

Repo	LOC	Cold index	Peak RSS	Warm query	Index on disk
tokio	174k Rust	~10 s	~22 MB	~18 ms	39 MB
hugo	224k Go	~26 s	~24 MB	~22 ms	54 MB

Peak resident memory is flat across repo size — no OOM, no external process. One SQLite file is the only state.

Warning

The vector is a robustness layer, not a semantic engine. Because the embedding is a feature-hash, a conceptual query that shares no tokens with the target will not surface it — and on a clean corpus where lexical search already wins, fusion can be a slight net negative. BENCHMARK.md publishes this (synthetic corpus + a one-off external comparison on real OSS, with ≥30% committed known-miss queries) rather than hiding it. Deep structural context (callers/callees, call graphs) is out of scope by design. A real local embedding model is an opt-in, user-supplied build feature — never downloaded — so the default install stays zero-dependency.

See BENCHMARK.md for the method, the reproduce command, and per-mode recall@k / MRR across BM25-only, vector-only, and hybrid.

🏗️ Repository layout

apohara-codesearch/
├── crates/
│   ├── apohara-indexer/        # the engine (library)
│   │   └── src/
│   │       ├── walker.rs        # .gitignore-aware file walk
│   │       ├── parser.rs        # tree-sitter structural extraction (Rust/TS/Python/Go)
│   │       ├── chunker.rs       # per-symbol + bounded module/window chunks
│   │       ├── tokens.rs        # shared snake/camel identifier tokenizer
│   │       ├── embeddings.rs    # deterministic blake3 feature-hash vector
│   │       ├── embedder.rs      # pluggable Embedder trait (opt-in gguf-embed)
│   │       ├── storage.rs       # SQLite: chunks + FTS5 + sqlite-vec
│   │       ├── schema.rs        # migrations + embedder refuse-to-mix meta
│   │       ├── search.rs        # BM25 + vector + RRF + MMR + structural boost
│   │       └── incremental.rs   # blake3-delta reindex in one transaction
│   └── apohara-codesearch/     # the MCP server + CLI
│       ├── src/{main,server,watch,dto}.rs
│       └── examples/           # bench-search (in-CI) · bench-external (one-off)
├── npm/                         # @apohara/codesearch-mcp wrapper (downloads the Release binary)
├── .claude-plugin/ + marketplace.json   # Claude Code plugin manifest
└── .github/workflows/          # ci.yml (test/clippy/fmt/dist) · release.yml (cargo-dist)

🗺️ Roadmap

🤝 Contributing

Contributions are welcome.

Fork the repository.
Create a feature branch (git checkout -b feature/my-change).
Make your change and run the suite: cargo test --workspace (clippy -D warnings + rustfmt --check gate CI).
Open a pull request.

Unless you state otherwise, any contribution you intentionally submit for inclusion in this work, as defined in the Apache-2.0 license, shall be dual-licensed as below, without any additional terms or conditions.

📄 License

Licensed under either of MIT or Apache-2.0, at your option. See NOTICE for third-party dependency licenses.

Maintained by SuarezPM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apohara-codesearch

💡 Concept

✨ Features

🚀 Quick Start

Tools

🧭 Where it fits

🔬 How it works / honesty

Footprint at scale

🏗️ Repository layout

🗺️ Roadmap

🤝 Contributing

📄 License

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.claude-plugin		.claude-plugin
.github/workflows		.github/workflows
crates		crates
examples		examples
npm		npm
scripts		scripts
.gitignore		.gitignore
.mcp.json		.mcp.json
BENCHMARK.md		BENCHMARK.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
NOTICE		NOTICE
README.md		README.md
marketplace.json		marketplace.json

Folders and files

Latest commit

History

Repository files navigation

apohara-codesearch

💡 Concept

✨ Features

🚀 Quick Start

Tools

🧭 Where it fits

🔬 How it works / honesty

Footprint at scale

🏗️ Repository layout

🗺️ Roadmap

🤝 Contributing

📄 License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages