Skip to content

SuarezPM/apohara-codesearch

Repository files navigation

apohara-codesearch

Hybrid code search for your coding agent — one offline binary, no model, no database.

CI License Rust npm MCP

Quick Start · Features · Where it fits · How it works

A single Rust binary that runs as a Model Context Protocol server, giving a coding agent fast, fully-offline hybrid search over any local repository — no embedding model to download, no external vector or graph database. It installs in seconds, runs air-gapped in a few megabytes of RAM, and keeps its entire state in one SQLite file.


# Your agent calls the search_code MCP tool:
search_code(path=".", query="where does the runtime block on a future?")

# → top hit — a chunk WITH its structure, not just a line number:
{
  "file": "src/runtime/handle.rs",
  "start_line": 241, "end_line": 341,
  "kind": "method",
  "signature": "block_on<F: Future>(&self, future: F) -> F::Output",
  "snippet": "/// Runs a future to completion on this Handle's associated Runtime...",
  "imports": [{ "source": "crate::runtime::task::JoinHandle", "line": 17 }],
  "exports": []
}

Structure (signatures, imports/exports, and struct/enum/class/type symbols) is extracted for Rust, TypeScript, Python, and Go. Any other language is still searchable — indexed as overlapping text windows.


💡 Concept

Note

A hash, not a model. The dominant code-intelligence tools are heavy: Node plus native bindings, a C/C++ toolchain for some grammars, an embedded graph or vector database, and a learned embedding model that downloads on first run. Their strength is deep understanding; their cost is that they are anything but lightweight.

apohara-codesearch takes the other side of that trade. The embedding is a deterministic blake3 feature-hash, not a learned model — so there is nothing to download, nothing to serve, and the same input always produces the same vector. That makes semantic recall weaker than a model-based tool; we compensate with hybrid retrieval (lexical + vector, fused) rather than pretending the hash is semantic. It is a Claude Code MCP plugin, and works with any MCP client.


✨ Features

🔌 MCP stdio server Two tools — search_code (hybrid) and reindex (incremental) — over plain JSON-RPC. Works with Claude Code or any MCP client.
🦀 One static binary No Node, no native bindings, no toolchain, no service. cargo install or npx, then run. The only state is one SQLite file.
🧠 Hybrid ranking BM25 (SQLite FTS5) + a feature-hash vector (sqlite-vec), merged with Reciprocal Rank Fusion, then MMR-diversified.
🌳 Structural extraction Per-symbol chunks with signatures + file imports/exports for Rust, TS, Python, Go; everything else indexed as text.
📴 Offline & air-gapped Zero network at runtime AND at build. No model fetch, no telemetry, no API keys.
🪶 Near-zero footprint ~22 MB resident memory indexing a 224k-LOC repo — flat with repo size (memory-bounded pipeline).
Incremental + watch reindex does blake3 content-hash deltas; the watch subcommand keeps the index current as files change (a plain CLI loop, not a plugin hook).
🔁 Deterministic Same input ⇒ same vector ⇒ byte-stable recall@k/MRR. Re-indexing is stable.

🚀 Quick Start

Register it with your MCP client. For Claude Code, add to .mcp.json:

{ "mcpServers": { "codesearch": { "command": "npx", "args": ["-y", "@apohara/codesearch-mcp"] } } }

The npx wrapper downloads the matching prebuilt binary for your platform on first run. That is the whole install — no model, no database, no daemon.

Other acquisition paths — build from source, run directly, keep the index live
# Build + install from a checkout (lowest-trust path):
cargo install --path crates/apohara-codesearch

# Run the binary directly as a stdio MCP server:
apohara-codesearch serve

# Keep the index current as files change (plain CLI loop, NOT a Claude Code hook):
apohara-codesearch watch <path>

Prebuilt, per-OS binaries are also published on Releases (built by cargo-dist). It installs as a Claude Code plugin via the apohara marketplace too.

[!WARNING] Downloading a prebuilt binary is itself a supply-chain surface. Verify the checksum from the Release, or prefer cargo install and build from source.

Tools

Tool What it does
search_code Hybrid BM25 + vector search over a repo path. Lazily indexes on first call. Returns the top-k hits with structural context.
reindex Re-index a repo. Incremental by default (blake3 content-hash deltas); force: true rebuilds from scratch.

🧭 Where it fits

Lighter than the graph tools, structure-aware where ripgrep is text-only. It does not match a model-based tool on conceptual recall, and it does not build a call graph — those are deliberately out of scope.

apohara-codesearch graph / embedding tools ripgrep
Runtime dependencies one static binary Node + native bindings + toolchain one binary
Model download none hundreds of MB none
External DB / service none embedded graph / vector DB none
Offline / air-gapped usually requires a fetch
Structural context signatures + imports/exports (4 langs) call graphs, deep text only
Ranking hybrid BM25 + vector (RRF) learned embeddings exact / regex

🔬 How it works / honesty

  1. Walk + chunk. A .gitignore-aware walk splits each file into per-symbol chunks (with the symbol's signature attached) plus bounded module-remainder and window chunks, so a giant file never collapses into one diluted chunk.
  2. Index. Each chunk gets a BM25 lexical row (SQLite FTS5) and a feature-hash vector row (sqlite-vec), keyed on a shared row id. Both sides share one identifier tokenizer, so parseString and parse_string match each other.
  3. Search. A query runs through both BM25 and vector k-NN; the two ranked lists are merged with Reciprocal Rank Fusion, diversified with MMR, then the survivors are hydrated with their structural context.
  4. Stay current. Re-indexing hashes each file and reprocesses only what changed, in a single transaction that keeps the three tables consistent.

Footprint at scale

Measured with the default feature-hash embedder on a Ryzen 5 3600 / 48 GB box, driven over the stdio MCP tools:

Repo LOC Cold index Peak RSS Warm query Index on disk
tokio 174k Rust ~10 s ~22 MB ~18 ms 39 MB
hugo 224k Go ~26 s ~24 MB ~22 ms 54 MB

Peak resident memory is flat across repo size — no OOM, no external process. One SQLite file is the only state.

Warning

The vector is a robustness layer, not a semantic engine. Because the embedding is a feature-hash, a conceptual query that shares no tokens with the target will not surface it — and on a clean corpus where lexical search already wins, fusion can be a slight net negative. BENCHMARK.md publishes this (synthetic corpus + a one-off external comparison on real OSS, with ≥30% committed known-miss queries) rather than hiding it. Deep structural context (callers/callees, call graphs) is out of scope by design. A real local embedding model is an opt-in, user-supplied build feature — never downloaded — so the default install stays zero-dependency.

See BENCHMARK.md for the method, the reproduce command, and per-mode recall@k / MRR across BM25-only, vector-only, and hybrid.


🏗️ Repository layout

apohara-codesearch/
├── crates/
│   ├── apohara-indexer/        # the engine (library)
│   │   └── src/
│   │       ├── walker.rs        # .gitignore-aware file walk
│   │       ├── parser.rs        # tree-sitter structural extraction (Rust/TS/Python/Go)
│   │       ├── chunker.rs       # per-symbol + bounded module/window chunks
│   │       ├── tokens.rs        # shared snake/camel identifier tokenizer
│   │       ├── embeddings.rs    # deterministic blake3 feature-hash vector
│   │       ├── embedder.rs      # pluggable Embedder trait (opt-in gguf-embed)
│   │       ├── storage.rs       # SQLite: chunks + FTS5 + sqlite-vec
│   │       ├── schema.rs        # migrations + embedder refuse-to-mix meta
│   │       ├── search.rs        # BM25 + vector + RRF + MMR + structural boost
│   │       └── incremental.rs   # blake3-delta reindex in one transaction
│   └── apohara-codesearch/     # the MCP server + CLI
│       ├── src/{main,server,watch,dto}.rs
│       └── examples/           # bench-search (in-CI) · bench-external (one-off)
├── npm/                         # @apohara/codesearch-mcp wrapper (downloads the Release binary)
├── .claude-plugin/ + marketplace.json   # Claude Code plugin manifest
└── .github/workflows/          # ci.yml (test/clippy/fmt/dist) · release.yml (cargo-dist)

🗺️ Roadmap

  • MCP stdio server (search_code + reindex) + watch subcommand
  • Structural extraction for Rust, TypeScript, Python, Go
  • Hybrid retrieval — BM25 + feature-hash vector, RRF + MMR + structural boost
  • Incremental reindex (blake3 content-hash deltas), one SQLite file
  • Honest benchmark — synthetic (in-CI) + external real-OSS, with committed known-miss
  • Large-OSS soak (Rust + Go ≥100k LOC) — flat ~22 MB peak RSS
  • Pluggable Embedder trait (opt-in, default stays zero-model)
  • Real local embedder backend (candle / safetensors, opt-in, user-supplied)
  • Skip generated/minified assets in the walker (DB-bloat hardening)
  • Per-language chunk-cap validation (TypeScript / Python)

🤝 Contributing

Contributions are welcome.

  1. Fork the repository.
  2. Create a feature branch (git checkout -b feature/my-change).
  3. Make your change and run the suite: cargo test --workspace (clippy -D warnings + rustfmt --check gate CI).
  4. Open a pull request.

Unless you state otherwise, any contribution you intentionally submit for inclusion in this work, as defined in the Apache-2.0 license, shall be dual-licensed as below, without any additional terms or conditions.


📄 License

Licensed under either of MIT or Apache-2.0, at your option. See NOTICE for third-party dependency licenses.

Maintained by SuarezPM.

About

Offline hybrid code-search MCP server — one Rust binary, no model, no database.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors