Add MCP server surface and parallel indexing for AI coding agents#44
Merged
Conversation
Adds a fifth surface, `coderag mcp`, exposing CodeRAG over the Model Context Protocol so Claude Code / Codex / Cursor query a warm, pre-indexed workspace instead of running slow grep/glob/read loops. - coderag/surfaces/mcp_server.py: FastMCP server (stdio) with four tools — search_code (compact path:line results), get_file (precise range, indexed-only), index_status, reindex. Thin adapter over the CodeRAG facade. New [mcp] extra. - Zero-config UX: warms the embedding model once, indexes the workspace in the background (responsive immediately), and keeps it live via the watcher. - Concurrency: serialize all writers through a CodeRAG facade lock and guard FaissVectorIndex add/remove/search/rebuild with its own lock, so the watcher can write while agent queries read. Adds an optional stop_event to watch(). - Parallel indexing: chunk+embed across index_workers threads while keeping the SQLite/FAISS writes single-writer (Indexer._prepare/_write). - General file directories: CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text` indexes any UTF-8 text file (incl. extensionless like Dockerfile); binary files are skipped via a NUL-byte sniff. A few common text extensions added by default. - scripts/bench_vs_grep.py: measures indexed search vs a grep baseline (accuracy via the eval harness, latency, and approximate context tokens). - tests/test_mcp.py: in-memory tool tests + parallel-indexing correctness + a search-safe-during-indexing concurrency test. - Docs: README MCP section with onboarding snippets, example.env, CI installs [mcp]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
…ep loop" pitch Surfaces the main use case up front — one ranked, path:line-cited query instead of an agent's multi-round grep/glob/read loop — with proof from the committed eval (this repo's 24 NL->file cases): hybrid MRR 0.822 / R@5 1.00 / R@1 ~0.69, beating ranked-keyword BM25 (0.751, itself stronger than raw grep). Points to scripts/bench_vs_grep.py for the latency/token comparison and docs/eval.md for the full symbol-level / reranker / multi-repo tables and caveats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
Replace the bare `except ValueError: pass` with a commented `continue` that explains why a non-"path:count" ripgrep line is skipped, clearing the CodeQL "empty except" code-scanning alert on PR #44. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
run_mcp's `transport` was annotated `str`, but FastMCP.run expects Literal["stdio", "sse", "streamable-http"], which failed `mypy coderag` on the CI quality-and-tests legs (3.11/3.12/3.13). Narrow the parameter to that Literal; argparse already constrains --transport to those values and the CLI call site passes Any, so nothing else changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds an MCP (Model Context Protocol) server surface to CodeRAG, enabling AI coding agents (Claude Code, Codex, Cursor) to search a warm, pre-indexed workspace instead of running slow grep/glob/read loops. It also introduces parallel indexing to speed up the initial index build and adds support for indexing arbitrary text files.
Key Changes
MCP Server Surface (
coderag/surfaces/mcp_server.py)build_mcp()function that constructs a FastMCP server with four tools:search_code: Hybrid semantic + keyword search with optional filtering (language, path prefix, kind)get_file: Read indexed files with optional line-range selectionindex_status: Report index coverage, freshness, and retrieval configreindex: Trigger incremental or full re-indexingrun_mcp()function to start the server with background indexing and filesystem watching_format_hit) that returnspath:start-endlocations and truncated snippets by default to minimize token usageParallel Indexing (
coderag/indexer.py)_embed_and_write()method that parallelizes the expensive chunking+embedding step across worker threads while keeping store/FAISS writes serial_prepare()and_write()helper methods to separate the parallelizable embedding from the single-threaded persistenceindex_workers(default 4) to control parallelismThread-Safe Vector Index (
coderag/store/vector_index.py)threading.RLock()to serialize FAISS index access (reads and writes)add(),remove(),search(),ntotal,save(), andrebuild_from_store()Text File Indexing Support
detect_language()incoderag/chunking/languages.pyto support:.xml,.html,.css,.scss,.vue,.svelte, etc.Dockerfile,Makefile,LICENSE,.env,.gitignore, etc.all_textparameter to treat unrecognized files as plain text (for general document search)coderag/indexer.pyto skip binary files (NUL-byte detection in first 8KB)coderag/watch.pyto respect theall_textflag when detecting indexable filesConfiguration & CLI
index_all_textconfig option (defaultFalse) to enable indexing arbitrary text filesindex_workersandembed_batchconfig options for parallel indexing tuningcoderag mcpCLI command with options:--transport: Choose MCP transport (stdio, sse, streamable-http)--no-index: Reuse existing index instead of re-indexing on startup--no-watch: Disable filesystem watcher--all-text: Enable text-file indexingcoderag/api.pywith_index_lockto serialize concurrent indexing operationsTesting & Benchmarking
tests/test_mcp.py: Comprehensive MCP server tests covering:scripts/bench_vs_grep.py: Benchmark script comparing CodeRAG's indexed search against ripgrep baseline, measuring accuracy (recall@k, nDCG@k, MRR), latency, andhttps://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC