Skip to content

Add MCP server surface and parallel indexing for AI coding agents#44

Merged
Neverdecel merged 4 commits into
masterfrom
claude/upbeat-thompson-tv56fk
Jun 17, 2026
Merged

Add MCP server surface and parallel indexing for AI coding agents#44
Neverdecel merged 4 commits into
masterfrom
claude/upbeat-thompson-tv56fk

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Summary

This PR adds an MCP (Model Context Protocol) server surface to CodeRAG, enabling AI coding agents (Claude Code, Codex, Cursor) to search a warm, pre-indexed workspace instead of running slow grep/glob/read loops. It also introduces parallel indexing to speed up the initial index build and adds support for indexing arbitrary text files.

Key Changes

MCP Server Surface (coderag/surfaces/mcp_server.py)

  • New build_mcp() function that constructs a FastMCP server with four tools:
    • search_code: Hybrid semantic + keyword search with optional filtering (language, path prefix, kind)
    • get_file: Read indexed files with optional line-range selection
    • index_status: Report index coverage, freshness, and retrieval config
    • reindex: Trigger incremental or full re-indexing
  • run_mcp() function to start the server with background indexing and filesystem watching
  • Compact result formatting (_format_hit) that returns path:start-end locations and truncated snippets by default to minimize token usage
  • Thread-safe state management for concurrent indexing and searching

Parallel Indexing (coderag/indexer.py)

  • New _embed_and_write() method that parallelizes the expensive chunking+embedding step across worker threads while keeping store/FAISS writes serial
  • _prepare() and _write() helper methods to separate the parallelizable embedding from the single-threaded persistence
  • Configurable index_workers (default 4) to control parallelism
  • Progress bar support for multi-threaded indexing

Thread-Safe Vector Index (coderag/store/vector_index.py)

  • Added threading.RLock() to serialize FAISS index access (reads and writes)
  • Protects all index operations: add(), remove(), search(), ntotal, save(), and rebuild_from_store()
  • Necessary because the MCP server runs the watcher (writes) alongside live agent queries (reads) concurrently

Text File Indexing Support

  • Extended detect_language() in coderag/chunking/languages.py to support:
    • New file extensions: .xml, .html, .css, .scss, .vue, .svelte, etc.
    • Well-known filenames without extensions: Dockerfile, Makefile, LICENSE, .env, .gitignore, etc.
    • New all_text parameter to treat unrecognized files as plain text (for general document search)
  • Updated coderag/indexer.py to skip binary files (NUL-byte detection in first 8KB)
  • Updated coderag/watch.py to respect the all_text flag when detecting indexable files

Configuration & CLI

  • New index_all_text config option (default False) to enable indexing arbitrary text files
  • New index_workers and embed_batch config options for parallel indexing tuning
  • New coderag mcp CLI command with options:
    • --transport: Choose MCP transport (stdio, sse, streamable-http)
    • --no-index: Reuse existing index instead of re-indexing on startup
    • --no-watch: Disable filesystem watcher
    • --all-text: Enable text-file indexing
  • Updated coderag/api.py with _index_lock to serialize concurrent indexing operations

Testing & Benchmarking

  • New tests/test_mcp.py: Comprehensive MCP server tests covering:
    • Tool registration and invocation
    • Search result formatting and filtering
    • File reading with range selection
    • Index status and reindex operations
    • Parallel indexing correctness (serial vs. parallel produce identical results)
    • Concurrent search safety during indexing (race condition testing)
  • New scripts/bench_vs_grep.py: Benchmark script comparing CodeRAG's indexed search against ripgrep baseline, measuring accuracy (recall@k, nDCG@k, MRR), latency, and

https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

Adds a fifth surface, `coderag mcp`, exposing CodeRAG over the Model Context
Protocol so Claude Code / Codex / Cursor query a warm, pre-indexed workspace
instead of running slow grep/glob/read loops.

- coderag/surfaces/mcp_server.py: FastMCP server (stdio) with four tools —
  search_code (compact path:line results), get_file (precise range, indexed-only),
  index_status, reindex. Thin adapter over the CodeRAG facade. New [mcp] extra.
- Zero-config UX: warms the embedding model once, indexes the workspace in the
  background (responsive immediately), and keeps it live via the watcher.
- Concurrency: serialize all writers through a CodeRAG facade lock and guard
  FaissVectorIndex add/remove/search/rebuild with its own lock, so the watcher
  can write while agent queries read. Adds an optional stop_event to watch().
- Parallel indexing: chunk+embed across index_workers threads while keeping the
  SQLite/FAISS writes single-writer (Indexer._prepare/_write).
- General file directories: CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text`
  indexes any UTF-8 text file (incl. extensionless like Dockerfile); binary files
  are skipped via a NUL-byte sniff. A few common text extensions added by default.
- scripts/bench_vs_grep.py: measures indexed search vs a grep baseline (accuracy
  via the eval harness, latency, and approximate context tokens).
- tests/test_mcp.py: in-memory tool tests + parallel-indexing correctness + a
  search-safe-during-indexing concurrency test.
- Docs: README MCP section with onboarding snippets, example.env, CI installs [mcp].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
Comment thread scripts/bench_vs_grep.py Fixed
claude added 3 commits June 17, 2026 19:01
…ep loop" pitch

Surfaces the main use case up front — one ranked, path:line-cited query instead
of an agent's multi-round grep/glob/read loop — with proof from the committed
eval (this repo's 24 NL->file cases): hybrid MRR 0.822 / R@5 1.00 / R@1 ~0.69,
beating ranked-keyword BM25 (0.751, itself stronger than raw grep). Points to
scripts/bench_vs_grep.py for the latency/token comparison and docs/eval.md for
the full symbol-level / reranker / multi-repo tables and caveats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
Replace the bare `except ValueError: pass` with a commented `continue` that
explains why a non-"path:count" ripgrep line is skipped, clearing the CodeQL
"empty except" code-scanning alert on PR #44.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
run_mcp's `transport` was annotated `str`, but FastMCP.run expects
Literal["stdio", "sse", "streamable-http"], which failed `mypy coderag` on the
CI quality-and-tests legs (3.11/3.12/3.13). Narrow the parameter to that Literal;
argparse already constrains --transport to those values and the CLI call site
passes Any, so nothing else changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC
@codecov-commenter

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@Neverdecel Neverdecel merged commit e8f5ff1 into master Jun 17, 2026
13 checks passed
@Neverdecel Neverdecel deleted the claude/upbeat-thompson-tv56fk branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants