Add MCP server surface and parallel indexing for AI coding agents by Neverdecel · Pull Request #44 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T18:56:16Z

Summary

This PR adds an MCP (Model Context Protocol) server surface to CodeRAG, enabling AI coding agents (Claude Code, Codex, Cursor) to search a warm, pre-indexed workspace instead of running slow grep/glob/read loops. It also introduces parallel indexing to speed up the initial index build and adds support for indexing arbitrary text files.

Key Changes

MCP Server Surface (`coderag/surfaces/mcp_server.py`)

New build_mcp() function that constructs a FastMCP server with four tools:
- search_code: Hybrid semantic + keyword search with optional filtering (language, path prefix, kind)
- get_file: Read indexed files with optional line-range selection
- index_status: Report index coverage, freshness, and retrieval config
- reindex: Trigger incremental or full re-indexing
run_mcp() function to start the server with background indexing and filesystem watching
Compact result formatting (_format_hit) that returns path:start-end locations and truncated snippets by default to minimize token usage
Thread-safe state management for concurrent indexing and searching

Parallel Indexing (`coderag/indexer.py`)

New _embed_and_write() method that parallelizes the expensive chunking+embedding step across worker threads while keeping store/FAISS writes serial
_prepare() and _write() helper methods to separate the parallelizable embedding from the single-threaded persistence
Configurable index_workers (default 4) to control parallelism
Progress bar support for multi-threaded indexing

Thread-Safe Vector Index (`coderag/store/vector_index.py`)

Added threading.RLock() to serialize FAISS index access (reads and writes)
Protects all index operations: add(), remove(), search(), ntotal, save(), and rebuild_from_store()
Necessary because the MCP server runs the watcher (writes) alongside live agent queries (reads) concurrently

Text File Indexing Support

Extended detect_language() in coderag/chunking/languages.py to support:
- New file extensions: .xml, .html, .css, .scss, .vue, .svelte, etc.
- Well-known filenames without extensions: Dockerfile, Makefile, LICENSE, .env, .gitignore, etc.
- New all_text parameter to treat unrecognized files as plain text (for general document search)
Updated coderag/indexer.py to skip binary files (NUL-byte detection in first 8KB)
Updated coderag/watch.py to respect the all_text flag when detecting indexable files

Configuration & CLI

New index_all_text config option (default False) to enable indexing arbitrary text files
New index_workers and embed_batch config options for parallel indexing tuning
New coderag mcp CLI command with options:
- --transport: Choose MCP transport (stdio, sse, streamable-http)
- --no-index: Reuse existing index instead of re-indexing on startup
- --no-watch: Disable filesystem watcher
- --all-text: Enable text-file indexing
Updated coderag/api.py with _index_lock to serialize concurrent indexing operations

Testing & Benchmarking

New tests/test_mcp.py: Comprehensive MCP server tests covering:
- Tool registration and invocation
- Search result formatting and filtering
- File reading with range selection
- Index status and reindex operations
- Parallel indexing correctness (serial vs. parallel produce identical results)
- Concurrent search safety during indexing (race condition testing)
New scripts/bench_vs_grep.py: Benchmark script comparing CodeRAG's indexed search against ripgrep baseline, measuring accuracy (recall@k, nDCG@k, MRR), latency, and

https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

Adds a fifth surface, `coderag mcp`, exposing CodeRAG over the Model Context Protocol so Claude Code / Codex / Cursor query a warm, pre-indexed workspace instead of running slow grep/glob/read loops. - coderag/surfaces/mcp_server.py: FastMCP server (stdio) with four tools — search_code (compact path:line results), get_file (precise range, indexed-only), index_status, reindex. Thin adapter over the CodeRAG facade. New [mcp] extra. - Zero-config UX: warms the embedding model once, indexes the workspace in the background (responsive immediately), and keeps it live via the watcher. - Concurrency: serialize all writers through a CodeRAG facade lock and guard FaissVectorIndex add/remove/search/rebuild with its own lock, so the watcher can write while agent queries read. Adds an optional stop_event to watch(). - Parallel indexing: chunk+embed across index_workers threads while keeping the SQLite/FAISS writes single-writer (Indexer._prepare/_write). - General file directories: CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text` indexes any UTF-8 text file (incl. extensionless like Dockerfile); binary files are skipped via a NUL-byte sniff. A few common text extensions added by default. - scripts/bench_vs_grep.py: measures indexed search vs a grep baseline (accuracy via the eval harness, latency, and approximate context tokens). - tests/test_mcp.py: in-memory tool tests + parallel-indexing correctness + a search-safe-during-indexing concurrency test. - Docs: README MCP section with onboarding snippets, example.env, CI installs [mcp]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

…ep loop" pitch Surfaces the main use case up front — one ranked, path:line-cited query instead of an agent's multi-round grep/glob/read loop — with proof from the committed eval (this repo's 24 NL->file cases): hybrid MRR 0.822 / R@5 1.00 / R@1 ~0.69, beating ranked-keyword BM25 (0.751, itself stronger than raw grep). Points to scripts/bench_vs_grep.py for the latency/token comparison and docs/eval.md for the full symbol-level / reranker / multi-repo tables and caveats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

Replace the bare `except ValueError: pass` with a commented `continue` that explains why a non-"path:count" ripgrep line is skipped, clearing the CodeQL "empty except" code-scanning alert on PR #44. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

run_mcp's `transport` was annotated `str`, but FastMCP.run expects Literal["stdio", "sse", "streamable-http"], which failed `mypy coderag` on the CI quality-and-tests legs (3.11/3.12/3.13). Narrow the parameter to that Literal; argparse already constrains --transport to those values and the CLI call site passes Any, so nothing else changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PKDkohprCqYpiLmB1xx4sC

codecov-commenter · 2026-06-17T19:10:40Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread scripts/bench_vs_grep.py Fixed

claude added 3 commits June 17, 2026 19:01

Neverdecel merged commit e8f5ff1 into master Jun 17, 2026
13 checks passed

Neverdecel mentioned this pull request Jun 17, 2026

Show retrieval speed in the demo-mode web UI #45

Merged

Neverdecel deleted the claude/upbeat-thompson-tv56fk branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MCP server surface and parallel indexing for AI coding agents#44

Add MCP server surface and parallel indexing for AI coding agents#44
Neverdecel merged 4 commits into
masterfrom
claude/upbeat-thompson-tv56fk

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

codecov-commenter commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Neverdecel commented Jun 17, 2026

Summary

Key Changes

MCP Server Surface (coderag/surfaces/mcp_server.py)

Parallel Indexing (coderag/indexer.py)

Thread-Safe Vector Index (coderag/store/vector_index.py)

Text File Indexing Support

Configuration & CLI

Testing & Benchmarking

Uh oh!

Uh oh!

codecov-commenter commented Jun 17, 2026

Welcome to Codecov 🎉

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MCP Server Surface (`coderag/surfaces/mcp_server.py`)

Parallel Indexing (`coderag/indexer.py`)

Thread-Safe Vector Index (`coderag/store/vector_index.py`)