Symbol-level eval: non-saturated benchmark + model comparison findings#38
Merged
Conversation
File-level eval on a small repo saturates (hybrid hits Hit@10=1.0), so it could not measure retrieval improvements. Symbol-level localization — find the right function/class, not just file — has real headroom and discriminates. - build_from_git(symbols=True) / `coderag eval --build --level symbol`: maps each commit's changed lines (zero-context diff hunks) to the symbols they touch, parsed from the file content *at that commit* via CodeRAG's own chunker, then intersected with the symbols present at HEAD so every ground-truth symbol is retrievable from the index. Off by default. - Tests cover symbol extraction (only the changed function is reported) and the default-off behavior. Result (10 symbol-level cases, this repo): the benchmark stops saturating (Hit@10 ~0.5), and the previously-flat cross-encoder reranker now shows the predicted lift — R@1 0.183->0.283 (+55%), MRR 0.420->0.514, nDCG@10 0.369->0.448. This validates move #2 and confirms the file-level null result was a benchmark artifact. Documented in docs/eval.md and the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Tooling for the symbol-level model comparisons: - RECOMMENDED_RERANKERS registry + `coderag eval --list-models` now lists local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2) with size/notes, so code-aware rerankers are discoverable. - scripts/bench_embedders.py --rerank-models: score one hybrid+rerank row per named reranker, to compare reranker models on a fixed index. - coderag/eval/datasets/coderag_self_symbols.jsonl: 22 curated natural-language -> function/method cases (verified symbol names) for a trustworthy symbol-level eval, less noisy than the git-mined set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Measured on the curated 22-case symbol dataset (bge-small vs jina-code, with the ms-marco reranker): 1. The code-specific jina-code-v2 does NOT beat bge-small on NL->symbol queries (dense MRR 0.483 vs 0.675); a good general text embedder wins for natural-language "where is X" retrieval. 2. Equal-weight hybrid is not universally better: for the strong bge-small retriever, dense alone (0.675) beats 1:1 hybrid (0.573) because weak BM25 drags it down via RRF. Fusion weighting should be query-type-aware (dense-up for NL, BM25-up for identifiers) -- the biggest lever found. 3. Reranking lifts top-1 precision (R@1 0.364->0.409, +12%), consistent with the git-mined result. Documents these in docs/eval.md and elevates query-type fusion weighting in the strategy doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR #37 shipped the eval harness but flagged a blocker: file-level eval on this small repo saturates (hybrid hits Hit@10 = 1.0), so it couldn't measure any retrieval improvement — both the embedder swap and the reranker came back flat. This PR makes the benchmark discriminating and then uses it to get real answers.
What's in here
1. Symbol-level dataset mining (the enabler)
Finding the right function/class (not just file) is far harder, so the benchmark stops saturating (Hit@10 ≈ 0.5 instead of 1.0).
build_from_git(symbols=True)/coderag eval --build --level symbol: maps each commit's changed lines (zero-context diff hunks) to the symbols they touch — parsed from the file content at that commit via CodeRAG's own chunker, then intersected with HEAD symbols so every ground-truth symbol is retrievable. Off by default.coderag/eval/datasets/coderag_self_symbols.jsonl: 22 hand-verified natural-language → function/method cases (less noisy than the git-mined set).2. Tooling for model comparison
RECOMMENDED_RERANKERSregistry +coderag eval --list-modelsnow lists local cross-encoder rerankers (MiniLM, bge-reranker-base, jina-reranker-v2).scripts/bench_embedders.py --rerank-models: onehybrid+rerankrow per named reranker, on a fixed index.3. Findings (measured, honest)
The reranker is validated once there's headroom. On 10 git-mined symbol cases it lifted R@1 0.183 → 0.283 (+55%); on the 22 curated cases, R@1 0.364 → 0.409 (+12%) — the earlier file-level "no lift" was a saturation artifact, not a property of the technique.
Curated 22-case symbol-level comparison:
bge-small · dense(MRR 0.675) clearly beatsjina-code · dense(0.483) on NL→symbol queries — a good general text embedder is stronger for "where is X" retrieval; jina-v2-base-code is older/code↔code-tuned.bge-small, dense alone (0.675) beats 1:1 hybrid (0.573): weak BM25 on NL queries drags the dense ranking down via RRF. For the weaker jina-code, BM25 helps. → Fusion weighting should be query-type-aware (dense-up for NL, BM25-up for identifiers), not fixed 1:1.Net: the largest lever found is query-type-aware fusion weighting, then reranking for top-1 — not a bigger embedding model. Larger code-aware rerankers are registered but are ~1 GB and slow to rerank on CPU; test them on GPU / a smaller pool, ideally on a larger external repo.
Testing
New offline tests for symbol mining (only the changed function is reported; default-off) and the reranker registry. Full
pytest -m "not integration"green;ruff+mypyclean on new code.Follow-ups
🤖 Generated with Claude Code
Generated by Claude Code