Multi-repo evaluation: macro-averaged generalization view#43
Merged
Conversation
| import json | ||
| from collections import OrderedDict | ||
| from pathlib import Path | ||
| from typing import List |
| from coderag.api import CodeRAG | ||
| from coderag.config import Config | ||
| from coderag.eval import build_from_git, compare_modes, load_dataset, save_dataset | ||
| from coderag.eval.harness import EvalResult, aggregate_by_mode, format_table |
Single-repo tuning overfits (external-validation.md), so a config should only become a default once it wins on the average of several repos. Adds the aggregation + a driver to judge that. - harness: mean_results() (macro-average several EvalResults, each weighted equally so a big repo can't dominate) and aggregate_by_mode() (group per-repo results by mode and average across repos). Exported from coderag.eval. - scripts/bench_multirepo.py: manifest-driven driver that scores each repo and prints per-repo tables plus a macro-averaged aggregate (mean:<mode> rows). Reuses prepared indexes/datasets; --index / --build to prepare them. - coderag/eval/datasets/multirepo.example.json sample manifest. - Tests for the aggregation helpers; docs/eval.md "Multi-repo evaluation" section framing this as the gate for promoting adaptive fusion to default-on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Ran scripts/bench_multirepo.py across four repos (coderag, flask, requests, click; 627 git-mined symbol cases, with the embedded-identifier classifier). Aggregate MRR: hybrid 0.442 (best) > adaptive 0.423 > bm25 0.411 > dense 0.356. Hybrid is first-or-tied on every repo; adaptive is a wash on the well-powered repos and trails overall. The big curated-CodeRAG adaptive win was an artifact of dense-friendly clean-NL queries; on realistic git-mined commit queries dense is the weakest modality, so leaning dense stops paying off. Locks in the defaults: adaptive_fusion stays off (safe opt-in, not an aggregate win); 1:1 hybrid stays the default. The multi-repo harness blocked a single-repo "win" from becoming a default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
d8ef7df to
d0aac70
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The external-repo validation (#40) proved single-repo tuning overfits — levers that won on CodeRAG reversed on
pydantic. So a config (e.g. adaptive fusion) should only be promoted to a default once it wins on the average of several repos. This adds the tooling to judge that.What
harness.mean_results()— macro-averages severalEvalResults, each weighted equally so a big repo can't dominate (the right lens for generalization; total case count carried inn).harness.aggregate_by_mode()— groups per-repo results by mode label and averages each across repos. Both exported fromcoderag.eval.scripts/bench_multirepo.py— manifest-driven driver: scores each repo with the harness, prints a per-repo table, then a macro-averaged aggregate (mean:<mode>rows). Reuses prepared indexes/datasets (indexing is the slow part);--index/--buildto prepare;--adaptive/--rerankto include those modes.coderag/eval/datasets/multirepo.example.json;docs/eval.md"Multi-repo evaluation" section.Validation
Unit tests for the aggregation (macro-average, mode grouping, first-seen order, empty-guard) and an end-to-end smoke of the driver across two repos confirming per-repo tables + the aggregate
mean:*rows. Fullpytest -m "not integration"green;ruff+mypyclean.Why it matters now
This is the explicit gate for promoting adaptive fusion (#42) to default-on: it must be ≥ hybrid on the aggregate and on every individual repo before the default flips. Next step is to run it across a handful of external repos.
🤖 Generated with Claude Code
Generated by Claude Code