Multi-repo evaluation: macro-averaged generalization view by Neverdecel · Pull Request #43 · Neverdecel/CodeRAG

Neverdecel · 2026-06-17T17:02:37Z

Why

The external-repo validation (#40) proved single-repo tuning overfits — levers that won on CodeRAG reversed on pydantic. So a config (e.g. adaptive fusion) should only be promoted to a default once it wins on the average of several repos. This adds the tooling to judge that.

What

harness.mean_results() — macro-averages several EvalResults, each weighted equally so a big repo can't dominate (the right lens for generalization; total case count carried in n).
harness.aggregate_by_mode() — groups per-repo results by mode label and averages each across repos. Both exported from coderag.eval.
scripts/bench_multirepo.py — manifest-driven driver: scores each repo with the harness, prints a per-repo table, then a macro-averaged aggregate (mean:<mode> rows). Reuses prepared indexes/datasets (indexing is the slow part); --index/--build to prepare; --adaptive/--rerank to include those modes.
Sample manifest coderag/eval/datasets/multirepo.example.json; docs/eval.md "Multi-repo evaluation" section.

Validation

Unit tests for the aggregation (macro-average, mode grouping, first-seen order, empty-guard) and an end-to-end smoke of the driver across two repos confirming per-repo tables + the aggregate mean:* rows. Full pytest -m "not integration" green; ruff + mypy clean.

Why it matters now

This is the explicit gate for promoting adaptive fusion (#42) to default-on: it must be ≥ hybrid on the aggregate and on every individual repo before the default flips. Next step is to run it across a handful of external repos.

🤖 Generated with Claude Code

Generated by Claude Code

+import json
+from collections import OrderedDict
+from pathlib import Path
+from typing import List


+from coderag.api import CodeRAG
+from coderag.config import Config
+from coderag.eval import build_from_git, compare_modes, load_dataset, save_dataset
+from coderag.eval.harness import EvalResult, aggregate_by_mode, format_table


Single-repo tuning overfits (external-validation.md), so a config should only become a default once it wins on the average of several repos. Adds the aggregation + a driver to judge that. - harness: mean_results() (macro-average several EvalResults, each weighted equally so a big repo can't dominate) and aggregate_by_mode() (group per-repo results by mode and average across repos). Exported from coderag.eval. - scripts/bench_multirepo.py: manifest-driven driver that scores each repo and prints per-repo tables plus a macro-averaged aggregate (mean:<mode> rows). Reuses prepared indexes/datasets; --index / --build to prepare them. - coderag/eval/datasets/multirepo.example.json sample manifest. - Tests for the aggregation helpers; docs/eval.md "Multi-repo evaluation" section framing this as the gate for promoting adaptive fusion to default-on. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

Ran scripts/bench_multirepo.py across four repos (coderag, flask, requests, click; 627 git-mined symbol cases, with the embedded-identifier classifier). Aggregate MRR: hybrid 0.442 (best) > adaptive 0.423 > bm25 0.411 > dense 0.356. Hybrid is first-or-tied on every repo; adaptive is a wash on the well-powered repos and trails overall. The big curated-CodeRAG adaptive win was an artifact of dense-friendly clean-NL queries; on realistic git-mined commit queries dense is the weakest modality, so leaning dense stops paying off. Locks in the defaults: adaptive_fusion stays off (safe opt-in, not an aggregate win); 1:1 hybrid stays the default. The multi-repo harness blocked a single-repo "win" from becoming a default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

claude added 2 commits June 17, 2026 17:36

Neverdecel force-pushed the claude/multi-repo-eval branch from d8ef7df to d0aac70 Compare June 17, 2026 17:37

Neverdecel merged commit 31eb8d6 into master Jun 17, 2026
12 checks passed

Neverdecel deleted the claude/multi-repo-eval branch June 18, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-repo evaluation: macro-averaged generalization view#43

Multi-repo evaluation: macro-averaged generalization view#43
Neverdecel merged 2 commits into
masterfrom
claude/multi-repo-eval

Neverdecel commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Neverdecel commented Jun 17, 2026

Why

What

Validation

Why it matters now

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants