Skip to content

Multi-repo evaluation: macro-averaged generalization view#43

Merged
Neverdecel merged 2 commits into
masterfrom
claude/multi-repo-eval
Jun 17, 2026
Merged

Multi-repo evaluation: macro-averaged generalization view#43
Neverdecel merged 2 commits into
masterfrom
claude/multi-repo-eval

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Why

The external-repo validation (#40) proved single-repo tuning overfits — levers that won on CodeRAG reversed on pydantic. So a config (e.g. adaptive fusion) should only be promoted to a default once it wins on the average of several repos. This adds the tooling to judge that.

What

  • harness.mean_results() — macro-averages several EvalResults, each weighted equally so a big repo can't dominate (the right lens for generalization; total case count carried in n).
  • harness.aggregate_by_mode() — groups per-repo results by mode label and averages each across repos. Both exported from coderag.eval.
  • scripts/bench_multirepo.py — manifest-driven driver: scores each repo with the harness, prints a per-repo table, then a macro-averaged aggregate (mean:<mode> rows). Reuses prepared indexes/datasets (indexing is the slow part); --index/--build to prepare; --adaptive/--rerank to include those modes.
  • Sample manifest coderag/eval/datasets/multirepo.example.json; docs/eval.md "Multi-repo evaluation" section.

Validation

Unit tests for the aggregation (macro-average, mode grouping, first-seen order, empty-guard) and an end-to-end smoke of the driver across two repos confirming per-repo tables + the aggregate mean:* rows. Full pytest -m "not integration" green; ruff + mypy clean.

Why it matters now

This is the explicit gate for promoting adaptive fusion (#42) to default-on: it must be ≥ hybrid on the aggregate and on every individual repo before the default flips. Next step is to run it across a handful of external repos.

🤖 Generated with Claude Code


Generated by Claude Code

import json
from collections import OrderedDict
from pathlib import Path
from typing import List
from coderag.api import CodeRAG
from coderag.config import Config
from coderag.eval import build_from_git, compare_modes, load_dataset, save_dataset
from coderag.eval.harness import EvalResult, aggregate_by_mode, format_table
claude added 2 commits June 17, 2026 17:36
Single-repo tuning overfits (external-validation.md), so a config should only
become a default once it wins on the average of several repos. Adds the
aggregation + a driver to judge that.

- harness: mean_results() (macro-average several EvalResults, each weighted
  equally so a big repo can't dominate) and aggregate_by_mode() (group
  per-repo results by mode and average across repos). Exported from coderag.eval.
- scripts/bench_multirepo.py: manifest-driven driver that scores each repo and
  prints per-repo tables plus a macro-averaged aggregate (mean:<mode> rows).
  Reuses prepared indexes/datasets; --index / --build to prepare them.
- coderag/eval/datasets/multirepo.example.json sample manifest.
- Tests for the aggregation helpers; docs/eval.md "Multi-repo evaluation"
  section framing this as the gate for promoting adaptive fusion to default-on.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
Ran scripts/bench_multirepo.py across four repos (coderag, flask, requests,
click; 627 git-mined symbol cases, with the embedded-identifier classifier).
Aggregate MRR: hybrid 0.442 (best) > adaptive 0.423 > bm25 0.411 > dense 0.356.
Hybrid is first-or-tied on every repo; adaptive is a wash on the well-powered
repos and trails overall. The big curated-CodeRAG adaptive win was an artifact
of dense-friendly clean-NL queries; on realistic git-mined commit queries dense
is the weakest modality, so leaning dense stops paying off.

Locks in the defaults: adaptive_fusion stays off (safe opt-in, not an aggregate
win); 1:1 hybrid stays the default. The multi-repo harness blocked a single-repo
"win" from becoming a default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01LhTCPRjNmSitYxgSDfttT7
@Neverdecel Neverdecel force-pushed the claude/multi-repo-eval branch from d8ef7df to d0aac70 Compare June 17, 2026 17:37
@Neverdecel Neverdecel merged commit 31eb8d6 into master Jun 17, 2026
12 checks passed
@Neverdecel Neverdecel deleted the claude/multi-repo-eval branch June 18, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants