Skip to content

Feat/long context eval curation pipeline#2

Open
Fadhili5 wants to merge 2 commits intomainfrom
feat/long-context-eval-curation-pipeline
Open

Feat/long context eval curation pipeline#2
Fadhili5 wants to merge 2 commits intomainfrom
feat/long-context-eval-curation-pipeline

Conversation

@Fadhili5
Copy link
Copy Markdown
Owner

Summary

Introduces a three-stage Python curation pipeline (schema.py, scorer.py, harvester.py) that scores candidate repositories and harvests multi-file, cross-layer PRs as long-context eval tasks (part of #23316)
Adds ExternalRepoRig : a new test utility that clones a real repo at a pinned commit, applies a patch, and runs the repo's own test suite as a pass/fail oracle
Adds a full Vitest test suite (18 tests) for ExternalRepoRig covering setup, clone, patch application, timeout, and cleanup

Details;

Curation pipeline (evals/datasets/curation/):

Repos are scored across four weighted signals (import density 35%, multi-file PR ratio 30%, CI complexity 15%, language diversity 20%). Repos below the 70th-percentile composite or with a broken test suite at HEAD are dropped before any task harvesting. Harvested PRs must pass four hardness gates in sequence: file count (≥5 files, ≥2 modules), layer span (≥2 of api/core/test/config), issue linkage, and leakage (issue body must not contain gold-patch symbol names).

ExternalRepoRig (packages/test-utils/src/external-repo-rig.ts):

Parallel class to 'TestRig' does not extend it. Intended for long_context.eval.ts in a later PR. Uses --filter=blob:none partial clone and pipes patches via stdin to git apply -.

Related Issues

Closes google-gemini#23316

How to Validate

ExternalRepoRig tests

cd packages/test-utils && npm test

Python pipeline (dry run )
cd evals/datasets/curation python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows pip install -r requirements.txt python -c "from schema import ScoredRepo, RawTask, ValidatedTask; print('schema OK')" python scorer.py --candidates /dev/null --output /tmp/out.json --skip-baseline

Pre-Merge Checklist:

  • Updated relevant documentation and README
  • Added/updated tests (18 Vitest tests for ExternalRepoRig; Python pipeline validated via import check)
  • Noted breaking changes
  • Validated on required platforms/methods:
    • MacOS
      • npm run
    • Windows
      • npm run
    • Linux

Adds ExternalRepoRig as a parallel class to TestRig in the
test-utils package. Where TestRig evaluates agent behaviour against
inline files, ExternalRepoRig evaluates task completion against real
repositories: clone at a pinned commit, apply a generated patch, and
run the repo's own test suite as the evaluation oracle.

This is the harness primitive for the long-context coding evaluation
dataset (issue google-gemini#23316). The class exposes four methods:

- setup(): initialises a clean working directory for the clone
- clone(): partial clone via --filter=blob:none + pinned checkout
- applyPatch(): pipes a unified diff to git apply stdin, no temp file
- runTestSuite(): spawns the test command with shell:true, merges
  stdout/stderr, resolves (never rejects) on timeout or spawn error
- cleanup(): removes the clone dir, respects KEEP_OUTPUT

Follows existing TestRig conventions: INTEGRATION_TEST_FILE_DIR,
KEEP_OUTPUT, VERBOSE, and CI env vars are all honoured.

Also adds test/test:ci scripts to the package so the 18 unit tests
are picked up by the CI matrix.

Related to google-gemini#23316
…chema)

Implements the Python pipeline that builds the long-context eval dataset:

- schema.py: Pydantic v2 models for ScoredRepo, RawTask, ValidatedTask,
  SolvabilityProbe and associated enums (TaskType, LayerType)
- scorer.py: scores candidate repos on four weighted signals
  (import_density, multi_file_pr_ratio, ci_complexity, language_diversity),
  runs a baseline health gate, and selects repos at >=70th-percentile composite
- harvester.py: mines merged PRs and applies four hardness gates
  (file count, layer, issue linkage, leakage) to produce raw task candidates
- README.md: usage docs for all pipeline stages
- requirements.txt: pydantic>=2.0, requests>=2.31
- output/.gitkeep: preserves generated-output directory in Git
- .gitignore: excludes curation output JSON, __pycache__, .venv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long-Context & Complex Reasoning Coding Evaluation Dataset

1 participant