Feat/long context eval curation pipeline by Fadhili5 · Pull Request #2 · Fadhili5/gemini-cli

Fadhili5 · 2026-03-25T08:09:14Z

Summary

Introduces a three-stage Python curation pipeline (schema.py, scorer.py, harvester.py) that scores candidate repositories and harvests multi-file, cross-layer PRs as long-context eval tasks (part of #23316)
Adds ExternalRepoRig : a new test utility that clones a real repo at a pinned commit, applies a patch, and runs the repo's own test suite as a pass/fail oracle
Adds a full Vitest test suite (18 tests) for ExternalRepoRig covering setup, clone, patch application, timeout, and cleanup

Details;

Curation pipeline (evals/datasets/curation/):

Repos are scored across four weighted signals (import density 35%, multi-file PR ratio 30%, CI complexity 15%, language diversity 20%). Repos below the 70th-percentile composite or with a broken test suite at HEAD are dropped before any task harvesting. Harvested PRs must pass four hardness gates in sequence: file count (≥5 files, ≥2 modules), layer span (≥2 of api/core/test/config), issue linkage, and leakage (issue body must not contain gold-patch symbol names).

ExternalRepoRig (packages/test-utils/src/external-repo-rig.ts):

Parallel class to 'TestRig' does not extend it. Intended for long_context.eval.ts in a later PR. Uses --filter=blob:none partial clone and pipes patches via stdin to git apply -.

Related Issues

Closes google-gemini#23316

How to Validate

ExternalRepoRig tests

cd packages/test-utils && npm test

Python pipeline (dry run )
cd evals/datasets/curation python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows pip install -r requirements.txt python -c "from schema import ScoredRepo, RawTask, ValidatedTask; print('schema OK')" python scorer.py --candidates /dev/null --output /tmp/out.json --skip-baseline

Pre-Merge Checklist:

Adds ExternalRepoRig as a parallel class to TestRig in the test-utils package. Where TestRig evaluates agent behaviour against inline files, ExternalRepoRig evaluates task completion against real repositories: clone at a pinned commit, apply a generated patch, and run the repo's own test suite as the evaluation oracle. This is the harness primitive for the long-context coding evaluation dataset (issue google-gemini#23316). The class exposes four methods: - setup(): initialises a clean working directory for the clone - clone(): partial clone via --filter=blob:none + pinned checkout - applyPatch(): pipes a unified diff to git apply stdin, no temp file - runTestSuite(): spawns the test command with shell:true, merges stdout/stderr, resolves (never rejects) on timeout or spawn error - cleanup(): removes the clone dir, respects KEEP_OUTPUT Follows existing TestRig conventions: INTEGRATION_TEST_FILE_DIR, KEEP_OUTPUT, VERBOSE, and CI env vars are all honoured. Also adds test/test:ci scripts to the package so the 18 unit tests are picked up by the CI matrix. Related to google-gemini#23316

…chema) Implements the Python pipeline that builds the long-context eval dataset: - schema.py: Pydantic v2 models for ScoredRepo, RawTask, ValidatedTask, SolvabilityProbe and associated enums (TaskType, LayerType) - scorer.py: scores candidate repos on four weighted signals (import_density, multi_file_pr_ratio, ci_complexity, language_diversity), runs a baseline health gate, and selects repos at >=70th-percentile composite - harvester.py: mines merged PRs and applies four hardness gates (file count, layer, issue linkage, leakage) to produce raw task candidates - README.md: usage docs for all pipeline stages - requirements.txt: pydantic>=2.0, requests>=2.31 - output/.gitkeep: preserves generated-output directory in Git - .gitignore: excludes curation output JSON, __pycache__, .venv

Fadhili5 added 2 commits March 25, 2026 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/long context eval curation pipeline#2

Feat/long context eval curation pipeline#2
Fadhili5 wants to merge 2 commits intomainfrom
feat/long-context-eval-curation-pipeline

Fadhili5 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fadhili5 commented Mar 25, 2026

Summary

Details;

Related Issues

How to Validate

ExternalRepoRig tests

Pre-Merge Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant