Open
Conversation
Adds ExternalRepoRig as a parallel class to TestRig in the test-utils package. Where TestRig evaluates agent behaviour against inline files, ExternalRepoRig evaluates task completion against real repositories: clone at a pinned commit, apply a generated patch, and run the repo's own test suite as the evaluation oracle. This is the harness primitive for the long-context coding evaluation dataset (issue google-gemini#23316). The class exposes four methods: - setup(): initialises a clean working directory for the clone - clone(): partial clone via --filter=blob:none + pinned checkout - applyPatch(): pipes a unified diff to git apply stdin, no temp file - runTestSuite(): spawns the test command with shell:true, merges stdout/stderr, resolves (never rejects) on timeout or spawn error - cleanup(): removes the clone dir, respects KEEP_OUTPUT Follows existing TestRig conventions: INTEGRATION_TEST_FILE_DIR, KEEP_OUTPUT, VERBOSE, and CI env vars are all honoured. Also adds test/test:ci scripts to the package so the 18 unit tests are picked up by the CI matrix. Related to google-gemini#23316
…chema) Implements the Python pipeline that builds the long-context eval dataset: - schema.py: Pydantic v2 models for ScoredRepo, RawTask, ValidatedTask, SolvabilityProbe and associated enums (TaskType, LayerType) - scorer.py: scores candidate repos on four weighted signals (import_density, multi_file_pr_ratio, ci_complexity, language_diversity), runs a baseline health gate, and selects repos at >=70th-percentile composite - harvester.py: mines merged PRs and applies four hardness gates (file count, layer, issue linkage, leakage) to produce raw task candidates - README.md: usage docs for all pipeline stages - requirements.txt: pydantic>=2.0, requests>=2.31 - output/.gitkeep: preserves generated-output directory in Git - .gitignore: excludes curation output JSON, __pycache__, .venv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces a three-stage Python curation pipeline (
schema.py,scorer.py,harvester.py) that scores candidate repositories and harvests multi-file, cross-layer PRs as long-context eval tasks (part of #23316)Adds
ExternalRepoRig: a new test utility that clones a real repo at a pinned commit, applies a patch, and runs the repo's own test suite as a pass/fail oracleAdds a full Vitest test suite (18 tests) for
ExternalRepoRigcovering setup, clone, patch application, timeout, and cleanupDetails;
Curation pipeline (
evals/datasets/curation/):Repos are scored across four weighted signals (import density 35%, multi-file PR ratio 30%, CI complexity 15%, language diversity 20%). Repos below the 70th-percentile composite or with a broken test suite at HEAD are dropped before any task harvesting. Harvested PRs must pass four hardness gates in sequence: file count (≥5 files, ≥2 modules), layer span (≥2 of api/core/test/config), issue linkage, and leakage (issue body must not contain gold-patch symbol names).
ExternalRepoRig (
packages/test-utils/src/external-repo-rig.ts):Parallel class to 'TestRig' does not extend it. Intended for
long_context.eval.tsin a later PR. Uses--filter=blob:nonepartial clone and pipes patches via stdin togit apply -.Related Issues
Closes google-gemini#23316
How to Validate
ExternalRepoRig tests
cd packages/test-utils && npm testPython pipeline (dry run )
cd evals/datasets/curation python -m venv .venv && source .venv/bin/activate # or .venv\Scripts\activate on Windows pip install -r requirements.txt python -c "from schema import ScoredRepo, RawTask, ValidatedTask; print('schema OK')" python scorer.py --candidates /dev/null --output /tmp/out.json --skip-baselinePre-Merge Checklist: