feat: revert agent test edits at grading (swe_lego, swe_rebench_v2) by rasdani · Pull Request #1212 · PrimeIntellect-ai/verifiers

rasdani · 2026-04-20T21:54:35Z

Summary

Both swe_lego and swe_rebench_v2 apply test_patch in setup() so agents can read the failing tests from t=0, but then run test_cmd at grading against whatever state the working tree is in. This PR adds the canonical SWE-bench pre-grading dance — revert agent edits to test files, then re-apply test_patch cleanly — closing a reward-hack loophole where an agent could weaken FAIL_TO_PASS assertions mid-rollout and still score reward=1. Agent source edits (their actual fix) survive untouched; only test-file bits get canonicalized.

Why

Upstream SWE-bench's harness does this exact two-step before running the test command (see swebench/harness/test_spec/python.py#L405-L462 and swebench/harness/utils.py::get_modified_files / get_new_files). Our wrappers previously applied test_patch at setup-time only, so any modification the agent made to the test file in between stuck around for grading. This PR ports the upstream pattern.

Scope

New module verifiers/envs/experimental/composable/tasksets/swe/_test_patch.py with get_modified_files, get_new_files, and revert_and_reapply_test_patch. The async helper does git checkout <base_commit> -- <modified>, rm -f <new>, then delegates the re-apply to the taskset's own _apply_patch_file so the taskset-specific git apply flags (different between swe_lego and swe_rebench_v2) are preserved. Uses base_commit (not HEAD) so a git commit by the agent can't shift the checkout target onto their tampered snapshot.
swe_lego.py::_run_tests calls the helper at the top of the method, before any existing logic.
swe_rebench_v2.py::_run_tests calls the helper right after resolving workdir.
Setup-time test_patch apply is left in place on both wrappers (idempotent with the grading-time re-apply). No changes to _calculate_reward, setup, validate_instance, _apply_gold_patch, or any other taskset.

Test plan

Static checks

uv run ruff check verifiers/envs/experimental/composable/tasksets/swe/ — clean.
uv run pre-commit run --files <touched> — clean (ruff-check, ruff-format, sync-AGENTS).
Inline parser sanity tests via uv run python -c ...: modified-only patch, new-file-only patch, mixed, empty string, quoted paths-with-spaces, multiple modifieds, dedup, realistic SWE-bench-shaped test_patch — all 8 cases pass.

Live e2e (against real Prime sandboxes)

Happy-path validate (n=1 per wrapper) — confirms the revert-and-reapply doesn't regress the baseline (every validate_instance call now runs the helper; a broken helper would turn previously-green gold patches red):

wrapper	instance	`valid`	elapsed
`swe_lego`	`adamchainz__apig-wsgi-80`	✅ True	19.98s
`swe_rebench_v2`	`elastic__synthetics-316`	✅ True	44.40s

Sweep validate (n=4 per wrapper, concurrency=2) — light regression across 8 distinct rows:

wrapper	result	notes
`swe_lego`	4/4 ✅	—
`swe_rebench_v2`	4/4 ✅	First run was 3/4 with a transient on `crawler-commons__crawler-commons-227` (java, `parse_java_mvn`); a clean rerun on the identical instance list came back 4/4. `valid=False, error=None` without reproducibility — likely Maven/JVM first-run flake, no evidence the revert helper was involved.

Tamper test — the actual guard validation. Bypass gold; tamper test file via AST rewrite (replace every test_*() body with pass, preserving syntax); call _run_tests. Monkey-patched validate_instance for orchestration; monkey-patched revert_and_reapply_test_patch to no-op for the control case:

case	tamper	revert active?	reward	`valid`	explanation
A (this PR's code)	AST replaced 17 `test_*` bodies with `pass`	✅ yes	0.0	False	revert restored `test_apig_wsgi.py` + re-applied `test_patch` → canonical F2P tests ran against buggy code → failed → cheat blocked
B (control)	same tamper	❌ no (monkey-patched)	1.0	True	tamper survived → 17 `def test_*(): pass` bodies trivially succeeded → cheat succeeded

Both cases parsed 21 outcomes (tests ran cleanly in both — tamper preserved syntax, only the grading state differed). The reward split is entirely attributable to the revert step: with it, the loophole is structurally closed; without it, the exact reward-hack scenario this PR exists to prevent reproduces end-to-end.

Known limitations

Paths quoted with C-style escapes in diff headers (git config core.quotepath) are not fully decoded — we strip surrounding double quotes but leave backslash-escapes alone. SWE-bench test_patch fields in practice don't carry such paths; noted inline in the module docstring.
Revert scope matches upstream SWE-bench: only files listed in test_patch are reverted. An agent could in theory write/modify a conftest.py outside test_patch's file list to affect pytest collection — upstream has the same gap; closing it is a separate hardening question.

Mirror upstream SWE-bench's pre-grading dance so agents can't reward-hack by weakening FAIL_TO_PASS assertions mid-rollout. Before running the test command: git-checkout HEAD on any test files the agent modified, rm -f any newly-added test files, then re-apply test_patch cleanly. Agent source edits (their actual fix) are untouched. Adds a small helper module with two unified-diff parsers (get_modified_files, get_new_files) and an async revert/reapply that delegates the final re-apply to the taskset's own _apply_patch_file so the taskset-specific git apply flags are preserved.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 578cd8b. Configure here.}

Bugbot catch on #1212: `git checkout HEAD -- <test_file>` is unsafe if an agent runs `git add && git commit` mid-rollout — HEAD then points at the agent's commit, so "revert" would restore the agent's (potentially tampered) test version instead of the pristine base state. Re-apply afterward might then conflict or 3-way-merge against the tampered content, leaving the reward-hack loophole this PR exists to close partially open. Thread `base_commit` through `revert_and_reapply_test_patch` and use `git checkout <base_commit> -- <path>`. Matches upstream SWE-bench (swebench/harness/test_spec/python.py#L420). Both call-sites guard on `base_commit` being non-empty before invoking the helper. Includes docstring/comment updates pointing at the `HEAD` → `base_commit` rationale so the reasoning stays with the code.

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/swe/_test_patch.py Outdated

rasdani requested a review from hallerite April 21, 2026 03:13

rasdani merged commit 58b164e into main Apr 21, 2026
6 checks passed

snimu mentioned this pull request Apr 22, 2026

chore: v0.1.13.dev4 dev release #1227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: revert agent test edits at grading (swe_lego, swe_rebench_v2)#1212

feat: revert agent test edits at grading (swe_lego, swe_rebench_v2)#1212
rasdani merged 2 commits intomainfrom
feat/swe-revert-test-patch

rasdani commented Apr 20, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rasdani commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Scope

Test plan

Static checks

Live e2e (against real Prime sandboxes)

Known limitations

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented Apr 20, 2026 •

edited

Loading