feat: revert agent test edits at grading (swe_lego, swe_rebench_v2)#1212
Merged
feat: revert agent test edits at grading (swe_lego, swe_rebench_v2)#1212
Conversation
Mirror upstream SWE-bench's pre-grading dance so agents can't reward-hack by weakening FAIL_TO_PASS assertions mid-rollout. Before running the test command: git-checkout HEAD on any test files the agent modified, rm -f any newly-added test files, then re-apply test_patch cleanly. Agent source edits (their actual fix) are untouched. Adds a small helper module with two unified-diff parsers (get_modified_files, get_new_files) and an async revert/reapply that delegates the final re-apply to the taskset's own _apply_patch_file so the taskset-specific git apply flags are preserved.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 578cd8b. Configure here.
Bugbot catch on #1212: `git checkout HEAD -- <test_file>` is unsafe if an agent runs `git add && git commit` mid-rollout — HEAD then points at the agent's commit, so "revert" would restore the agent's (potentially tampered) test version instead of the pristine base state. Re-apply afterward might then conflict or 3-way-merge against the tampered content, leaving the reward-hack loophole this PR exists to close partially open. Thread `base_commit` through `revert_and_reapply_test_patch` and use `git checkout <base_commit> -- <path>`. Matches upstream SWE-bench (swebench/harness/test_spec/python.py#L420). Both call-sites guard on `base_commit` being non-empty before invoking the helper. Includes docstring/comment updates pointing at the `HEAD` → `base_commit` rationale so the reasoning stays with the code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Both
swe_legoandswe_rebench_v2applytest_patchinsetup()so agents can read the failing tests from t=0, but then runtest_cmdat grading against whatever state the working tree is in. This PR adds the canonical SWE-bench pre-grading dance — revert agent edits to test files, then re-applytest_patchcleanly — closing a reward-hack loophole where an agent could weaken FAIL_TO_PASS assertions mid-rollout and still score reward=1. Agent source edits (their actual fix) survive untouched; only test-file bits get canonicalized.Why
Upstream SWE-bench's harness does this exact two-step before running the test command (see
swebench/harness/test_spec/python.py#L405-L462andswebench/harness/utils.py::get_modified_files/get_new_files). Our wrappers previously appliedtest_patchat setup-time only, so any modification the agent made to the test file in between stuck around for grading. This PR ports the upstream pattern.Scope
verifiers/envs/experimental/composable/tasksets/swe/_test_patch.pywithget_modified_files,get_new_files, andrevert_and_reapply_test_patch. The async helper doesgit checkout <base_commit> -- <modified>,rm -f <new>, then delegates the re-apply to the taskset's own_apply_patch_fileso the taskset-specificgit applyflags (different between swe_lego and swe_rebench_v2) are preserved. Usesbase_commit(notHEAD) so agit commitby the agent can't shift the checkout target onto their tampered snapshot.swe_lego.py::_run_testscalls the helper at the top of the method, before any existing logic.swe_rebench_v2.py::_run_testscalls the helper right after resolvingworkdir.test_patchapply is left in place on both wrappers (idempotent with the grading-time re-apply). No changes to_calculate_reward,setup,validate_instance,_apply_gold_patch, or any other taskset.Test plan
Static checks
uv run ruff check verifiers/envs/experimental/composable/tasksets/swe/— clean.uv run pre-commit run --files <touched>— clean (ruff-check, ruff-format, sync-AGENTS).uv run python -c ...: modified-only patch, new-file-only patch, mixed, empty string, quoted paths-with-spaces, multiple modifieds, dedup, realistic SWE-bench-shapedtest_patch— all 8 cases pass.Live e2e (against real Prime sandboxes)
Happy-path validate (n=1 per wrapper) — confirms the revert-and-reapply doesn't regress the baseline (every
validate_instancecall now runs the helper; a broken helper would turn previously-green gold patches red):validswe_legoadamchainz__apig-wsgi-80swe_rebench_v2elastic__synthetics-316Sweep validate (n=4 per wrapper, concurrency=2) — light regression across 8 distinct rows:
swe_legoswe_rebench_v2crawler-commons__crawler-commons-227(java,parse_java_mvn); a clean rerun on the identical instance list came back 4/4.valid=False, error=Nonewithout reproducibility — likely Maven/JVM first-run flake, no evidence the revert helper was involved.Tamper test — the actual guard validation. Bypass gold; tamper test file via AST rewrite (replace every
test_*()body withpass, preserving syntax); call_run_tests. Monkey-patchedvalidate_instancefor orchestration; monkey-patchedrevert_and_reapply_test_patchto no-op for the control case:validtest_*bodies withpasstest_apig_wsgi.py+ re-appliedtest_patch→ canonical F2P tests ran against buggy code → failed → cheat blockeddef test_*(): passbodies trivially succeeded → cheat succeededBoth cases parsed 21 outcomes (tests ran cleanly in both — tamper preserved syntax, only the grading state differed). The reward split is entirely attributable to the revert step: with it, the loophole is structurally closed; without it, the exact reward-hack scenario this PR exists to prevent reproduces end-to-end.
Known limitations
git config core.quotepath) are not fully decoded — we strip surrounding double quotes but leave backslash-escapes alone. SWE-benchtest_patchfields in practice don't carry such paths; noted inline in the module docstring.test_patchare reverted. An agent could in theory write/modify aconftest.pyoutsidetest_patch's file list to affect pytest collection — upstream has the same gap; closing it is a separate hardening question.