Run SWE-Lego eval via dataset's canonical test_cmd#1205
Merged
Conversation
Switches `SWELegoTaskSet._run_tests` to execute `info['test_cmd']` —
SWE-Lego-Real-Data ships a per-row pytest invocation pointing at the
whole test FILE with the flags upstream's eval uses (`LANG=C.UTF-8`,
`-p no:cacheprovider`, `-W ignore::DeprecationWarning`, sometimes
`--cov=pkg`). Scoring then parses pytest `-rA` outcomes for the specific
FAIL_TO_PASS / PASS_TO_PASS ids instead of trusting pytest's overall
exit code.
Why the change: the prior implementation hand-rolled a pytest call with
`-x --tb=short` and passed F2P / P2P test ids directly. Validating the
4432 resolved rows at gold-patch surfaced 187 false negatives whose
root cause is one of:
* a repo-wide `[tool.pytest.ini_options] addopts = "--cov-fail-under=N"`
→ pytest exits non-zero even when every scored test passes unless
`--cov=pkg` is supplied (captured in test_cmd),
* conftest-level `filterwarnings = error` that flips a
`PytestDeprecationWarning` into a failure unless
`-W ignore::DeprecationWarning` is present (captured in test_cmd),
* module-scoped fixtures that only run when the whole file is
collected (running a specific id skips them and masquerades as a
gold-patch regression),
* parametrize ids with whitespace/special chars that are unparseable
as CLI args (upstream runs the whole file so id syntax is moot).
Running the whole file as upstream does, then checking F2P/P2P
outcomes via parsed -rA lines (``_parse_outcomes``), recovers 138 /
187 rows (74%) while not regressing any of 30 known-pass controls.
Validated on:
* 8-row labeled smoke test (4 always-pass controls, 2 D2-rescued,
2 genuine F2P failures) — 8/8 match expectation.
* 127-row bulk probe (97 prior failures + 30 controls) run outside
the taskset — 30/30 controls pass, 49/97 prior failures rescued.
Tested rubric scoring via `TaskSet.validate()` — both validation
and agent rollouts go through the same `_run_tests` + `_calculate_reward`
path so behavior is consistent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3d69b08. Configure here.
Bugbot findings on #1205: 1. `_OUTCOME_LINE_RE` used `\S+` which stops at the first whitespace. Pytest parametrized ids like `test_dits[TypeMismatch, List var]` were silently truncated, so `outcomes.get(full_id)` returned None and the test scored 0.0 even when it passed. Switch to non-greedy `.+?` with an optional ` - <reason>` tail (for FAILED/ERROR/XFAIL) and strip trailing whitespace on the captured id. 2. If a row's `test_cmd` happens to lack `-rA`, `_parse_outcomes` returns `{}` and reward was silently 0.0 with no warning. Log a clear warning at that site pointing at the likely cause so bad rows are visible. Also dropped SKIPPED from the regex alternatives: pytest's -rA format for skips is `SKIPPED [N] <file>:<line>: <reason>` — no test id to match against F2P/P2P anyway, and a skipped required-test correctly scores 0 via 'no PASSED entry'.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Switches
SWELegoTaskSetto run the per-rowinfo['test_cmd']thatSWE-Lego-Real-Data ships — a pytest invocation pointing at the whole
test FILE with the flags upstream's eval uses — instead of our
hand-rolled
python -m pytest -x --tb=short <F2P ids>; <P2P ids>.Scoring then parses
-rAoutput for the specific F2P/P2P ids ratherthan trusting pytest's overall exit code.
Why
Validating the 4432 rows in
PrimeIntellect/SWE-Lego-Real-Dataatgold-patch surfaced 187 false negatives. Root-cause analysis (split
F2P/P2P tails, then the upstream-style probe on all 187 + 30 passing
controls) traced them to four mechanisms:
--cov-fail-under=Ninpyproject.toml→ pytest exits non-zeroeven when every scored test passed, unless
--cov=pkgis supplied(captured in
test_cmd).filterwarnings = errorin conftest → aPytestDeprecationWarningflips into a failure unless
-W ignore::DeprecationWarningis present(captured in
test_cmd).collected → running a specific id skips them, showing up as a
spurious gold-patch regression.
unparseable as CLI args — upstream runs the whole file so id syntax
is moot.
Running the dataset's
test_cmdverbatim and parsing-rAoutcomesrecovers 138 / 187 (74%) rows with zero regressions on 30
known-pass controls.
Changes
verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py_build_eval_scriptnow wrapsinfo['test_cmd'];_run_testspassestest_cmdthrough; new_parse_outcomeshelper;_calculate_rewardscores via parsed F2P/P2P outcomes; class + function docstrings updatedNo changes to
setup()(still appliestest_patch),_apply_gold_patch,or the class's public API. Agent rollouts and gold-patch validation
both route through the same
_run_tests+_calculate_reward— nobehavioral divergence between the two paths.
Validation
(
adamchainz__apig-wsgi-80coverage_threshold,Stranger6667__postmarker-125pytest_error), 2 genuine F2P failures(
msgpack__msgpack-python-229,marcosschroh__dataclasses-avroschema-724). 8/8 match expectedoutcome.
30 random passing controls): 30/30 controls pass, 49/97 prior
failures rescued.
-x, add-p no:cacheprovider -W ignore::DeprecationWarning,LANG=C.UTF-8)that was tested independently and recovered an additional 90 rows,
the dataset's clean set grows from 4245 / 4432 (95.78%) to 4383 /
4432 (98.9%). The remaining ~1% are genuine failures (real F2P
fails, real P2P regressions, or rows with truncated parametrize IDs
that no strategy can recover).
Test plan
scripts/d2_smoke_test.pylocally)scripts/d2_bulk_probe.pylocally)updated allow-list as a
validatedsplit on the dataset repo.🤖 Generated with Claude Code
Note
Medium Risk
Changes SWE-Lego test execution and scoring logic, which can affect benchmark pass/fail outcomes and may depend on
test_cmdemitting-rAsummary lines. Failure modes are mostly mis-scoring (0 reward) rather than infrastructure breakage.Overview
Switches SWE-Lego evaluation to run the dataset’s canonical per-row
test_cmd(whole-file pytest invocation with upstream flags) instead of constructing pytest runs fromFAIL_TO_PASS/PASS_TO_PASSIDs.Reward calculation now parses pytest
-rAshort-summary lines (via new_parse_outcomesregex) and scores based on whether all required F2P/P2P test IDs arePASSED, rather than trusting pytest’s overall exit code; rows missingtest_cmdnow error and missing/parseable outcomes are surfaced via warnings.Reviewed by Cursor Bugbot for commit ed37166. Bugbot is set up for automated code reviews on this repo. Configure here.