Run SWE-Lego eval via dataset's canonical test_cmd by hallerite · Pull Request #1205 · PrimeIntellect-ai/verifiers

hallerite · 2026-04-20T13:49:36Z

Summary

Switches SWELegoTaskSet to run the per-row info['test_cmd'] that
SWE-Lego-Real-Data ships — a pytest invocation pointing at the whole
test FILE with the flags upstream's eval uses — instead of our
hand-rolled python -m pytest -x --tb=short <F2P ids>; <P2P ids>.
Scoring then parses -rA output for the specific F2P/P2P ids rather
than trusting pytest's overall exit code.

Why

Validating the 4432 rows in PrimeIntellect/SWE-Lego-Real-Data at
gold-patch surfaced 187 false negatives. Root-cause analysis (split
F2P/P2P tails, then the upstream-style probe on all 187 + 30 passing
controls) traced them to four mechanisms:

--cov-fail-under=N in pyproject.toml → pytest exits non-zero
even when every scored test passed, unless --cov=pkg is supplied
(captured in test_cmd).
filterwarnings = error in conftest → a PytestDeprecationWarning
flips into a failure unless -W ignore::DeprecationWarning is present
(captured in test_cmd).
Module-scoped fixtures that only run when the whole file is
collected → running a specific id skips them, showing up as a
spurious gold-patch regression.
Parametrize ids with whitespace/special chars that are
unparseable as CLI args — upstream runs the whole file so id syntax
is moot.

Running the dataset's test_cmd verbatim and parsing -rA outcomes
recovers 138 / 187 (74%) rows with zero regressions on 30
known-pass controls.

Changes

File	Change
`verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py`	`_build_eval_script` now wraps `info['test_cmd']`; `_run_tests` passes `test_cmd` through; new `_parse_outcomes` helper; `_calculate_reward` scores via parsed F2P/P2P outcomes; class + function docstrings updated

No changes to setup() (still applies test_patch), _apply_gold_patch,
or the class's public API. Agent rollouts and gold-patch validation
both route through the same _run_tests + _calculate_reward — no
behavioral divergence between the two paths.

Validation

8-row smoke test: 4 always-pass controls, 2 D2-rescued
(adamchainz__apig-wsgi-80 coverage_threshold,
Stranger6667__postmarker-125 pytest_error), 2 genuine F2P failures
(msgpack__msgpack-python-229,
marcosschroh__dataclasses-avroschema-724). 8/8 match expected
outcome.
127-row bulk probe (upstream-style eval on 97 prior failures +
30 random passing controls): 30/30 controls pass, 49/97 prior
failures rescued.
Combined with a minimal intermediate flag-fix (drop -x, add
-p no:cacheprovider -W ignore::DeprecationWarning, LANG=C.UTF-8)
that was tested independently and recovered an additional 90 rows,
the dataset's clean set grows from 4245 / 4432 (95.78%) to 4383 /
4432 (98.9%). The remaining ~1% are genuine failures (real F2P
fails, real P2P regressions, or rows with truncated parametrize IDs
that no strategy can recover).

Test plan

8-row smoke test (scripts/d2_smoke_test.py locally)
127-row bulk probe (scripts/d2_bulk_probe.py locally)
Full 4432-row revalidate — I'll run after merge and publish the
updated allow-list as a validated split on the dataset repo.
CI

🤖 Generated with Claude Code

Note

Medium Risk
Changes SWE-Lego test execution and scoring logic, which can affect benchmark pass/fail outcomes and may depend on test_cmd emitting -rA summary lines. Failure modes are mostly mis-scoring (0 reward) rather than infrastructure breakage.

Overview
Switches SWE-Lego evaluation to run the dataset’s canonical per-row test_cmd (whole-file pytest invocation with upstream flags) instead of constructing pytest runs from FAIL_TO_PASS/PASS_TO_PASS IDs.

Reward calculation now parses pytest -rA short-summary lines (via new _parse_outcomes regex) and scores based on whether all required F2P/P2P test IDs are PASSED, rather than trusting pytest’s overall exit code; rows missing test_cmd now error and missing/parseable outcomes are surfaced via warnings.

^{Reviewed by Cursor Bugbot for commit ed37166. Bugbot is set up for automated code reviews on this repo. Configure here.}

Switches `SWELegoTaskSet._run_tests` to execute `info['test_cmd']` — SWE-Lego-Real-Data ships a per-row pytest invocation pointing at the whole test FILE with the flags upstream's eval uses (`LANG=C.UTF-8`, `-p no:cacheprovider`, `-W ignore::DeprecationWarning`, sometimes `--cov=pkg`). Scoring then parses pytest `-rA` outcomes for the specific FAIL_TO_PASS / PASS_TO_PASS ids instead of trusting pytest's overall exit code. Why the change: the prior implementation hand-rolled a pytest call with `-x --tb=short` and passed F2P / P2P test ids directly. Validating the 4432 resolved rows at gold-patch surfaced 187 false negatives whose root cause is one of: * a repo-wide `[tool.pytest.ini_options] addopts = "--cov-fail-under=N"` → pytest exits non-zero even when every scored test passes unless `--cov=pkg` is supplied (captured in test_cmd), * conftest-level `filterwarnings = error` that flips a `PytestDeprecationWarning` into a failure unless `-W ignore::DeprecationWarning` is present (captured in test_cmd), * module-scoped fixtures that only run when the whole file is collected (running a specific id skips them and masquerades as a gold-patch regression), * parametrize ids with whitespace/special chars that are unparseable as CLI args (upstream runs the whole file so id syntax is moot). Running the whole file as upstream does, then checking F2P/P2P outcomes via parsed -rA lines (``_parse_outcomes``), recovers 138 / 187 rows (74%) while not regressing any of 30 known-pass controls. Validated on: * 8-row labeled smoke test (4 always-pass controls, 2 D2-rescued, 2 genuine F2P failures) — 8/8 match expectation. * 127-row bulk probe (97 prior failures + 30 controls) run outside the taskset — 30/30 controls pass, 49/97 prior failures rescued. Tested rubric scoring via `TaskSet.validate()` — both validation and agent rollouts go through the same `_run_tests` + `_calculate_reward` path so behavior is consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3d69b08. Configure here.}

Bugbot findings on #1205: 1. `_OUTCOME_LINE_RE` used `\S+` which stops at the first whitespace. Pytest parametrized ids like `test_dits[TypeMismatch, List var]` were silently truncated, so `outcomes.get(full_id)` returned None and the test scored 0.0 even when it passed. Switch to non-greedy `.+?` with an optional ` - <reason>` tail (for FAILED/ERROR/XFAIL) and strip trailing whitespace on the captured id. 2. If a row's `test_cmd` happens to lack `-rA`, `_parse_outcomes` returns `{}` and reward was silently 0.0 with no warning. Log a clear warning at that site pointing at the likely cause so bad rows are visible. Also dropped SKIPPED from the regex alternatives: pytest's -rA format for skips is `SKIPPED [N] <file>:<line>: <reason>` — no test id to match against F2P/P2P anyway, and a skipped required-test correctly scores 0 via 'no PASSED entry'.

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py

hallerite and others added 2 commits April 20, 2026 14:08

ruff: reformat swe_lego.py

39c8038

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into swe-lego-upstream-flags

3d69b08

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py

rasdani merged commit 6d251c7 into main Apr 20, 2026
6 checks passed

hallerite deleted the swe-lego-upstream-flags branch April 20, 2026 21:56

snimu mentioned this pull request Apr 22, 2026

chore: v0.1.13.dev4 dev release #1227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run SWE-Lego eval via dataset's canonical test_cmd#1205

Run SWE-Lego eval via dataset's canonical test_cmd#1205
rasdani merged 4 commits intomainfrom
swe-lego-upstream-flags

hallerite commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallerite commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Changes

Validation

Test plan

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallerite commented Apr 20, 2026 •

edited by cursor Bot

Loading