Skip to content

Run SWE-Lego eval via dataset's canonical test_cmd#1205

Merged
rasdani merged 4 commits intomainfrom
swe-lego-upstream-flags
Apr 20, 2026
Merged

Run SWE-Lego eval via dataset's canonical test_cmd#1205
rasdani merged 4 commits intomainfrom
swe-lego-upstream-flags

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Apr 20, 2026

Summary

Switches SWELegoTaskSet to run the per-row info['test_cmd'] that
SWE-Lego-Real-Data ships — a pytest invocation pointing at the whole
test FILE with the flags upstream's eval uses — instead of our
hand-rolled python -m pytest -x --tb=short <F2P ids>; <P2P ids>.
Scoring then parses -rA output for the specific F2P/P2P ids rather
than trusting pytest's overall exit code.

Why

Validating the 4432 rows in PrimeIntellect/SWE-Lego-Real-Data at
gold-patch surfaced 187 false negatives. Root-cause analysis (split
F2P/P2P tails, then the upstream-style probe on all 187 + 30 passing
controls) traced them to four mechanisms:

  • --cov-fail-under=N in pyproject.toml → pytest exits non-zero
    even when every scored test passed, unless --cov=pkg is supplied
    (captured in test_cmd).
  • filterwarnings = error in conftest → a PytestDeprecationWarning
    flips into a failure unless -W ignore::DeprecationWarning is present
    (captured in test_cmd).
  • Module-scoped fixtures that only run when the whole file is
    collected → running a specific id skips them, showing up as a
    spurious gold-patch regression.
  • Parametrize ids with whitespace/special chars that are
    unparseable as CLI args — upstream runs the whole file so id syntax
    is moot.

Running the dataset's test_cmd verbatim and parsing -rA outcomes
recovers 138 / 187 (74%) rows with zero regressions on 30
known-pass controls.

Changes

File Change
verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py _build_eval_script now wraps info['test_cmd']; _run_tests passes test_cmd through; new _parse_outcomes helper; _calculate_reward scores via parsed F2P/P2P outcomes; class + function docstrings updated

No changes to setup() (still applies test_patch), _apply_gold_patch,
or the class's public API. Agent rollouts and gold-patch validation
both route through the same _run_tests + _calculate_reward — no
behavioral divergence between the two paths.

Validation

  • 8-row smoke test: 4 always-pass controls, 2 D2-rescued
    (adamchainz__apig-wsgi-80 coverage_threshold,
    Stranger6667__postmarker-125 pytest_error), 2 genuine F2P failures
    (msgpack__msgpack-python-229,
    marcosschroh__dataclasses-avroschema-724). 8/8 match expected
    outcome.
  • 127-row bulk probe (upstream-style eval on 97 prior failures +
    30 random passing controls): 30/30 controls pass, 49/97 prior
    failures rescued
    .
  • Combined with a minimal intermediate flag-fix (drop -x, add
    -p no:cacheprovider -W ignore::DeprecationWarning, LANG=C.UTF-8)
    that was tested independently and recovered an additional 90 rows,
    the dataset's clean set grows from 4245 / 4432 (95.78%) to 4383 /
    4432 (98.9%)
    . The remaining ~1% are genuine failures (real F2P
    fails, real P2P regressions, or rows with truncated parametrize IDs
    that no strategy can recover).

Test plan

  • 8-row smoke test (scripts/d2_smoke_test.py locally)
  • 127-row bulk probe (scripts/d2_bulk_probe.py locally)
  • Full 4432-row revalidate — I'll run after merge and publish the
    updated allow-list as a validated split on the dataset repo.
  • CI

🤖 Generated with Claude Code


Note

Medium Risk
Changes SWE-Lego test execution and scoring logic, which can affect benchmark pass/fail outcomes and may depend on test_cmd emitting -rA summary lines. Failure modes are mostly mis-scoring (0 reward) rather than infrastructure breakage.

Overview
Switches SWE-Lego evaluation to run the dataset’s canonical per-row test_cmd (whole-file pytest invocation with upstream flags) instead of constructing pytest runs from FAIL_TO_PASS/PASS_TO_PASS IDs.

Reward calculation now parses pytest -rA short-summary lines (via new _parse_outcomes regex) and scores based on whether all required F2P/P2P test IDs are PASSED, rather than trusting pytest’s overall exit code; rows missing test_cmd now error and missing/parseable outcomes are surfaced via warnings.

Reviewed by Cursor Bugbot for commit ed37166. Bugbot is set up for automated code reviews on this repo. Configure here.

Switches `SWELegoTaskSet._run_tests` to execute `info['test_cmd']` —
SWE-Lego-Real-Data ships a per-row pytest invocation pointing at the
whole test FILE with the flags upstream's eval uses (`LANG=C.UTF-8`,
`-p no:cacheprovider`, `-W ignore::DeprecationWarning`, sometimes
`--cov=pkg`). Scoring then parses pytest `-rA` outcomes for the specific
FAIL_TO_PASS / PASS_TO_PASS ids instead of trusting pytest's overall
exit code.

Why the change: the prior implementation hand-rolled a pytest call with
`-x --tb=short` and passed F2P / P2P test ids directly. Validating the
4432 resolved rows at gold-patch surfaced 187 false negatives whose
root cause is one of:

  * a repo-wide `[tool.pytest.ini_options] addopts = "--cov-fail-under=N"`
    → pytest exits non-zero even when every scored test passes unless
    `--cov=pkg` is supplied (captured in test_cmd),
  * conftest-level `filterwarnings = error` that flips a
    `PytestDeprecationWarning` into a failure unless
    `-W ignore::DeprecationWarning` is present (captured in test_cmd),
  * module-scoped fixtures that only run when the whole file is
    collected (running a specific id skips them and masquerades as a
    gold-patch regression),
  * parametrize ids with whitespace/special chars that are unparseable
    as CLI args (upstream runs the whole file so id syntax is moot).

Running the whole file as upstream does, then checking F2P/P2P
outcomes via parsed -rA lines (``_parse_outcomes``), recovers 138 /
187 rows (74%) while not regressing any of 30 known-pass controls.

Validated on:
  * 8-row labeled smoke test (4 always-pass controls, 2 D2-rescued,
    2 genuine F2P failures) — 8/8 match expectation.
  * 127-row bulk probe (97 prior failures + 30 controls) run outside
    the taskset — 30/30 controls pass, 49/97 prior failures rescued.

Tested rubric scoring via `TaskSet.validate()` — both validation
and agent rollouts go through the same `_run_tests` + `_calculate_reward`
path so behavior is consistent.
Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py
hallerite and others added 2 commits April 20, 2026 14:08
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3d69b08. Configure here.

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py
Bugbot findings on #1205:

1. `_OUTCOME_LINE_RE` used `\S+` which stops at the first whitespace.
   Pytest parametrized ids like `test_dits[TypeMismatch, List var]`
   were silently truncated, so `outcomes.get(full_id)` returned None
   and the test scored 0.0 even when it passed. Switch to non-greedy
   `.+?` with an optional ` - <reason>` tail (for FAILED/ERROR/XFAIL)
   and strip trailing whitespace on the captured id.

2. If a row's `test_cmd` happens to lack `-rA`, `_parse_outcomes`
   returns `{}` and reward was silently 0.0 with no warning. Log a
   clear warning at that site pointing at the likely cause so bad
   rows are visible.

Also dropped SKIPPED from the regex alternatives: pytest's -rA format
for skips is `SKIPPED [N] <file>:<line>: <reason>` — no test id to
match against F2P/P2P anyway, and a skipped required-test correctly
scores 0 via 'no PASSED entry'.
@rasdani rasdani merged commit 6d251c7 into main Apr 20, 2026
6 checks passed
@hallerite hallerite deleted the swe-lego-upstream-flags branch April 20, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants