Skip to content

feat: add SWE-rebench-V2 TaskSet#1187

Merged
rasdani merged 4 commits intomainfrom
feat/swe-rebench-v2-taskset
Apr 20, 2026
Merged

feat: add SWE-rebench-V2 TaskSet#1187
rasdani merged 4 commits intomainfrom
feat/swe-rebench-v2-taskset

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented Apr 19, 2026

Summary

  • Adds SWERebenchV2TaskSet backend (swerebench-v2) for nebius/SWE-rebench-V2 — 32 079 real multilingual bug-fix instances across 20 languages, each shipping a pre-built docker image, gold patch, test_patch, and an install_config dict pinning per-instance test_cmd + log_parser.
  • Grading mirrors upstream scripts/eval.py: apply patch then test_patch with git apply -v --3way --recount --ignore-space-change --whitespace=nowarn, run the pinned test_cmd, parse output with the parser named by install_config.log_parser, check FAIL_TO_PASS ∪ PASS_TO_PASS ⊆ PASSED with timing-suffix normalization on both sides. Workdir is /{repo-name} (upstream convention), not /testbed.
  • Vendors upstream lib/agent/log_parsers.py as swe_rebench_v2_log_parsers.py (upstream isn't pip-installable). Attribution header + # ruff: noqa on the vendored module; sole edit vs. upstream is inlining TestStatus (replaces the unresolvable from lib.agent.swe_constants import TestStatus). 76 parsers covering all 20 supported languages.

Files

  • verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py — class + rubric (new)
  • verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2_log_parsers.py — vendored (new)
  • verifiers/envs/experimental/composable/tasksets/swe/swe_tasksets.pymake_swerebench_v2_taskset + swerebench-v2 in the make_swe_taskset factory map
  • verifiers/envs/experimental/composable/tasksets/swe/__init__.py, verifiers/envs/experimental/composable/tasksets/__init__.py — re-exports

Drop-in via make_swe_taskset(backend="swerebench-v2") — no research-environments changes needed (rlm-swe picks this up automatically like swebench / swelego-real / etc).

Test plan

  • uv run ruff check … + uv run pre-commit run --files … — clean (vendored module excluded via # ruff: noqa).

  • Import smoke: dataset loads (32 079 rows), NAME_TO_PARSER registers 76 parsers, factory dispatch works.

  • Live gold-patch validateawait make_swerebench_v2_taskset().validate(n=1, concurrency=1)valid=True, 79s (TS instance, parse_log_js_4).

  • Multi-language validate sweepvalidate(n=1) per language, 4 distinct parsers, all green:

    language parser image (first row) elapsed valid
    python parse_log_pytest docker.io/swerebenchv2/wtforms-wtforms:614-848d28d 41.4s
    java parse_java_mvn docker.io/swerebenchv2/crawler-commons-crawler-commons:227-ab9e33a 52.1s
    go parse_log_gotest docker.io/swerebenchv2/filecoin-project-specs-actors:592-db0bbaa 66.4s
    rust parse_log_cargo docker.io/swerebenchv2/dtolnay-cxx:839-c5f472e 70.8s
  • vf-eval end-to-end, reward=1 achieveduv run vf-eval rlm-swe -a '{"task_type":"swerebench-v2"}' -p prime -m openai/gpt-5.4 -n 1 -r 1 -s completed cleanly with reward=1.000, stop_conditions: agent_completed=1.000, 4m18s, 18 turns / 17 tool calls. Instance: elastic__synthetics-316 (TS, parse_log_js_4).

Notes

  • install_config.install is intentionally not re-run: the published images ship with deps pre-installed, matching upstream's eval.py which only runs test_cmd.
  • language filter takes the dataset's native labels (python, go, rust, java, ts, js, php, kotlin, julia, elixir, scala, swift, dart, c, cpp, csharp, r, clojure, ocaml, lua); omit for the full 32k cross-language mix.
  • rlm-swe host image installs are fixed upstream: PrimeIntellect-ai/rlm#33 + Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34 pin UV_INSTALL_DIR / UV_TOOL_BIN_DIR so both uv and its tool shims colocate at $HOME/.local/bin regardless of the image's XDG configuration.

🤖 Generated with Claude Code


Note

Medium Risk
Adds a sizable new sandbox-executed evaluation path that runs per-instance test_cmd and parses diverse test outputs; issues are mostly contained to the new backend but could affect sandbox stability/timeouts.

Overview
Adds a new SWE backend swerebench-v2 backed by the nebius/SWE-rebench-V2 dataset, including a new SWERebenchV2TaskSet + rubric that grades by applying the instance’s patches, running its pinned test_cmd inside the provided Docker image/workdir, and comparing parsed test statuses against FAIL_TO_PASS/PASS_TO_PASS (with timing-suffix normalization).

Wires the backend into make_swe_taskset and re-exports it from the SWE taskset packages. Vendors the upstream SWE-rebench-V2 log parser suite (large swe_rebench_v2_log_parsers.py) to match upstream evaluation behavior.

Reviewed by Cursor Bugbot for commit a17fa55. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds `SWERebenchV2TaskSet` backend (`swerebench-v2`) for
nebius/SWE-rebench-V2 — 32k multilingual real bug-fix instances across 20
languages, each shipping a pre-built image, a gold `patch`, a `test_patch`,
and an `install_config` pinning `test_cmd` + `log_parser`.

Grading mirrors upstream `scripts/eval.py`: apply patch + test_patch with
`git apply -v --3way --recount --ignore-space-change --whitespace=nowarn`,
run the pinned `test_cmd`, parse with the named parser, check F2P+P2P ⊆
PASSED (timing suffixes normalized on both sides). Workdir is
`/{repo-name}` to match upstream.

Vendors upstream `lib/agent/log_parsers.py` (SWE-rebench-V2 isn't
pip-installable) as `swe_rebench_v2_log_parsers.py` with attribution
header + `ruff: noqa`; only edit vs. upstream is replacing
`from lib.agent.swe_constants import TestStatus` with an inline enum.

Verified `valid=True` on the first live instance via
`make_swerebench_v2_taskset().validate(n=1, concurrency=1)` (79s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rasdani rasdani marked this pull request as ready for review April 19, 2026 02:03
@rasdani rasdani requested a review from hallerite April 19, 2026 02:12
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we maybe group everything related to swe_rebench_v2 by putting it in a folder?

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_tasksets.py
Picks up #1199 (base-class filter_fn) so super().__init__(filter_fn=...)
resolves, and #1186 (swe-smith) which also merged. All three conflicts
were list/dict unions — swerebench-v2 additions alongside swesmith
additions; no semantic choices.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cd24c97. Configure here.

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py Outdated
The prior _build_eval_script dedented an f-string template whose
{body} substitution was multi-line — the first body line inherited
the template's 8-space indent, subsequent body lines started at
column 0, so dedent's common-prefix was 0 and stripped nothing.
Result: indented script (bash-tolerated but sloppy) plus misleading
dead dedent logic.

Rebuild as a flat list of lines joined with newlines. Every line
now starts at column 0 by construction. Drops the unused dedent
import.
@rasdani rasdani merged commit 5d4b8b9 into main Apr 20, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants