feat: add SWE-rebench-V2 TaskSet by rasdani · Pull Request #1187 · PrimeIntellect-ai/verifiers

rasdani · 2026-04-19T01:51:15Z

Summary

Adds SWERebenchV2TaskSet backend (swerebench-v2) for nebius/SWE-rebench-V2 — 32 079 real multilingual bug-fix instances across 20 languages, each shipping a pre-built docker image, gold patch, test_patch, and an install_config dict pinning per-instance test_cmd + log_parser.
Grading mirrors upstream scripts/eval.py: apply patch then test_patch with git apply -v --3way --recount --ignore-space-change --whitespace=nowarn, run the pinned test_cmd, parse output with the parser named by install_config.log_parser, check FAIL_TO_PASS ∪ PASS_TO_PASS ⊆ PASSED with timing-suffix normalization on both sides. Workdir is /{repo-name} (upstream convention), not /testbed.
Vendors upstream lib/agent/log_parsers.py as swe_rebench_v2_log_parsers.py (upstream isn't pip-installable). Attribution header + # ruff: noqa on the vendored module; sole edit vs. upstream is inlining TestStatus (replaces the unresolvable from lib.agent.swe_constants import TestStatus). 76 parsers covering all 20 supported languages.

Files

verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py — class + rubric (new)
verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2_log_parsers.py — vendored (new)
verifiers/envs/experimental/composable/tasksets/swe/swe_tasksets.py — make_swerebench_v2_taskset + swerebench-v2 in the make_swe_taskset factory map
verifiers/envs/experimental/composable/tasksets/swe/__init__.py, verifiers/envs/experimental/composable/tasksets/__init__.py — re-exports

Drop-in via make_swe_taskset(backend="swerebench-v2") — no research-environments changes needed (rlm-swe picks this up automatically like swebench / swelego-real / etc).

Test plan

uv run ruff check … + uv run pre-commit run --files … — clean (vendored module excluded via # ruff: noqa).
Import smoke: dataset loads (32 079 rows), NAME_TO_PARSER registers 76 parsers, factory dispatch works.
Live gold-patch validate — await make_swerebench_v2_taskset().validate(n=1, concurrency=1) → valid=True, 79s (TS instance, parse_log_js_4).

Multi-language validate sweep — validate(n=1) per language, 4 distinct parsers, all green:

language	parser	image (first row)	elapsed	valid
python	parse_log_pytest	docker.io/swerebenchv2/wtforms-wtforms:614-848d28d	41.4s	✅
java	parse_java_mvn	docker.io/swerebenchv2/crawler-commons-crawler-commons:227-ab9e33a	52.1s	✅
go	parse_log_gotest	docker.io/swerebenchv2/filecoin-project-specs-actors:592-db0bbaa	66.4s	✅
rust	parse_log_cargo	docker.io/swerebenchv2/dtolnay-cxx:839-c5f472e	70.8s	✅

vf-eval end-to-end, reward=1 achieved — uv run vf-eval rlm-swe -a '{"task_type":"swerebench-v2"}' -p prime -m openai/gpt-5.4 -n 1 -r 1 -s completed cleanly with reward=1.000, stop_conditions: agent_completed=1.000, 4m18s, 18 turns / 17 tool calls. Instance: elastic__synthetics-316 (TS, parse_log_js_4).

Notes

install_config.install is intentionally not re-run: the published images ship with deps pre-installed, matching upstream's eval.py which only runs test_cmd.
language filter takes the dataset's native labels (python, go, rust, java, ts, js, php, kotlin, julia, elixir, scala, swift, dart, c, cpp, csharp, r, clojure, ocaml, lua); omit for the full 32k cross-language mix.
rlm-swe host image installs are fixed upstream: PrimeIntellect-ai/rlm#33 + Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34 pin UV_INSTALL_DIR / UV_TOOL_BIN_DIR so both uv and its tool shims colocate at $HOME/.local/bin regardless of the image's XDG configuration.

🤖 Generated with Claude Code

Note

Medium Risk
Adds a sizable new sandbox-executed evaluation path that runs per-instance test_cmd and parses diverse test outputs; issues are mostly contained to the new backend but could affect sandbox stability/timeouts.

Overview
Adds a new SWE backend swerebench-v2 backed by the nebius/SWE-rebench-V2 dataset, including a new SWERebenchV2TaskSet + rubric that grades by applying the instance’s patches, running its pinned test_cmd inside the provided Docker image/workdir, and comparing parsed test statuses against FAIL_TO_PASS/PASS_TO_PASS (with timing-suffix normalization).

Wires the backend into make_swe_taskset and re-exports it from the SWE taskset packages. Vendors the upstream SWE-rebench-V2 log parser suite (large swe_rebench_v2_log_parsers.py) to match upstream evaluation behavior.

^{Reviewed by Cursor Bugbot for commit a17fa55. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds `SWERebenchV2TaskSet` backend (`swerebench-v2`) for nebius/SWE-rebench-V2 — 32k multilingual real bug-fix instances across 20 languages, each shipping a pre-built image, a gold `patch`, a `test_patch`, and an `install_config` pinning `test_cmd` + `log_parser`. Grading mirrors upstream `scripts/eval.py`: apply patch + test_patch with `git apply -v --3way --recount --ignore-space-change --whitespace=nowarn`, run the pinned `test_cmd`, parse with the named parser, check F2P+P2P ⊆ PASSED (timing suffixes normalized on both sides). Workdir is `/{repo-name}` to match upstream. Vendors upstream `lib/agent/log_parsers.py` (SWE-rebench-V2 isn't pip-installable) as `swe_rebench_v2_log_parsers.py` with attribution header + `ruff: noqa`; only edit vs. upstream is replacing `from lib.agent.swe_constants import TestStatus` with an inline enum. Verified `valid=True` on the first live instance via `make_swerebench_v2_taskset().validate(n=1, concurrency=1)` (79s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallerite · 2026-04-20T11:49:53Z

can we maybe group everything related to swe_rebench_v2 by putting it in a folder?

Picks up #1199 (base-class filter_fn) so super().__init__(filter_fn=...) resolves, and #1186 (swe-smith) which also merged. All three conflicts were list/dict unions — swerebench-v2 additions alongside swesmith additions; no semantic choices.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cd24c97. Configure here.}

The prior _build_eval_script dedented an f-string template whose {body} substitution was multi-line — the first body line inherited the template's 8-space indent, subsequent body lines started at column 0, so dedent's common-prefix was 0 and stripped nothing. Result: indented script (bash-tolerated but sloppy) plus misleading dead dedent logic. Rebuild as a flat list of lines joined with newlines. Every line now starts at column 0 by construction. Drops the unused dedent import.

rasdani marked this pull request as ready for review April 19, 2026 02:03

rasdani requested a review from hallerite April 19, 2026 02:12

rasdani mentioned this pull request Apr 19, 2026

feat: add filter_fn kwarg to all tasksets for ad-hoc row filtering #1199

Merged

5 tasks

hallerite reviewed Apr 20, 2026

View reviewed changes

swerebench-v2: thread filter_fn to base TaskSet

d014171

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py

merge main into swe-rebench-v2 branch

cd24c97

Picks up #1199 (base-class filter_fn) so super().__init__(filter_fn=...) resolves, and #1186 (swe-smith) which also merged. All three conflicts were list/dict unions — swerebench-v2 additions alongside swesmith additions; no semantic choices.

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py Outdated

rasdani merged commit 5d4b8b9 into main Apr 20, 2026
6 checks passed

snimu mentioned this pull request Apr 22, 2026

chore: v0.1.13.dev4 dev release #1227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add SWE-rebench-V2 TaskSet#1187

feat: add SWE-rebench-V2 TaskSet#1187
rasdani merged 4 commits intomainfrom
feat/swe-rebench-v2-taskset

rasdani commented Apr 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

hallerite Apr 20, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rasdani commented Apr 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Test plan

Notes

Uh oh!

hallerite Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rasdani commented Apr 19, 2026 •

edited by cursor Bot

Loading