Merged
Conversation
Adds `SWERebenchV2TaskSet` backend (`swerebench-v2`) for
nebius/SWE-rebench-V2 — 32k multilingual real bug-fix instances across 20
languages, each shipping a pre-built image, a gold `patch`, a `test_patch`,
and an `install_config` pinning `test_cmd` + `log_parser`.
Grading mirrors upstream `scripts/eval.py`: apply patch + test_patch with
`git apply -v --3way --recount --ignore-space-change --whitespace=nowarn`,
run the pinned `test_cmd`, parse with the named parser, check F2P+P2P ⊆
PASSED (timing suffixes normalized on both sides). Workdir is
`/{repo-name}` to match upstream.
Vendors upstream `lib/agent/log_parsers.py` (SWE-rebench-V2 isn't
pip-installable) as `swe_rebench_v2_log_parsers.py` with attribution
header + `ruff: noqa`; only edit vs. upstream is replacing
`from lib.agent.swe_constants import TestStatus` with an inline enum.
Verified `valid=True` on the first live instance via
`make_swerebench_v2_taskset().validate(n=1, concurrency=1)` (79s).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
hallerite
reviewed
Apr 20, 2026
Member
There was a problem hiding this comment.
can we maybe group everything related to swe_rebench_v2 by putting it in a folder?
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cd24c97. Configure here.
The prior _build_eval_script dedented an f-string template whose
{body} substitution was multi-line — the first body line inherited
the template's 8-space indent, subsequent body lines started at
column 0, so dedent's common-prefix was 0 and stripped nothing.
Result: indented script (bash-tolerated but sloppy) plus misleading
dead dedent logic.
Rebuild as a flat list of lines joined with newlines. Every line
now starts at column 0 by construction. Drops the unused dedent
import.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
SWERebenchV2TaskSetbackend (swerebench-v2) fornebius/SWE-rebench-V2— 32 079 real multilingual bug-fix instances across 20 languages, each shipping a pre-built docker image, goldpatch,test_patch, and aninstall_configdict pinning per-instancetest_cmd+log_parser.scripts/eval.py: applypatchthentest_patchwithgit apply -v --3way --recount --ignore-space-change --whitespace=nowarn, run the pinnedtest_cmd, parse output with the parser named byinstall_config.log_parser, checkFAIL_TO_PASS ∪ PASS_TO_PASS ⊆ PASSEDwith timing-suffix normalization on both sides. Workdir is/{repo-name}(upstream convention), not/testbed.lib/agent/log_parsers.pyasswe_rebench_v2_log_parsers.py(upstream isn't pip-installable). Attribution header +# ruff: noqaon the vendored module; sole edit vs. upstream is inliningTestStatus(replaces the unresolvablefrom lib.agent.swe_constants import TestStatus). 76 parsers covering all 20 supported languages.Files
verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py— class + rubric (new)verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2_log_parsers.py— vendored (new)verifiers/envs/experimental/composable/tasksets/swe/swe_tasksets.py—make_swerebench_v2_taskset+swerebench-v2in themake_swe_tasksetfactory mapverifiers/envs/experimental/composable/tasksets/swe/__init__.py,verifiers/envs/experimental/composable/tasksets/__init__.py— re-exportsDrop-in via
make_swe_taskset(backend="swerebench-v2")— no research-environments changes needed (rlm-swe picks this up automatically likeswebench/swelego-real/ etc).Test plan
uv run ruff check …+uv run pre-commit run --files …— clean (vendored module excluded via# ruff: noqa).Import smoke: dataset loads (32 079 rows),
NAME_TO_PARSERregisters 76 parsers, factory dispatch works.Live gold-patch validate —
await make_swerebench_v2_taskset().validate(n=1, concurrency=1)→valid=True, 79s (TS instance,parse_log_js_4).Multi-language validate sweep —
validate(n=1)per language, 4 distinct parsers, all green:vf-eval end-to-end,
reward=1achieved —uv run vf-eval rlm-swe -a '{"task_type":"swerebench-v2"}' -p prime -m openai/gpt-5.4 -n 1 -r 1 -scompleted cleanly withreward=1.000,stop_conditions: agent_completed=1.000, 4m18s, 18 turns / 17 tool calls. Instance:elastic__synthetics-316(TS,parse_log_js_4).Notes
install_config.installis intentionally not re-run: the published images ship with deps pre-installed, matching upstream'seval.pywhich only runstest_cmd.languagefilter takes the dataset's native labels (python,go,rust,java,ts,js,php,kotlin,julia,elixir,scala,swift,dart,c,cpp,csharp,r,clojure,ocaml,lua); omit for the full 32k cross-language mix.rlm-swehost image installs are fixed upstream: PrimeIntellect-ai/rlm#33 + Idea: Expanding N-turn conversations into N different "up to k turns" conversations and compute reward separately #34 pinUV_INSTALL_DIR/UV_TOOL_BIN_DIRso bothuvand its tool shims colocate at$HOME/.local/binregardless of the image's XDG configuration.🤖 Generated with Claude Code
Note
Medium Risk
Adds a sizable new sandbox-executed evaluation path that runs per-instance
test_cmdand parses diverse test outputs; issues are mostly contained to the new backend but could affect sandbox stability/timeouts.Overview
Adds a new SWE backend
swerebench-v2backed by thenebius/SWE-rebench-V2dataset, including a newSWERebenchV2TaskSet+ rubric that grades by applying the instance’s patches, running its pinnedtest_cmdinside the provided Docker image/workdir, and comparing parsed test statuses againstFAIL_TO_PASS/PASS_TO_PASS(with timing-suffix normalization).Wires the backend into
make_swe_tasksetand re-exports it from the SWE taskset packages. Vendors the upstream SWE-rebench-V2 log parser suite (largeswe_rebench_v2_log_parsers.py) to match upstream evaluation behavior.Reviewed by Cursor Bugbot for commit a17fa55. Bugbot is set up for automated code reviews on this repo. Configure here.