Context
Several benchmarks in this repository overlap with Harbor registry datasets/adapters. Once Harbor execution is centralized, these benchmarks can be reduced to thin compatibility wrappers that preserve OpenHands CLI/output conventions while delegating task execution to Harbor.
Candidate benchmarks
- SWE-Bench ->
swebench-verified
- SWE-Bench Pro ->
swebenchpro
- SWE-Bench Multilingual ->
swebench_multilingual
- SWTBench ->
swtbench-verified
- SWESmith ->
swesmith
- Multi-SWE-Bench ->
multi-swe-bench
- GAIA ->
gaia
- Terminal-Bench ->
terminal-bench@2.0
- SWEGym ->
swegym
Proposed scope
- Add a small registry/mapping from OpenHands benchmark names to Harbor dataset names.
- Convert covered benchmarks incrementally to call the shared Harbor runner.
- Preserve current script entry points and
output.jsonl/cost-report behavior where practical.
- Leave unsupported or parity-sensitive behavior on the legacy path until a Harbor adapter is validated.
Acceptance criteria
- At least one non-Terminal-Bench covered benchmark uses the shared Harbor wrapper path.
- The migration path is documented for the remaining covered benchmarks.
- Tests cover mapping and output conversion behavior.
Context
Several benchmarks in this repository overlap with Harbor registry datasets/adapters. Once Harbor execution is centralized, these benchmarks can be reduced to thin compatibility wrappers that preserve OpenHands CLI/output conventions while delegating task execution to Harbor.
Candidate benchmarks
swebench-verifiedswebenchproswebench_multilingualswtbench-verifiedswesmithmulti-swe-benchgaiaterminal-bench@2.0swegymProposed scope
output.jsonl/cost-report behavior where practical.Acceptance criteria