Replace Harbor-covered benchmark implementations with thin Harbor compatibility wrappers

## Context
Several benchmarks in this repository overlap with Harbor registry datasets/adapters. Once Harbor execution is centralized, these benchmarks can be reduced to thin compatibility wrappers that preserve OpenHands CLI/output conventions while delegating task execution to Harbor.

## Candidate benchmarks
- SWE-Bench -> `swebench-verified`
- SWE-Bench Pro -> `swebenchpro`
- SWE-Bench Multilingual -> `swebench_multilingual`
- SWTBench -> `swtbench-verified`
- SWESmith -> `swesmith`
- Multi-SWE-Bench -> `multi-swe-bench`
- GAIA -> `gaia`
- Terminal-Bench -> `terminal-bench@2.0`
- SWEGym -> `swegym`

## Proposed scope
- Add a small registry/mapping from OpenHands benchmark names to Harbor dataset names.
- Convert covered benchmarks incrementally to call the shared Harbor runner.
- Preserve current script entry points and `output.jsonl`/cost-report behavior where practical.
- Leave unsupported or parity-sensitive behavior on the legacy path until a Harbor adapter is validated.

## Acceptance criteria
- At least one non-Terminal-Bench covered benchmark uses the shared Harbor wrapper path.
- The migration path is documented for the remaining covered benchmarks.
- Tests cover mapping and output conversion behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace Harbor-covered benchmark implementations with thin Harbor compatibility wrappers #720

Context

Candidate benchmarks

Proposed scope

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Replace Harbor-covered benchmark implementations with thin Harbor compatibility wrappers #720

Description

Context

Candidate benchmarks

Proposed scope

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions