Skip to content

Replace Harbor-covered benchmark implementations with thin Harbor compatibility wrappers #720

@neubig

Description

@neubig

Context

Several benchmarks in this repository overlap with Harbor registry datasets/adapters. Once Harbor execution is centralized, these benchmarks can be reduced to thin compatibility wrappers that preserve OpenHands CLI/output conventions while delegating task execution to Harbor.

Candidate benchmarks

  • SWE-Bench -> swebench-verified
  • SWE-Bench Pro -> swebenchpro
  • SWE-Bench Multilingual -> swebench_multilingual
  • SWTBench -> swtbench-verified
  • SWESmith -> swesmith
  • Multi-SWE-Bench -> multi-swe-bench
  • GAIA -> gaia
  • Terminal-Bench -> terminal-bench@2.0
  • SWEGym -> swegym

Proposed scope

  • Add a small registry/mapping from OpenHands benchmark names to Harbor dataset names.
  • Convert covered benchmarks incrementally to call the shared Harbor runner.
  • Preserve current script entry points and output.jsonl/cost-report behavior where practical.
  • Leave unsupported or parity-sensitive behavior on the legacy path until a Harbor adapter is validated.

Acceptance criteria

  • At least one non-Terminal-Bench covered benchmark uses the shared Harbor wrapper path.
  • The migration path is documented for the remaining covered benchmarks.
  • Tests cover mapping and output conversion behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions