Skip to content

[codex] Add Harbor covered benchmark mapping#731

Open
neubig wants to merge 1 commit into
mainfrom
codex/harbor-covered-wrapper-map
Open

[codex] Add Harbor covered benchmark mapping#731
neubig wants to merge 1 commit into
mainfrom
codex/harbor-covered-wrapper-map

Conversation

@neubig
Copy link
Copy Markdown
Member

@neubig neubig commented May 30, 2026

Summary

  • add a shared mapping from covered OpenHands benchmark module names to Harbor dataset names
  • wire Terminal-Bench and SkillsBench config defaults through the shared mapping
  • add tests for mapping coverage, normalization, uncovered benchmarks, and wrapper defaults

Replaces closed PR #728, which could not be reopened after its original stacked base branch was deleted by merging #727.
Closes #720.

Validation

  • uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_harbor.py
  • uv run pre-commit run --files benchmarks/skillsbench/config.py benchmarks/terminalbench/config.py benchmarks/utils/harbor_compat.py tests/test_harbor_compat.py
  • 5-example CI smoke tests are being run with MiniMax M2.7 via OpenHands/evaluation#567.

This PR was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Member Author

neubig commented May 30, 2026

Validation update for replacement of #728:

  • Local validation passed on rebased branch 9169a930:
    • uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_harbor.py
    • uv run pre-commit run --files benchmarks/skillsbench/config.py benchmarks/terminalbench/config.py benchmarks/utils/harbor_compat.py tests/test_harbor_compat.py
  • CI checks on this PR passed (pre-commit, tests).
  • Evaluation infrastructure note: OpenHands/evaluation main did not expose skillsbench; I added support on evaluation PR OpenHands/evaluation#567 and used that branch for the smoke tests.

5-example CI smoke runs with MiniMax M2.7:

Benchmark Run Harness Result
TerminalBench 26695326120 26695755325 5/5 submitted; 0 completed; 5 errored
SkillsBench 26694648518 26695090760 5/5 submitted; 0 completed; 5 errored

Artifacts:

Observed failures are in the Harbor execution path, not this mapping diff itself:

  • TerminalBench: all selected tasks errored with RewardFileNotFoundError after agent execution; no /logs/verifier/reward.{txt,json} was produced.
  • SkillsBench: errors included RewardFileNotFoundError; one task failed Docker startup because the task requested CPUs outside the 4-CPU pod limit; one task failed installing openhands-sdk inside the task container due DNS resolution.

I did not proceed to full benchmark runs because both 5-example CI smokes currently produce 0 completed examples; full runs would not provide meaningful comparison until these Harbor/runtime issues are resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace Harbor-covered benchmark implementations with thin Harbor compatibility wrappers

1 participant