[codex] Add Harbor covered benchmark mapping by neubig · Pull Request #731 · OpenHands/benchmarks

neubig · 2026-05-30T20:52:09Z

Summary

add a shared mapping from covered OpenHands benchmark module names to Harbor dataset names
wire Terminal-Bench and SkillsBench config defaults through the shared mapping
add tests for mapping coverage, normalization, uncovered benchmarks, and wrapper defaults

Replaces closed PR #728, which could not be reopened after its original stacked base branch was deleted by merging #727.
Closes #720.

Validation

uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_harbor.py
uv run pre-commit run --files benchmarks/skillsbench/config.py benchmarks/terminalbench/config.py benchmarks/utils/harbor_compat.py tests/test_harbor_compat.py
5-example CI smoke tests are being run with MiniMax M2.7 via OpenHands/evaluation#567.

This PR was created by an AI agent (OpenHands) on behalf of the user.

neubig · 2026-05-30T21:54:48Z

Validation update for replacement of #728:

Local validation passed on rebased branch 9169a930:
- uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_harbor.py
- uv run pre-commit run --files benchmarks/skillsbench/config.py benchmarks/terminalbench/config.py benchmarks/utils/harbor_compat.py tests/test_harbor_compat.py
CI checks on this PR passed (pre-commit, tests).
Evaluation infrastructure note: OpenHands/evaluation main did not expose skillsbench; I added support on evaluation PR OpenHands/evaluation#567 and used that branch for the smoke tests.

5-example CI smoke runs with MiniMax M2.7:

Benchmark	Run	Harness	Result
TerminalBench	26695326120	26695755325	5/5 submitted; 0 completed; 5 errored
SkillsBench	26694648518	26695090760	5/5 submitted; 0 completed; 5 errored

Artifacts:

TerminalBench infer: https://results.eval.all-hands.dev/terminalbench/litellm_proxy-minimax-MiniMax-M2-7/26695326120/infer-output.tar.gz
TerminalBench harness: https://results.eval.all-hands.dev/terminalbench/litellm_proxy-minimax-MiniMax-M2-7/26695755325/results.tar.gz
SkillsBench infer: https://results.eval.all-hands.dev/skillsbench/litellm_proxy-minimax-MiniMax-M2-7/26694648518/infer-output.tar.gz
SkillsBench harness: https://results.eval.all-hands.dev/skillsbench/litellm_proxy-minimax-MiniMax-M2-7/26695090760/results.tar.gz

Observed failures are in the Harbor execution path, not this mapping diff itself:

TerminalBench: all selected tasks errored with RewardFileNotFoundError after agent execution; no /logs/verifier/reward.{txt,json} was produced.
SkillsBench: errors included RewardFileNotFoundError; one task failed Docker startup because the task requested CPUs outside the 4-CPU pod limit; one task failed installing openhands-sdk inside the task container due DNS resolution.

I did not proceed to full benchmark runs because both 5-example CI smokes currently produce 0 completed examples; full runs would not provide meaningful comparison until these Harbor/runtime issues are resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

Add Harbor covered benchmark mapping

9169a93

neubig mentioned this pull request May 30, 2026

[codex] Add Harbor covered benchmark mapping #728

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Add Harbor covered benchmark mapping#731

[codex] Add Harbor covered benchmark mapping#731
neubig wants to merge 1 commit into
mainfrom
codex/harbor-covered-wrapper-map

neubig commented May 30, 2026

Uh oh!

neubig commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neubig commented May 30, 2026

Summary

Validation

Uh oh!

neubig commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant