Skip to content

[codex] Add Harbor covered benchmark mapping#728

Closed
neubig wants to merge 1 commit into
codex/shared-harbor-runnerfrom
codex/harbor-covered-wrapper-map
Closed

[codex] Add Harbor covered benchmark mapping#728
neubig wants to merge 1 commit into
codex/shared-harbor-runnerfrom
codex/harbor-covered-wrapper-map

Conversation

@neubig
Copy link
Copy Markdown
Member

@neubig neubig commented May 28, 2026

Summary

  • add a shared mapping from covered OpenHands benchmark module names to Harbor dataset names
  • wire Terminal-Bench and SkillsBench config defaults through the shared mapping
  • add tests for mapping coverage, normalization, uncovered benchmarks, and existing wrapper defaults

Stacked on #727.
Closes #720.

Validation

  • python -m py_compile benchmarks/utils/harbor_compat.py benchmarks/terminalbench/config.py benchmarks/skillsbench/config.py
  • uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_skillsbench_run_infer.py
  • uv run ruff check benchmarks/utils/harbor.py benchmarks/utils/harbor_compat.py benchmarks/terminalbench/run_infer.py benchmarks/skillsbench/run_infer.py tests/test_harbor_compat.py

@neubig neubig force-pushed the codex/shared-harbor-runner branch from cc30da7 to 057a3aa Compare May 28, 2026 15:33
@neubig neubig force-pushed the codex/harbor-covered-wrapper-map branch from 2e8c35c to 538a073 Compare May 28, 2026 15:34
@neubig neubig marked this pull request as ready for review May 28, 2026 15:34
@neubig
Copy link
Copy Markdown
Member Author

neubig commented May 28, 2026

@OpenHands /codereview

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 28, 2026

I'm on it! neubig can track my progress at all-hands.dev

@neubig
Copy link
Copy Markdown
Member Author

neubig commented May 28, 2026

@OpenHands /codereview

Validation context for the re-review:

  • PR CI is green: pre-commit and tests succeeded on commit 538a073.
  • Local validation passed: uv run ruff check ..., uv run pyright benchmarks/utils/harbor.py benchmarks/utils/harbor_compat.py tests/test_harbor_compat.py, and uv run pytest tests/test_harbor_compat.py tests/test_terminalbench.py tests/test_skillsbench_run_infer.py.
  • Terminal-Bench smoke using the stacked benchmarks branch succeeded at the GitHub Actions deployment level: OpenHands/evaluation run 26586388116 (run-infer.yml, benchmark=terminalbench, eval_limit=1, benchmarks_branch=codex/harbor-covered-wrapper-map).

Note: the SDK run-eval.yml dispatcher accepted terminalbench, but the downstream eval-job.yml prerequisite switch rejected it before inference; the direct run-infer.yml path is the valid Terminal-Bench smoke path for now.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 28, 2026

I'm on it! neubig can track my progress at all-hands.dev

Copy link
Copy Markdown
Member Author

neubig commented May 30, 2026

After #727 was merged and the stacked base branch was deleted, this PR could not be reopened or retargeted by GitHub (state cannot be changed. The codex/harbor-covered-wrapper-map branch was force-pushed or recreated).

I created replacement PR #731 on top of main with the rebased #728 changes and posted the CI smoke validation results there: #731 (comment)

This comment was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant