ci: add workflow_call triggers to build workflows by simonrosenberg · Pull Request #666 · OpenHands/benchmarks

simonrosenberg · 2026-04-14T01:04:39Z

Summary

Add workflow_call triggers to the 5 build workflows used by the evaluation pipeline (swebench, gaia, swtbench, commit0, swebenchmultimodal)
Update checkout steps to work when called cross-repo from OpenHands/evaluation
Update concurrency groups to key on sdk-commit when available

This enables the evaluation repo to call these build workflows as reusable workflows via uses:, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py.

Changes per workflow

Each build workflow gets:

A workflow_call: trigger with inputs for sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable)
Explicit repository: OpenHands/benchmarks + token on the checkout step (required because github.repository resolves to the caller's repo for workflow_call)
benchmarks-ref as the first priority in checkout ref determination
Simplified SDK submodule update condition (removed event_name guard since inputs.sdk-commit is sufficient)
Updated concurrency group: ${{ inputs.sdk-commit || github.ref }} to avoid serializing builds for different SDK commits

Backward compatibility

All existing triggers (workflow_dispatch, pull_request_target) work exactly as before. The workflow_call inputs use the same names as the existing workflow_dispatch inputs where applicable, so the inputs.* references in steps resolve correctly for both trigger types.

Validation

End-to-end runs completed successfully using these reusable workflows called from OpenHands/evaluation (all with eval_limit=1 or 5, claude-sonnet-4-5-20250929):

Benchmark	Result	SDK trigger	Eval workflow run
swebench	5/5 completed, 5/5 resolved, 0 errors	#24406322597	#24406353727
gaia	5/5 completed, 3/5 resolved, 0 errors	#24406322541	#24406354480
commit0	5/5 completed, 2/5 resolved, 0 errors	#24406322608	#24406355165
swebenchmultimodal	1/1 completed, 0/1 resolved, 0 errors	#24414072255	#24414097101
swtbench	5/5 inferred, 0 errors (eval harness blocked by unrelated `swtbench-highcore` node pool capacity)	#24406322601	#24406353744

The swebenchmultimodal run exercised fresh GHCR pushes for eval-builder, eval-base, and eval-agent-server, confirming package permissions are correctly scoped for cross-repo writes.

GHCR package permissions

For the evaluation repo's GITHUB_TOKEN to push images, the following GHCR packages must grant OpenHands/evaluation write access via Package settings → Manage Actions access:

ghcr.io/openhands/eval-builder
ghcr.io/openhands/eval-base
ghcr.io/openhands/eval-agent-server

Test plan

Test cross-repo workflow_call from evaluation repo (validated above for all 5 benchmarks)
Verify GHCR package permissions allow push from evaluation repo context
Verify direct workflow_dispatch triggers still work for each build workflow
Verify pull_request_target label triggers still work

🤖 Generated with Claude Code

Enable build workflows to be called as reusable workflows from the evaluation repo, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py. Changes per workflow: - Add workflow_call trigger with inputs: sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable) - Add explicit repository/token to checkout for cross-repo context - Add benchmarks-ref input to checkout-ref determination - Simplify SDK submodule condition (remove event_name guard) - Update concurrency group to use sdk-commit when available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟢 Good taste - Clean, focused changes that solve a real problem without adding unnecessary complexity.

Verdict: ✅ Worth merging after testing

Key insight: The fallback patterns (inputs.sdk-commit || github.ref, secrets.PAT_TOKEN || github.token) elegantly handle both workflow_call and existing triggers without conditional spaghetti.

One trade-off to note: hardcoding repository: OpenHands/benchmarks breaks fork support for workflow_dispatch, but this is pragmatic since the primary use case is cross-repo calls from evaluation.

Test plan checkboxes are unchecked - verify all scenarios work before merge.

When build workflows are called via workflow_call from the evaluation repo, GITHUB_TOKEN is scoped to the calling repo and can't push to packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder). Fall back to PAT_TOKEN which has cross-repo write:packages scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pushes" This reverts commit 9585262.

Matches the old orchestrate_eval.py polling bound in the evaluation repo (MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h on swtbench (#527 context) but 10h is the operationally-meaningful ceiling: the evaluation pipeline can't usefully wait longer than that. Applied to the 5 workflows that PR #666 adds workflow_call triggers for (swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build workflows (swegym, swesmith, multiswebench) keep their existing 1440 since they aren't part of this PR's scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Clean workflow_call integration with good fallback patterns.

Two items in the test plan remain unchecked (workflow_dispatch and pull_request_target backward compatibility). Verify these before merging to ensure existing triggers still work.

Verdict: ✅ Worth merging after completing test plan validation

all-hands-bot · 2026-04-15T09:00:34Z

+        description: 'Limit number of images to build (0 for all)'
+        required: false
+        type: string
+        default: '0'


🟡 Suggestion: Inconsistent default for n-limit across workflows.

build-swtbench: default: '0'

build-commit0, build-swebench, build-swebenchmultimodal: default: ''

Is there a reason swtbench uses '0' while others use empty string? Consider standardizing unless the difference is intentional.

all-hands-bot · 2026-04-15T09:00:34Z

+      benchmarks-ref:
+        description: 'Benchmarks repo ref to checkout (for cross-repo calls)'
+        required: false
+        type: string


🟡 Suggestion: GAIA workflow doesn't include n-limit or instance-ids inputs.

The other 4 build workflows have these inputs for limiting builds. Is this intentional because GAIA doesn't support limiting, or should these be added for consistency?

* ci: add workflow_call triggers to build workflows Enable build workflows to be called as reusable workflows from the evaluation repo, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py. Changes per workflow: - Add workflow_call trigger with inputs: sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable) - Add explicit repository/token to checkout for cross-repo context - Add benchmarks-ref input to checkout-ref determination - Simplify SDK submodule condition (remove event_name guard) - Update concurrency group to use sdk-commit when available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes When build workflows are called via workflow_call from the evaluation repo, GITHUB_TOKEN is scoped to the calling repo and can't push to packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder). Fall back to PAT_TOKEN which has cross-repo write:packages scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes" This reverts commit 9585262. * ci: set build timeouts to 10h on the 5 workflow_call build workflows Matches the old orchestrate_eval.py polling bound in the evaluation repo (MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h on swtbench (OpenHands#527 context) but 10h is the operationally-meaningful ceiling: the evaluation pipeline can't usefully wait longer than that. Applied to the 5 workflows that PR OpenHands#666 adds workflow_call triggers for (swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build workflows (swegym, swesmith, multiswebench) keep their existing 1440 since they aren't part of this PR's scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Debug Agent <debug@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Surgical revert of the swtbench portion of #651 — puts the SWT-bench image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the useblacksmith/setup-docker-builder builder. Blacksmith's larger persistent disk masked the unbounded local-image growth that now fails on ubuntu-latest-8core (evaluation#495). Kept intact from main: - workflow_call trigger and cross-repo checkout (#666) - all input/env plumbing and downstream steps Intended as a parallel fallback branch while the real fix in #672 (docker rmi after push + free-disk-space + preflight) is validated.

all-hands-bot reviewed Apr 14, 2026

View reviewed changes

Debug Agent and others added 3 commits April 14, 2026 03:33

Revert "Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call …

f63ba9e

…pushes" This reverts commit 9585262.

simonrosenberg self-assigned this Apr 15, 2026

simonrosenberg requested a review from all-hands-bot April 15, 2026 08:58

all-hands-bot reviewed Apr 15, 2026

View reviewed changes

xingyaoww approved these changes Apr 15, 2026

View reviewed changes

simonrosenberg merged commit 832b2f9 into main Apr 15, 2026
3 checks passed

simonrosenberg mentioned this pull request Apr 20, 2026

[DO NOT MERGE] ci(swtbench): revert to Blacksmith runner (fallback for #495) #673

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add workflow_call triggers to build workflows#666

ci: add workflow_call triggers to build workflows#666
simonrosenberg merged 4 commits into
mainfrom
feat/workflow-call-build-images

simonrosenberg commented Apr 14, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Apr 15, 2026

Uh oh!

all-hands-bot Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simonrosenberg commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes per workflow

Backward compatibility

Validation

GHCR package permissions

Related

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Apr 14, 2026 •

edited

Loading