ci: add workflow_call triggers to build workflows#666
Conversation
Enable build workflows to be called as reusable workflows from the evaluation repo, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py. Changes per workflow: - Add workflow_call trigger with inputs: sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable) - Add explicit repository/token to checkout for cross-repo context - Add benchmarks-ref input to checkout-ref determination - Simplify SDK submodule condition (remove event_name guard) - Update concurrency group to use sdk-commit when available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean, focused changes that solve a real problem without adding unnecessary complexity.
Verdict: ✅ Worth merging after testing
Key insight: The fallback patterns (inputs.sdk-commit || github.ref, secrets.PAT_TOKEN || github.token) elegantly handle both workflow_call and existing triggers without conditional spaghetti.
One trade-off to note: hardcoding repository: OpenHands/benchmarks breaks fork support for workflow_dispatch, but this is pragmatic since the primary use case is cross-repo calls from evaluation.
Test plan checkboxes are unchecked - verify all scenarios work before merge.
When build workflows are called via workflow_call from the evaluation repo, GITHUB_TOKEN is scoped to the calling repo and can't push to packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder). Fall back to PAT_TOKEN which has cross-repo write:packages scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pushes" This reverts commit 9585262.
Matches the old orchestrate_eval.py polling bound in the evaluation repo (MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h on swtbench (#527 context) but 10h is the operationally-meaningful ceiling: the evaluation pipeline can't usefully wait longer than that. Applied to the 5 workflows that PR #666 adds workflow_call triggers for (swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build workflows (swegym, swesmith, multiswebench) keep their existing 1440 since they aren't part of this PR's scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Clean workflow_call integration with good fallback patterns.
Two items in the test plan remain unchecked (workflow_dispatch and pull_request_target backward compatibility). Verify these before merging to ensure existing triggers still work.
Verdict: ✅ Worth merging after completing test plan validation
| description: 'Limit number of images to build (0 for all)' | ||
| required: false | ||
| type: string | ||
| default: '0' |
There was a problem hiding this comment.
🟡 Suggestion: Inconsistent default for n-limit across workflows.
- build-swtbench:
default: '0' - build-commit0, build-swebench, build-swebenchmultimodal:
default: ''
Is there a reason swtbench uses '0' while others use empty string? Consider standardizing unless the difference is intentional.
| benchmarks-ref: | ||
| description: 'Benchmarks repo ref to checkout (for cross-repo calls)' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
🟡 Suggestion: GAIA workflow doesn't include n-limit or instance-ids inputs.
The other 4 build workflows have these inputs for limiting builds. Is this intentional because GAIA doesn't support limiting, or should these be added for consistency?
* ci: add workflow_call triggers to build workflows Enable build workflows to be called as reusable workflows from the evaluation repo, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py. Changes per workflow: - Add workflow_call trigger with inputs: sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable) - Add explicit repository/token to checkout for cross-repo context - Add benchmarks-ref input to checkout-ref determination - Simplify SDK submodule condition (remove event_name guard) - Update concurrency group to use sdk-commit when available Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes When build workflows are called via workflow_call from the evaluation repo, GITHUB_TOKEN is scoped to the calling repo and can't push to packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder). Fall back to PAT_TOKEN which has cross-repo write:packages scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes" This reverts commit 9585262. * ci: set build timeouts to 10h on the 5 workflow_call build workflows Matches the old orchestrate_eval.py polling bound in the evaluation repo (MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h on swtbench (OpenHands#527 context) but 10h is the operationally-meaningful ceiling: the evaluation pipeline can't usefully wait longer than that. Applied to the 5 workflows that PR OpenHands#666 adds workflow_call triggers for (swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build workflows (swegym, swesmith, multiswebench) keep their existing 1440 since they aren't part of this PR's scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Debug Agent <debug@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Surgical revert of the swtbench portion of #651 — puts the SWT-bench image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the useblacksmith/setup-docker-builder builder. Blacksmith's larger persistent disk masked the unbounded local-image growth that now fails on ubuntu-latest-8core (evaluation#495). Kept intact from main: - workflow_call trigger and cross-repo checkout (#666) - all input/env plumbing and downstream steps Intended as a parallel fallback branch while the real fix in #672 (docker rmi after push + free-disk-space + preflight) is validated.
Summary
workflow_calltriggers to the 5 build workflows used by the evaluation pipeline (swebench, gaia, swtbench, commit0, swebenchmultimodal)OpenHands/evaluationsdk-commitwhen availableThis enables the evaluation repo to call these build workflows as reusable workflows via
uses:, eliminating the need for cross-repoworkflow_dispatchpolling and the temporary branch hack inorchestrate_eval.py.Changes per workflow
Each build workflow gets:
workflow_call:trigger with inputs forsdk-commit,n-limit,instance-ids,benchmarks-ref(andagent-typewhere applicable)repository: OpenHands/benchmarks+tokenon the checkout step (required becausegithub.repositoryresolves to the caller's repo forworkflow_call)benchmarks-refas the first priority in checkout ref determinationevent_nameguard sinceinputs.sdk-commitis sufficient)${{ inputs.sdk-commit || github.ref }}to avoid serializing builds for different SDK commitsBackward compatibility
All existing triggers (
workflow_dispatch,pull_request_target) work exactly as before. Theworkflow_callinputs use the same names as the existingworkflow_dispatchinputs where applicable, so theinputs.*references in steps resolve correctly for both trigger types.Validation
End-to-end runs completed successfully using these reusable workflows called from OpenHands/evaluation (all with
eval_limit=1or5,claude-sonnet-4-5-20250929):swtbench-highcorenode pool capacity)The swebenchmultimodal run exercised fresh GHCR pushes for
eval-builder,eval-base, andeval-agent-server, confirming package permissions are correctly scoped for cross-repo writes.GHCR package permissions
For the evaluation repo's
GITHUB_TOKENto push images, the following GHCR packages must grantOpenHands/evaluationwrite access via Package settings → Manage Actions access:ghcr.io/openhands/eval-builderghcr.io/openhands/eval-baseghcr.io/openhands/eval-agent-serverRelated
Test plan
workflow_callfrom evaluation repo (validated above for all 5 benchmarks)workflow_dispatchtriggers still work for each build workflowpull_request_targetlabel triggers still work🤖 Generated with Claude Code