Skip to content

ci: add workflow_call triggers to build workflows#666

Merged
simonrosenberg merged 4 commits into
mainfrom
feat/workflow-call-build-images
Apr 15, 2026
Merged

ci: add workflow_call triggers to build workflows#666
simonrosenberg merged 4 commits into
mainfrom
feat/workflow-call-build-images

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented Apr 14, 2026

Summary

  • Add workflow_call triggers to the 5 build workflows used by the evaluation pipeline (swebench, gaia, swtbench, commit0, swebenchmultimodal)
  • Update checkout steps to work when called cross-repo from OpenHands/evaluation
  • Update concurrency groups to key on sdk-commit when available

This enables the evaluation repo to call these build workflows as reusable workflows via uses:, eliminating the need for cross-repo workflow_dispatch polling and the temporary branch hack in orchestrate_eval.py.

Changes per workflow

Each build workflow gets:

  1. A workflow_call: trigger with inputs for sdk-commit, n-limit, instance-ids, benchmarks-ref (and agent-type where applicable)
  2. Explicit repository: OpenHands/benchmarks + token on the checkout step (required because github.repository resolves to the caller's repo for workflow_call)
  3. benchmarks-ref as the first priority in checkout ref determination
  4. Simplified SDK submodule update condition (removed event_name guard since inputs.sdk-commit is sufficient)
  5. Updated concurrency group: ${{ inputs.sdk-commit || github.ref }} to avoid serializing builds for different SDK commits

Backward compatibility

All existing triggers (workflow_dispatch, pull_request_target) work exactly as before. The workflow_call inputs use the same names as the existing workflow_dispatch inputs where applicable, so the inputs.* references in steps resolve correctly for both trigger types.

Validation

End-to-end runs completed successfully using these reusable workflows called from OpenHands/evaluation (all with eval_limit=1 or 5, claude-sonnet-4-5-20250929):

Benchmark Result SDK trigger Eval workflow run
swebench 5/5 completed, 5/5 resolved, 0 errors #24406322597 #24406353727
gaia 5/5 completed, 3/5 resolved, 0 errors #24406322541 #24406354480
commit0 5/5 completed, 2/5 resolved, 0 errors #24406322608 #24406355165
swebenchmultimodal 1/1 completed, 0/1 resolved, 0 errors #24414072255 #24414097101
swtbench 5/5 inferred, 0 errors (eval harness blocked by unrelated swtbench-highcore node pool capacity) #24406322601 #24406353744

The swebenchmultimodal run exercised fresh GHCR pushes for eval-builder, eval-base, and eval-agent-server, confirming package permissions are correctly scoped for cross-repo writes.

GHCR package permissions

For the evaluation repo's GITHUB_TOKEN to push images, the following GHCR packages must grant OpenHands/evaluation write access via Package settings → Manage Actions access:

  • ghcr.io/openhands/eval-builder
  • ghcr.io/openhands/eval-base
  • ghcr.io/openhands/eval-agent-server

Related

  • Companion PR in evaluation repo: OpenHands/evaluation#471 (moves orchestration from K8s to GH Actions)
  • Resolves part of OpenHands/evaluation#371

Test plan

  • Test cross-repo workflow_call from evaluation repo (validated above for all 5 benchmarks)
  • Verify GHCR package permissions allow push from evaluation repo context
  • Verify direct workflow_dispatch triggers still work for each build workflow
  • Verify pull_request_target label triggers still work

🤖 Generated with Claude Code

Enable build workflows to be called as reusable workflows from the
evaluation repo, eliminating the need for cross-repo workflow_dispatch
polling and the temporary branch hack in orchestrate_eval.py.

Changes per workflow:
- Add workflow_call trigger with inputs: sdk-commit, n-limit,
  instance-ids, benchmarks-ref (and agent-type where applicable)
- Add explicit repository/token to checkout for cross-repo context
- Add benchmarks-ref input to checkout-ref determination
- Simplify SDK submodule condition (remove event_name guard)
- Update concurrency group to use sdk-commit when available

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, focused changes that solve a real problem without adding unnecessary complexity.

Verdict: ✅ Worth merging after testing

Key insight: The fallback patterns (inputs.sdk-commit || github.ref, secrets.PAT_TOKEN || github.token) elegantly handle both workflow_call and existing triggers without conditional spaghetti.

One trade-off to note: hardcoding repository: OpenHands/benchmarks breaks fork support for workflow_dispatch, but this is pragmatic since the primary use case is cross-repo calls from evaluation.

Test plan checkboxes are unchecked - verify all scenarios work before merge.

Debug Agent and others added 3 commits April 14, 2026 03:33
When build workflows are called via workflow_call from the evaluation
repo, GITHUB_TOKEN is scoped to the calling repo and can't push to
packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder).
Fall back to PAT_TOKEN which has cross-repo write:packages scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Matches the old orchestrate_eval.py polling bound in the evaluation repo
(MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h
on swtbench (#527 context) but 10h is the operationally-meaningful
ceiling: the evaluation pipeline can't usefully wait longer than that.

Applied to the 5 workflows that PR #666 adds workflow_call triggers for
(swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build
workflows (swegym, swesmith, multiswebench) keep their existing 1440
since they aren't part of this PR's scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg self-assigned this Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Clean workflow_call integration with good fallback patterns.

Two items in the test plan remain unchecked (workflow_dispatch and pull_request_target backward compatibility). Verify these before merging to ensure existing triggers still work.

Verdict: ✅ Worth merging after completing test plan validation

description: 'Limit number of images to build (0 for all)'
required: false
type: string
default: '0'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Inconsistent default for n-limit across workflows.

  • build-swtbench: default: '0'
  • build-commit0, build-swebench, build-swebenchmultimodal: default: ''

Is there a reason swtbench uses '0' while others use empty string? Consider standardizing unless the difference is intentional.

benchmarks-ref:
description: 'Benchmarks repo ref to checkout (for cross-repo calls)'
required: false
type: string
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: GAIA workflow doesn't include n-limit or instance-ids inputs.

The other 4 build workflows have these inputs for limiting builds. Is this intentional because GAIA doesn't support limiting, or should these be added for consistency?

@simonrosenberg simonrosenberg merged commit 832b2f9 into main Apr 15, 2026
3 checks passed
GaokaiZhang pushed a commit to GaokaiZhang/benchmarks that referenced this pull request Apr 17, 2026
* ci: add workflow_call triggers to build workflows

Enable build workflows to be called as reusable workflows from the
evaluation repo, eliminating the need for cross-repo workflow_dispatch
polling and the temporary branch hack in orchestrate_eval.py.

Changes per workflow:
- Add workflow_call trigger with inputs: sdk-commit, n-limit,
  instance-ids, benchmarks-ref (and agent-type where applicable)
- Add explicit repository/token to checkout for cross-repo context
- Add benchmarks-ref input to checkout-ref determination
- Simplify SDK submodule condition (remove event_name guard)
- Update concurrency group to use sdk-commit when available

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes

When build workflows are called via workflow_call from the evaluation
repo, GITHUB_TOKEN is scoped to the calling repo and can't push to
packages owned by the benchmarks repo (ghcr.io/openhands/eval-builder).
Fall back to PAT_TOKEN which has cross-repo write:packages scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "Use PAT_TOKEN for GHCR login to fix cross-repo workflow_call pushes"

This reverts commit 9585262.

* ci: set build timeouts to 10h on the 5 workflow_call build workflows

Matches the old orchestrate_eval.py polling bound in the evaluation repo
(MAX_POLL_ATTEMPTS=600 × 60s = 10h). This was previously relaxed to 24h
on swtbench (OpenHands#527 context) but 10h is the operationally-meaningful
ceiling: the evaluation pipeline can't usefully wait longer than that.

Applied to the 5 workflows that PR OpenHands#666 adds workflow_call triggers for
(swebench, gaia, swtbench, commit0, swebenchmultimodal). Other build
workflows (swegym, swesmith, multiswebench) keep their existing 1440
since they aren't part of this PR's scope.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Debug Agent <debug@example.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg pushed a commit that referenced this pull request Apr 20, 2026
Surgical revert of the swtbench portion of #651 — puts the SWT-bench
image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the
useblacksmith/setup-docker-builder builder. Blacksmith's larger
persistent disk masked the unbounded local-image growth that now
fails on ubuntu-latest-8core (evaluation#495).

Kept intact from main:
- workflow_call trigger and cross-repo checkout (#666)
- all input/env plumbing and downstream steps

Intended as a parallel fallback branch while the real fix in #672
(docker rmi after push + free-disk-space + preflight) is validated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants