feat: fix eval workflow #1241

simonrosenberg · 2025-11-24T12:56:18Z

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9c0c585-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9c0c585-python \
  ghcr.io/openhands/agent-server:9c0c585-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9c0c585-golang-amd64
ghcr.io/openhands/agent-server:9c0c585-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9c0c585-golang-arm64
ghcr.io/openhands/agent-server:9c0c585-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9c0c585-java-amd64
ghcr.io/openhands/agent-server:9c0c585-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9c0c585-java-arm64
ghcr.io/openhands/agent-server:9c0c585-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9c0c585-python-amd64
ghcr.io/openhands/agent-server:9c0c585-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9c0c585-python-arm64
ghcr.io/openhands/agent-server:9c0c585-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9c0c585-golang
ghcr.io/openhands/agent-server:9c0c585-java
ghcr.io/openhands/agent-server:9c0c585-python

About Multi-Architecture Support

Each variant tag (e.g., 9c0c585-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 9c0c585-python-amd64) are also available if needed

The benchmarks workflow now accepts sdk-commit parameter to specify which SDK version to use when building images. This fix ensures the correct SDK commit SHA is used instead of whatever the submodule defaults to. The workflow now resolves the sdk_ref to an actual commit SHA and passes that to both the benchmarks and evaluation workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changed from add-sdk-commit-param to fix-sdk-sha-detection branch which includes the fix for SDK SHA detection bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Replace all references to secrets.ALLHANDS_BOT_GITHUB_PAT with secrets.PAT_TOKEN to use the organization-level secret with proper cross-repo workflow permissions.

Replace PAT_TOKEN with OPENHANDS_AGENT_PAT to use the correct organization-level secret with workflow dispatch permissions for benchmarks and evaluation repos.

Match the PAT-override branch structure by setting PAT_TOKEN as a global environment variable in the resolve parameters step, removing redundant per-step env declarations.

openhands-ai · 2025-11-24T22:27:39Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Eval
- Run Eval
- Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1241 at branch `sdk-e2e-use-benchmarks-builder`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

The fix-sdk-sha-detection branch has been merged to main. Update both the workflow dispatch and wait steps to use main branch.

simonrosenberg · 2025-11-24T22:46:33Z

@OpenHands /codereview-roasted read the diff and review the code. Find any possible simplification in the workflow code

openhands-ai · 2025-11-24T22:46:41Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

openhands-ai · 2025-11-24T22:48:39Z

Summary

I've completed a critical code review of PR #1241 focusing on workflow simplifications as requested. No code changes were made - this was a review-only task per the /codereview-roasted command.

Key Findings:

🔴 Taste Rating: NEEDS IMPROVEMENT

The workflow refactoring moves builds from local execution to cross-repo dispatch, which is architecturally sound for separation of concerns, but introduces significant complexity and fragility.

Critical Issues Found:

Race Condition (Lines 208-229): The "Wait for benchmarks build" step polls for the most recent workflow run on main branch, but there's NO guarantee this is the run just dispatched. If another commit triggers a build or runs are queued, you're checking the wrong run entirely.
Repeated PAT Token Validation: PAT_TOKEN is checked identically in 4 different steps (lines 100-108, 180-183, 232-235, 257-259). This should be validated once at the job level.
Polling Anti-Pattern (Lines 218-229): The workflow polls GitHub API every 30 seconds for up to 20 minutes (40 attempts × 30s). This is wasteful, fragile, and has no backoff strategy.

Major Simplification Opportunities:

Eliminate Polling Entirely: Replace dispatch + poll with GitHub Actions workflow_call (reusable workflows). This provides synchronous execution with no race conditions or API waste.
Use github-script Action: Replace curl + jq gymnastics with GitHub's native actions/github-script for cleaner, more maintainable workflow dispatches.
Extract Magic Numbers: Move MAX_ATTEMPTS=40 and SLEEP_SECONDS=30 to top-level env vars for visibility.
Skip Unnecessary Builds: Check if images already exist in GHCR before triggering rebuild - would save massive CI time.
Consolidate Error Handling: The curl error handling pattern is repeated - should use native Actions features or extract to a composite action.

Recommended Architecture:

Instead of: dispatch → poll → dispatch eval
Use: workflow_call → dispatch eval (no polling needed)

This eliminates all polling logic, race conditions, and makes dependencies explicit and reliable.

The detailed review has been provided in the conversation above with specific line numbers, concrete fixes, and architectural recommendations for simplifying the workflow code.

_{View full conversation}

Critical fix for race condition identified in code review: Problem: - Workflow polled for 'most recent run on main' after dispatch - No guarantee this was the run we just dispatched - Could pick up wrong run if multiple builds were queued Solution: - Record timestamp before dispatch - Wait 10s after dispatch for run to be created - Query for runs created after dispatch timestamp - Track specific run ID throughout polling loop - Filter by created_at >= dispatch_time to find correct run This ensures we always monitor the correct workflow run.

The PAT_TOKEN is validated once in the 'Resolve parameters' step and written to $GITHUB_ENV, making it available to all subsequent steps. Removed redundant validation from: - Dispatch benchmarks image build - Wait for benchmarks build completion - Dispatch evaluation workflow This reduces duplication and makes the workflow cleaner.

Moved MAX_ATTEMPTS and SLEEP_SECONDS from inline script values to top-level environment variables for better visibility and maintainability. Added: - MAX_POLL_ATTEMPTS: '40' (40 attempts × 30s = 20 minutes max) - POLL_INTERVAL_SECONDS: '30' This makes it easier to adjust polling behavior and documents the timeout calculation in one central location.

simonrosenberg · 2025-11-24T23:11:36Z

🟡 Updated Review: Code Quality Improved, Architecture Still Has Room for Optimization

Great progress addressing the critical issues from the previous review! Several key problems have been fixed. However, the fundamental architectural pattern still has opportunities for simplification.

✅ Issues Addressed

1. PAT Token Validation - IMPROVED

Before: PAT_TOKEN validated 4 times across different steps
Now: Set once in "Set params" step (lines 103-112) and stored in $GITHUB_ENV, eliminating redundant checks.

Note: The "Wait for benchmarks build" step now uses $PAT_TOKEN without validation, assuming it's set. This is fine since the job would have already failed earlier if missing.

2. Magic Numbers - FIXED ✨

Before: Hardcoded MAX_ATTEMPTS=40 and SLEEP_SECONDS=30 buried in shell script
Now: Moved to top-level env vars (lines 47-48):

MAX_POLL_ATTEMPTS: '40'  # 40 attempts × 30s = 20 minutes max wait
POLL_INTERVAL_SECONDS: '30'

Much better visibility and configurability! Good documentation comment too.

3. Race Condition - SIGNIFICANTLY MITIGATED 🎯

Before: Polled for most recent run on main, no guarantee it was the dispatched one
Now:

Records DISPATCH_TIME before triggering (line 189)
Filters for runs created_at >= DISPATCH_TIME (line 235)
Queries 5 recent runs instead of 1 to avoid missing the target
Added 10-second buffer after dispatch (line 215)

This is a solid improvement! The race condition is now much less likely, though not completely eliminated (see remaining issues below).

4. Authorization Header - FIXED

Before: Mixed "Bearer" and "token"
Now: Consistently uses Authorization: token throughout (lines 201, 234, 264)

🟡 Remaining Opportunities for Simplification

[Medium Priority] Polling Pattern Still Present

Lines 218-262: The workflow still uses a polling loop to wait for the benchmarks build.

Why this matters: While the race condition is mitigated, polling is inherently fragile:

Consumes GitHub API quota (up to 40 requests per run)
Adds 10-20 minutes of wait time to every workflow run
Can still fail if GitHub API is slow or returns unexpected data

Better approach: Use GitHub Actions workflow_call (reusable workflows):

jobs:
  build-images:
    uses: OpenHands/benchmarks/.github/workflows/build-swe-bench-images.yml@main
    with:
      dataset: ${{ env.DATASET }}
      sdk-commit: ${{ steps.get-sha.outputs.sdk_sha }}
    secrets: inherit
  
  dispatch-eval:
    needs: build-images
    runs-on: ubuntu-latest
    # ... rest of the workflow

This eliminates ALL polling, provides deterministic run tracking, and makes dependencies explicit.

[Low Priority] curl + jq Could Be Cleaner

Lines 192-208, 253-267: Using curl + jq for GitHub API calls works but is verbose.

Alternative: Use actions/github-script@v7:

- name: Dispatch benchmarks build
  uses: actions/github-script@v7
  with:
    github-token: ${{ env.PAT_TOKEN }}
    script: |
      await github.rest.actions.createWorkflowDispatch({
        owner: 'OpenHands',
        repo: 'benchmarks',
        workflow_id: '${{ env.BENCHMARKS_BUILD_WORKFLOW }}',
        ref: 'main',
        inputs: { ... }
      });

Cleaner, better error messages, no jq gymnastics.

[Edge Case] Remaining Race Condition

Lines 228-239: The timestamp filter created_at >= DISPATCH_TIME is good but not perfect.

Scenario: If two PRs trigger eval workflows within the same second, both could match the same run.

Potential fix: Also filter by workflow inputs (check sdk-commit matches expected SHA) once you identify candidate runs. The GitHub API exposes workflow run inputs, so you could verify:

INPUTS=$(echo "$RUN_DATA" | jq -r '.inputs."sdk-commit"')
if [ "$INPUTS" = "$SDK_SHA" ]; then
  # This is definitely our run
fi

[Optimization] Unconditional Rebuild

The workflow always triggers a benchmarks build, even if images already exist for this SDK_SHA.

Optimization opportunity: Query GitHub Container Registry first:

# Check if image exists
if docker manifest inspect ghcr.io/openhands/eval-agent-server:$SDK_SHA 2>/dev/null; then
  echo "Images already exist for $SDK_SHA, skipping build"
  exit 0
fi

This would save significant CI time for re-runs or evaluations of already-built commits.

[Debugging] Missing Error Context

Line 256: When benchmarks build fails, you print the conclusion but not the logs URL.

Improvement:

echo "Benchmarks build failed with conclusion ${CONCL}" >&2
echo "View logs: https://github.com/${REPO}/actions/runs/${RUN_ID}" >&2
exit 1

📊 Progress Summary

Issue	Before	After	Status
PAT token duplication	4 validations	1 validation	✅ Fixed
Magic numbers	Hardcoded	Env vars	✅ Fixed
Race condition	Severe	Mitigated	🟡 Improved
Auth headers	Inconsistent	Consistent	✅ Fixed
Polling pattern	Present	Present	🔴 Remains

VERDICT:

🟡 Good progress, merge-worthy with notes: The critical bugs are fixed and the code is significantly more robust. The remaining issues are architectural optimizations rather than correctness problems. The workflow will function reliably in its current form.

Key Insight: You've successfully moved from a broken polling implementation to a working one. The next evolution would be eliminating polling entirely with workflow_call, but that requires changes to the benchmarks repo's workflow, which is out of scope for this PR.

Recommendation

✅ Approve to merge - The workflow is now production-ready. Consider the remaining optimizations as follow-up improvements rather than blockers.

If you do want to eliminate polling in the future, the benchmarks workflow would need to be converted to a reusable workflow with workflow_call trigger, which is a cleaner long-term solution.

Updated polling configuration: - MAX_POLL_ATTEMPTS: 40 → 80 - POLL_INTERVAL_SECONDS: 30s → 60s - Total max wait time: 20 minutes → 80 minutes This provides more headroom for long-running image builds, especially when building many images or during high load periods.

simonrosenberg · 2025-11-24T23:23:30Z

.github/workflows/run-eval.yml

+    MAX_POLL_ATTEMPTS: '80'  # 80 attempts × 60s = 80 minutes max wait
+    POLL_INTERVAL_SECONDS: '60'


@xingyaoww is 80 minutes enough for building all swe bench images?

I made an issue to avoid polling but it requires modifying all workflows so Im doing this later #1249

simonrosenberg added 6 commits November 24, 2025 12:05

ci: delegate swe-bench image build to benchmarks workflow

0de793d

ci: allow passing benchmarks PAT for dispatch

f9974bc

ci: allow overriding sdk ref for eval dispatch

77fc6b0

ci: debug eval dispatch token

5dcc2dc

ci: remove manual pat input, rely on repo secrets

7d02f10

ci: simplify dispatch tokens to use bot PAT only

7b01693

simonrosenberg changed the title ~~feat: add eval workflow~~ feat: fix eval workflow Nov 24, 2025

ci: drop token verification step

6fa1821

simonrosenberg marked this pull request as ready for review November 24, 2025 13:13

simonrosenberg self-assigned this Nov 24, 2025

simonrosenberg requested review from mamoodi, neubig and xingyaoww November 24, 2025 13:13

simonrosenberg added 2 commits November 24, 2025 14:53

ci: require sdk_ref input and drop branch

893cbb3

ci: simplify sdk_ref handling

0f20d59

xingyaoww approved these changes Nov 24, 2025

View reviewed changes

simonrosenberg and others added 2 commits November 24, 2025 15:27

ci: drop unused sdk_ref_input echo

c28eb09

simonrosenberg force-pushed the sdk-e2e-use-benchmarks-builder branch from 6060513 to 4957d68 Compare November 24, 2025 15:53

simonrosenberg and others added 5 commits November 24, 2025 17:28

Update workflow to use benchmarks fix branch

922fb45

Changed from add-sdk-commit-param to fix-sdk-sha-detection branch which includes the fix for SDK SHA detection bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Use PAT_TOKEN org secret instead of ALLHANDS_BOT_GITHUB_PAT

5225a37

Replace all references to secrets.ALLHANDS_BOT_GITHUB_PAT with secrets.PAT_TOKEN to use the organization-level secret with proper cross-repo workflow permissions.

Use OPENHANDS_AGENT_PAT org secret for cross-repo workflow dispatch

4ada431

Replace PAT_TOKEN with OPENHANDS_AGENT_PAT to use the correct organization-level secret with workflow dispatch permissions for benchmarks and evaluation repos.

Revert to ALLHANDS_BOT_GITHUB_PAT for cross-repo workflow dispatch

b3b57fc

Set PAT_TOKEN globally in resolve parameters step

31a2441

Match the PAT-override branch structure by setting PAT_TOKEN as a global environment variable in the resolve parameters step, removing redundant per-step env declarations.

Update workflow to use benchmarks main branch

dbb8cad

The fix-sdk-sha-detection branch has been merged to main. Update both the workflow dispatch and wait steps to use main branch.

simonrosenberg added 2 commits November 25, 2025 00:06

simonrosenberg commented Nov 24, 2025

View reviewed changes

simonrosenberg merged commit 5a2cd30 into main Nov 25, 2025
21 checks passed

simonrosenberg deleted the sdk-e2e-use-benchmarks-builder branch November 25, 2025 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: fix eval workflow #1241

feat: fix eval workflow #1241

Uh oh!

simonrosenberg commented Nov 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

simonrosenberg commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

simonrosenberg commented Nov 24, 2025

Uh oh!

simonrosenberg Nov 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		MAX_POLL_ATTEMPTS: '80' # 80 attempts × 60s = 80 minutes max wait
		POLL_INTERVAL_SECONDS: '60'

feat: fix eval workflow #1241

feat: fix eval workflow #1241

Uh oh!

Conversation

simonrosenberg commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

simonrosenberg commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Uh oh!

openhands-ai bot commented Nov 24, 2025

Summary

Key Findings:

Critical Issues Found:

Major Simplification Opportunities:

Recommended Architecture:

Uh oh!

simonrosenberg commented Nov 24, 2025

🟡 Updated Review: Code Quality Improved, Architecture Still Has Room for Optimization

✅ Issues Addressed

1. PAT Token Validation - IMPROVED

2. Magic Numbers - FIXED ✨

3. Race Condition - SIGNIFICANTLY MITIGATED 🎯

4. Authorization Header - FIXED

🟡 Remaining Opportunities for Simplification

[Medium Priority] Polling Pattern Still Present

[Low Priority] curl + jq Could Be Cleaner

[Edge Case] Remaining Race Condition

[Optimization] Unconditional Rebuild

[Debugging] Missing Error Context

📊 Progress Summary

VERDICT:

Recommendation

Uh oh!

simonrosenberg Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Nov 24, 2025 •

edited by github-actions bot

Loading

simonrosenberg Nov 24, 2025 •

edited

Loading