Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Nov 24, 2025


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:9c0c585-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-9c0c585-python \
  ghcr.io/openhands/agent-server:9c0c585-python

All tags pushed for this build

ghcr.io/openhands/agent-server:9c0c585-golang-amd64
ghcr.io/openhands/agent-server:9c0c585-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:9c0c585-golang-arm64
ghcr.io/openhands/agent-server:9c0c585-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:9c0c585-java-amd64
ghcr.io/openhands/agent-server:9c0c585-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:9c0c585-java-arm64
ghcr.io/openhands/agent-server:9c0c585-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:9c0c585-python-amd64
ghcr.io/openhands/agent-server:9c0c585-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:9c0c585-python-arm64
ghcr.io/openhands/agent-server:9c0c585-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:9c0c585-golang
ghcr.io/openhands/agent-server:9c0c585-java
ghcr.io/openhands/agent-server:9c0c585-python

About Multi-Architecture Support

  • Each variant tag (e.g., 9c0c585-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 9c0c585-python-amd64) are also available if needed

@simonrosenberg simonrosenberg changed the title feat: add eval workflow feat: fix eval workflow Nov 24, 2025
@simonrosenberg simonrosenberg marked this pull request as ready for review November 24, 2025 13:13
@simonrosenberg simonrosenberg self-assigned this Nov 24, 2025
simonrosenberg and others added 2 commits November 24, 2025 15:27
The benchmarks workflow now accepts sdk-commit parameter to specify
which SDK version to use when building images. This fix ensures the
correct SDK commit SHA is used instead of whatever the submodule defaults to.

The workflow now resolves the sdk_ref to an actual commit SHA and passes
that to both the benchmarks and evaluation workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@simonrosenberg simonrosenberg force-pushed the sdk-e2e-use-benchmarks-builder branch from 6060513 to 4957d68 Compare November 24, 2025 15:53
simonrosenberg and others added 5 commits November 24, 2025 17:28
Changed from add-sdk-commit-param to fix-sdk-sha-detection branch
which includes the fix for SDK SHA detection bug.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Replace all references to secrets.ALLHANDS_BOT_GITHUB_PAT with secrets.PAT_TOKEN
to use the organization-level secret with proper cross-repo workflow permissions.
Replace PAT_TOKEN with OPENHANDS_AGENT_PAT to use the correct organization-level
secret with workflow dispatch permissions for benchmarks and evaluation repos.
Match the PAT-override branch structure by setting PAT_TOKEN as a global
environment variable in the resolve parameters step, removing redundant
per-step env declarations.
@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Eval
    • Run Eval
    • Run Eval

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1241 at branch `sdk-e2e-use-benchmarks-builder`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

The fix-sdk-sha-detection branch has been merged to main.
Update both the workflow dispatch and wait steps to use main branch.
@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted read the diff and review the code. Find any possible simplification in the workflow code

@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

Summary

I've completed a critical code review of PR #1241 focusing on workflow simplifications as requested. No code changes were made - this was a review-only task per the /codereview-roasted command.

Key Findings:

🔴 Taste Rating: NEEDS IMPROVEMENT

The workflow refactoring moves builds from local execution to cross-repo dispatch, which is architecturally sound for separation of concerns, but introduces significant complexity and fragility.


Critical Issues Found:

  1. Race Condition (Lines 208-229): The "Wait for benchmarks build" step polls for the most recent workflow run on main branch, but there's NO guarantee this is the run just dispatched. If another commit triggers a build or runs are queued, you're checking the wrong run entirely.

  2. Repeated PAT Token Validation: PAT_TOKEN is checked identically in 4 different steps (lines 100-108, 180-183, 232-235, 257-259). This should be validated once at the job level.

  3. Polling Anti-Pattern (Lines 218-229): The workflow polls GitHub API every 30 seconds for up to 20 minutes (40 attempts × 30s). This is wasteful, fragile, and has no backoff strategy.


Major Simplification Opportunities:

  1. Eliminate Polling Entirely: Replace dispatch + poll with GitHub Actions workflow_call (reusable workflows). This provides synchronous execution with no race conditions or API waste.

  2. Use github-script Action: Replace curl + jq gymnastics with GitHub's native actions/github-script for cleaner, more maintainable workflow dispatches.

  3. Extract Magic Numbers: Move MAX_ATTEMPTS=40 and SLEEP_SECONDS=30 to top-level env vars for visibility.

  4. Skip Unnecessary Builds: Check if images already exist in GHCR before triggering rebuild - would save massive CI time.

  5. Consolidate Error Handling: The curl error handling pattern is repeated - should use native Actions features or extract to a composite action.


Recommended Architecture:

Instead of: dispatch → poll → dispatch eval
Use: workflow_call → dispatch eval (no polling needed)

This eliminates all polling logic, race conditions, and makes dependencies explicit and reliable.


The detailed review has been provided in the conversation above with specific line numbers, concrete fixes, and architectural recommendations for simplifying the workflow code.

View full conversation

Critical fix for race condition identified in code review:

Problem:
- Workflow polled for 'most recent run on main' after dispatch
- No guarantee this was the run we just dispatched
- Could pick up wrong run if multiple builds were queued

Solution:
- Record timestamp before dispatch
- Wait 10s after dispatch for run to be created
- Query for runs created after dispatch timestamp
- Track specific run ID throughout polling loop
- Filter by created_at >= dispatch_time to find correct run

This ensures we always monitor the correct workflow run.
The PAT_TOKEN is validated once in the 'Resolve parameters' step
and written to $GITHUB_ENV, making it available to all subsequent
steps. Removed redundant validation from:
- Dispatch benchmarks image build
- Wait for benchmarks build completion
- Dispatch evaluation workflow

This reduces duplication and makes the workflow cleaner.
Moved MAX_ATTEMPTS and SLEEP_SECONDS from inline script values
to top-level environment variables for better visibility and
maintainability.

Added:
- MAX_POLL_ATTEMPTS: '40' (40 attempts × 30s = 20 minutes max)
- POLL_INTERVAL_SECONDS: '30'

This makes it easier to adjust polling behavior and documents
the timeout calculation in one central location.
Copy link
Collaborator Author

🟡 Updated Review: Code Quality Improved, Architecture Still Has Room for Optimization

Great progress addressing the critical issues from the previous review! Several key problems have been fixed. However, the fundamental architectural pattern still has opportunities for simplification.


Issues Addressed

1. PAT Token Validation - IMPROVED

Before: PAT_TOKEN validated 4 times across different steps
Now: Set once in "Set params" step (lines 103-112) and stored in $GITHUB_ENV, eliminating redundant checks.

Note: The "Wait for benchmarks build" step now uses $PAT_TOKEN without validation, assuming it's set. This is fine since the job would have already failed earlier if missing.

2. Magic Numbers - FIXED

Before: Hardcoded MAX_ATTEMPTS=40 and SLEEP_SECONDS=30 buried in shell script
Now: Moved to top-level env vars (lines 47-48):

MAX_POLL_ATTEMPTS: '40'  # 40 attempts × 30s = 20 minutes max wait
POLL_INTERVAL_SECONDS: '30'

Much better visibility and configurability! Good documentation comment too.

3. Race Condition - SIGNIFICANTLY MITIGATED 🎯

Before: Polled for most recent run on main, no guarantee it was the dispatched one
Now:

  • Records DISPATCH_TIME before triggering (line 189)
  • Filters for runs created_at >= DISPATCH_TIME (line 235)
  • Queries 5 recent runs instead of 1 to avoid missing the target
  • Added 10-second buffer after dispatch (line 215)

This is a solid improvement! The race condition is now much less likely, though not completely eliminated (see remaining issues below).

4. Authorization Header - FIXED

Before: Mixed "Bearer" and "token"
Now: Consistently uses Authorization: token throughout (lines 201, 234, 264)


🟡 Remaining Opportunities for Simplification

[Medium Priority] Polling Pattern Still Present

Lines 218-262: The workflow still uses a polling loop to wait for the benchmarks build.

Why this matters: While the race condition is mitigated, polling is inherently fragile:

  • Consumes GitHub API quota (up to 40 requests per run)
  • Adds 10-20 minutes of wait time to every workflow run
  • Can still fail if GitHub API is slow or returns unexpected data

Better approach: Use GitHub Actions workflow_call (reusable workflows):

jobs:
  build-images:
    uses: OpenHands/benchmarks/.github/workflows/build-swe-bench-images.yml@main
    with:
      dataset: ${{ env.DATASET }}
      sdk-commit: ${{ steps.get-sha.outputs.sdk_sha }}
    secrets: inherit
  
  dispatch-eval:
    needs: build-images
    runs-on: ubuntu-latest
    # ... rest of the workflow

This eliminates ALL polling, provides deterministic run tracking, and makes dependencies explicit.

[Low Priority] curl + jq Could Be Cleaner

Lines 192-208, 253-267: Using curl + jq for GitHub API calls works but is verbose.

Alternative: Use actions/github-script@v7:

- name: Dispatch benchmarks build
  uses: actions/github-script@v7
  with:
    github-token: ${{ env.PAT_TOKEN }}
    script: |
      await github.rest.actions.createWorkflowDispatch({
        owner: 'OpenHands',
        repo: 'benchmarks',
        workflow_id: '${{ env.BENCHMARKS_BUILD_WORKFLOW }}',
        ref: 'main',
        inputs: { ... }
      });

Cleaner, better error messages, no jq gymnastics.

[Edge Case] Remaining Race Condition

Lines 228-239: The timestamp filter created_at >= DISPATCH_TIME is good but not perfect.

Scenario: If two PRs trigger eval workflows within the same second, both could match the same run.

Potential fix: Also filter by workflow inputs (check sdk-commit matches expected SHA) once you identify candidate runs. The GitHub API exposes workflow run inputs, so you could verify:

INPUTS=$(echo "$RUN_DATA" | jq -r '.inputs."sdk-commit"')
if [ "$INPUTS" = "$SDK_SHA" ]; then
  # This is definitely our run
fi

[Optimization] Unconditional Rebuild

The workflow always triggers a benchmarks build, even if images already exist for this SDK_SHA.

Optimization opportunity: Query GitHub Container Registry first:

# Check if image exists
if docker manifest inspect ghcr.io/openhands/eval-agent-server:$SDK_SHA 2>/dev/null; then
  echo "Images already exist for $SDK_SHA, skipping build"
  exit 0
fi

This would save significant CI time for re-runs or evaluations of already-built commits.

[Debugging] Missing Error Context

Line 256: When benchmarks build fails, you print the conclusion but not the logs URL.

Improvement:

echo "Benchmarks build failed with conclusion ${CONCL}" >&2
echo "View logs: https://github.com/${REPO}/actions/runs/${RUN_ID}" >&2
exit 1

📊 Progress Summary

Issue Before After Status
PAT token duplication 4 validations 1 validation ✅ Fixed
Magic numbers Hardcoded Env vars ✅ Fixed
Race condition Severe Mitigated 🟡 Improved
Auth headers Inconsistent Consistent ✅ Fixed
Polling pattern Present Present 🔴 Remains

VERDICT:

🟡 Good progress, merge-worthy with notes: The critical bugs are fixed and the code is significantly more robust. The remaining issues are architectural optimizations rather than correctness problems. The workflow will function reliably in its current form.

Key Insight: You've successfully moved from a broken polling implementation to a working one. The next evolution would be eliminating polling entirely with workflow_call, but that requires changes to the benchmarks repo's workflow, which is out of scope for this PR.


Recommendation

Approve to merge - The workflow is now production-ready. Consider the remaining optimizations as follow-up improvements rather than blockers.

If you do want to eliminate polling in the future, the benchmarks workflow would need to be converted to a reusable workflow with workflow_call trigger, which is a cleaner long-term solution.

Updated polling configuration:
- MAX_POLL_ATTEMPTS: 40 → 80
- POLL_INTERVAL_SECONDS: 30s → 60s
- Total max wait time: 20 minutes → 80 minutes

This provides more headroom for long-running image builds,
especially when building many images or during high load periods.
Comment on lines +48 to +49
MAX_POLL_ATTEMPTS: '80' # 80 attempts × 60s = 80 minutes max wait
POLL_INTERVAL_SECONDS: '60'
Copy link
Collaborator Author

@simonrosenberg simonrosenberg Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xingyaoww is 80 minutes enough for building all swe bench images?

I made an issue to avoid polling but it requires modifying all workflows so Im doing this later #1249

@simonrosenberg simonrosenberg merged commit 5a2cd30 into main Nov 25, 2025
21 checks passed
@simonrosenberg simonrosenberg deleted the sdk-e2e-use-benchmarks-builder branch November 25, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants