-
Notifications
You must be signed in to change notification settings - Fork 73
feat: fix eval workflow #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: fix eval workflow #1241
Conversation
The benchmarks workflow now accepts sdk-commit parameter to specify which SDK version to use when building images. This fix ensures the correct SDK commit SHA is used instead of whatever the submodule defaults to. The workflow now resolves the sdk_ref to an actual commit SHA and passes that to both the benchmarks and evaluation workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6060513 to
4957d68
Compare
Changed from add-sdk-commit-param to fix-sdk-sha-detection branch which includes the fix for SDK SHA detection bug. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Replace all references to secrets.ALLHANDS_BOT_GITHUB_PAT with secrets.PAT_TOKEN to use the organization-level secret with proper cross-repo workflow permissions.
Replace PAT_TOKEN with OPENHANDS_AGENT_PAT to use the correct organization-level secret with workflow dispatch permissions for benchmarks and evaluation repos.
Match the PAT-override branch structure by setting PAT_TOKEN as a global environment variable in the resolve parameters step, removing redundant per-step env declarations.
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
The fix-sdk-sha-detection branch has been merged to main. Update both the workflow dispatch and wait steps to use main branch.
|
@OpenHands /codereview-roasted read the diff and review the code. Find any possible simplification in the workflow code |
|
I'm on it! simonrosenberg can track my progress at all-hands.dev |
SummaryI've completed a critical code review of PR #1241 focusing on workflow simplifications as requested. No code changes were made - this was a review-only task per the Key Findings:🔴 Taste Rating: NEEDS IMPROVEMENT The workflow refactoring moves builds from local execution to cross-repo dispatch, which is architecturally sound for separation of concerns, but introduces significant complexity and fragility. Critical Issues Found:
Major Simplification Opportunities:
Recommended Architecture:Instead of: This eliminates all polling logic, race conditions, and makes dependencies explicit and reliable. The detailed review has been provided in the conversation above with specific line numbers, concrete fixes, and architectural recommendations for simplifying the workflow code. |
Critical fix for race condition identified in code review: Problem: - Workflow polled for 'most recent run on main' after dispatch - No guarantee this was the run we just dispatched - Could pick up wrong run if multiple builds were queued Solution: - Record timestamp before dispatch - Wait 10s after dispatch for run to be created - Query for runs created after dispatch timestamp - Track specific run ID throughout polling loop - Filter by created_at >= dispatch_time to find correct run This ensures we always monitor the correct workflow run.
The PAT_TOKEN is validated once in the 'Resolve parameters' step and written to $GITHUB_ENV, making it available to all subsequent steps. Removed redundant validation from: - Dispatch benchmarks image build - Wait for benchmarks build completion - Dispatch evaluation workflow This reduces duplication and makes the workflow cleaner.
Moved MAX_ATTEMPTS and SLEEP_SECONDS from inline script values to top-level environment variables for better visibility and maintainability. Added: - MAX_POLL_ATTEMPTS: '40' (40 attempts × 30s = 20 minutes max) - POLL_INTERVAL_SECONDS: '30' This makes it easier to adjust polling behavior and documents the timeout calculation in one central location.
🟡 Updated Review: Code Quality Improved, Architecture Still Has Room for OptimizationGreat progress addressing the critical issues from the previous review! Several key problems have been fixed. However, the fundamental architectural pattern still has opportunities for simplification. ✅ Issues Addressed1. PAT Token Validation - IMPROVEDBefore: PAT_TOKEN validated 4 times across different steps Note: The "Wait for benchmarks build" step now uses 2. Magic Numbers - FIXED ✨Before: Hardcoded MAX_POLL_ATTEMPTS: '40' # 40 attempts × 30s = 20 minutes max wait
POLL_INTERVAL_SECONDS: '30'Much better visibility and configurability! Good documentation comment too. 3. Race Condition - SIGNIFICANTLY MITIGATED 🎯Before: Polled for most recent run on main, no guarantee it was the dispatched one
This is a solid improvement! The race condition is now much less likely, though not completely eliminated (see remaining issues below). 4. Authorization Header - FIXEDBefore: Mixed "Bearer" and "token" 🟡 Remaining Opportunities for Simplification[Medium Priority] Polling Pattern Still PresentLines 218-262: The workflow still uses a polling loop to wait for the benchmarks build. Why this matters: While the race condition is mitigated, polling is inherently fragile:
Better approach: Use GitHub Actions jobs:
build-images:
uses: OpenHands/benchmarks/.github/workflows/build-swe-bench-images.yml@main
with:
dataset: ${{ env.DATASET }}
sdk-commit: ${{ steps.get-sha.outputs.sdk_sha }}
secrets: inherit
dispatch-eval:
needs: build-images
runs-on: ubuntu-latest
# ... rest of the workflowThis eliminates ALL polling, provides deterministic run tracking, and makes dependencies explicit. [Low Priority] curl + jq Could Be CleanerLines 192-208, 253-267: Using curl + jq for GitHub API calls works but is verbose. Alternative: Use - name: Dispatch benchmarks build
uses: actions/github-script@v7
with:
github-token: ${{ env.PAT_TOKEN }}
script: |
await github.rest.actions.createWorkflowDispatch({
owner: 'OpenHands',
repo: 'benchmarks',
workflow_id: '${{ env.BENCHMARKS_BUILD_WORKFLOW }}',
ref: 'main',
inputs: { ... }
});Cleaner, better error messages, no jq gymnastics. [Edge Case] Remaining Race ConditionLines 228-239: The timestamp filter Scenario: If two PRs trigger eval workflows within the same second, both could match the same run. Potential fix: Also filter by workflow inputs (check INPUTS=$(echo "$RUN_DATA" | jq -r '.inputs."sdk-commit"')
if [ "$INPUTS" = "$SDK_SHA" ]; then
# This is definitely our run
fi[Optimization] Unconditional RebuildThe workflow always triggers a benchmarks build, even if images already exist for this Optimization opportunity: Query GitHub Container Registry first: # Check if image exists
if docker manifest inspect ghcr.io/openhands/eval-agent-server:$SDK_SHA 2>/dev/null; then
echo "Images already exist for $SDK_SHA, skipping build"
exit 0
fiThis would save significant CI time for re-runs or evaluations of already-built commits. [Debugging] Missing Error ContextLine 256: When benchmarks build fails, you print the conclusion but not the logs URL. Improvement: echo "Benchmarks build failed with conclusion ${CONCL}" >&2
echo "View logs: https://github.com/${REPO}/actions/runs/${RUN_ID}" >&2
exit 1📊 Progress Summary
VERDICT:🟡 Good progress, merge-worthy with notes: The critical bugs are fixed and the code is significantly more robust. The remaining issues are architectural optimizations rather than correctness problems. The workflow will function reliably in its current form. Key Insight: You've successfully moved from a broken polling implementation to a working one. The next evolution would be eliminating polling entirely with Recommendation✅ Approve to merge - The workflow is now production-ready. Consider the remaining optimizations as follow-up improvements rather than blockers. If you do want to eliminate polling in the future, the benchmarks workflow would need to be converted to a reusable workflow with |
Updated polling configuration: - MAX_POLL_ATTEMPTS: 40 → 80 - POLL_INTERVAL_SECONDS: 30s → 60s - Total max wait time: 20 minutes → 80 minutes This provides more headroom for long-running image builds, especially when building many images or during high load periods.
| MAX_POLL_ATTEMPTS: '80' # 80 attempts × 60s = 80 minutes max wait | ||
| POLL_INTERVAL_SECONDS: '60' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xingyaoww is 80 minutes enough for building all swe bench images?
I made an issue to avoid polling but it requires modifying all workflows so Im doing this later #1249
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:9c0c585-pythonRun
All tags pushed for this build
About Multi-Architecture Support
9c0c585-python) is a multi-arch manifest supporting both amd64 and arm649c0c585-python-amd64) are also available if needed