-
Notifications
You must be signed in to change notification settings - Fork 73
Simplify evaluation workflow by removing benchmarks build polling #1267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Remove benchmarks build dispatch and polling steps (327 -> ~180 lines) - Remove polling configuration environment variables - Orchestration now handled by evaluation repository's Kubernetes job - Workflow now only validates inputs and dispatches evaluation This change: - Reduces GitHub Actions API calls from 40+ to ~2 per evaluation - Eliminates complex polling logic from YAML - Improves maintainability and error handling - Makes the full evaluation flow easier to understand Co-authored-by: openhands <openhands@all-hands.dev>
Allow specifying which evaluation repo branch to use when dispatching the evaluation workflow, making it easier to test cross-repo changes. Co-authored-by: openhands <openhands@all-hands.dev>
- Add models.json with full model configurations to SDK repo - Add resolve_model_configs.py to lookup model configs by ID - Update run-eval workflow to resolve and pass model configs - Change evaluation workflow input from model IDs to model configs Co-authored-by: openhands <openhands@all-hands.dev>
- Extend authorization check to also validate workflow_dispatch triggers - Same authorized-labelers.txt list now applies to both PR labels and manual triggers - Ensures only authorized users can trigger evaluations via any method Co-authored-by: openhands <openhands@all-hands.dev>
45d1ecd to
7b1bc9a
Compare
- Pass reason input from SDK workflow to evaluation workflow - Reason is shown in Slack notifications when provided Co-authored-by: openhands <openhands@all-hands.dev>
- Delete allowed-model-stubs.json (redundant with models.json) - Extract allowed model IDs directly from models.json - Rename workflow input from 'model_stubs' to 'model_ids' for clarity - Update validation logic to use model IDs from models.json - Improve error messages to show available models on validation failure This simplifies configuration management by having a single source of truth for model definitions. Co-authored-by: openhands <openhands@all-hands.dev>
- Extract find_models_by_id() for better testability and separation of concerns - Use consistent error handling with get_required_env() and error_exit() - Standardize error message format (remove redundant 'ERROR:' prefixes) Co-authored-by: openhands <openhands@all-hands.dev>
Tests cover: - Single and multiple model lookups - Order preservation - Missing model error handling - Empty list handling - Full config preservation Co-authored-by: openhands <openhands@all-hands.dev>
| if [ -n "$RUN_ID" ] && [ "$RUN_ID" != "null" ]; then | ||
| echo "Found workflow run: $RUN_ID" | ||
| else | ||
| echo "Waiting for workflow run to appear (attempt $i/$MAX_ATTEMPTS)..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what happens with all this code building from benchmarks, how will it work now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We moved it to the Kubernetes workflow in our internal OpenHands/evaluation repo 😓 We don't want to keep polling stuff in CI since CI compute is likely more expensive than K8S, and keep these centralized in one place is a bit easier to maintain in the long run anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine if the results are transparent and reproducible IMHO
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this is accurate, right?
Applies to both workflow_dispatch and pull_request_target events
Will it comment on PRs, in the case of runs on PRs?
The comment is very valuable because it has logs - or it did, on V0. In particular when people get a not so good result, I think maybe we need to offer all data they need to dig into.
|
I think I asked this before, sorry, but I just checked yesterday's PR, and I didn't see any comment in response to |
|
|
||
|
|
||
| def main() -> None: | ||
| models_json_path = get_required_env("MODELS_JSON_PATH") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we write models.json as a dict explicitly in this file since it seems it is only used here?
.github/workflows/run-eval.yml
Outdated
| echo "User $LABELER is not authorized to trigger eval." >&2 | ||
| ACTOR="${{ github.actor }}" | ||
| if ! grep -Fx "$ACTOR" .github/run-eval/authorized-actors.txt >/dev/null; then | ||
| echo "User $ACTOR is not authorized to trigger eval." >&2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just remove this authorization? Only maintainer/ at least people with triage access to the repo can tag a PR, so i think it should be fine even if we don't check for github.action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note. This might have consequences the other way around though: we basically don't have new maintainers from the community for a long time. If we add an additional criterium, it might become a little harder even?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes both ways. I don't know. How about we let it sink in and give it some thought for a while?
- Remove ACTOR authorization logic and authorized-actors.txt - Embed model configs in resolve_model_configs.py, remove models.json - Pass PR_NUMBER to evaluation workflow for result comments - Fix Python script execution using heredoc for proper YAML formatting Co-authored-by: openhands <openhands@all-hands.dev>
|
Evaluation Triggered
|
This allows dispatching to a specific benchmarks repo branch instead of always using main, enabling end-to-end testing of feature branches. Co-authored-by: openhands <openhands@all-hands.dev>
…el_config Co-authored-by: openhands <openhands@all-hands.dev>
|
Evaluation Triggered
|
|
Evaluation Triggered
|
🎉 Evaluation Job CompletedEvaluation Name: Results Summary
View Metadata | View Results | Download Full Results |
@enyst FYI |
Rename test_resolve_model_configs.py to test_resolve_model_config.py to match the actual module name (resolve_model_config.py, singular). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Consolidate SHA resolution logic into a single step instead of separate checkout and resolution steps. This eliminates unnecessary git operations and makes the workflow more efficient. Changes: - Remove temporary pr_number input parameter (testing complete) - Resolve SDK commit SHA directly in params step for all event types - pull_request_target: Use github.event.pull_request.head.sha - release: Use github.event.release.target_commitish - workflow_dispatch: Use git rev-parse to convert ref to SHA - Remove redundant "Checkout evaluated ref for PRs" step - Remove redundant "Resolve SDK commit SHA for evaluation" step - Update model_ids description to reference MODELS dict location This reduces workflow execution time and simplifies maintenance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The previous version tried to checkout github.event.inputs.sdk_ref directly, which fails when a short SHA is provided because actions/checkout@v4 with fetch-depth: 0 cannot resolve short SHAs. Solution: Use github.ref (the branch the workflow runs on) for initial checkout. This works because: - For workflow_dispatch: The workflow runs on the branch specified in UI - For pull_request_target: Uses the PR base branch ref - For release: Uses the release tag/branch The actual SDK SHA for evaluation is still correctly resolved later in the params step (line 136-138) using git rev-parse. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Thank you, this is starting to work! It doesn't have any links to "Download Full Results" and the rest, though? They look like they would be links, but they're not. We need links to look into results, IMHO, numbers alone don't tell much. Specially for debugging bad results or simply understanding them. Edited to add: |
TEMPORARY CHANGE - will be reverted after testing This adds a pr_number workflow input so we can test the clickable GCS links fix in PR comments without needing to trigger via label. Testing plan: 1. Trigger workflow with pr_number=1267 2. Verify PR comment has clickable HTTPS links 3. Revert this commit after successful test 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Evaluation Triggered
|
🎉 Evaluation Job CompletedEvaluation Name: Results Summary
|
|
|
Unfortunately, those links are not public, not accessible outside AH I assume. 😢 I believe that's why Mamoodi had made in the past a full log available via GitHub actions. Not sure it was a great way, but it was a way. |
|
Evaluation Triggered
|
|
Evaluation Triggered
|
🎉 Evaluation Job CompletedEvaluation Name: Results Summary
|
This reverts commit e5bbafe.
|
Evaluation Triggered
|
🎉 Evaluation Job CompletedEvaluation Name: Results Summary
|
|
Just to note, it still doesn't like me. 😢 AccessDenied
Access denied.
Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).
|
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
It's work in progress! |
|
@enyst btw the URL fixing is done ins a complete other repo, if you want to review the current code t would be awesome :) |
enyst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a cursory look, it seems to me like we could maybe simplify some code. Let's take it in though, and have some fun!
If it works now, we can always adjust things as they come.
thank you!! |
Simplify Cross-Repository Evaluation Workflow Orchestration
Fixes #1265
Summary
This PR simplifies the SDK evaluation workflow by moving orchestration complexity from GitHub Actions (YAML) into Python code in the evaluation repository where it naturally belongs.
Before
After
Changes
Software Agent SDK Repository
1. Simplified Workflow (
run-eval.yml)Removed:
Streamlined to:
models.json2. Model Configuration Management
Created
resolve_model_configs.py:models.jsonReplaced model stubs with single source of truth:
.github/run-eval/llm_config_model_stubs.json.github/run-eval/models.jsonas the single authoritative sourceBenefits:
3. Enhanced Input Validation
Added authorization check:
authorized-labelers.txt) can trigger evaluationsworkflow_dispatchandpull_request_targeteventsModel ID validation:
models.json4. Testing Feature Branches
New
eval_branchinput:mainfor normal operations5. Trigger Reason Propagation
New
reasoninput:Code Review Updates
After initial implementation, addressed code review feedback to improve reliability:
1. Fixed Workflow Run Detection (HIGH PRIORITY)
Problem: Timestamp-based detection was unreliable and vulnerable to race conditions
Solution: Implemented baseline run ID comparison using Check Suites API
check_suite.id > baseline_id2. Removed Sanitize-Inputs Job (MEDIUM PRIORITY)
Problem: Unnecessary GitHub Actions job that could be done in Python
Solution: Moved sanitization logic to
orchestrate_eval.py3. Added Comprehensive Tests
Validation
Full Workflow Testing
Test 1: Valid Model ID
Test 2: Evaluation Deployment
Test 3: Invalid Model ID (Validation)
Test 4: Code Review Fixes - End-to-End
Trigger Reason Feature
Input: "Testing fix for helm deployment issue"
Sanitized: "Testing-fix-for-helm-deployment-issue"
Result: ✅ Successfully passed to Kubernetes pod via helm_args
Impact
Reduced Complexity
Improved Maintainability
Better Error Handling
Enhanced Features
Testing
Related PRs
Co-authored-by: openhands openhands@all-hands.dev
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:0f06ad7-pythonRun
All tags pushed for this build
About Multi-Architecture Support
0f06ad7-python) is a multi-arch manifest supporting both amd64 and arm640f06ad7-python-amd64) are also available if needed