-
Notifications
You must be signed in to change notification settings - Fork 76
Align eval labels with benchmarks build tiers (1, 50, 200) #1254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align eval labels with benchmarks build tiers (1, 50, 200) #1254
Conversation
Update run-eval workflow to use labels that match the benchmarks repo's build tiers: - run-eval-1: Quick debugging (1 instance) - run-eval-50: Standard testing (50 instances) - run-eval-200: Extended testing (200 instances) Removed run-eval-2, run-eval-10, and run-eval-100 labels which don't align with benchmarks' build-swebench-50 and build-swebench-200 labels. This ensures eval instance counts match available pre-built image tiers in the benchmarks repository, avoiding unnecessary image builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Evaluation Triggered
|
|
Evaluation Triggered
|
|
Evaluation Triggered
|
|
Evaluation Triggered
|
xingyaoww
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also want @neubig's thought on how this will work on OH index (e.g., we might have multiple datasets)
.github/workflows/run-eval.yml
Outdated
| github.event.label.name == 'run-eval-2' || | ||
| github.event.label.name == 'run-eval-50' || | ||
| github.event.label.name == 'run-eval-100')) | ||
| github.event.label.name == 'run-eval-200')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add eval-500 for the full set
if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily? |
In that case, how do we run it? |
By workflow dispatch from the (private) Do we really want to trigger a +$500 job with a github PR label? |
Yes, rather than none. I see more reasons for Yes than for "dropping labels". A few points, sorry for conciseness:
|
Mainly my answer to your questions is: flexible configuration + you can still trigger workflow dispatch |
|
Oh, okay, got it! Thank you, Simon. |
Summary
This PR updates the
run-evalworkflow to use eval labels that align with the benchmarks repository's image build tiers.Changes
Updated eval labels:
run-eval-1: Quick debugging (1 instance)run-eval-50: Standard testing (50 instances)run-eval-200: Extended testing (200 instances)Removed labels:
run-eval-2: No matching benchmarks build tierrun-eval-10: No matching benchmarks build tierrun-eval-100: No matching benchmarks build tierRationale
The benchmarks repo provides these build label tiers:
build-swebench-50: Build 50 images (~5-10 minutes)build-swebench-200: Build 200 images (~20-40 minutes)build-swebench: Build all images (full evaluation)By aligning our eval labels with these tiers, we ensure:
Testing
workflow_dispatchinput options (lines 21-23)pull_request_targetlabel condition (lines 55-57)🤖 Generated with Claude Code
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:c6fd3db-pythonRun
All tags pushed for this build
About Multi-Architecture Support
c6fd3db-python) is a multi-arch manifest supporting both amd64 and arm64c6fd3db-python-amd64) are also available if needed