Align eval labels with benchmarks build tiers (1, 50, 200) #1254

simonrosenberg · 2025-11-25T09:57:48Z

Summary

This PR updates the run-eval workflow to use eval labels that align with the benchmarks repository's image build tiers.

Changes

Updated eval labels:

✅ run-eval-1: Quick debugging (1 instance)
✅ run-eval-50: Standard testing (50 instances)
✅ run-eval-200: Extended testing (200 instances)

Removed labels:

❌ run-eval-2: No matching benchmarks build tier
❌ run-eval-10: No matching benchmarks build tier
❌ run-eval-100: No matching benchmarks build tier

Rationale

The benchmarks repo provides these build label tiers:

build-swebench-50: Build 50 images (~5-10 minutes)
build-swebench-200: Build 200 images (~20-40 minutes)
build-swebench: Build all images (full evaluation)

By aligning our eval labels with these tiers, we ensure:

Pre-built images are available for requested eval instance counts
No wasted image builds for instance counts we don't use
Consistent tier structure across SDK and benchmarks repos

Testing

Workflow validation passed (YAML formatting, pre-commit hooks)
Labels updated in three places:
- workflow_dispatch input options (lines 21-23)
- pull_request_target label condition (lines 55-57)
- Parameter resolution case statement (lines 116-118)

🤖 Generated with Claude Code

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:c6fd3db-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-c6fd3db-python \
  ghcr.io/openhands/agent-server:c6fd3db-python

All tags pushed for this build

ghcr.io/openhands/agent-server:c6fd3db-golang-amd64
ghcr.io/openhands/agent-server:c6fd3db-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:c6fd3db-golang-arm64
ghcr.io/openhands/agent-server:c6fd3db-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:c6fd3db-java-amd64
ghcr.io/openhands/agent-server:c6fd3db-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:c6fd3db-java-arm64
ghcr.io/openhands/agent-server:c6fd3db-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:c6fd3db-python-amd64
ghcr.io/openhands/agent-server:c6fd3db-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:c6fd3db-python-arm64
ghcr.io/openhands/agent-server:c6fd3db-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:c6fd3db-golang
ghcr.io/openhands/agent-server:c6fd3db-java
ghcr.io/openhands/agent-server:c6fd3db-python

About Multi-Architecture Support

Each variant tag (e.g., c6fd3db-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., c6fd3db-python-amd64) are also available if needed

Update run-eval workflow to use labels that match the benchmarks repo's build tiers: - run-eval-1: Quick debugging (1 instance) - run-eval-50: Standard testing (50 instances) - run-eval-200: Extended testing (200 instances) Removed run-eval-2, run-eval-10, and run-eval-100 labels which don't align with benchmarks' build-swebench-50 and build-swebench-200 labels. This ensures eval instance counts match available pre-built image tiers in the benchmarks repository, avoiding unnecessary image builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-11-25T10:09:48Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR Align eval labels with benchmarks build tiers (1, 50, 200) #1254
SDK: 7555282
Eval limit: 1
Models: claude-sonnet-4-5-20250929

github-actions · 2025-11-25T11:18:57Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR Align eval labels with benchmarks build tiers (1, 50, 200) #1254
SDK: 3e54cf1
Eval limit: 1
Models: claude-sonnet-4-5-20250929

github-actions · 2025-11-25T12:11:27Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR Align eval labels with benchmarks build tiers (1, 50, 200) #1254
SDK: 3e54cf1
Eval limit: 1
Models: claude-sonnet-4-5-20250929

github-actions · 2025-11-25T12:54:59Z

Evaluation Triggered

Trigger: Label 'run-eval-1' on PR Align eval labels with benchmarks build tiers (1, 50, 200) #1254
SDK: 3e54cf1
Eval limit: 1
Models: claude-sonnet-4-5-20250929

xingyaoww

would also want @neubig's thought on how this will work on OH index (e.g., we might have multiple datasets)

xingyaoww · 2025-11-25T14:57:19Z

.github/workflows/run-eval.yml

-              github.event.label.name == 'run-eval-2' ||
              github.event.label.name == 'run-eval-50' ||
-              github.event.label.name == 'run-eval-100'))
+              github.event.label.name == 'run-eval-200'))


We should also add eval-500 for the full set

simonrosenberg · 2025-11-25T15:07:32Z

would also want @neubig's thought on how this will work on OH index (e.g., we might have multiple datasets)

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

enyst · 2025-11-25T15:10:04Z

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

simonrosenberg · 2025-11-25T15:21:02Z

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...

Do we really want to trigger a +$500 job with a github PR label?

enyst · 2025-11-25T16:02:56Z

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...

Do we really want to trigger a +$500 job with a github PR label?

Yes, rather than none. I see more reasons for Yes than for "dropping labels". A few points, sorry for conciseness:

The main point is, dropping labels entirely means it removes the ability of open source maintainers, such as yours truly, to eval using this workflow, if I understand correctly?
As far as I know, those able to trigger jobs with this workflow are already able to run full evals on the remote AH infra and AH LLMs. We just need to start it on our machines, and wait for it on our machines, rather than using github. I have done it sometimes. So is the cost a real problem? Idk, how is cost a problem, if we already can? 😅
In the past I've been involved in trying to get this workflow to work, so I can use it to help people who needed evals in the community. I really liked the idea of this eval, so I can run it without blocking my machine over some experiment people had in some PRs. It eventually succeeded, and it was working (-ish, not on forks, I had to make a duplicate branch). However, it was unreliable for a while. From when it was reliable, until we started agent-sdk, not many times and not many others have used it. I have used it when necessary and not when not. Graham, Xingyao and Hoang have used it at times. Do we expect that suddenly we'll all become trigger happy? 😅
It seems easy to make it work? Because the workflow already has an approved list. The list could include who are able to trigger a $500 eval, if we want to include a full eval, and that's it?

.github/workflows/run-eval.yml

simonrosenberg · 2025-11-25T16:19:03Z

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...
Do we really want to trigger a +$500 job with a github PR label?

Yes, rather than none. I see more reasons for Yes than for "dropping labels". A few points, sorry for conciseness:

The main point is, dropping labels entirely means it removes the ability of open source maintainers, such as yours truly, to eval using this workflow, if I understand correctly?

As far as I know, those able to trigger jobs with this workflow are already able to run full evals on the remote AH infra and AH LLMs. We just need to start it on our machines, and wait for it on our machines, rather than using github. I have done it sometimes. So is the cost a real problem? Idk, how is cost a problem, if we already can? 😅

In the past I've been involved in trying to get this workflow to work, so I can use it to help people who needed evals in the community. I really liked the idea of this eval, so I can run it without blocking my machine over some experiment people had in some PRs. It eventually succeeded, and it was working (-ish, not on forks, I had to make a duplicate branch). However, it was unreliable for a while. From when it was reliable, until we started agent-sdk, not many times and not many others have used it. I have used it when necessary and not when not. Graham, Xingyao and Hoang have used it at times. Do we expect that suddenly we'll all become trigger happy? 😅

It seems easy to make it work? Because the workflow already has an approved list. The list could include who are able to trigger a $500 eval, if we want to include a full eval, and that's it?

No. The current SDK run-eval.yml exposes a workflow_dispatch that itself calls the evaluation repo. So you can still trigger an eval with worfklow dispatch.
It's not just the cost. We will be adding more and more benchmarks, and more and more models. So evals can be configured in a 2d grid model x benchmark. But labels don't easily allow such flexibility.
In any case you can still trigger the workflow via dispatch. It just seems to be that a $500+ job should require flexible and thoughtful configuration.

Mainly my answer to your questions is: flexible configuration + you can still trigger workflow dispatch

enyst · 2025-11-25T16:51:09Z

Oh, okay, got it! Thank you, Simon.

…r testing" This reverts commit 3e54cf1.

…ing" This reverts commit dd83ffd.

simonrosenberg self-assigned this Nov 25, 2025

simonrosenberg added the run-eval-1 Runs evaluation on 1 SWE-bench instance label Nov 25, 2025

simonrosenberg requested a review from xingyaoww November 25, 2025 10:12

simonrosenberg added 2 commits November 25, 2025 12:07

Pass SDK PR number to evaluation workflow for results commenting

dd83ffd

[TESTING] Point to eval branch post-eval-results-to-sdk-pr for testing

3e54cf1

simonrosenberg added run-eval-1 Runs evaluation on 1 SWE-bench instance and removed run-eval-1 Runs evaluation on 1 SWE-bench instance labels Nov 25, 2025

xingyaoww reviewed Nov 25, 2025

View reviewed changes

simonrosenberg added 2 commits November 25, 2025 17:04

bump waiting time

c69b4dc

add run-eval-500 option

bc9a2b2

simonrosenberg requested a review from xingyaoww November 25, 2025 16:06

Merge branch 'main' into align-eval-labels-with-benchmarks-tiers

10c17c7

xingyaoww reviewed Nov 25, 2025

View reviewed changes

.github/workflows/run-eval.yml Outdated Show resolved Hide resolved

simonrosenberg added 2 commits November 25, 2025 17:24

use ubuntu latest

e1046bb

fix

ced023c

xingyaoww approved these changes Nov 25, 2025

View reviewed changes

simonrosenberg added 2 commits November 25, 2025 17:55

Revert "[TESTING] Point to eval branch post-eval-results-to-sdk-pr fo…

10ffda8

…r testing" This reverts commit 3e54cf1.

Revert "Pass SDK PR number to evaluation workflow for results comment…

f0c6cd4

…ing" This reverts commit dd83ffd.

simonrosenberg merged commit 1e8692b into main Nov 25, 2025
17 checks passed

simonrosenberg deleted the align-eval-labels-with-benchmarks-tiers branch November 25, 2025 17:06

Align eval labels with benchmarks build tiers (1, 50, 200) #1254

Align eval labels with benchmarks build tiers (1, 50, 200) #1254

Uh oh!

Conversation

simonrosenberg commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Rationale

Testing

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

github-actions bot commented Nov 25, 2025

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

xingyaoww Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Nov 25, 2025

Uh oh!

enyst commented Nov 25, 2025

Uh oh!

simonrosenberg commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

simonrosenberg commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simonrosenberg commented Nov 25, 2025 •

edited by github-actions bot

Loading

simonrosenberg commented Nov 25, 2025 •

edited

Loading

enyst commented Nov 25, 2025 •

edited

Loading

simonrosenberg commented Nov 25, 2025 •

edited

Loading