Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Nov 25, 2025

Summary

This PR updates the run-eval workflow to use eval labels that align with the benchmarks repository's image build tiers.

Changes

Updated eval labels:

  • run-eval-1: Quick debugging (1 instance)
  • run-eval-50: Standard testing (50 instances)
  • run-eval-200: Extended testing (200 instances)

Removed labels:

  • run-eval-2: No matching benchmarks build tier
  • run-eval-10: No matching benchmarks build tier
  • run-eval-100: No matching benchmarks build tier

Rationale

The benchmarks repo provides these build label tiers:

  • build-swebench-50: Build 50 images (~5-10 minutes)
  • build-swebench-200: Build 200 images (~20-40 minutes)
  • build-swebench: Build all images (full evaluation)

By aligning our eval labels with these tiers, we ensure:

  1. Pre-built images are available for requested eval instance counts
  2. No wasted image builds for instance counts we don't use
  3. Consistent tier structure across SDK and benchmarks repos

Testing

  • Workflow validation passed (YAML formatting, pre-commit hooks)
  • Labels updated in three places:
    • workflow_dispatch input options (lines 21-23)
    • pull_request_target label condition (lines 55-57)
    • Parameter resolution case statement (lines 116-118)

🤖 Generated with Claude Code


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:c6fd3db-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-c6fd3db-python \
  ghcr.io/openhands/agent-server:c6fd3db-python

All tags pushed for this build

ghcr.io/openhands/agent-server:c6fd3db-golang-amd64
ghcr.io/openhands/agent-server:c6fd3db-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:c6fd3db-golang-arm64
ghcr.io/openhands/agent-server:c6fd3db-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:c6fd3db-java-amd64
ghcr.io/openhands/agent-server:c6fd3db-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:c6fd3db-java-arm64
ghcr.io/openhands/agent-server:c6fd3db-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:c6fd3db-python-amd64
ghcr.io/openhands/agent-server:c6fd3db-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:c6fd3db-python-arm64
ghcr.io/openhands/agent-server:c6fd3db-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:c6fd3db-golang
ghcr.io/openhands/agent-server:c6fd3db-java
ghcr.io/openhands/agent-server:c6fd3db-python

About Multi-Architecture Support

  • Each variant tag (e.g., c6fd3db-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., c6fd3db-python-amd64) are also available if needed

Update run-eval workflow to use labels that match the benchmarks repo's
build tiers:
- run-eval-1: Quick debugging (1 instance)
- run-eval-50: Standard testing (50 instances)
- run-eval-200: Extended testing (200 instances)

Removed run-eval-2, run-eval-10, and run-eval-100 labels which don't
align with benchmarks' build-swebench-50 and build-swebench-200 labels.

This ensures eval instance counts match available pre-built image tiers
in the benchmarks repository, avoiding unnecessary image builds.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@simonrosenberg simonrosenberg self-assigned this Nov 25, 2025
@simonrosenberg simonrosenberg added the run-eval-1 Runs evaluation on 1 SWE-bench instance label Nov 25, 2025
@github-actions
Copy link
Contributor

Evaluation Triggered

@simonrosenberg simonrosenberg added run-eval-1 Runs evaluation on 1 SWE-bench instance and removed run-eval-1 Runs evaluation on 1 SWE-bench instance labels Nov 25, 2025
@github-actions
Copy link
Contributor

Evaluation Triggered

@simonrosenberg simonrosenberg added run-eval-1 Runs evaluation on 1 SWE-bench instance and removed run-eval-1 Runs evaluation on 1 SWE-bench instance labels Nov 25, 2025
@github-actions
Copy link
Contributor

Evaluation Triggered

@simonrosenberg simonrosenberg added run-eval-1 Runs evaluation on 1 SWE-bench instance and removed run-eval-1 Runs evaluation on 1 SWE-bench instance labels Nov 25, 2025
@github-actions
Copy link
Contributor

Evaluation Triggered

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would also want @neubig's thought on how this will work on OH index (e.g., we might have multiple datasets)

github.event.label.name == 'run-eval-2' ||
github.event.label.name == 'run-eval-50' ||
github.event.label.name == 'run-eval-100'))
github.event.label.name == 'run-eval-200'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add eval-500 for the full set

@simonrosenberg
Copy link
Collaborator Author

would also want @neubig's thought on how this will work on OH index (e.g., we might have multiple datasets)

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2025

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

@simonrosenberg
Copy link
Collaborator Author

simonrosenberg commented Nov 25, 2025

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...

Do we really want to trigger a +$500 job with a github PR label?

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2025

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...

Do we really want to trigger a +$500 job with a github PR label?

Yes, rather than none. I see more reasons for Yes than for "dropping labels". A few points, sorry for conciseness:

  1. The main point is, dropping labels entirely means it removes the ability of open source maintainers, such as yours truly, to eval using this workflow, if I understand correctly?

  2. As far as I know, those able to trigger jobs with this workflow are already able to run full evals on the remote AH infra and AH LLMs. We just need to start it on our machines, and wait for it on our machines, rather than using github. I have done it sometimes. So is the cost a real problem? Idk, how is cost a problem, if we already can? 😅

  3. In the past I've been involved in trying to get this workflow to work, so I can use it to help people who needed evals in the community. I really liked the idea of this eval, so I can run it without blocking my machine over some experiment people had in some PRs. It eventually succeeded, and it was working (-ish, not on forks, I had to make a duplicate branch). However, it was unreliable for a while. From when it was reliable, until we started agent-sdk, not many times and not many others have used it. I have used it when necessary and not when not. Graham, Xingyao and Hoang have used it at times. Do we expect that suddenly we'll all become trigger happy? 😅

  4. It seems easy to make it work? Because the workflow already has an approved list. The list could include who are able to trigger a $500 eval, if we want to include a full eval, and that's it?

@simonrosenberg
Copy link
Collaborator Author

simonrosenberg commented Nov 25, 2025

if evaluation is getting richer and richer, perhaps we should drop label triggers entirely since they can't really specify (model, eval_dataset) easily?

In that case, how do we run it?

By workflow dispatch from the (private) evaluation repository. Allowing to specify custom sdk branch, custom set of models, custom benchmark config, ...
Do we really want to trigger a +$500 job with a github PR label?

Yes, rather than none. I see more reasons for Yes than for "dropping labels". A few points, sorry for conciseness:

  1. The main point is, dropping labels entirely means it removes the ability of open source maintainers, such as yours truly, to eval using this workflow, if I understand correctly?
  2. As far as I know, those able to trigger jobs with this workflow are already able to run full evals on the remote AH infra and AH LLMs. We just need to start it on our machines, and wait for it on our machines, rather than using github. I have done it sometimes. So is the cost a real problem? Idk, how is cost a problem, if we already can? 😅
  3. In the past I've been involved in trying to get this workflow to work, so I can use it to help people who needed evals in the community. I really liked the idea of this eval, so I can run it without blocking my machine over some experiment people had in some PRs. It eventually succeeded, and it was working (-ish, not on forks, I had to make a duplicate branch). However, it was unreliable for a while. From when it was reliable, until we started agent-sdk, not many times and not many others have used it. I have used it when necessary and not when not. Graham, Xingyao and Hoang have used it at times. Do we expect that suddenly we'll all become trigger happy? 😅
  4. It seems easy to make it work? Because the workflow already has an approved list. The list could include who are able to trigger a $500 eval, if we want to include a full eval, and that's it?
  1. No. The current SDK run-eval.yml exposes a workflow_dispatch that itself calls the evaluation repo. So you can still trigger an eval with worfklow dispatch.
  2. It's not just the cost. We will be adding more and more benchmarks, and more and more models. So evals can be configured in a 2d grid model x benchmark. But labels don't easily allow such flexibility.
  3. In any case you can still trigger the workflow via dispatch. It just seems to be that a $500+ job should require flexible and thoughtful configuration.

Mainly my answer to your questions is: flexible configuration + you can still trigger workflow dispatch

@enyst
Copy link
Collaborator

enyst commented Nov 25, 2025

Oh, okay, got it! Thank you, Simon.

@simonrosenberg simonrosenberg merged commit 1e8692b into main Nov 25, 2025
17 checks passed
@simonrosenberg simonrosenberg deleted the align-eval-labels-with-benchmarks-tiers branch November 25, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-eval-1 Runs evaluation on 1 SWE-bench instance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants