Support Remote Runtime #85

xingyaoww · 2025-11-11T19:37:10Z

Command to run eval

uv run swebench-infer .llm_config/sonnet-4-5.json --n-limit 200 --workspace remote --num-workers 32 --max-iterations 500

…hing - Create workflow that can be manually triggered via workflow_dispatch - Integrate Blacksmith caching for faster Docker builds - Configure workflow to push images to ghcr.io/openhands/eval-agent-server - Make --critic parameter optional in build_images.py for build-only usage - Fix .gitignore patterns for eval_outputs and builds directories This workflow follows Blacksmith documentation for Docker builds and allows building SWE-Bench evaluation images with configurable parameters like dataset, split, target, platforms, and concurrent workers. Closes #37

…caching Following the pattern from OpenHands/software-agent-sdk#990 and Blacksmith's official documentation (https://docs.blacksmith.sh/blacksmith-caching/docker-builds), this change replaces the standard docker/setup-buildx-action with useblacksmith/setup-docker-builder@v1. Key improvements: - Replaces docker/setup-buildx-action@v3 with useblacksmith/setup-docker-builder@v1 - Removes manual cache configuration (useblacksmith/cache@v6) - Blacksmith's Docker builder automatically manages Docker layer caching via NVMe-backed sticky disks - Provides 2x to 40x improvements in build times according to Blacksmith's customers - Since we only build amd64 images, we don't need the complex multi-platform matrix strategy This approach is recommended for workflows that use Docker commands directly (as opposed to using docker/build-push-action). Co-authored-by: openhands <openhands@all-hands.dev>

…s/build-swe-bench-images-workflow

The GitHub Actions workflow was failing because uv was trying to build pyarrow from source, which requires the Arrow C++ library and CMake. This change adds the --no-build-package pyarrow flag to force uv to use the pre-built binary wheel instead of attempting to build from source. Co-authored-by: openhands <openhands@all-hands.dev>

The root cause of the build failure was that uv was installing Python 3.14.0, which doesn't have binary wheels for pyarrow 21.0.0 yet. This caused uv to attempt building from source, which failed due to missing Arrow C++ libraries. Solution: Added .python-version file to pin Python to 3.12, which matches the project's target-version in pyproject.toml and has full binary wheel support for all dependencies. Co-authored-by: openhands <openhands@all-hands.dev>

Use github.run_id instead of dataset/split names which contain slashes that are invalid in artifact names. Also added if-no-files-found: warn to provide better feedback if logs are missing. Co-authored-by: openhands <openhands@all-hands.dev>

…cters GitHub Actions artifact upload doesn't allow colons in filenames, but our log paths contain colons from Docker image tags (e.g., 'django-11999:latest'). Archive the entire builds directory into a tar.gz before upload to work around this restriction. Co-authored-by: openhands <openhands@all-hands.dev>

Docker image tags have a maximum length of 128 characters. When building SWE-Bench images with long base image names (e.g., scikit-learn), the generated cache tags exceed this limit and cause build failures with: 'ERROR: failed to configure registry cache exporter: invalid reference format' Solution: Apply a patch to vendor/software-agent-sdk that hashes the base_image_slug when it would cause the final tag to exceed 128 characters. Uses SHA256 hash (first 12 chars) to create a shorter unique identifier while maintaining cache efficiency. The patch is applied during the workflow setup before installing dependencies. Co-authored-by: openhands <openhands@all-hands.dev>

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <openhands@all-hands.dev>

…s/build-swe-bench-images-workflow

This reverts commit 3ba1e46.

The build workflow was experiencing log file corruption and I/O errors due to concurrent builds writing to the wrong log files. This was caused by using ThreadPoolExecutor with contextlib.redirect_stdout/stderr, which only provides thread-local redirection of Python-level writes. The SDK's build() function spawns subprocesses and uses logger.info()/warning() to output build logs. Logger handlers write to process-wide file descriptors, not thread-local redirected streams, causing output from concurrent threads to: - Write to the wrong log files - Attempt writing to closed file handles - Result in ValueError('I/O operation on closed file.') Solution: Replace ThreadPoolExecutor with ProcessPoolExecutor to provide complete process-level isolation with separate stdout/stderr/logging per build. The additional overhead is negligible compared to Docker build time. Changes: - Import ProcessPoolExecutor instead of ThreadPoolExecutor - Move build_one_fn to module level (_build_with_logging) for pickle support - Update executor initialization to use ProcessPoolExecutor - Add explanatory comments about isolation requirements Co-authored-by: openhands <openhands@all-hands.dev>

This commit improves the tagging system for SWE-Bench Docker images to enable better reproducibility and clarity. ## Changes ### 1. Benchmarks Build System **benchmarks/swe_bench/build_images.py:** - Added `get_sdk_commit_hash()`: Extracts 7-char SDK submodule commit hash - Added `extract_instance_id()`: Parses SWE-Bench base images to extract instance IDs - Modified `main()`: Sets SDK_VERSION_OVERRIDE env var with SDK commit hash - Modified `build_one()`: - Generates custom tags: `swebench-{instance_id}` - Disables versioned tags via `include_versioned_tag=False` ### 2. SDK Submodule Update **vendor/software-agent-sdk:** Updated to commit 77d50e61 which includes: - `SDK_VERSION_OVERRIDE` environment variable support - `include_versioned_tag` option in BuildOptions - Target-based tag suffixes (replaces `-dev` suffix) - See: OpenHands/software-agent-sdk#1088 ### 3. Documentation **TAGGING_CHANGES.md:** Comprehensive documentation explaining: - Why these changes are needed (submodule git context issues) - Tag format comparison (before/after) - Benefits (reproducibility, usability, maintainability) - Implementation details and examples ## Tag Format ### Before ``` v1.0.0_docker.io_s_swebench_s_sweb.eval.x86_64.django_1776_django-12155_tag_latest_source-minimal-dev ``` - 137 characters - Package version (non-reproducible) - Unclear `-dev` suffix ### After ``` a612c0a-swebench-django-12155-source-minimal main-swebench-django-12155-source-minimal ``` - 84 characters (39% shorter) - Exact commit hash (reproducible) - Clear target indication ## Benefits 1. **Reproducibility**: Git commit hash ensures exact SDK version tracking 2. **Clarity**: Instance ID and target clearly visible in tag 3. **Consistency**: All builds use same suffix pattern 4. **Backward Compatible**: SDK changes only apply when explicitly enabled ## Related - SDK PR: OpenHands/software-agent-sdk#1088 - Issue: Improve SWE-Bench image build workflow Co-authored-by: openhands <openhands@all-hands.dev>

Updated SDK submodule to bc25aa0d which omits the target suffix for binary builds since it's the default/common case. This keeps tags cleaner. Tag examples: - Binary: a612c0a-swebench-django-12155 (no suffix) - Source: a612c0a-swebench-django-12155-source - Source-minimal: a612c0a-swebench-django-12155-source-minimal Updated TAGGING_CHANGES.md to reflect this behavior with updated examples showing both binary and source-minimal formats. Co-authored-by: openhands <openhands@all-hands.dev>

Updates SDK submodule to 27f37dc0 which fixes an issue where SHORT_SHA was using git info from the benchmarks repo instead of the SDK repo. Now tags correctly use the SDK commit hash when SDK_VERSION_OVERRIDE is set, ensuring proper versioning in vendored/submodule contexts. Co-authored-by: openhands <openhands@all-hands.dev>

SDK now automatically detects its own commit hash, so we don't need to manually extract and override it. This simplifies the build script significantly: - Removed get_sdk_commit_hash() function - Removed SDK_VERSION_OVERRIDE env var setting - Removed unused imports (subprocess, os) - Updated documentation to reflect simpler approach The SDK's _sdk_version() now automatically finds the SDK repo root and gets the commit hash directly, regardless of whether it's used as a submodule or vendored dependency. Co-authored-by: openhands <openhands@all-hands.dev>

Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <openhands@all-hands.dev>

This reverts commit 8d8ed8c.

…s/build-swe-bench-images-workflow

…e-runtime

openhands-ai · 2025-11-13T17:19:00Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Build SWE-Bench Images

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #85 at branch `xw/remote-runtime`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

xingyaoww · 2025-11-13T20:02:42Z

@OpenHands review diff introduced by this PR, understand how remote runtime works conceptually and add a section in README.md to explain how it works. Basically you will need:

Pre-build agent-server images for a particular SHA (you can label build-swebench in a PR to trigger this in this repo)
And launch the command with a working remote runtime API key

Then write a more detailed instruction of how to run SWE-Bench evaluation w/ Remote Runtime in swebench/README.md (use this command as example - uv run swebench-infer .llm_config/sonnet-4-5.json --n-limit 200 --workspace remote --num-workers 32 --max-iterations 500)

openhands-ai · 2025-11-13T20:02:52Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2025-11-13T20:08:23Z

Summary

I've successfully reviewed PR #85 and added comprehensive documentation for Remote Runtime support. The changes have been committed and pushed to the xw/remote-runtime branch.

Changes Made:

✅ Main README.md - Added "Workspace Types" section explaining:

Docker Workspace (default) vs Remote Workspace comparison
How Remote Runtime works conceptually:
1. Pre-built agent images tagged with SDK SHA
2. Runtime API provisioning containers on-demand
3. Image existence verification before execution
4. Parallel execution in isolated containers
Prerequisites for using Remote Runtime:
- Pre-built images via build-swebench PR label or manual builds
- RUNTIME_API_KEY environment variable requirement
- Optional configuration (RUNTIME_API_URL, SDK_SHORT_SHA)

✅ benchmarks/swe_bench/README.md - Added detailed instructions:

Reorganized into "Docker Workspace" and "Remote Workspace" sections
Remote Workspace guide with:
- Step 1: Pre-building images (automated via PR label or manual)
- Step 2: Environment variable setup
- Step 3: Running inference with example commands
- Practical examples for full-scale and subset evaluations
- Troubleshooting section for common errors
- Comparison table: Docker vs Remote workspace

Key Features Documented:

Automated Image Building: Label a PR with build-swebench to trigger GitHub Actions that build and push images for all SWE-Bench instances
Scalable Evaluation: Use --workspace remote --num-workers 32 to run 32+ parallel evaluations

Example Command (from PR description):

uv run swebench-infer .llm_config/sonnet-4-5.json \
    --n-limit 200 --workspace remote --num-workers 32 --max-iterations 500

The documentation is now comprehensive, well-structured, and ready for users to leverage the Remote Runtime feature for large-scale SWE-Bench evaluations.

PR #85 has been updated: #85

_{View full conversation}

xingyaoww · 2025-11-14T19:05:22Z

Remote runtime eval is ready 🔥
362/500 (72.4%) using remote runtime + sonnet 4.5
This is comparable to 72.8% that Juan got earlier!

Documentation for how to use it now: https://github.com/OpenHands/benchmarks/tree/main/benchmarks/swe_bench#remote-workspace-scalable-cloud-evaluation
Total instances: 500
Instances submitted: 499
Instances completed: 498
Instances incomplete: 1
Instances resolved: 362
Instances unresolved: 136
Instances with empty patches: 1
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 500

Benchmarks commit: 376cd94

openhands-agent and others added 30 commits October 27, 2025 21:39

Merge commit 'bb150852c64a555806cfa939f31e8f9abd7b3791' into openhand…

282f863

…s/build-swe-bench-images-workflow

revert unneed stuff

8508006

simplify setup dependency

a565e77

set eval-agent-server

9bbd7fb

fix line break

c661b2c

default to 10 for testing

632432e

run on all prs for debugging

c536903

Update patch with pre-commit formatting fixes

21bb226

Updated the patch to match the formatting requirements from ruff and other pre-commit checks. This ensures the patch applies cleanly and passes all linting/formatting checks. Co-authored-by: openhands <openhands@all-hands.dev>

checkout to v1.0.0 of sdk

2f89775

update uv.lock

dfb966b

Merge commit 'dfb966bd2d3e4d2086223cf4ff85d998d15354d4' into openhand…

d04de8a

…s/build-swe-bench-images-workflow

Revert "Fix Docker cache tag length exceeding 128 character limit"

cdd7200

This reverts commit 3ba1e46.

chore: update SDK to commit 85e436df

6d6845e

Update SDK submodule to include automatic SDK_VERSION detection. SDK now auto-detects its own commit hash without requiring external override, making the tagging system fully automatic. Co-authored-by: openhands <openhands@all-hands.dev>

update agent-sdk version

8d8ed8c

improve custom tags for swebench image

8763fad

Revert "update agent-sdk version"

99927f8

This reverts commit 8d8ed8c.

Merge commit '2ca8a917036ddb6ac069b3ecbb0f14ec616a4883' into openhand…

8ed14f3

…s/build-swe-bench-images-workflow

update sha

7e3c50e

xingyaoww added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 13, 2025

Merge commit '03cd6395e407d1463ed99e2eb80466fe9b10d590' into xw/remot…

422282e

…e-runtime

xingyaoww removed the build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. label Nov 13, 2025

trying fixing docker build trigger

5d734aa

xingyaoww added the build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. label Nov 13, 2025

fix typo

3e1f8f9

xingyaoww added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 13, 2025

tweak

8601875

xingyaoww removed the build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. label Nov 13, 2025

tweak

af6966a

xingyaoww added the build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. label Nov 13, 2025

xingyaoww added 2 commits November 13, 2025 16:36

drop default

2160810

Merge commit 'b3f5ab74e589803943cd65414ef2510e6b1d2966' into xw/remot…

19d58fa

…e-runtime

xingyaoww added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 13, 2025

xingyaoww added 2 commits November 13, 2025 17:22

sleep after failure

fd5c0c6

check target image existence before build

ea3f69f

xingyaoww added build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. and removed build-swebench Build 500 SWE-Bench Verified Image based on SDK version on this PR. labels Nov 13, 2025

xingyaoww requested a review from simonrosenberg November 13, 2025 17:33

simonrosenberg approved these changes Nov 13, 2025

View reviewed changes

xingyaoww merged commit 3dfeb4b into main Nov 13, 2025
3 checks passed

xingyaoww mentioned this pull request Nov 13, 2025

Add comprehensive documentation for Remote Runtime workspace #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Remote Runtime #85

Support Remote Runtime #85

Uh oh!

xingyaoww commented Nov 11, 2025 •

edited

Loading

Uh oh!

openhands-ai bot commented Nov 13, 2025

Uh oh!

Uh oh!

xingyaoww commented Nov 13, 2025

Uh oh!

openhands-ai bot commented Nov 13, 2025

Uh oh!

openhands-ai bot commented Nov 13, 2025

Uh oh!

xingyaoww commented Nov 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support Remote Runtime #85

Support Remote Runtime #85

Uh oh!

Conversation

xingyaoww commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openhands-ai bot commented Nov 13, 2025

Uh oh!

Uh oh!

xingyaoww commented Nov 13, 2025

Uh oh!

openhands-ai bot commented Nov 13, 2025

Uh oh!

openhands-ai bot commented Nov 13, 2025

Summary

Changes Made:

Key Features Documented:

Uh oh!

xingyaoww commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xingyaoww commented Nov 11, 2025 •

edited

Loading

xingyaoww commented Nov 14, 2025 •

edited

Loading