Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
c9b86c1
Add GitHub workflow for building SWE-Bench images with Blacksmith cac…
openhands-agent Oct 27, 2025
5752043
Use Blacksmith's setup-docker-builder action for faster Docker layer …
openhands-agent Nov 3, 2025
282f863
Merge commit 'bb150852c64a555806cfa939f31e8f9abd7b3791' into openhand…
xingyaoww Nov 4, 2025
8508006
revert unneed stuff
xingyaoww Nov 4, 2025
a565e77
simplify setup dependency
xingyaoww Nov 4, 2025
9bbd7fb
set eval-agent-server
xingyaoww Nov 4, 2025
c661b2c
fix line break
xingyaoww Nov 4, 2025
632432e
default to 10 for testing
xingyaoww Nov 4, 2025
c536903
run on all prs for debugging
xingyaoww Nov 4, 2025
efb731f
Fix pyarrow build issue by forcing binary wheel installation
openhands-agent Nov 4, 2025
29084f2
Pin Python version to 3.12 to fix pyarrow compatibility
openhands-agent Nov 4, 2025
551405b
Fix artifact upload naming to avoid invalid characters
openhands-agent Nov 4, 2025
90b6ed6
Fix artifact upload by archiving logs to avoid invalid filename chara…
openhands-agent Nov 4, 2025
3ba1e46
Fix Docker cache tag length exceeding 128 character limit
openhands-agent Nov 4, 2025
21bb226
Update patch with pre-commit formatting fixes
openhands-agent Nov 4, 2025
2f89775
checkout to v1.0.0 of sdk
xingyaoww Nov 6, 2025
dfb966b
update uv.lock
xingyaoww Nov 6, 2025
d04de8a
Merge commit 'dfb966bd2d3e4d2086223cf4ff85d998d15354d4' into openhand…
xingyaoww Nov 6, 2025
cdd7200
Revert "Fix Docker cache tag length exceeding 128 character limit"
xingyaoww Nov 6, 2025
001bcee
Fix log file mixing issue by using ProcessPoolExecutor
openhands-agent Nov 6, 2025
271b527
Improve Docker image tagging for reproducibility
openhands-agent Nov 6, 2025
92f04c1
refactor: omit target suffix for binary builds (default case)
openhands-agent Nov 6, 2025
49d9667
fix: update SDK to use SDK_VERSION for commit tags
openhands-agent Nov 6, 2025
c2711a3
refactor: remove SDK_VERSION_OVERRIDE logic
openhands-agent Nov 6, 2025
6d6845e
chore: update SDK to commit 85e436df
openhands-agent Nov 6, 2025
8d8ed8c
update agent-sdk version
xingyaoww Nov 7, 2025
8763fad
improve custom tags for swebench image
xingyaoww Nov 7, 2025
99927f8
Revert "update agent-sdk version"
xingyaoww Nov 7, 2025
8ed14f3
Merge commit '2ca8a917036ddb6ac069b3ecbb0f14ec616a4883' into openhand…
xingyaoww Nov 7, 2025
7e3c50e
update sha
xingyaoww Nov 7, 2025
c118297
fix: update run_infer.py to use new SDK tag format
openhands-agent Nov 7, 2025
4f3f9b1
refactor: deduplicate extract_custom_tag by importing from run_infer
openhands-agent Nov 7, 2025
26c3f02
docs: clarify SHORT_SHA source in run_infer.py
openhands-agent Nov 7, 2025
89e4cda
update sdk
xingyaoww Nov 7, 2025
eacfe0b
refactor
xingyaoww Nov 7, 2025
3a2c009
remove tagging changes
xingyaoww Nov 7, 2025
84c8876
bump commit
xingyaoww Nov 7, 2025
de46db7
simplify build script
xingyaoww Nov 7, 2025
bcbd455
bump version
xingyaoww Nov 7, 2025
96f2da6
bump
xingyaoww Nov 7, 2025
aad870b
bump
xingyaoww Nov 7, 2025
acee9cb
refactor build util into shared file
xingyaoww Nov 7, 2025
a4bf9e4
simplify build on the fly logic
xingyaoww Nov 7, 2025
9ef0d48
remove targets and platform
xingyaoww Nov 7, 2025
06e994a
Add automatic comment to issue #81 on successful build
openhands-agent Nov 7, 2025
fba2a55
Fix SDK URL and add workflow trigger information
openhands-agent Nov 7, 2025
0ab219f
Update .gitignore to properly allow .openhands/microagents/
openhands-agent Nov 7, 2025
aa8b452
Add error handling to skip comment when no images are built
openhands-agent Nov 7, 2025
a95969e
Fix manifest file path detection using find command
openhands-agent Nov 7, 2025
46b5266
bump sdk
xingyaoww Nov 7, 2025
16526b3
increase n work and n limit
xingyaoww Nov 7, 2025
90ee94e
Show only one tag per image in issue comment
openhands-agent Nov 7, 2025
2d10954
bump sdk commit
xingyaoww Nov 8, 2025
178123e
increase to 500 limit and 32 concurrency
xingyaoww Nov 8, 2025
0619134
disable rebuild on every push
xingyaoww Nov 10, 2025
e67b9b0
Fix workflow summary mismatch: use manifest.jsonl instead of summary.…
openhands-agent Nov 10, 2025
822e417
Remove redundant 'Upload build manifest' step
openhands-agent Nov 10, 2025
04f0cf4
bump sdk to v1.1
xingyaoww Nov 11, 2025
a1c93c9
support remote runtime & bump ver again
xingyaoww Nov 11, 2025
07abd72
fix target type
xingyaoww Nov 11, 2025
59b6631
Merge commit '89162cbbba455b5b6aa69c9facbd8c11eb6ed9f2' into xw/remot…
xingyaoww Nov 11, 2025
4949957
bump sdk
xingyaoww Nov 11, 2025
cc121b5
Merge commit '4dab8b1e02bd89e2ffa258847c917746967e67dd' into xw/remot…
xingyaoww Nov 11, 2025
94c4326
check image exists before launching remote runtime job
xingyaoww Nov 12, 2025
0f621e4
Merge commit '34bcaea6fbf0477b6f6691ec9d2bbcda7dcafbcc' into xw/remot…
xingyaoww Nov 12, 2025
d7d6faf
Merge commit '15fd19d91fa933d20790abb3f87098f3d0874399' into xw/remot…
xingyaoww Nov 12, 2025
422282e
Merge commit '03cd6395e407d1463ed99e2eb80466fe9b10d590' into xw/remot…
xingyaoww Nov 13, 2025
5d734aa
trying fixing docker build trigger
xingyaoww Nov 13, 2025
3e1f8f9
fix typo
xingyaoww Nov 13, 2025
8601875
tweak
xingyaoww Nov 13, 2025
af6966a
tweak
xingyaoww Nov 13, 2025
2160810
drop default
xingyaoww Nov 13, 2025
19d58fa
Merge commit 'b3f5ab74e589803943cd65414ef2510e6b1d2966' into xw/remot…
xingyaoww Nov 13, 2025
fd5c0c6
sleep after failure
xingyaoww Nov 13, 2025
ea3f69f
check target image existence before build
xingyaoww Nov 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 65 additions & 30 deletions benchmarks/swe_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
construct_eval_output_dir,
get_default_on_result_writer,
)
from benchmarks.utils.image_utils import image_exists
from benchmarks.utils.models import (
EvalInstance,
EvalMetadata,
Expand All @@ -26,7 +27,7 @@
from openhands.sdk import LLM, Agent, Conversation, get_logger
from openhands.sdk.workspace import RemoteWorkspace
from openhands.tools.preset.default import get_default_tools
from openhands.workspace import DockerWorkspace
from openhands.workspace import APIRemoteWorkspace, DockerWorkspace


logger = get_logger(__name__)
Expand Down Expand Up @@ -96,45 +97,78 @@ def prepare_workspace(self, instance: EvalInstance) -> RemoteWorkspace:
"""
Use DockerWorkspace by default.
"""
SKIP_BUILD = os.getenv("SKIP_BUILD", "1").lower() in ("1", "true", "yes")
logger.info(f"SKIP_BUILD={SKIP_BUILD}")
official_docker_image = get_official_docker_image(instance.id)
build_target = "source-minimal"
custom_tag = extract_custom_tag(official_docker_image)

# For non-binary targets, append target suffix
suffix = f"-{build_target}" if build_target != "binary" else ""
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{SDK_SHORT_SHA}-{custom_tag}{suffix}"
)
if not SKIP_BUILD:
logger.info(
f"Building workspace from {official_docker_image} "
f"for instance {instance.id}. "
"This may take a while...\n"
"You can run benchmarks/swe_bench/build_images.py and set "
"SWE_BENCH_SKIP_BUILD=1 to skip building and use pre-built "
"agent-server image."

if self.metadata.workspace_type == "docker":
agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{SDK_SHORT_SHA}-{custom_tag}{suffix}"
)
SKIP_BUILD = os.getenv("SKIP_BUILD", "1").lower() in ("1", "true", "yes")
logger.info(f"SKIP_BUILD={SKIP_BUILD}")
if not SKIP_BUILD:
logger.info(
f"Building workspace from {official_docker_image} "
f"for instance {instance.id}. "
"This may take a while...\n"
"You can run benchmarks/swe_bench/build_images.py and set "
"SWE_BENCH_SKIP_BUILD=1 to skip building and use pre-built "
"agent-server image."
)
output = build_image(
base_image=official_docker_image,
target_image=EVAL_AGENT_SERVER_IMAGE,
custom_tag=custom_tag,
target=build_target,
push=False,
)
logger.info(f"Image build output: {output}")
assert output.error is None, f"Image build failed: {output.error}"
if agent_server_image not in output.tags:
raise RuntimeError(
f"Built image tags {output.tags} do not include expected tag "
f"{agent_server_image}"
)

workspace = DockerWorkspace(
server_image=agent_server_image,
working_dir="/workspace",
)
output = build_image(
base_image=official_docker_image,
target_image=EVAL_AGENT_SERVER_IMAGE,
custom_tag=custom_tag,
target=build_target,
push=False,
elif self.metadata.workspace_type == "remote":
runtime_api_key = os.getenv("RUNTIME_API_KEY")
sdk_short_sha = os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA)
if not runtime_api_key:
raise ValueError(
"RUNTIME_API_KEY environment variable is not set for remote workspace"
)

agent_server_image = (
f"{EVAL_AGENT_SERVER_IMAGE}:{sdk_short_sha}-{custom_tag}{suffix}"
)
logger.info(f"Image build output: {output}")
assert output.error is None, f"Image build failed: {output.error}"
if agent_server_image not in output.tags:
if not image_exists(agent_server_image):
raise RuntimeError(
f"Built image tags {output.tags} do not include expected tag "
f"{agent_server_image}"
f"Agent server image {agent_server_image} does not exist in container registry, "
"make sure to build, push it, and make it public accessible before using remote workspace."
)
logger.info(
f"Using remote workspace with image {agent_server_image} (sdk sha: {sdk_short_sha})"
)
workspace = APIRemoteWorkspace(
runtime_api_url=os.getenv(
"RUNTIME_API_URL", "https://runtime.eval.all-hands.dev"
),
runtime_api_key=runtime_api_key,
server_image=agent_server_image,
target_type="source" if "source" in build_target else "binary",
)
else:
raise ValueError(
f"Unsupported workspace_type: {self.metadata.workspace_type}"
)

workspace = DockerWorkspace(
server_image=agent_server_image,
working_dir="/workspace",
)
for cmd in self.metadata.env_setup_commands or []:
res = workspace.execute_command(cmd)
if res.exit_code != 0:
Expand Down Expand Up @@ -297,6 +331,7 @@ def main() -> None:
critic_name=args.critic,
selected_instances_file=args.select,
max_retries=args.max_retries,
workspace_type=args.workspace,
)

# Run orchestrator with a simple JSONL writer
Expand Down
7 changes: 7 additions & 0 deletions benchmarks/utils/args_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,13 @@ def get_parser(add_llm_config: bool = True) -> argparse.ArgumentParser:
help="Dataset name",
)
parser.add_argument("--split", type=str, default="test", help="Dataset split")
parser.add_argument(
"--workspace",
type=str,
default="docker",
choices=["docker", "remote"],
help="Type of workspace to use (default: docker)",
)
parser.add_argument(
"--max-iterations", type=int, default=100, help="Maximum iterations"
)
Expand Down
8 changes: 8 additions & 0 deletions benchmarks/utils/build_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import contextlib
import io
import subprocess
import time
import tomllib
from concurrent.futures import ProcessPoolExecutor, as_completed
from datetime import UTC, datetime
Expand All @@ -19,6 +20,7 @@

from benchmarks.utils.args_parser import get_parser
from benchmarks.utils.constants import EVAL_AGENT_SERVER_IMAGE
from benchmarks.utils.image_utils import image_exists
from openhands.agent_server.docker.build import BuildOptions, TargetType, build
from openhands.sdk import get_logger

Expand Down Expand Up @@ -195,6 +197,11 @@ def build_image(
git_sha=git_sha,
sdk_version=sdk_version,
)
for t in opts.all_tags[0]:
# Check if image exists or not
if image_exists(t):
logger.info(f"Image {t} already exists. Skipping build.")
return BuildOutput(base_image=base_image, tags=[t], error=None)
tags = build(opts)
return BuildOutput(base_image=base_image, tags=tags, error=None)

Expand Down Expand Up @@ -224,6 +231,7 @@ def _build_with_logging(
logger.info(
f"Retrying build for {base_image} (attempt {attempt + 1}/{max_retries})"
)
time.sleep(2 + attempt * 2)
result = build_image(base_image, target_image, custom_tag, target, push)
result.log_path = str(log_path)
if not result.error:
Expand Down
105 changes: 105 additions & 0 deletions benchmarks/utils/image_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
import base64
import sys

import requests


ACCEPT = ",".join(
[
"application/vnd.oci.image.index.v1+json",
"application/vnd.oci.image.manifest.v1+json",
"application/vnd.docker.distribution.manifest.v2+json",
"application/vnd.docker.distribution.manifest.list.v2+json",
]
)


def _parse(image: str):
digest = None
if "@" in image:
image, digest = image.split("@", 1)
tag = None
last = image.rsplit("/", 1)[-1]
if ":" in last: # tag after last slash (not registry:port)
image, tag = image.rsplit(":", 1)
parts = image.split("/")
if "." in parts[0] or ":" in parts[0] or parts[0] == "localhost":
registry, repo = parts[0], "/".join(parts[1:])
else:
registry, repo = "registry-1.docker.io", "/".join(parts)
ref = digest or tag or "latest"
return registry, repo, ref


def _dockerhub_token(repo: str) -> str | None:
url = f"https://auth.docker.io/token?service=registry.docker.io&scope=repository:{repo}:pull"
r = requests.get(url, timeout=10)
if r.ok:
return r.json().get("token")
return None


def _ghcr_token(repo: str, username: str | None, pat: str | None) -> str | None:
# Public: anonymous works; Private: Basic auth with PAT (read:packages) to get bearer
url = f"https://ghcr.io/token?service=ghcr.io&scope=repository:{repo}:pull"
headers = {}
if username and pat:
headers["Authorization"] = (
"Basic " + base64.b64encode(f"{username}:{pat}".encode()).decode()
)
r = requests.get(url, headers=headers, timeout=10)
if r.ok:
return r.json().get("token")
return None


def image_exists(
image_ref: str,
gh_username: str | None = None,
gh_pat: str | None = None, # GitHub PAT with read:packages for private GHCR
docker_token: str | None = None, # Docker Hub JWT if you already have one
) -> bool:
registry, repo, ref = _parse(image_ref)
headers = {"Accept": ACCEPT}

if registry in ("docker.io", "index.docker.io", "registry-1.docker.io"):
base = "https://registry-1.docker.io"
token = docker_token or _dockerhub_token(repo)
if token:
headers["Authorization"] = f"Bearer {token}"
elif registry == "ghcr.io":
base = "https://ghcr.io"
token = _ghcr_token(repo, gh_username, gh_pat)
if token:
headers["Authorization"] = f"Bearer {token}"
else:
base = f"https://{registry}"

url = f"{base}/v2/{repo}/manifests/{ref}"
try:
r = requests.head(url, headers=headers, timeout=10)
if r.status_code in (
405,
406,
): # some registries disallow HEAD or need GET for content-negotiation
r = requests.get(url, headers=headers, timeout=10)
# 200 -> exists; 401/403 -> exists but unauthorized; 404 -> not found
return r.status_code == 200
except requests.RequestException:
return False


if __name__ == "__main__":
if len(sys.argv) < 2:
print(
"Usage: python image_check.py <image[:tag]|image@sha256:...> [gh_user] [gh_pat]"
)
sys.exit(1)

image = sys.argv[1]
gh_user = sys.argv[2] if len(sys.argv) > 2 else None
gh_pat = sys.argv[3] if len(sys.argv) > 3 else None

ok = image_exists(image, gh_username=gh_user, gh_pat=gh_pat)
print(f"{image} -> {'✅ exists' if ok else '❌ not found or unauthorized'}")
6 changes: 5 additions & 1 deletion benchmarks/utils/models.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from typing import Any
from typing import Any, Literal

from pydantic import BaseModel, Field

Expand Down Expand Up @@ -45,6 +45,10 @@ class EvalMetadata(BaseModel):
ge=0,
description="Maximum number of retries for instances that throw exceptions",
)
workspace_type: Literal["docker", "remote"] = Field(
default="docker",
description="Type of workspace to use, e.g., 'docker' or 'remote'",
)


EvalInstanceID = str
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ dependencies = [
"openhands-workspace",
"modal>=1.1.4",
"swebench",
"docker-registry-client>=0.5.2",
]

[project.scripts]
Expand Down
31 changes: 31 additions & 0 deletions uv.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.