feat: build lightweight benchmark images by default by simonrosenberg · Pull Request #548 · OpenHands/benchmarks

simonrosenberg · 2026-03-20T15:41:06Z

Summary

Add --lightweight CLI flag to SWE-bench and SWT-bench image build scripts
Thread extra_build_args through the full call chain: build_all_images → _build_with_logging → build_image → SDK BuildOptions
Both workflows default lightweight: true, skipping ACP, boto3, and browser-use
Can be toggled off via workflow dispatch input (lightweight: false) when full images are needed

Changes

benchmarks/utils/build_utils.py: LIGHTWEIGHT_BUILD_ARGS constant, --lightweight CLI flag, extra_build_args parameter threaded through all build functions
benchmarks/swebench/build_images.py: Pass lightweight build args to build_all_images
benchmarks/swtbench/build_images.py: Same
.github/workflows/build-swtbench-images.yml: lightweight input (default true), passed as --lightweight flag
.github/workflows/build-swebench-images.yml: Same
vendor/software-agent-sdk: Updated to feat/lightweight-benchmark-images branch (feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538)

Expected impact

From analysis of 433-image build logs (#537):

Dependency skipped	Per-image saving	Cumulative (433 imgs)
npm ACP + nodejs	~32s install + ~4s export/push	4.5h
browser-use + playwright	~15s export/push	1.8h
boto3/botocore	~3s export/push	0.4h
Total	~54s/image	~6.7h cumulative

Wall-clock improvement: ~1.9-3.0h (at 3.5x effective parallelism), plus non-linear savings from reduced disk pressure (fewer prune events, lower batch degradation).

Dependencies

Requires feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538 (merges #2535 + #2536 + #2537)

Test plan

Run SWT-bench build with --lightweight on 4-10 images to verify builds succeed
Verify image can run a basic SWT-bench evaluation
Run without --lightweight to verify no regression for full images
Full 433-image build to measure actual wall-clock improvement

Closes #537

🤖 Generated with Claude Code

Add --lightweight flag to image build scripts that passes INSTALL_ACP=false, INSTALL_BOTO3=false, INSTALL_BROWSER=false to the SDK Dockerfile. These skip npm ACP packages, boto3/botocore, and browser-use/playwright — none of which are used by benchmarks. Estimated savings at 433 images: - npm ACP + nodejs: ~32s/image (3.9h cumulative) - browser-use + playwright: ~15s export/push (1.8h cumulative) - boto3/botocore: ~3s export/push (0.4h cumulative) - Total: ~1.9-3.0h wall-clock improvement Changes: - build_utils.py: add extra_build_args plumbing through build_image → _build_with_logging → build_all_images; add --lightweight CLI flag - swtbench/build_images.py, swebench/build_images.py: pass LIGHTWEIGHT_BUILD_ARGS when --lightweight is set - Both build workflows: add lightweight input (default: true) - SDK submodule: update to feat/lightweight-benchmark-images branch Depends on: OpenHands/software-agent-sdk#2538 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

all-hands-bot

The core implementation is clean and pragmatic, but there are critical blockers before this can merge. See inline comments.

all-hands-bot · 2026-03-20T15:43:51Z

vendor/software-agent-sdk

@@ -1 +1 @@
-Subproject commit fc9e8fce2d5a6b5dee4543ff740d0c295aa968a3
+Subproject commit fde7eebe527a7c3fc7d6bc716e3a42716488ddb9


🔴 Critical: This submodule points to commit fde7eebe on feature branch feat/lightweight-benchmark-images (OpenHands/software-agent-sdk#2538).

Problem: Feature branches are unstable - they can be rebased, force-pushed, or deleted. This creates a dependency time bomb.

Required fix:

Wait for feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538 to merge to main

Update this submodule to point to the merged commit on main (or a tagged release)

Never point production submodules to feature branches

all-hands-bot · 2026-03-20T15:43:51Z

benchmarks/utils/build_utils.py


+# Default build args for lightweight benchmark images.
+# These correspond to ARGs in the SDK Dockerfile that default to "true".
+LIGHTWEIGHT_BUILD_ARGS: dict[str, str] = {


🟢 Good taste: Simple, clean constant. No magic, no over-engineering. This is the right way to define build args.

all-hands-bot · 2026-03-20T15:43:51Z

.github/workflows/build-swebench-images.yml

        default: 'false'
        type: string
+      lightweight:
+        description: 'Build lightweight images (skip ACP, boto3, browser-use)'


🟡 Suggestion: Using type: string with 'true'/'false' strings instead of type: boolean is unconventional.

If this is intentional (e.g., for workflow_dispatch compatibility), add a comment explaining why. Otherwise the next person might "fix" it to boolean and break things.

lightweight: description: 'Build lightweight images (skip ACP, boto3, browser-use)' required: false default: 'true' type: string # Must be string, not boolean, for workflow_dispatch compatibility

all-hands-bot · 2026-03-20T15:43:51Z

benchmarks/swebench/build_images.py

    )
    build_dir = default_build_output_dir(args.dataset, args.split)

+    extra_build_args = LIGHTWEIGHT_BUILD_ARGS if args.lightweight else None


🟢 Acceptable: Clean conditional assignment. Some might prefer an empty dict {} over None to avoid null checks, but since you're checking downstream anyway, this is fine.

Hardcode BENCHMARK_BUILD_ARGS in build_image() instead of threading an optional flag through the entire call chain. Benchmark images never need ACP, boto3, or browser-use, so there's no reason to make this configurable. Removes: --lightweight CLI flag, lightweight workflow input, extra_build_args parameter from build_all_images/_build_with_logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

build_image() defaults extra_build_args to LIGHTWEIGHT_BUILD_ARGS (skip ACP, boto3, browser-use) but callers can pass ACP_BUILD_ARGS when building images for ACP evaluation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add --agent-type flag to build scripts and workflow inputs. When set to 'acp-claude' or 'acp-codex', images are built with ACP installed (ACP_BUILD_ARGS). Otherwise defaults to LIGHTWEIGHT_BUILD_ARGS which skips ACP, boto3, and browser-use. Thread extra_build_args through build_all_images → _build_with_logging → build_image so the flag reaches the SDK BuildOptions. Workflow inputs default to 'default' (lightweight), matching existing behavior. ACP benchmarks explicitly pass --agent-type acp-claude.

all-hands-bot reviewed Mar 20, 2026

View reviewed changes

simonrosenberg mentioned this pull request Mar 20, 2026

Proposal: lightweight benchmark images via optional dependency flags #537

Closed

Debug Agent and others added 4 commits March 20, 2026 13:28

revert: restore workflow files to main branch state

66c941f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

simonrosenberg mentioned this pull request Mar 20, 2026

feat: build lightweight benchmark images by default #549

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: build lightweight benchmark images by default#548

feat: build lightweight benchmark images by default#548
simonrosenberg wants to merge 5 commits intomainfrom
feat/lightweight-benchmark-images

simonrosenberg commented Mar 20, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

all-hands-bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1 +1 @@
		Subproject commit fc9e8fce2d5a6b5dee4543ff740d0c295aa968a3
		Subproject commit fde7eebe527a7c3fc7d6bc716e3a42716488ddb9

Conversation

simonrosenberg commented Mar 20, 2026

Summary

Changes

Expected impact

Dependencies

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants