feat: build lightweight benchmark images by default#548
feat: build lightweight benchmark images by default#548simonrosenberg wants to merge 5 commits intomainfrom
Conversation
Add --lightweight flag to image build scripts that passes INSTALL_ACP=false, INSTALL_BOTO3=false, INSTALL_BROWSER=false to the SDK Dockerfile. These skip npm ACP packages, boto3/botocore, and browser-use/playwright — none of which are used by benchmarks. Estimated savings at 433 images: - npm ACP + nodejs: ~32s/image (3.9h cumulative) - browser-use + playwright: ~15s export/push (1.8h cumulative) - boto3/botocore: ~3s export/push (0.4h cumulative) - Total: ~1.9-3.0h wall-clock improvement Changes: - build_utils.py: add extra_build_args plumbing through build_image → _build_with_logging → build_all_images; add --lightweight CLI flag - swtbench/build_images.py, swebench/build_images.py: pass LIGHTWEIGHT_BUILD_ARGS when --lightweight is set - Both build workflows: add lightweight input (default: true) - SDK submodule: update to feat/lightweight-benchmark-images branch Depends on: OpenHands/software-agent-sdk#2538 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
The core implementation is clean and pragmatic, but there are critical blockers before this can merge. See inline comments.
| @@ -1 +1 @@ | |||
| Subproject commit fc9e8fce2d5a6b5dee4543ff740d0c295aa968a3 | |||
| Subproject commit fde7eebe527a7c3fc7d6bc716e3a42716488ddb9 | |||
There was a problem hiding this comment.
🔴 Critical: This submodule points to commit fde7eebe on feature branch feat/lightweight-benchmark-images (OpenHands/software-agent-sdk#2538).
Problem: Feature branches are unstable - they can be rebased, force-pushed, or deleted. This creates a dependency time bomb.
Required fix:
- Wait for feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538 to merge to main
- Update this submodule to point to the merged commit on main (or a tagged release)
- Never point production submodules to feature branches
|
|
||
| # Default build args for lightweight benchmark images. | ||
| # These correspond to ARGs in the SDK Dockerfile that default to "true". | ||
| LIGHTWEIGHT_BUILD_ARGS: dict[str, str] = { |
There was a problem hiding this comment.
🟢 Good taste: Simple, clean constant. No magic, no over-engineering. This is the right way to define build args.
| default: 'false' | ||
| type: string | ||
| lightweight: | ||
| description: 'Build lightweight images (skip ACP, boto3, browser-use)' |
There was a problem hiding this comment.
🟡 Suggestion: Using type: string with 'true'/'false' strings instead of type: boolean is unconventional.
If this is intentional (e.g., for workflow_dispatch compatibility), add a comment explaining why. Otherwise the next person might "fix" it to boolean and break things.
lightweight:
description: 'Build lightweight images (skip ACP, boto3, browser-use)'
required: false
default: 'true'
type: string # Must be string, not boolean, for workflow_dispatch compatibility
benchmarks/swebench/build_images.py
Outdated
| ) | ||
| build_dir = default_build_output_dir(args.dataset, args.split) | ||
|
|
||
| extra_build_args = LIGHTWEIGHT_BUILD_ARGS if args.lightweight else None |
There was a problem hiding this comment.
🟢 Acceptable: Clean conditional assignment. Some might prefer an empty dict {} over None to avoid null checks, but since you're checking downstream anyway, this is fine.
Hardcode BENCHMARK_BUILD_ARGS in build_image() instead of threading an optional flag through the entire call chain. Benchmark images never need ACP, boto3, or browser-use, so there's no reason to make this configurable. Removes: --lightweight CLI flag, lightweight workflow input, extra_build_args parameter from build_all_images/_build_with_logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build_image() defaults extra_build_args to LIGHTWEIGHT_BUILD_ARGS (skip ACP, boto3, browser-use) but callers can pass ACP_BUILD_ARGS when building images for ACP evaluation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --agent-type flag to build scripts and workflow inputs. When set to 'acp-claude' or 'acp-codex', images are built with ACP installed (ACP_BUILD_ARGS). Otherwise defaults to LIGHTWEIGHT_BUILD_ARGS which skips ACP, boto3, and browser-use. Thread extra_build_args through build_all_images → _build_with_logging → build_image so the flag reaches the SDK BuildOptions. Workflow inputs default to 'default' (lightweight), matching existing behavior. ACP benchmarks explicitly pass --agent-type acp-claude.
Summary
--lightweightCLI flag to SWE-bench and SWT-bench image build scriptsextra_build_argsthrough the full call chain:build_all_images→_build_with_logging→build_image→ SDKBuildOptionslightweight: true, skipping ACP, boto3, and browser-uselightweight: false) when full images are neededChanges
benchmarks/utils/build_utils.py:LIGHTWEIGHT_BUILD_ARGSconstant,--lightweightCLI flag,extra_build_argsparameter threaded through all build functionsbenchmarks/swebench/build_images.py: Pass lightweight build args tobuild_all_imagesbenchmarks/swtbench/build_images.py: Same.github/workflows/build-swtbench-images.yml:lightweightinput (defaulttrue), passed as--lightweightflag.github/workflows/build-swebench-images.yml: Samevendor/software-agent-sdk: Updated tofeat/lightweight-benchmark-imagesbranch (feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538)Expected impact
From analysis of 433-image build logs (#537):
Wall-clock improvement: ~1.9-3.0h (at 3.5x effective parallelism), plus non-linear savings from reduced disk pressure (fewer prune events, lower batch degradation).
Dependencies
Test plan
--lightweighton 4-10 images to verify builds succeed--lightweightto verify no regression for full imagesCloses #537
🤖 Generated with Claude Code