Skip to content

feat: build lightweight benchmark images by default#548

Open
simonrosenberg wants to merge 5 commits intomainfrom
feat/lightweight-benchmark-images
Open

feat: build lightweight benchmark images by default#548
simonrosenberg wants to merge 5 commits intomainfrom
feat/lightweight-benchmark-images

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

  • Add --lightweight CLI flag to SWE-bench and SWT-bench image build scripts
  • Thread extra_build_args through the full call chain: build_all_images_build_with_loggingbuild_image → SDK BuildOptions
  • Both workflows default lightweight: true, skipping ACP, boto3, and browser-use
  • Can be toggled off via workflow dispatch input (lightweight: false) when full images are needed

Changes

  • benchmarks/utils/build_utils.py: LIGHTWEIGHT_BUILD_ARGS constant, --lightweight CLI flag, extra_build_args parameter threaded through all build functions
  • benchmarks/swebench/build_images.py: Pass lightweight build args to build_all_images
  • benchmarks/swtbench/build_images.py: Same
  • .github/workflows/build-swtbench-images.yml: lightweight input (default true), passed as --lightweight flag
  • .github/workflows/build-swebench-images.yml: Same
  • vendor/software-agent-sdk: Updated to feat/lightweight-benchmark-images branch (feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538)

Expected impact

From analysis of 433-image build logs (#537):

Dependency skipped Per-image saving Cumulative (433 imgs)
npm ACP + nodejs ~32s install + ~4s export/push 4.5h
browser-use + playwright ~15s export/push 1.8h
boto3/botocore ~3s export/push 0.4h
Total ~54s/image ~6.7h cumulative

Wall-clock improvement: ~1.9-3.0h (at 3.5x effective parallelism), plus non-linear savings from reduced disk pressure (fewer prune events, lower batch degradation).

Dependencies

Test plan

  • Run SWT-bench build with --lightweight on 4-10 images to verify builds succeed
  • Verify image can run a basic SWT-bench evaluation
  • Run without --lightweight to verify no regression for full images
  • Full 433-image build to measure actual wall-clock improvement

Closes #537

🤖 Generated with Claude Code

Add --lightweight flag to image build scripts that passes
INSTALL_ACP=false, INSTALL_BOTO3=false, INSTALL_BROWSER=false
to the SDK Dockerfile. These skip npm ACP packages, boto3/botocore,
and browser-use/playwright — none of which are used by benchmarks.

Estimated savings at 433 images:
- npm ACP + nodejs: ~32s/image (3.9h cumulative)
- browser-use + playwright: ~15s export/push (1.8h cumulative)
- boto3/botocore: ~3s export/push (0.4h cumulative)
- Total: ~1.9-3.0h wall-clock improvement

Changes:
- build_utils.py: add extra_build_args plumbing through build_image →
  _build_with_logging → build_all_images; add --lightweight CLI flag
- swtbench/build_images.py, swebench/build_images.py: pass
  LIGHTWEIGHT_BUILD_ARGS when --lightweight is set
- Both build workflows: add lightweight input (default: true)
- SDK submodule: update to feat/lightweight-benchmark-images branch

Depends on: OpenHands/software-agent-sdk#2538

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core implementation is clean and pragmatic, but there are critical blockers before this can merge. See inline comments.

@@ -1 +1 @@
Subproject commit fc9e8fce2d5a6b5dee4543ff740d0c295aa968a3
Subproject commit fde7eebe527a7c3fc7d6bc716e3a42716488ddb9
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: This submodule points to commit fde7eebe on feature branch feat/lightweight-benchmark-images (OpenHands/software-agent-sdk#2538).

Problem: Feature branches are unstable - they can be rebased, force-pushed, or deleted. This creates a dependency time bomb.

Required fix:

  1. Wait for feat: lightweight benchmark images (merge #2535 + #2536 + #2537) software-agent-sdk#2538 to merge to main
  2. Update this submodule to point to the merged commit on main (or a tagged release)
  3. Never point production submodules to feature branches


# Default build args for lightweight benchmark images.
# These correspond to ARGs in the SDK Dockerfile that default to "true".
LIGHTWEIGHT_BUILD_ARGS: dict[str, str] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste: Simple, clean constant. No magic, no over-engineering. This is the right way to define build args.

default: 'false'
type: string
lightweight:
description: 'Build lightweight images (skip ACP, boto3, browser-use)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Using type: string with 'true'/'false' strings instead of type: boolean is unconventional.

If this is intentional (e.g., for workflow_dispatch compatibility), add a comment explaining why. Otherwise the next person might "fix" it to boolean and break things.

lightweight:
  description: 'Build lightweight images (skip ACP, boto3, browser-use)'
  required: false
  default: 'true'
  type: string  # Must be string, not boolean, for workflow_dispatch compatibility

)
build_dir = default_build_output_dir(args.dataset, args.split)

extra_build_args = LIGHTWEIGHT_BUILD_ARGS if args.lightweight else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Acceptable: Clean conditional assignment. Some might prefer an empty dict {} over None to avoid null checks, but since you're checking downstream anyway, this is fine.

Debug Agent and others added 4 commits March 20, 2026 13:28
Hardcode BENCHMARK_BUILD_ARGS in build_image() instead of threading
an optional flag through the entire call chain. Benchmark images never
need ACP, boto3, or browser-use, so there's no reason to make this
configurable.

Removes: --lightweight CLI flag, lightweight workflow input,
extra_build_args parameter from build_all_images/_build_with_logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build_image() defaults extra_build_args to LIGHTWEIGHT_BUILD_ARGS
(skip ACP, boto3, browser-use) but callers can pass ACP_BUILD_ARGS
when building images for ACP evaluation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add --agent-type flag to build scripts and workflow inputs. When set to
'acp-claude' or 'acp-codex', images are built with ACP installed
(ACP_BUILD_ARGS). Otherwise defaults to LIGHTWEIGHT_BUILD_ARGS which
skips ACP, boto3, and browser-use.

Thread extra_build_args through build_all_images → _build_with_logging
→ build_image so the flag reaches the SDK BuildOptions.

Workflow inputs default to 'default' (lightweight), matching existing
behavior. ACP benchmarks explicitly pass --agent-type acp-claude.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: lightweight benchmark images via optional dependency flags

2 participants