fix(swtbench): rmi after push + free runner disk (fixes evaluation#495) by simonrosenberg · Pull Request #672 · OpenHands/benchmarks

simonrosenberg · 2026-04-19T23:35:40Z

Summary

Fixes OpenHands/evaluation#495. SWT-bench image builds started hitting No space left on device on ubuntu-latest-8core partway through assembly, e.g. run 24572967778 (disk filled at image 160/433 after ~42 min).

Root cause is a latent bug in assemble_agent_image exposed by the Blacksmith → GitHub-hosted migration in #651:

Assembly uses docker build + docker push (comment documents the 455 s → 70 s speedup vs. buildx --push) but never calls docker rmi on the assembled tag. Every per-instance agent image stayed in /var/lib/docker.
Blacksmith runners had hundreds of GB of persistent disk, so accumulation never mattered. ubuntu-latest-8core has ~80 GB and fills up around image 150.
Cache-hit and n-limit=5 runs hid this because remote_image_exists(final_tag) short-circuits the docker build. No full cold SWT-bench run has actually succeeded on ubuntu-latest-8core since the migration.

Changes

benchmarks/swebench/build_base_images.py — in assemble_agent_image, after all pushes succeed, run docker rmi -f <pushed tags>. Shared base+builder layers stay cached (they're what the comment-advertised speedup relies on), only the just-pushed final tag is dropped. Failures warn but don't fail the assembly.
.github/workflows/build-swtbench-images.yml — belt-and-suspenders headroom so the next unexpected growth doesn't OOM again:
- jlumbroso/free-disk-space pinned to 54081f138730dfa15788a46383842cd2f914a1be (v1.3.1). Frees ~30 GB by removing preinstalled Android/.NET/Haskell/etc. Keeps the docker-images cache.
- Swebench-style preflight step that prunes BuildKit cache to 60 GiB target and fails fast if less than 75 GB free on /var/lib/buildkit.

The code change is the actual fix; the workflow changes are defense in depth (applying them to SWE-bench too should follow in a separate PR since SWE-bench has the same latent bug).

Testing

Existing TestAssembleAgentImage tests should still pass — mocked subprocess.run returns OK for the new rmi call; assertions only check result.tags / result.error.
Real validation: re-running evaluation run 24572967778 with this branch pointed at from the evaluation repo. Will comment back on the issue with the re-run URL.

AI disclosure

This PR was prepared by Claude (Claude Code) on behalf of @simonrosenberg.

SWT-bench image assembly kept every freshly pushed agent image in the local docker daemon, so a full 500-instance cold build outgrew the ~80 GB ubuntu-latest-8core runner partway through (no-space-left around image 150/433). Pre-migration Blacksmith runners had hundreds of GB of persistent disk, which hid the retention bug. - assemble_agent_image: `docker rmi -f` the pushed tags once the push succeeds. Shared base+builder layers stay cached, so the intended ~70 s/image speedup is preserved; only the just-pushed final tag is dropped. - build-swtbench-images.yml: add pinned jlumbroso/free-disk-space step (frees ~30 GB of preinstalled toolchains) and the swebench- style preflight that prunes buildx cache and verifies ≥75 GB free on /var/lib/buildkit before the main build runs. Fixes OpenHands/evaluation#495

all-hands-bot

🟢 Good taste — Pragmatic fix for a real production issue. The solution is simple, defensive, and well-scoped.

all-hands-bot · 2026-04-19T23:38:24Z

Detailed Review Analysis

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

This PR modifies CI infrastructure and Docker image lifecycle management in production builds. Risk factors:

Increasing risk:

Changes GitHub Actions workflow (new external dependency, new preflight checks)
Adds jlumbroso/free-disk-space action (SHA-pinned, mitigated)
Modifies Docker image lifecycle across all SWT-bench builds

Decreasing risk:

Fixes actual production failures (No space left on device in run 24572967778)
Defensive implementation: rmi failures are logged but non-fatal
No changes to actual build or assembly logic
Backward compatible (only adds cleanup, doesn't change build behavior)
Preflight checks fail fast to prevent wasted CI time

Recommendation: Safe to merge after CI validation. The defensive approach (non-fatal rmi, warnings, preflight fast-fail) minimizes risk. Monitor the first few production runs to confirm disk usage stays under control.

VERDICT:

✅ Worth merging — Core fix is sound, defensive error handling is excellent. The workflow additions are appropriate defense-in-depth for a constrained runner environment.

KEY INSIGHT:

The root cause analysis is spot-on: Blacksmith's large disks masked a latent resource leak that GitHub runners expose. The fix correctly targets the leak (missing rmi) while adding safeguards (free-disk-space, preflight checks) to prevent similar issues. This is pragmatic engineering — solve the real problem, then add guardrails.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Surgical revert of the swtbench portion of #651 — puts the SWT-bench image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the useblacksmith/setup-docker-builder builder. Blacksmith's larger persistent disk masked the unbounded local-image growth that now fails on ubuntu-latest-8core (evaluation#495). Kept intact from main: - workflow_call trigger and cross-repo checkout (#666) - all input/env plumbing and downstream steps Intended as a parallel fallback branch while the real fix in #672 (docker rmi after push + free-disk-space + preflight) is validated.

Validation run [1] showed the previous change (#672 / commit 2bafa99) rmi'd the pushed final tags but OOMed anyway at image 165/433: ERROR: failed to build: failed to solve: ResourceExhausted: failed to copy: write /var/lib/docker/buildkit/content/ingest/.../data: no space left on device Disk filled because: - DOCKER_BUILDKIT=1 makes `docker build` route image pulls through BuildKit's content-addressed ingest store, which is *not* cleaned by `docker rmi` or `docker image prune`. - Each per-instance base image blob (SWE-bench bakes conda + test repos in -> several GB each) stays in ingest forever until `docker buildx prune` runs. Extend the post-push cleanup to three tiers, all warn-and-continue: 1. rmi the final tags AND the per-instance base tag (base is 1:1 with the instance, never reused). 2. `docker image prune -f` to return the dangling overlay2 layers to free space immediately. 3. `docker builder prune -f --keep-storage 40g` (configurable via OPENHANDS_BUILDKIT_KEEP_STORAGE_GB) to cap BuildKit's ingest cache. Shared builder image layers fit inside the 40g cap, so the intended local-cache speedup for assembly is preserved. [1] https://github.com/OpenHands/benchmarks/actions/runs/24642237265 Refs OpenHands/evaluation#495

Second validation run [1] got from 72/433 to 247/433 successful assemblies before OOM but died anyway. Three follow-ons: 1. Pre-assembly reset: `docker builder prune -af --keep-storage 0` at the top of assemble_all_agent_images. The base-image phase runs `buildx build --push` which fills BuildKit's ingest with ~100 GB of SWE-bench base-image blobs before assembly even starts; the per-assembly cap can't shrink that retroactively. 2. `docker system prune -f` added to the post-assembly cleanup. Val2 OOM sites included /var/lib/docker/tmp (tarsplit staging) and /var/lib/docker/image/overlay2/layerdb/tmp (new-layer commits) — system prune sweeps both. 3. Drop BuildKit keep-storage default from 40 GiB to 10 GiB and add -a (include unused images). The 40 GiB cap permitted too much to survive alongside four concurrent in-flight builds. Still tunable via OPENHANDS_BUILDKIT_KEEP_STORAGE_GB. The pre-assembly prune can be skipped with OPENHANDS_SKIP_PREASSEMBLY_PRUNE=1 for dev loops where cache reuse is more valuable than disk headroom. [1] https://github.com/OpenHands/benchmarks/actions/runs/24670432554 (247 ok, 51 failed, OOM at ~57%) Refs OpenHands/evaluation#495

…e cap Forensic read of the four #495 runs (run0 failed, val1+val2 OOM'd, val3 passed 433/433) showed several pieces of commits 2bafa99, 479296c, c18df3d were no-ops. Details in PR thread. Changes here: Workflow preflight step: - Drop `docker buildx prune --max-storage 60g` first-try: flag does not exist, every run printed `unknown flag: --max-storage` and fell over to `--keep-storage` (itself deprecated now; buildx renamed it to `--reserved-space`). Setup-buildx had just created the builder so there was nothing to prune anyway — reclaimed 0 B every run. - Drop `df /var/lib/buildkit` free-space gate: path does not exist on GH-hosted runners because BuildKit runs inside the buildx-container builder, not on the host. Every run logged "skipping disk check" and proceeded. Replace with a `df /` gate (≥150 GiB free) — the runner reliably has 238 GiB free after `free-disk-space`, so this is the check we actually wanted. Per-assembly cleanup (`assemble_agent_image`): - Drop `docker image prune -f`: redundant with `docker system prune -f` (system prune is a superset). Worse, the two compete for the Docker daemon's prune lock — run24777009977 log shows 139/433 image prunes and 110/433 system prunes failed with rc=1 racing each other on the same image. Dropping one saves the wasted subprocess spawns and the lock contention. - Raise OPENHANDS_BUILDKIT_KEEP_STORAGE_GB default 10 → 30. The 10 GiB cap from c18df3d was calibrated on the val2 run that OOM'd at 247/433; run24777009977 (same 10 GiB cap) then finished 433/433 with ~260 GiB starting headroom, so 10 was over-provisioned. 30 leaves more shared base+builder layers warm between assemblies, restoring the 70 s/image fast path the assemble_agent_image comment assumes. Pre-assembly prune (`assemble_all_agent_images`): - Replace `docker builder prune --keep-storage 0` with `docker buildx prune -af`. The base phase runs `docker buildx build --push` against the docker-container driver builder from setup-buildx-action; its cache lives inside that container, isolated from the host daemon's /var/lib/docker/buildkit. `docker builder prune` targets the host daemon (empty at this point — assembly hasn't started), which is why run24777009977 logged "Total reclaimed space: 0B" on this exact call. `docker buildx prune` targets the currently-selected buildx context — i.e. the container builder — and will actually reclaim whatever the base phase parked there. - Drop `docker image prune -f` companion call (nothing for it to prune before assembly begins) and the OPENHANDS_SKIP_PREASSEMBLY_PRUNE escape hatch (now that the step does useful work, don't encourage skipping it). Refs OpenHands/evaluation#495

all-hands-bot

🟢 Good taste — Pragmatic fix for a real production issue with solid defensive error handling. Just two minor consistency suggestions below.

Addresses PR #672 review comments on d400e50: - build_base_images.py:809 — capture the pre-assembly `docker buildx prune` subprocess result and warn on timeout/nonzero exit, matching the pattern used in the per-assembly cleanup below. - build-swtbench-images.yml:221 — comment explaining where the 150 GiB threshold comes from (ubuntu-latest-8core has 290 GiB total; ~260 GiB free post free-disk-space across validation runs 24642237265 / 24670432554 / 24777009977; 150 picks the midpoint between steady-state and observed cold-assembly consumption). Refs OpenHands/evaluation#495

all-hands-bot

🟢 Good taste — Pragmatic fix for a real production issue with excellent defensive error handling. The solution is simple, well-documented, and properly scoped. Both previously unresolved threads appear to be addressed in the current code.

The step reclaims ~24 GiB on top of the 238 GiB the runner already has free at job start, at a cost of ~3m24s wall time. With the code-side fixes in this PR (rmi + per-assembly system/builder prune + pre-assembly buildx prune), run 24798675779 finished 433/433 with ~100 GiB headroom even with free-disk-space in place, so the extra 24 GiB was not load-bearing. Preflight df / ≥150 GiB gate still fast-fails on unexpected runner state. Refs OpenHands/evaluation#495

simonrosenberg · 2026-04-23T10:18:32Z

Superseded by #690, which keeps the disk fix in the shared SWE/SWT phased image assembly code without workflow preflight changes.

simonrosenberg · 2026-04-23T10:18:34Z

Closing in favor of #690 so the SWE-bench and SWT-bench disk cleanup fix is reviewed as one shared-code PR.

all-hands-bot reviewed Apr 19, 2026

View reviewed changes

Comment thread benchmarks/swebench/build_base_images.py Outdated

Comment thread .github/workflows/build-swtbench-images.yml Outdated

Comment thread .github/workflows/build-swtbench-images.yml Outdated

simonrosenberg mentioned this pull request Apr 20, 2026

[DO NOT MERGE] ci(swtbench): revert to Blacksmith runner (fallback for #495) #673

Draft

Simon Rosenberg added 2 commits April 20, 2026 10:53

simonrosenberg mentioned this pull request Apr 22, 2026

ci: add free disk space step to prevent build failures #671

Closed

simonrosenberg requested a review from all-hands-bot April 22, 2026 19:57

all-hands-bot reviewed Apr 22, 2026

View reviewed changes

Comment thread benchmarks/swebench/build_base_images.py

Comment thread .github/workflows/build-swtbench-images.yml

simonrosenberg requested a review from all-hands-bot April 22, 2026 20:21

simonrosenberg self-assigned this Apr 22, 2026

all-hands-bot approved these changes Apr 22, 2026

View reviewed changes

This was referenced Apr 23, 2026

fix(image-build): release Docker disk during SWE/SWT assembly #690

Merged

Track structured timing for phased image assembly cleanup #692

Open

simonrosenberg closed this Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672

fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672
simonrosenberg wants to merge 6 commits into
mainfrom
openhands/fix-swtbench-disk-space-495

simonrosenberg commented Apr 19, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Apr 19, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

simonrosenberg commented Apr 23, 2026

Uh oh!

simonrosenberg commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Apr 19, 2026

Summary

Changes

Testing

AI disclosure

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Apr 19, 2026

Detailed Review Analysis

[RISK ASSESSMENT]

VERDICT:

KEY INSIGHT:

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Apr 23, 2026

Uh oh!

simonrosenberg commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants