Skip to content

fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672

Closed
simonrosenberg wants to merge 6 commits into
mainfrom
openhands/fix-swtbench-disk-space-495
Closed

fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672
simonrosenberg wants to merge 6 commits into
mainfrom
openhands/fix-swtbench-disk-space-495

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

Summary

Fixes OpenHands/evaluation#495. SWT-bench image builds started hitting No space left on device on ubuntu-latest-8core partway through assembly, e.g. run 24572967778 (disk filled at image 160/433 after ~42 min).

Root cause is a latent bug in assemble_agent_image exposed by the Blacksmith → GitHub-hosted migration in #651:

  • Assembly uses docker build + docker push (comment documents the 455 s → 70 s speedup vs. buildx --push) but never calls docker rmi on the assembled tag. Every per-instance agent image stayed in /var/lib/docker.
  • Blacksmith runners had hundreds of GB of persistent disk, so accumulation never mattered. ubuntu-latest-8core has ~80 GB and fills up around image 150.
  • Cache-hit and n-limit=5 runs hid this because remote_image_exists(final_tag) short-circuits the docker build. No full cold SWT-bench run has actually succeeded on ubuntu-latest-8core since the migration.

Changes

  1. benchmarks/swebench/build_base_images.py — in assemble_agent_image, after all pushes succeed, run docker rmi -f <pushed tags>. Shared base+builder layers stay cached (they're what the comment-advertised speedup relies on), only the just-pushed final tag is dropped. Failures warn but don't fail the assembly.
  2. .github/workflows/build-swtbench-images.yml — belt-and-suspenders headroom so the next unexpected growth doesn't OOM again:
    • jlumbroso/free-disk-space pinned to 54081f138730dfa15788a46383842cd2f914a1be (v1.3.1). Frees ~30 GB by removing preinstalled Android/.NET/Haskell/etc. Keeps the docker-images cache.
    • Swebench-style preflight step that prunes BuildKit cache to 60 GiB target and fails fast if less than 75 GB free on /var/lib/buildkit.

The code change is the actual fix; the workflow changes are defense in depth (applying them to SWE-bench too should follow in a separate PR since SWE-bench has the same latent bug).

Testing

  • Existing TestAssembleAgentImage tests should still pass — mocked subprocess.run returns OK for the new rmi call; assertions only check result.tags / result.error.
  • Real validation: re-running evaluation run 24572967778 with this branch pointed at from the evaluation repo. Will comment back on the issue with the re-run URL.

AI disclosure

This PR was prepared by Claude (Claude Code) on behalf of @simonrosenberg.

SWT-bench image assembly kept every freshly pushed agent image in the
local docker daemon, so a full 500-instance cold build outgrew the
~80 GB ubuntu-latest-8core runner partway through (no-space-left
around image 150/433). Pre-migration Blacksmith runners had hundreds
of GB of persistent disk, which hid the retention bug.

- assemble_agent_image: `docker rmi -f` the pushed tags once the push
  succeeds. Shared base+builder layers stay cached, so the intended
  ~70 s/image speedup is preserved; only the just-pushed final tag is
  dropped.
- build-swtbench-images.yml: add pinned jlumbroso/free-disk-space
  step (frees ~30 GB of preinstalled toolchains) and the swebench-
  style preflight that prunes buildx cache and verifies ≥75 GB free
  on /var/lib/buildkit before the main build runs.

Fixes OpenHands/evaluation#495
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste — Pragmatic fix for a real production issue. The solution is simple, defensive, and well-scoped.

Comment thread benchmarks/swebench/build_base_images.py Outdated
Comment thread .github/workflows/build-swtbench-images.yml Outdated
Comment thread .github/workflows/build-swtbench-images.yml Outdated
@all-hands-bot
Copy link
Copy Markdown
Collaborator

Detailed Review Analysis

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM

This PR modifies CI infrastructure and Docker image lifecycle management in production builds. Risk factors:

Increasing risk:

  • Changes GitHub Actions workflow (new external dependency, new preflight checks)
  • Adds jlumbroso/free-disk-space action (SHA-pinned, mitigated)
  • Modifies Docker image lifecycle across all SWT-bench builds

Decreasing risk:

  • Fixes actual production failures (No space left on device in run 24572967778)
  • Defensive implementation: rmi failures are logged but non-fatal
  • No changes to actual build or assembly logic
  • Backward compatible (only adds cleanup, doesn't change build behavior)
  • Preflight checks fail fast to prevent wasted CI time

Recommendation: Safe to merge after CI validation. The defensive approach (non-fatal rmi, warnings, preflight fast-fail) minimizes risk. Monitor the first few production runs to confirm disk usage stays under control.


VERDICT:

Worth merging — Core fix is sound, defensive error handling is excellent. The workflow additions are appropriate defense-in-depth for a constrained runner environment.

KEY INSIGHT:

The root cause analysis is spot-on: Blacksmith's large disks masked a latent resource leak that GitHub runners expose. The fix correctly targets the leak (missing rmi) while adding safeguards (free-disk-space, preflight checks) to prevent similar issues. This is pragmatic engineering — solve the real problem, then add guardrails.


This review was generated by an AI agent (OpenHands) on behalf of the user.

simonrosenberg pushed a commit that referenced this pull request Apr 20, 2026
Surgical revert of the swtbench portion of #651 — puts the SWT-bench
image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the
useblacksmith/setup-docker-builder builder. Blacksmith's larger
persistent disk masked the unbounded local-image growth that now
fails on ubuntu-latest-8core (evaluation#495).

Kept intact from main:
- workflow_call trigger and cross-repo checkout (#666)
- all input/env plumbing and downstream steps

Intended as a parallel fallback branch while the real fix in #672
(docker rmi after push + free-disk-space + preflight) is validated.
Simon Rosenberg added 2 commits April 20, 2026 10:53
Validation run [1] showed the previous change (#672 / commit 2bafa99)
rmi'd the pushed final tags but OOMed anyway at image 165/433:

  ERROR: failed to build: failed to solve: ResourceExhausted: failed
  to copy: write /var/lib/docker/buildkit/content/ingest/.../data:
  no space left on device

Disk filled because:
- DOCKER_BUILDKIT=1 makes `docker build` route image pulls through
  BuildKit's content-addressed ingest store, which is *not* cleaned
  by `docker rmi` or `docker image prune`.
- Each per-instance base image blob (SWE-bench bakes conda + test
  repos in -> several GB each) stays in ingest forever until
  `docker buildx prune` runs.

Extend the post-push cleanup to three tiers, all warn-and-continue:
1. rmi the final tags AND the per-instance base tag (base is 1:1
   with the instance, never reused).
2. `docker image prune -f` to return the dangling overlay2 layers
   to free space immediately.
3. `docker builder prune -f --keep-storage 40g` (configurable via
   OPENHANDS_BUILDKIT_KEEP_STORAGE_GB) to cap BuildKit's ingest
   cache.

Shared builder image layers fit inside the 40g cap, so the intended
local-cache speedup for assembly is preserved.

[1] https://github.com/OpenHands/benchmarks/actions/runs/24642237265

Refs OpenHands/evaluation#495
Second validation run [1] got from 72/433 to 247/433 successful
assemblies before OOM but died anyway. Three follow-ons:

1. Pre-assembly reset: `docker builder prune -af --keep-storage 0`
   at the top of assemble_all_agent_images. The base-image phase
   runs `buildx build --push` which fills BuildKit's ingest with
   ~100 GB of SWE-bench base-image blobs before assembly even
   starts; the per-assembly cap can't shrink that retroactively.
2. `docker system prune -f` added to the post-assembly cleanup.
   Val2 OOM sites included /var/lib/docker/tmp (tarsplit staging)
   and /var/lib/docker/image/overlay2/layerdb/tmp (new-layer
   commits) — system prune sweeps both.
3. Drop BuildKit keep-storage default from 40 GiB to 10 GiB and
   add -a (include unused images). The 40 GiB cap permitted too
   much to survive alongside four concurrent in-flight builds.
   Still tunable via OPENHANDS_BUILDKIT_KEEP_STORAGE_GB.

The pre-assembly prune can be skipped with
OPENHANDS_SKIP_PREASSEMBLY_PRUNE=1 for dev loops where cache
reuse is more valuable than disk headroom.

[1] https://github.com/OpenHands/benchmarks/actions/runs/24670432554
    (247 ok, 51 failed, OOM at ~57%)

Refs OpenHands/evaluation#495
…e cap

Forensic read of the four #495 runs (run0 failed, val1+val2 OOM'd,
val3 passed 433/433) showed several pieces of commits 2bafa99, 479296c,
c18df3d were no-ops. Details in PR thread. Changes here:

Workflow preflight step:
- Drop `docker buildx prune --max-storage 60g` first-try: flag does not
  exist, every run printed `unknown flag: --max-storage` and fell over
  to `--keep-storage` (itself deprecated now; buildx renamed it to
  `--reserved-space`). Setup-buildx had just created the builder so
  there was nothing to prune anyway — reclaimed 0 B every run.
- Drop `df /var/lib/buildkit` free-space gate: path does not exist on
  GH-hosted runners because BuildKit runs inside the buildx-container
  builder, not on the host. Every run logged "skipping disk check"
  and proceeded. Replace with a `df /` gate (≥150 GiB free) — the
  runner reliably has 238 GiB free after `free-disk-space`, so this
  is the check we actually wanted.

Per-assembly cleanup (`assemble_agent_image`):
- Drop `docker image prune -f`: redundant with `docker system prune -f`
  (system prune is a superset). Worse, the two compete for the Docker
  daemon's prune lock — run24777009977 log shows 139/433 image prunes
  and 110/433 system prunes failed with rc=1 racing each other on the
  same image. Dropping one saves the wasted subprocess spawns and the
  lock contention.
- Raise OPENHANDS_BUILDKIT_KEEP_STORAGE_GB default 10 → 30. The 10 GiB
  cap from c18df3d was calibrated on the val2 run that OOM'd at
  247/433; run24777009977 (same 10 GiB cap) then finished 433/433 with
  ~260 GiB starting headroom, so 10 was over-provisioned. 30 leaves
  more shared base+builder layers warm between assemblies, restoring
  the 70 s/image fast path the assemble_agent_image comment assumes.

Pre-assembly prune (`assemble_all_agent_images`):
- Replace `docker builder prune --keep-storage 0` with `docker buildx
  prune -af`. The base phase runs `docker buildx build --push` against
  the docker-container driver builder from setup-buildx-action; its
  cache lives inside that container, isolated from the host daemon's
  /var/lib/docker/buildkit. `docker builder prune` targets the host
  daemon (empty at this point — assembly hasn't started), which is why
  run24777009977 logged "Total reclaimed space: 0B" on this exact
  call. `docker buildx prune` targets the currently-selected buildx
  context — i.e. the container builder — and will actually reclaim
  whatever the base phase parked there.
- Drop `docker image prune -f` companion call (nothing for it to prune
  before assembly begins) and the OPENHANDS_SKIP_PREASSEMBLY_PRUNE
  escape hatch (now that the step does useful work, don't encourage
  skipping it).

Refs OpenHands/evaluation#495
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste — Pragmatic fix for a real production issue with solid defensive error handling. Just two minor consistency suggestions below.

Comment thread benchmarks/swebench/build_base_images.py
Comment thread .github/workflows/build-swtbench-images.yml
Addresses PR #672 review comments on d400e50:
- build_base_images.py:809 — capture the pre-assembly `docker buildx
  prune` subprocess result and warn on timeout/nonzero exit, matching
  the pattern used in the per-assembly cleanup below.
- build-swtbench-images.yml:221 — comment explaining where the 150
  GiB threshold comes from (ubuntu-latest-8core has 290 GiB total;
  ~260 GiB free post free-disk-space across validation runs
  24642237265 / 24670432554 / 24777009977; 150 picks the midpoint
  between steady-state and observed cold-assembly consumption).

Refs OpenHands/evaluation#495
@simonrosenberg simonrosenberg self-assigned this Apr 22, 2026
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste — Pragmatic fix for a real production issue with excellent defensive error handling. The solution is simple, well-documented, and properly scoped. Both previously unresolved threads appear to be addressed in the current code.

The step reclaims ~24 GiB on top of the 238 GiB the runner already has
free at job start, at a cost of ~3m24s wall time. With the code-side
fixes in this PR (rmi + per-assembly system/builder prune + pre-assembly
buildx prune), run 24798675779 finished 433/433 with ~100 GiB headroom
even with free-disk-space in place, so the extra 24 GiB was not
load-bearing. Preflight df / ≥150 GiB gate still fast-fails on
unexpected runner state.

Refs OpenHands/evaluation#495
@simonrosenberg
Copy link
Copy Markdown
Collaborator Author

Superseded by #690, which keeps the disk fix in the shared SWE/SWT phased image assembly code without workflow preflight changes.

@simonrosenberg
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #690 so the SWE-bench and SWT-bench disk cleanup fix is reviewed as one shared-code PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants