fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672
fix(swtbench): rmi after push + free runner disk (fixes evaluation#495)#672simonrosenberg wants to merge 6 commits into
Conversation
SWT-bench image assembly kept every freshly pushed agent image in the
local docker daemon, so a full 500-instance cold build outgrew the
~80 GB ubuntu-latest-8core runner partway through (no-space-left
around image 150/433). Pre-migration Blacksmith runners had hundreds
of GB of persistent disk, which hid the retention bug.
- assemble_agent_image: `docker rmi -f` the pushed tags once the push
succeeds. Shared base+builder layers stay cached, so the intended
~70 s/image speedup is preserved; only the just-pushed final tag is
dropped.
- build-swtbench-images.yml: add pinned jlumbroso/free-disk-space
step (frees ~30 GB of preinstalled toolchains) and the swebench-
style preflight that prunes buildx cache and verifies ≥75 GB free
on /var/lib/buildkit before the main build runs.
Fixes OpenHands/evaluation#495
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste — Pragmatic fix for a real production issue. The solution is simple, defensive, and well-scoped.
Detailed Review Analysis[RISK ASSESSMENT][Overall PR] This PR modifies CI infrastructure and Docker image lifecycle management in production builds. Risk factors: Increasing risk:
Decreasing risk:
Recommendation: Safe to merge after CI validation. The defensive approach (non-fatal VERDICT:✅ Worth merging — Core fix is sound, defensive error handling is excellent. The workflow additions are appropriate defense-in-depth for a constrained runner environment. KEY INSIGHT:The root cause analysis is spot-on: Blacksmith's large disks masked a latent resource leak that GitHub runners expose. The fix correctly targets the leak (missing This review was generated by an AI agent (OpenHands) on behalf of the user. |
Surgical revert of the swtbench portion of #651 — puts the SWT-bench image-build workflow back on blacksmith-32vcpu-ubuntu-2204 with the useblacksmith/setup-docker-builder builder. Blacksmith's larger persistent disk masked the unbounded local-image growth that now fails on ubuntu-latest-8core (evaluation#495). Kept intact from main: - workflow_call trigger and cross-repo checkout (#666) - all input/env plumbing and downstream steps Intended as a parallel fallback branch while the real fix in #672 (docker rmi after push + free-disk-space + preflight) is validated.
Validation run [1] showed the previous change (#672 / commit 2bafa99) rmi'd the pushed final tags but OOMed anyway at image 165/433: ERROR: failed to build: failed to solve: ResourceExhausted: failed to copy: write /var/lib/docker/buildkit/content/ingest/.../data: no space left on device Disk filled because: - DOCKER_BUILDKIT=1 makes `docker build` route image pulls through BuildKit's content-addressed ingest store, which is *not* cleaned by `docker rmi` or `docker image prune`. - Each per-instance base image blob (SWE-bench bakes conda + test repos in -> several GB each) stays in ingest forever until `docker buildx prune` runs. Extend the post-push cleanup to three tiers, all warn-and-continue: 1. rmi the final tags AND the per-instance base tag (base is 1:1 with the instance, never reused). 2. `docker image prune -f` to return the dangling overlay2 layers to free space immediately. 3. `docker builder prune -f --keep-storage 40g` (configurable via OPENHANDS_BUILDKIT_KEEP_STORAGE_GB) to cap BuildKit's ingest cache. Shared builder image layers fit inside the 40g cap, so the intended local-cache speedup for assembly is preserved. [1] https://github.com/OpenHands/benchmarks/actions/runs/24642237265 Refs OpenHands/evaluation#495
Second validation run [1] got from 72/433 to 247/433 successful assemblies before OOM but died anyway. Three follow-ons: 1. Pre-assembly reset: `docker builder prune -af --keep-storage 0` at the top of assemble_all_agent_images. The base-image phase runs `buildx build --push` which fills BuildKit's ingest with ~100 GB of SWE-bench base-image blobs before assembly even starts; the per-assembly cap can't shrink that retroactively. 2. `docker system prune -f` added to the post-assembly cleanup. Val2 OOM sites included /var/lib/docker/tmp (tarsplit staging) and /var/lib/docker/image/overlay2/layerdb/tmp (new-layer commits) — system prune sweeps both. 3. Drop BuildKit keep-storage default from 40 GiB to 10 GiB and add -a (include unused images). The 40 GiB cap permitted too much to survive alongside four concurrent in-flight builds. Still tunable via OPENHANDS_BUILDKIT_KEEP_STORAGE_GB. The pre-assembly prune can be skipped with OPENHANDS_SKIP_PREASSEMBLY_PRUNE=1 for dev loops where cache reuse is more valuable than disk headroom. [1] https://github.com/OpenHands/benchmarks/actions/runs/24670432554 (247 ok, 51 failed, OOM at ~57%) Refs OpenHands/evaluation#495
…e cap Forensic read of the four #495 runs (run0 failed, val1+val2 OOM'd, val3 passed 433/433) showed several pieces of commits 2bafa99, 479296c, c18df3d were no-ops. Details in PR thread. Changes here: Workflow preflight step: - Drop `docker buildx prune --max-storage 60g` first-try: flag does not exist, every run printed `unknown flag: --max-storage` and fell over to `--keep-storage` (itself deprecated now; buildx renamed it to `--reserved-space`). Setup-buildx had just created the builder so there was nothing to prune anyway — reclaimed 0 B every run. - Drop `df /var/lib/buildkit` free-space gate: path does not exist on GH-hosted runners because BuildKit runs inside the buildx-container builder, not on the host. Every run logged "skipping disk check" and proceeded. Replace with a `df /` gate (≥150 GiB free) — the runner reliably has 238 GiB free after `free-disk-space`, so this is the check we actually wanted. Per-assembly cleanup (`assemble_agent_image`): - Drop `docker image prune -f`: redundant with `docker system prune -f` (system prune is a superset). Worse, the two compete for the Docker daemon's prune lock — run24777009977 log shows 139/433 image prunes and 110/433 system prunes failed with rc=1 racing each other on the same image. Dropping one saves the wasted subprocess spawns and the lock contention. - Raise OPENHANDS_BUILDKIT_KEEP_STORAGE_GB default 10 → 30. The 10 GiB cap from c18df3d was calibrated on the val2 run that OOM'd at 247/433; run24777009977 (same 10 GiB cap) then finished 433/433 with ~260 GiB starting headroom, so 10 was over-provisioned. 30 leaves more shared base+builder layers warm between assemblies, restoring the 70 s/image fast path the assemble_agent_image comment assumes. Pre-assembly prune (`assemble_all_agent_images`): - Replace `docker builder prune --keep-storage 0` with `docker buildx prune -af`. The base phase runs `docker buildx build --push` against the docker-container driver builder from setup-buildx-action; its cache lives inside that container, isolated from the host daemon's /var/lib/docker/buildkit. `docker builder prune` targets the host daemon (empty at this point — assembly hasn't started), which is why run24777009977 logged "Total reclaimed space: 0B" on this exact call. `docker buildx prune` targets the currently-selected buildx context — i.e. the container builder — and will actually reclaim whatever the base phase parked there. - Drop `docker image prune -f` companion call (nothing for it to prune before assembly begins) and the OPENHANDS_SKIP_PREASSEMBLY_PRUNE escape hatch (now that the step does useful work, don't encourage skipping it). Refs OpenHands/evaluation#495
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste — Pragmatic fix for a real production issue with solid defensive error handling. Just two minor consistency suggestions below.
Addresses PR #672 review comments on d400e50: - build_base_images.py:809 — capture the pre-assembly `docker buildx prune` subprocess result and warn on timeout/nonzero exit, matching the pattern used in the per-assembly cleanup below. - build-swtbench-images.yml:221 — comment explaining where the 150 GiB threshold comes from (ubuntu-latest-8core has 290 GiB total; ~260 GiB free post free-disk-space across validation runs 24642237265 / 24670432554 / 24777009977; 150 picks the midpoint between steady-state and observed cold-assembly consumption). Refs OpenHands/evaluation#495
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste — Pragmatic fix for a real production issue with excellent defensive error handling. The solution is simple, well-documented, and properly scoped. Both previously unresolved threads appear to be addressed in the current code.
The step reclaims ~24 GiB on top of the 238 GiB the runner already has free at job start, at a cost of ~3m24s wall time. With the code-side fixes in this PR (rmi + per-assembly system/builder prune + pre-assembly buildx prune), run 24798675779 finished 433/433 with ~100 GiB headroom even with free-disk-space in place, so the extra 24 GiB was not load-bearing. Preflight df / ≥150 GiB gate still fast-fails on unexpected runner state. Refs OpenHands/evaluation#495
|
Superseded by #690, which keeps the disk fix in the shared SWE/SWT phased image assembly code without workflow preflight changes. |
|
Closing in favor of #690 so the SWE-bench and SWT-bench disk cleanup fix is reviewed as one shared-code PR. |
Summary
Fixes OpenHands/evaluation#495. SWT-bench image builds started hitting
No space left on deviceonubuntu-latest-8corepartway through assembly, e.g. run 24572967778 (disk filled at image 160/433 after ~42 min).Root cause is a latent bug in
assemble_agent_imageexposed by the Blacksmith → GitHub-hosted migration in #651:docker build+docker push(comment documents the 455 s → 70 s speedup vs.buildx --push) but never callsdocker rmion the assembled tag. Every per-instance agent image stayed in/var/lib/docker.ubuntu-latest-8corehas ~80 GB and fills up around image 150.n-limit=5runs hid this becauseremote_image_exists(final_tag)short-circuits the docker build. No full cold SWT-bench run has actually succeeded onubuntu-latest-8coresince the migration.Changes
benchmarks/swebench/build_base_images.py— inassemble_agent_image, after all pushes succeed, rundocker rmi -f <pushed tags>. Shared base+builder layers stay cached (they're what the comment-advertised speedup relies on), only the just-pushed final tag is dropped. Failures warn but don't fail the assembly..github/workflows/build-swtbench-images.yml— belt-and-suspenders headroom so the next unexpected growth doesn't OOM again:jlumbroso/free-disk-spacepinned to54081f138730dfa15788a46383842cd2f914a1be(v1.3.1). Frees ~30 GB by removing preinstalled Android/.NET/Haskell/etc. Keeps the docker-images cache./var/lib/buildkit.The code change is the actual fix; the workflow changes are defense in depth (applying them to SWE-bench too should follow in a separate PR since SWE-bench has the same latent bug).
Testing
TestAssembleAgentImagetests should still pass — mockedsubprocess.runreturns OK for the newrmicall; assertions only checkresult.tags/result.error.AI disclosure
This PR was prepared by Claude (Claude Code) on behalf of @simonrosenberg.