-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Problem
Benchmark image builds spend significant time exporting BuildKit cache to the registry (--cache-to), but this cache is never reused because each SWE-bench instance has a unique base image and therefore a unique cache tag. The registry cache is 100% cold on every run (cache_import_miss_count=1 for all images across all three experiment runs).
Meanwhile, Blacksmith's sticky disk already provides effective local layer caching — common Dockerfile steps (12-13 out of ~15) are cached locally across builds within the same run and across runs via Blacksmith's snapshot-and-clone mechanism.
Experiment
Three SWT-bench builds (100 images each, force-build=true, sdk-commit=main) with different OPENHANDS_BUILDKIT_CACHE_MODE values:
| Cache mode | Build duration | Throughput | Avg cache export/image | Avg wall clock/image | Retries | Run |
|---|---|---|---|---|---|---|
max |
2h 49m | 35.5 img/h | 22.4s (5.9%) | 380s | 1 | #23275304313 |
min |
2h 37m | 38.2 img/h | 11.7s (3.4%) | 348s | 0 | #23275495756 |
off |
1h 58m | 50.7 img/h | 0s (0%) | 263s | 0 | #23275305744 |
Caveat: sticky disk inheritance makes off vs max/min not apples-to-apples
All three runs executed on different Blacksmith physical hosts but shared sticky disk state via Blacksmith's snapshot-and-clone mechanism. Sticky disk commit timestamps and disk sizes reveal the inheritance chain:
| Run order | Mode | Physical host | Sticky disk committed | Final disk used |
|---|---|---|---|---|
| 1st | max |
production-131.153.236.169 | 04:16:28 UTC | 355.10 GiB |
| 2nd | min |
production-131.153.143.135 | 04:14:18 UTC | 360.24 GiB |
| 3rd | off |
production-125.253.72.239 | 06:18:00 UTC | 546.29 GiB |
The off run's sticky disk parent snapshot (ULID 01KM24V9RKK0DEZNMKVBF5J80B, timestamp 04:14:51 UTC) was created 33 seconds after the min run committed (04:14:18 UTC), and its initial disk usage (~360 GiB) matches the min run's final state. The off run inherited the min run's sticky disk, which had already built all 100 of the same images.
This means:
- The
offrun started with 13/13 cached Dockerfile steps for 86% of images (vs 12/15 formaxandmin) - The ~117s/image speedup of
offovermaxdecomposes into: ~22s from skipping cache export + ~95s from a warmer local BuildKit cache - The
maxvsmincomparison is more reliable (similar starting sticky disk state), showingminis ~8% faster thanmax
Detailed Findings
1. Registry cache is 100% useless for benchmark builds
Every image across all three runs reported cache_import_miss_count=1 with zero hits. The --cache-from type=registry lookup found nothing in GHCR for any image. This is expected because:
- Each SWE-bench instance has a unique cache tag (e.g.,
buildcache-source-minimal-sweb.eval.x86_64.sympy_1776_sympy-17139_tag_latest-0432b673f2b9) - These tags are per-instance and never shared across images
- Even within the same mode, the cache exported by one image is never imported by another
Cache import still costs time even on miss: avg 4.2s (max), 5.0s (min), 6.5s (off) per image.
2. Blacksmith sticky disk provides the real caching
Local BuildKit cache hit rates (cached Dockerfile steps per image):
| Cached steps | max run |
min run |
off run |
|---|---|---|---|
| 6-7 | 1 | 1 | 1 |
| 11 | 7 | 7 | 0 |
| 12 | 79 | 74 | 13 |
| 13 | 6 | 14 | 86 |
| 15-17 | 7 | 4 | 0 |
The max and min runs both had ~12 cached steps per image (from the same pre-existing sticky disk snapshot). The off run had 13 cached steps for 86% of images because it inherited from min's completed state with all 100 images already built.
Within a run, later batches show mild improvement in cache hits as newly built layers become available for subsequent images sharing base layers.
3. Cache export overhead is modest but pure waste
| Mode | Avg export time | Total export time (100 images) | % of wall clock |
|---|---|---|---|
max |
22.4s | 37.4 min | 5.9% |
min |
11.7s | 19.5 min | 3.4% |
off |
0s | 0 min | 0% |
For max, early images had significantly higher export times (75-93s for the first batch) because all layers had to be uploaded. Later images averaged ~18s as layers were already in the registry. The exported cache is never consumed by any subsequent build.
At 433 images: cache export wastes ~2.6 hours (max) or ~1.4 hours (min) of compute time.
4. Image export and push are the dominant overhead (unchanged by cache mode)
| Phase | max avg |
min avg |
off avg |
|---|---|---|---|
| Image export | 65.9s | 70.5s | 63.4s |
| Push layers | 30.3s | 35.6s | 26.3s |
| Combined | 96.2s | 106.1s | 89.7s |
These are unavoidable since we must push the built image to GHCR. They account for 25-30% of per-image wall clock time across all modes.
5. Disk usage was healthy for all runs
| Mode | Start disk | Peak disk | Peak % | Prune events | GiB/image |
|---|---|---|---|---|---|
max |
116 GiB | 355 GiB | 39.7% | 0 | 2.39 |
min |
140 GiB | 360 GiB | 40.3% | 0 | 2.20 |
off |
360 GiB | 546 GiB | 57.9% | 0 | 1.86 |
The off run used less disk per image (1.86 vs 2.39 GiB) because it doesn't store cache export artifacts locally. However, it started at a higher base (inherited from min). All runs stayed well under the 60% prune threshold.
6. Reliability
| Mode | Built | Failed | Retried | Error type |
|---|---|---|---|---|
max |
100 | 0 | 1 | BuildKit gRPC disconnect (transient) |
min |
100 | 0 | 0 | — |
off |
100 | 0 | 0 | — |
All three runs achieved 100% success rate. The single retry in max was a transient BuildKit gRPC error, not cache-related.
Recommendation
Default benchmark build workflows to cache-mode=off. This:
- Eliminates ~22s/image of wasted registry cache export (
max→off) - Avoids registry write contention when parallel workers export simultaneously
- Reduces per-image disk footprint (1.86 vs 2.39 GiB/image with
max) - Preserves
--cache-from(registry read) in case cache tags are ever pre-populated externally - Has no downside since the exported registry cache has a 100% miss rate across all observed runs
The reliable max → min comparison shows a ~10s/image saving (8% throughput improvement). The full max → off saving is ~22s/image from cache export alone, plus additional savings from reduced disk I/O contention.
Keep cache-mode=max for CI builds of the agent-server image on main, where the same image is rebuilt repeatedly and registry cache hits are valuable.
Future investigation: isolating cache export cost
To get a clean measurement of the pure cache export overhead (without sticky disk confounding), run max and off back-to-back starting from the same sticky disk snapshot, or run them on cold (no sticky disk) hosts. The current experiment bounds the cache export cost at 22s/image (max) and 12s/image (min), but the 95s/image difference from sticky disk warmth is a separate variable.
References
- PR adding
cache-modeinput to workflows: #536 - SDK feature: software-agent-sdk#2479
- Parent tracking issue: SWT-bench image build throughput tracker (historical source of truth) #530