Skip to content

Use cache-mode=off for batch benchmark image builds #540

@simonrosenberg

Description

@simonrosenberg

Problem

Benchmark image builds spend significant time exporting BuildKit cache to the registry (--cache-to), but this cache is never reused because each SWE-bench instance has a unique base image and therefore a unique cache tag. The registry cache is 100% cold on every run (cache_import_miss_count=1 for all images across all three experiment runs).

Meanwhile, Blacksmith's sticky disk already provides effective local layer caching — common Dockerfile steps (12-13 out of ~15) are cached locally across builds within the same run and across runs via Blacksmith's snapshot-and-clone mechanism.

Experiment

Three SWT-bench builds (100 images each, force-build=true, sdk-commit=main) with different OPENHANDS_BUILDKIT_CACHE_MODE values:

Cache mode Build duration Throughput Avg cache export/image Avg wall clock/image Retries Run
max 2h 49m 35.5 img/h 22.4s (5.9%) 380s 1 #23275304313
min 2h 37m 38.2 img/h 11.7s (3.4%) 348s 0 #23275495756
off 1h 58m 50.7 img/h 0s (0%) 263s 0 #23275305744

Caveat: sticky disk inheritance makes off vs max/min not apples-to-apples

All three runs executed on different Blacksmith physical hosts but shared sticky disk state via Blacksmith's snapshot-and-clone mechanism. Sticky disk commit timestamps and disk sizes reveal the inheritance chain:

Run order Mode Physical host Sticky disk committed Final disk used
1st max production-131.153.236.169 04:16:28 UTC 355.10 GiB
2nd min production-131.153.143.135 04:14:18 UTC 360.24 GiB
3rd off production-125.253.72.239 06:18:00 UTC 546.29 GiB

The off run's sticky disk parent snapshot (ULID 01KM24V9RKK0DEZNMKVBF5J80B, timestamp 04:14:51 UTC) was created 33 seconds after the min run committed (04:14:18 UTC), and its initial disk usage (~360 GiB) matches the min run's final state. The off run inherited the min run's sticky disk, which had already built all 100 of the same images.

This means:

  • The off run started with 13/13 cached Dockerfile steps for 86% of images (vs 12/15 for max and min)
  • The ~117s/image speedup of off over max decomposes into: ~22s from skipping cache export + ~95s from a warmer local BuildKit cache
  • The max vs min comparison is more reliable (similar starting sticky disk state), showing min is ~8% faster than max

Detailed Findings

1. Registry cache is 100% useless for benchmark builds

Every image across all three runs reported cache_import_miss_count=1 with zero hits. The --cache-from type=registry lookup found nothing in GHCR for any image. This is expected because:

  • Each SWE-bench instance has a unique cache tag (e.g., buildcache-source-minimal-sweb.eval.x86_64.sympy_1776_sympy-17139_tag_latest-0432b673f2b9)
  • These tags are per-instance and never shared across images
  • Even within the same mode, the cache exported by one image is never imported by another

Cache import still costs time even on miss: avg 4.2s (max), 5.0s (min), 6.5s (off) per image.

2. Blacksmith sticky disk provides the real caching

Local BuildKit cache hit rates (cached Dockerfile steps per image):

Cached steps max run min run off run
6-7 1 1 1
11 7 7 0
12 79 74 13
13 6 14 86
15-17 7 4 0

The max and min runs both had ~12 cached steps per image (from the same pre-existing sticky disk snapshot). The off run had 13 cached steps for 86% of images because it inherited from min's completed state with all 100 images already built.

Within a run, later batches show mild improvement in cache hits as newly built layers become available for subsequent images sharing base layers.

3. Cache export overhead is modest but pure waste

Mode Avg export time Total export time (100 images) % of wall clock
max 22.4s 37.4 min 5.9%
min 11.7s 19.5 min 3.4%
off 0s 0 min 0%

For max, early images had significantly higher export times (75-93s for the first batch) because all layers had to be uploaded. Later images averaged ~18s as layers were already in the registry. The exported cache is never consumed by any subsequent build.

At 433 images: cache export wastes ~2.6 hours (max) or ~1.4 hours (min) of compute time.

4. Image export and push are the dominant overhead (unchanged by cache mode)

Phase max avg min avg off avg
Image export 65.9s 70.5s 63.4s
Push layers 30.3s 35.6s 26.3s
Combined 96.2s 106.1s 89.7s

These are unavoidable since we must push the built image to GHCR. They account for 25-30% of per-image wall clock time across all modes.

5. Disk usage was healthy for all runs

Mode Start disk Peak disk Peak % Prune events GiB/image
max 116 GiB 355 GiB 39.7% 0 2.39
min 140 GiB 360 GiB 40.3% 0 2.20
off 360 GiB 546 GiB 57.9% 0 1.86

The off run used less disk per image (1.86 vs 2.39 GiB) because it doesn't store cache export artifacts locally. However, it started at a higher base (inherited from min). All runs stayed well under the 60% prune threshold.

6. Reliability

Mode Built Failed Retried Error type
max 100 0 1 BuildKit gRPC disconnect (transient)
min 100 0 0
off 100 0 0

All three runs achieved 100% success rate. The single retry in max was a transient BuildKit gRPC error, not cache-related.

Recommendation

Default benchmark build workflows to cache-mode=off. This:

  • Eliminates ~22s/image of wasted registry cache export (maxoff)
  • Avoids registry write contention when parallel workers export simultaneously
  • Reduces per-image disk footprint (1.86 vs 2.39 GiB/image with max)
  • Preserves --cache-from (registry read) in case cache tags are ever pre-populated externally
  • Has no downside since the exported registry cache has a 100% miss rate across all observed runs

The reliable maxmin comparison shows a ~10s/image saving (8% throughput improvement). The full maxoff saving is ~22s/image from cache export alone, plus additional savings from reduced disk I/O contention.

Keep cache-mode=max for CI builds of the agent-server image on main, where the same image is rebuilt repeatedly and registry cache hits are valuable.

Future investigation: isolating cache export cost

To get a clean measurement of the pure cache export overhead (without sticky disk confounding), run max and off back-to-back starting from the same sticky disk snapshot, or run them on cold (no sticky disk) hosts. The current experiment bounds the cache export cost at 22s/image (max) and 12s/image (min), but the 95s/image difference from sticky disk warmth is a separate variable.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions