Use cache-mode=off for batch benchmark image builds

## Problem

Benchmark image builds spend significant time exporting BuildKit cache to the registry (`--cache-to`), but this cache is **never reused** because each SWE-bench instance has a unique base image and therefore a unique cache tag. The registry cache is 100% cold on every run (`cache_import_miss_count=1` for all images across all three experiment runs).

Meanwhile, Blacksmith's **sticky disk** already provides effective local layer caching — common Dockerfile steps (12-13 out of ~15) are cached locally across builds within the same run and across runs via Blacksmith's snapshot-and-clone mechanism.

## Experiment

Three SWT-bench builds (100 images each, `force-build=true`, `sdk-commit=main`) with different `OPENHANDS_BUILDKIT_CACHE_MODE` values:

| Cache mode | Build duration | Throughput | Avg cache export/image | Avg wall clock/image | Retries | Run |
|-----------|---------------|-----------|----------------------|---------------------|---------|-----|
| `max` | 2h 49m | 35.5 img/h | 22.4s (5.9%) | 380s | 1 | [#23275304313](https://github.com/OpenHands/benchmarks/actions/runs/23275304313) |
| `min` | 2h 37m | 38.2 img/h | 11.7s (3.4%) | 348s | 0 | [#23275495756](https://github.com/OpenHands/benchmarks/actions/runs/23275495756) |
| `off` | 1h 58m | 50.7 img/h | 0s (0%) | 263s | 0 | [#23275305744](https://github.com/OpenHands/benchmarks/actions/runs/23275305744) |

### Caveat: sticky disk inheritance makes `off` vs `max`/`min` not apples-to-apples

All three runs executed on **different Blacksmith physical hosts** but shared sticky disk state via Blacksmith's snapshot-and-clone mechanism. Sticky disk commit timestamps and disk sizes reveal the inheritance chain:

| Run order | Mode | Physical host | Sticky disk committed | Final disk used |
|-----------|------|--------------|----------------------|----------------|
| 1st | `max` | production-131.153.236.169 | 04:16:28 UTC | 355.10 GiB |
| 2nd | `min` | production-131.153.143.135 | 04:14:18 UTC | 360.24 GiB |
| 3rd | `off` | production-125.253.72.239 | 06:18:00 UTC | 546.29 GiB |

The `off` run's sticky disk parent snapshot (ULID `01KM24V9RKK0DEZNMKVBF5J80B`, timestamp 04:14:51 UTC) was created 33 seconds after the `min` run committed (04:14:18 UTC), and its initial disk usage (~360 GiB) matches the `min` run's final state. **The `off` run inherited the `min` run's sticky disk, which had already built all 100 of the same images.**

This means:
- The `off` run started with **13/13 cached Dockerfile steps** for 86% of images (vs 12/15 for `max` and `min`)
- The ~117s/image speedup of `off` over `max` decomposes into: **~22s from skipping cache export** + **~95s from a warmer local BuildKit cache**
- The `max` vs `min` comparison is more reliable (similar starting sticky disk state), showing `min` is ~8% faster than `max`

## Detailed Findings

### 1. Registry cache is 100% useless for benchmark builds

Every image across all three runs reported `cache_import_miss_count=1` with zero hits. The `--cache-from type=registry` lookup found nothing in GHCR for any image. This is expected because:
- Each SWE-bench instance has a unique cache tag (e.g., `buildcache-source-minimal-sweb.eval.x86_64.sympy_1776_sympy-17139_tag_latest-0432b673f2b9`)
- These tags are per-instance and never shared across images
- Even within the same mode, the cache exported by one image is never imported by another

Cache import still costs time even on miss: avg 4.2s (`max`), 5.0s (`min`), 6.5s (`off`) per image.

### 2. Blacksmith sticky disk provides the real caching

Local BuildKit cache hit rates (cached Dockerfile steps per image):

| Cached steps | `max` run | `min` run | `off` run |
|:---:|:---:|:---:|:---:|
| 6-7 | 1 | 1 | 1 |
| 11 | 7 | 7 | 0 |
| 12 | 79 | 74 | 13 |
| 13 | 6 | 14 | **86** |
| 15-17 | 7 | 4 | 0 |

The `max` and `min` runs both had ~12 cached steps per image (from the same pre-existing sticky disk snapshot). The `off` run had 13 cached steps for 86% of images because it inherited from `min`'s completed state with all 100 images already built.

Within a run, later batches show mild improvement in cache hits as newly built layers become available for subsequent images sharing base layers.

### 3. Cache export overhead is modest but pure waste

| Mode | Avg export time | Total export time (100 images) | % of wall clock |
|------|----------------|-------------------------------|-----------------|
| `max` | 22.4s | 37.4 min | 5.9% |
| `min` | 11.7s | 19.5 min | 3.4% |
| `off` | 0s | 0 min | 0% |

For `max`, early images had significantly higher export times (75-93s for the first batch) because all layers had to be uploaded. Later images averaged ~18s as layers were already in the registry. The exported cache is never consumed by any subsequent build.

At 433 images: cache export wastes ~2.6 hours (`max`) or ~1.4 hours (`min`) of compute time.

### 4. Image export and push are the dominant overhead (unchanged by cache mode)

| Phase | `max` avg | `min` avg | `off` avg |
|-------|-----------|-----------|-----------|
| Image export | 65.9s | 70.5s | 63.4s |
| Push layers | 30.3s | 35.6s | 26.3s |
| **Combined** | **96.2s** | **106.1s** | **89.7s** |

These are unavoidable since we must push the built image to GHCR. They account for 25-30% of per-image wall clock time across all modes.

### 5. Disk usage was healthy for all runs

| Mode | Start disk | Peak disk | Peak % | Prune events | GiB/image |
|------|-----------|-----------|--------|-------------|-----------|
| `max` | 116 GiB | 355 GiB | 39.7% | 0 | 2.39 |
| `min` | 140 GiB | 360 GiB | 40.3% | 0 | 2.20 |
| `off` | 360 GiB | 546 GiB | 57.9% | 0 | 1.86 |

The `off` run used less disk per image (1.86 vs 2.39 GiB) because it doesn't store cache export artifacts locally. However, it started at a higher base (inherited from `min`). All runs stayed well under the 60% prune threshold.

### 6. Reliability

| Mode | Built | Failed | Retried | Error type |
|------|-------|--------|---------|------------|
| `max` | 100 | 0 | 1 | BuildKit gRPC disconnect (transient) |
| `min` | 100 | 0 | 0 | — |
| `off` | 100 | 0 | 0 | — |

All three runs achieved 100% success rate. The single retry in `max` was a transient BuildKit gRPC error, not cache-related.

## Recommendation

Default benchmark build workflows to `cache-mode=off`. This:
- **Eliminates ~22s/image of wasted registry cache export** (`max` → `off`)
- **Avoids registry write contention** when parallel workers export simultaneously
- **Reduces per-image disk footprint** (1.86 vs 2.39 GiB/image with `max`)
- **Preserves `--cache-from` (registry read)** in case cache tags are ever pre-populated externally
- **Has no downside** since the exported registry cache has a 100% miss rate across all observed runs

The reliable `max` → `min` comparison shows a ~10s/image saving (8% throughput improvement). The full `max` → `off` saving is ~22s/image from cache export alone, plus additional savings from reduced disk I/O contention.

Keep `cache-mode=max` for CI builds of the agent-server image on `main`, where the same image is rebuilt repeatedly and registry cache hits are valuable.

### Future investigation: isolating cache export cost

To get a clean measurement of the pure cache export overhead (without sticky disk confounding), run `max` and `off` back-to-back starting from the **same** sticky disk snapshot, or run them on cold (no sticky disk) hosts. The current experiment bounds the cache export cost at 22s/image (`max`) and 12s/image (`min`), but the 95s/image difference from sticky disk warmth is a separate variable.

## References

- PR adding `cache-mode` input to workflows: [#536](https://github.com/OpenHands/benchmarks/pull/536)
- SDK feature: [software-agent-sdk#2479](https://github.com/OpenHands/software-agent-sdk/pull/2479)
- Parent tracking issue: #530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cache-mode=off for batch benchmark image builds #540

Problem

Experiment

Caveat: sticky disk inheritance makes `off` vs `max`/`min` not apples-to-apples

Detailed Findings

1. Registry cache is 100% useless for benchmark builds

2. Blacksmith sticky disk provides the real caching

3. Cache export overhead is modest but pure waste

4. Image export and push are the dominant overhead (unchanged by cache mode)

5. Disk usage was healthy for all runs

6. Reliability

Recommendation

Future investigation: isolating cache export cost

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache mode	Build duration	Throughput	Avg cache export/image	Avg wall clock/image	Retries	Run
`max`	2h 49m	35.5 img/h	22.4s (5.9%)	380s	1	#23275304313
`min`	2h 37m	38.2 img/h	11.7s (3.4%)	348s	0	#23275495756
`off`	1h 58m	50.7 img/h	0s (0%)	263s	0	#23275305744

Run order	Mode	Physical host	Sticky disk committed	Final disk used
1st	`max`	production-131.153.236.169	04:16:28 UTC	355.10 GiB
2nd	`min`	production-131.153.143.135	04:14:18 UTC	360.24 GiB
3rd	`off`	production-125.253.72.239	06:18:00 UTC	546.29 GiB

Phase	`max` avg	`min` avg	`off` avg
Image export	65.9s	70.5s	63.4s
Push layers	30.3s	35.6s	26.3s
Combined	96.2s	106.1s	89.7s

Mode	Start disk	Peak disk	Peak %	GiB/image
`max`	116 GiB	355 GiB	39.7%	2.39
`min`	140 GiB	360 GiB	40.3%	2.20
`off`	360 GiB	546 GiB	57.9%	1.86

Mode	Built	Retried	Error type
`max`	100	1	BuildKit gRPC disconnect (transient)
`min`	100	0	—
`off`	100	0	—

Use cache-mode=off for batch benchmark image builds #540

Description

Problem

Experiment

Caveat: sticky disk inheritance makes off vs max/min not apples-to-apples

Detailed Findings

1. Registry cache is 100% useless for benchmark builds

2. Blacksmith sticky disk provides the real caching

3. Cache export overhead is modest but pure waste

4. Image export and push are the dominant overhead (unchanged by cache mode)

5. Disk usage was healthy for all runs

6. Reliability

Recommendation

Future investigation: isolating cache export cost

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Caveat: sticky disk inheritance makes `off` vs `max`/`min` not apples-to-apples