`openshell sandbox create --from` OOM-killed on large images due to in-memory image buffering

## Agent Diagnostic

Investigated using the `create-spike` skill backed by `principal-engineer-reviewer`. Skills loaded: `debug-openshell-cluster`, `openshell-cli`, `create-spike`.

**Investigation summary:**
- Traced `sandbox create --from` through `run.rs → build.rs → push.rs`
- Confirmed three discrete full-image heap allocations in `push_local_images()`
- Verified bollard `upload_to_container` API supports streaming via `body_try_stream` — no API constraint blocks the fix
- Confirmed `tar` crate size-in-header constraint requires a seekable (disk) intermediate
- Identified `tempfile` is already a dev-dep; needs promotion to prod dep
- No test coverage exists for `push.rs`

---

## Description

Running `openshell sandbox create --from <dockerfile-dir>` on a large sandbox image (~3.7 GB) causes the `openshell` process to be killed by the Linux OOM killer. The image export pipeline in `crates/openshell-bootstrap/src/push.rs` buffers the entire Docker image tar **three times** in memory before importing it into the gateway, producing ~11 GB peak allocation.

**Expected:** The export → tar wrap → upload pipeline streams data in O(chunk) memory.  
**Actual:** ~3× image size held in memory simultaneously → OOM kill on images ≥ ~4 GB on machines with ≤ 16 GB free RAM.

---

## Root Cause

### Data flow

```
sandbox create --from ./my-image/
  └─ run.rs:build_from_dockerfile()
       └─ build.rs:build_and_push_image()
            ├─ build_image()          ← streaming, no issue
            └─ push_local_images()
                 ├─ collect_export()  ← BUFFER 1: 3.7 GB Vec<u8>
                 ├─ wrap_in_tar()     ← BUFFER 2: 3.7 GB Vec<u8> (both live simultaneously)
                 └─ upload_archive()  ← BUFFER 3: 3.7 GB Bytes::copy_from_slice
```

### Buffer 1 — `collect_export` (`push.rs:97–107`)

```rust
async fn collect_export(docker: &Docker, images: &[&str]) -> Result<Vec<u8>> {
    let mut stream = docker.export_images(images);  // already a Stream<Item=Result<Bytes>>
    let mut buf = Vec::new();
    while let Some(chunk) = stream.next().await {
        buf.extend_from_slice(&bytes);              // ← 3.7 GB accumulates here
    }
    Ok(buf)
}
```

### Buffer 2 — `wrap_in_tar` (`push.rs:114–131`)

```rust
fn wrap_in_tar(file_path: &str, data: &[u8]) -> Result<Vec<u8>> {
    let mut builder = tar::Builder::new(Vec::new());
    header.set_size(data.len() as u64);
    builder.append(&header, data)?;  // ← copies full 3.7 GB into new Vec<u8>
    builder.into_inner()             // ← returns second 3.7 GB allocation
}
```

Both `image_tar` (Buffer 1) and `outer_tar` (Buffer 2) are live simultaneously at `push.rs:54`. **Peak ~7.4 GB at this point.**

### Buffer 3 — `upload_archive` (`push.rs:135–151`)

```rust
docker.upload_to_container(
    container_name,
    Some(options),
    bollard::body_full(Bytes::copy_from_slice(archive)),  // ← third 3.7 GB copy
)
```

All three allocations overlap in scope. **Peak ~11 GB.**

| Step | Allocation | Size |
|------|-----------|------|
| `collect_export` | `Vec<u8>` (image tar) | ~3.7 GB |
| `wrap_in_tar` | `Vec<u8>` (outer tar) | ~3.7 GB |
| `upload_archive` | `Bytes` (HTTP body) | ~3.7 GB |
| **Total peak** | | **~11 GB** |

---

## Streaming Fix: Feasibility Confirmed

### Bollard supports streaming

`upload_to_container` accepts a `BodyType` which has three constructors:

| Function | Input |
|---|---|
| `bollard::body_full(Bytes)` | Materialized — current, causes Buffer 3 |
| `bollard::body_stream(impl Stream<Item=Bytes>)` | Infallible stream |
| `bollard::body_try_stream(impl Stream<Item=Result<Bytes, io::Error>>)` | **Best fit** |

**There is no API constraint blocking streaming.**

### Tar size-in-header constraint requires temp file

The POSIX tar format encodes the entry byte size in the 512-byte header, written *before* the data. The `tar` crate`s `append_writer` escape hatch requires a seekable destination. This means:

- A fully in-memory zero-copy pipeline is **not feasible** without knowing the export size in advance
- Writing the raw export to a temp file first, then constructing the outer tar header from the known file size, then streaming the header + file chunks is the correct approach

### Proposed streaming pipeline

```
export_images() stream
      │ (chunked Bytes)
      ▼
write to NamedTempFile   ← sync via spawn_blocking or async write
      │ (flush + seek to start)
      ▼
stat file → build 512-byte tar header in memory (set_size = file.len())
      │
      ▼
Stream: [header_bytes] ++ [file content chunks] ++ [tar padding]
      │ (impl Stream<Item=Result<Bytes, io::Error>>)
      ▼
bollard::body_try_stream(...)
      ▼
upload_to_container()    ← single pass, O(chunk) RAM
```

Peak memory drops from ~11 GB to O(stream chunk size) + one temp file on disk (~3.7 GB disk, ~few MB RAM).

### Dependencies

| Primitive | Status |
|---|---|
| `tempfile::NamedTempFile` | Promote from dev-dep → prod dep in `Cargo.toml` |
| `tokio::fs::File` + `AsyncReadExt` | Available (tokio already a dep) |
| `futures::stream::try_unfold` | Available (futures already a dep) |
| `tokio_util::io::ReaderStream` | Optional (already in lockfile as transitive dep; cleaner ergonomics) |

---

## Design Questions

1. **Single vs. two temp files**: Writing raw export to file A, then outer tar to file B keeps logic simple. Writing export to file A, constructing the 512-byte header in memory, then streaming `header_bytes ++ file_chunks ++ padding` avoids file B entirely. The latter is preferable — no second disk write, avoids a second ~3.7 GB allocation.

2. **Should `OPENSHELL_SCRATCH_DIR` be supported?** Some CI environments mount `/tmp` as `tmpfs` (in-RAM), defeating the purpose. An env-var override is a nice-to-have but not blocking for v1 — users can redirect `$TMPDIR`.

3. **Disk space pre-check**: A proactive check (`docker image inspect` → `VirtualSize` vs. available disk) before starting the export would give a clear error instead of a cryptic OS write failure. Nice-to-have.

4. **`build.rs` build context**: `create_build_context_tar` materializes the Dockerfile build context into a `Vec<u8>`. Same pattern, much smaller scale (MBs not GBs). Out of scope for this fix but worth tracking as a follow-up.

---

## Reproduction Steps

1. Build a large sandbox image (≥ 3 GB uncompressed), e.g.:
   ```
   openshell sandbox create --from sandboxes/gemini
   ```
2. Observe:
   ```
   [progress] Exported 3745 MiB
   Killed
   ```
3. Exit code: `137` (SIGKILL from OOM killer)

---

## Environment

- Image size: ~3.7 GB (base sandbox + `@google/gemini-cli@0.34.0`)
- Host RAM: 23 GB total, ~13 GB available at time of failure
- Swap: 8 GB total, ~12 MB free at time of failure
- OOM killer log: `Out of memory: Killed process (openshell) total-vm:19913320kB, anon-rss:13626992kB`

---

## Scope

| Dimension | Assessment |
|---|---|
| Type | `perf` / bug — correct behavior, resource-fatal on large inputs |
| Complexity | Medium — clear fix, async streaming around sync tar with size-in-header constraint |
| Confidence | High — no API blockers, temp-file path is proven |
| Primary files | `push.rs` (logic), `Cargo.toml` (promote `tempfile` to prod dep) |
| Tests needed | Unit tests for streaming helpers (tar header construction, stream reassembly); no Docker daemon required |

---

## Agent-First Checklist

- [x] I pointed my agent at the repo and had it investigate this issue
- [x] I loaded relevant skills (`debug-openshell-cluster`, `openshell-cli`, `create-spike`)
- [x] My agent could not resolve this — the diagnostic above explains why (code change required)


Step	Allocation	Size
`collect_export`	`Vec<u8>` (image tar)	~3.7 GB
`wrap_in_tar`	`Vec<u8>` (outer tar)	~3.7 GB
`upload_archive`	`Bytes` (HTTP body)	~3.7 GB
Total peak		~11 GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`openshell sandbox create --from` OOM-killed on large images due to in-memory image buffering #600

Agent Diagnostic

Description

Root Cause

Data flow

Buffer 1 — `collect_export` (`push.rs:97–107`)

Buffer 2 — `wrap_in_tar` (`push.rs:114–131`)

Buffer 3 — `upload_archive` (`push.rs:135–151`)

Streaming Fix: Feasibility Confirmed

Bollard supports streaming

Tar size-in-header constraint requires temp file

Proposed streaming pipeline

Dependencies

Design Questions

Reproduction Steps

Environment

Scope

Agent-First Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Function	Input
`bollard::body_full(Bytes)`	Materialized — current, causes Buffer 3
`bollard::body_stream(impl Stream<Item=Bytes>)`	Infallible stream
`bollard::body_try_stream(impl Stream<Item=Result<Bytes, io::Error>>)`	Best fit

Primitive	Status
`tempfile::NamedTempFile`	Promote from dev-dep → prod dep in `Cargo.toml`
`tokio::fs::File` + `AsyncReadExt`	Available (tokio already a dep)
`futures::stream::try_unfold`	Available (futures already a dep)
`tokio_util::io::ReaderStream`	Optional (already in lockfile as transitive dep; cleaner ergonomics)

Dimension	Assessment
Type	`perf` / bug — correct behavior, resource-fatal on large inputs
Complexity	Medium — clear fix, async streaming around sync tar with size-in-header constraint
Confidence	High — no API blockers, temp-file path is proven
Primary files	`push.rs` (logic), `Cargo.toml` (promote `tempfile` to prod dep)
Tests needed	Unit tests for streaming helpers (tar header construction, stream reassembly); no Docker daemon required

openshell sandbox create --from OOM-killed on large images due to in-memory image buffering #600

Description

Agent Diagnostic

Description

Root Cause

Data flow

Buffer 1 — collect_export (push.rs:97–107)

Buffer 2 — wrap_in_tar (push.rs:114–131)

Buffer 3 — upload_archive (push.rs:135–151)

Streaming Fix: Feasibility Confirmed

Bollard supports streaming

Tar size-in-header constraint requires temp file

Proposed streaming pipeline

Dependencies

Design Questions

Reproduction Steps

Environment

Scope

Agent-First Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`openshell sandbox create --from` OOM-killed on large images due to in-memory image buffering #600

Buffer 1 — `collect_export` (`push.rs:97–107`)

Buffer 2 — `wrap_in_tar` (`push.rs:114–131`)

Buffer 3 — `upload_archive` (`push.rs:135–151`)