feat(cli): two-stage compile-many primitive — framework once, sketches in parallel (#112)

Sub-issue of #112 (iter8 candidate 1).

## Design intent

Memory and CPU should not scale with the number of sketches. After the framework + dependent libraries are built once, every additional sketch only pays for:

1. Compiling its own `sketch.cpp` (and any sketch-local sources) to `.o`.
2. Linking `sketch.o` against the pre-built `fastled.a` (+ framework archive) to produce `firmware.elf` → `firmware.hex`/`.bin`.

Both steps are tiny relative to a full build. So the right compile strategy is **two-stage**, and the parallelism story has to live inside fbuild — not in a downstream wrapper.

## Where the parallelism belongs

Today the benchmark invokes FastLED's `./compile` wrapper, which loops over examples and calls `fbuild deploy` once per sketch. Any parallelism decision FastLED's wrapper makes (or doesn't make — its `ci/util/cpu_count.py` short-circuits to 1 on `GITHUB_ACTIONS`) sits one level above fbuild and isn't fbuild's concern. FastLED is downstream.

What's missing is the right primitive on the **fbuild** side: a single command that takes a board and a list of sketches, builds the shared framework/library archives once, then compiles + links the sketches in parallel using fbuild's own scheduler. With that primitive in place, the benchmark workflow calls fbuild directly — FastLED's wrapper isn't in the loop, and whatever knob FastLED's wrapper exposes downstream is independent.

## The primitive

New fbuild subcommand (working name): `fbuild compile-many`.

```
fbuild compile-many \
    --board <board> \
    --framework-jobs <N>   # parallelism for stage 1 (framework + libs)
    --sketch-jobs <M>      # parallelism for stage 2 (per-sketch compile + link)
    <sketch1> [<sketch2> ...]
```

Two independent knobs because the two stages have very different resource profiles:

- **`--framework-jobs`** governs how wide stage 1 fans out. The framework build is memory-heavy (many TUs, large per-TU includes after LDF, intermediate archive writes). On a 2-core `ubuntu-latest` this is the lever you keep modest (1 or 2). On a beefier runner you crank it up.
- **`--sketch-jobs`** governs how wide stage 2 fans out. A stage-2 worker is one `cc1plus` over a sketch.cpp plus one `ld` linking against pre-built archives via read-only mmap. Per-worker memory is tiny. On `ubuntu-latest` 2 is the safe default; on bigger runners 4–8 is fine. This number can be much higher than `--framework-jobs` without risking OOM.

A reasonable default policy when neither flag is set: `--framework-jobs = min(cores, 2)`, `--sketch-jobs = cores`. Conservative on framework, generous on sketches.

## Why not just `xargs -P 2 fbuild deploy`?

Each `fbuild deploy` invocation today pays the full per-invocation cost: load `pyproject.toml`, run LDF (cached via #205 but still process-startup overhead), enumerate framework libraries, materialize the toolchain into memory. With two concurrent invocations, that overhead doubles **and** each invocation may race to populate the same on-disk artifacts (framework `.o` files, library archives, LDF cache entries). zccache makes the duplicated work cheap on hit, but it doesn't make the memory cost free — you load the toolchain twice, do LDF twice, etc.

`compile-many` sidesteps that: one process, one toolchain load, one LDF run, one archive build — then fan out only the small per-sketch work.

## Plan

1. **fbuild — `compile-many` subcommand** (`crates/fbuild-cli`):
   - Accepts board + sketch list + `--framework-jobs` + `--sketch-jobs`.
   - Routes through the existing orchestrator infrastructure (`fbuild-build`) so platform code stays in one place.
   - Returns a per-sketch result map (success / failure + log path), suitable for the bench summary.

2. **fbuild — two-stage execution**:
   - Stage 1 runs once with `--framework-jobs` workers, produces the framework + library archives + records their cache key, and stops.
   - Stage 2 receives the cache key and the sketch list; spawns `--sketch-jobs` workers; each worker compiles its sketch's TUs and links against the stage-1 archives. Workers share read-only access to the archives via mmap; the daemon does not re-do LDF or re-archive.

3. **Concurrent-safety review**:
   - Confirm the zccache read path is lock-free under concurrent access from stage-2 workers.
   - Confirm the per-sketch output directory is unique per sketch so two workers can't race on the same `firmware.elf` path.

4. **Benchmark workflow update** (`.github/workflows/benchmark.yml`):
   - Drop the `./compile --no-interactive --no-parallel ...` line.
   - Call `.venv/bin/fbuild compile-many --board $BOARD --framework-jobs 1 --sketch-jobs 2 $EXAMPLES` directly.
   - Add `stage1_archives` and `stage2_sketches` rows to the per-phase TSV so the next iteration has a clean baseline.

5. **Downstream FastLED switch (out of scope here)**: a separate issue against FastLED/FastLED asks them to consume `fbuild compile-many` from their `./compile` wrapper. That work is independent of this fbuild PR — it can land any time after the primitive ships.

## Expected savings

iter7 warm baseline: compile phase = 136 s. First-example toolchain-init residual ≈ 10.97 s. Remaining ≈ 125 s for ~80 sketches × ~1.5 s each, all serial.

| scenario | compile phase | warm total |
|---|---:|---:|
| iter7 (FastLED `./compile`, serial) | 136 s | 198 s |
| `compile-many` stage-1 + 2-wide stage-2 | ~10 s init + ~63 s fan-out = **~73 s** | **~135 s** |
| `compile-many` stage-1 + 4-wide stage-2 (bigger runner) | ~10 s + ~32 s = **~42 s** | **~104 s** |

## Acceptance

- New `fbuild compile-many` subcommand exists with `--framework-jobs` + `--sketch-jobs` flags.
- Warm `compile` phase ≤ 80 s on `ubuntu-latest` with `--framework-jobs 1 --sketch-jobs 2`.
- Cold `compile` phase ≤ 110 s under the same flags.
- No example-failure regression vs serial.
- Memory peak across stage 2 ≤ (memory peak of one serial example) × `--sketch-jobs` + small overhead.
- Per-phase TSV records `stage1_archives` and `stage2_sketches` separately.
- Benchmark workflow calls fbuild directly; the `./compile` wrapper is no longer on the bench path.

## Out of scope

- Caching `fastled/.venv` — tracked separately as #239.
- Pushing per-sketch step below 1.5 s. Stage 2's per-sketch cost is the lower bound this lever can move toward; going further belongs to a daemon-side compile/link reuse ticket.
- Switching runners. The lever is "two stages on the same hardware", not "different infrastructure".
- Changing FastLED's `./compile` wrapper. Downstream concern; filed against FastLED/FastLED.

## Related

- #112 — meta tracker.
- #239 — `fastled/.venv` caching, stacks with this.
- FastLED/FastLED issue (will be linked here when filed) — downstream switch to consume `compile-many`.

scenario	compile phase	warm total
iter7 (FastLED `./compile`, serial)	136 s	198 s
`compile-many` stage-1 + 2-wide stage-2	~10 s init + ~63 s fan-out = ~73 s	~135 s
`compile-many` stage-1 + 4-wide stage-2 (bigger runner)	~10 s + ~32 s = ~42 s	~104 s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): two-stage compile-many primitive — framework once, sketches in parallel (#112) #238

Design intent

Where the parallelism belongs

The primitive

Why not just `xargs -P 2 fbuild deploy`?

Plan

Expected savings

Acceptance

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(cli): two-stage compile-many primitive — framework once, sketches in parallel (#112) #238

Description

Design intent

Where the parallelism belongs

The primitive

Why not just xargs -P 2 fbuild deploy?

Plan

Expected savings

Acceptance

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why not just `xargs -P 2 fbuild deploy`?