Skip to content

feat(cli): two-stage compile-many primitive — framework once, sketches in parallel (#112) #238

@zackees

Description

@zackees

Sub-issue of #112 (iter8 candidate 1).

Design intent

Memory and CPU should not scale with the number of sketches. After the framework + dependent libraries are built once, every additional sketch only pays for:

  1. Compiling its own sketch.cpp (and any sketch-local sources) to .o.
  2. Linking sketch.o against the pre-built fastled.a (+ framework archive) to produce firmware.elffirmware.hex/.bin.

Both steps are tiny relative to a full build. So the right compile strategy is two-stage, and the parallelism story has to live inside fbuild — not in a downstream wrapper.

Where the parallelism belongs

Today the benchmark invokes FastLED's ./compile wrapper, which loops over examples and calls fbuild deploy once per sketch. Any parallelism decision FastLED's wrapper makes (or doesn't make — its ci/util/cpu_count.py short-circuits to 1 on GITHUB_ACTIONS) sits one level above fbuild and isn't fbuild's concern. FastLED is downstream.

What's missing is the right primitive on the fbuild side: a single command that takes a board and a list of sketches, builds the shared framework/library archives once, then compiles + links the sketches in parallel using fbuild's own scheduler. With that primitive in place, the benchmark workflow calls fbuild directly — FastLED's wrapper isn't in the loop, and whatever knob FastLED's wrapper exposes downstream is independent.

The primitive

New fbuild subcommand (working name): fbuild compile-many.

fbuild compile-many \
    --board <board> \
    --framework-jobs <N>   # parallelism for stage 1 (framework + libs)
    --sketch-jobs <M>      # parallelism for stage 2 (per-sketch compile + link)
    <sketch1> [<sketch2> ...]

Two independent knobs because the two stages have very different resource profiles:

  • --framework-jobs governs how wide stage 1 fans out. The framework build is memory-heavy (many TUs, large per-TU includes after LDF, intermediate archive writes). On a 2-core ubuntu-latest this is the lever you keep modest (1 or 2). On a beefier runner you crank it up.
  • --sketch-jobs governs how wide stage 2 fans out. A stage-2 worker is one cc1plus over a sketch.cpp plus one ld linking against pre-built archives via read-only mmap. Per-worker memory is tiny. On ubuntu-latest 2 is the safe default; on bigger runners 4–8 is fine. This number can be much higher than --framework-jobs without risking OOM.

A reasonable default policy when neither flag is set: --framework-jobs = min(cores, 2), --sketch-jobs = cores. Conservative on framework, generous on sketches.

Why not just xargs -P 2 fbuild deploy?

Each fbuild deploy invocation today pays the full per-invocation cost: load pyproject.toml, run LDF (cached via #205 but still process-startup overhead), enumerate framework libraries, materialize the toolchain into memory. With two concurrent invocations, that overhead doubles and each invocation may race to populate the same on-disk artifacts (framework .o files, library archives, LDF cache entries). zccache makes the duplicated work cheap on hit, but it doesn't make the memory cost free — you load the toolchain twice, do LDF twice, etc.

compile-many sidesteps that: one process, one toolchain load, one LDF run, one archive build — then fan out only the small per-sketch work.

Plan

  1. fbuild — compile-many subcommand (crates/fbuild-cli):

    • Accepts board + sketch list + --framework-jobs + --sketch-jobs.
    • Routes through the existing orchestrator infrastructure (fbuild-build) so platform code stays in one place.
    • Returns a per-sketch result map (success / failure + log path), suitable for the bench summary.
  2. fbuild — two-stage execution:

    • Stage 1 runs once with --framework-jobs workers, produces the framework + library archives + records their cache key, and stops.
    • Stage 2 receives the cache key and the sketch list; spawns --sketch-jobs workers; each worker compiles its sketch's TUs and links against the stage-1 archives. Workers share read-only access to the archives via mmap; the daemon does not re-do LDF or re-archive.
  3. Concurrent-safety review:

    • Confirm the zccache read path is lock-free under concurrent access from stage-2 workers.
    • Confirm the per-sketch output directory is unique per sketch so two workers can't race on the same firmware.elf path.
  4. Benchmark workflow update (.github/workflows/benchmark.yml):

    • Drop the ./compile --no-interactive --no-parallel ... line.
    • Call .venv/bin/fbuild compile-many --board $BOARD --framework-jobs 1 --sketch-jobs 2 $EXAMPLES directly.
    • Add stage1_archives and stage2_sketches rows to the per-phase TSV so the next iteration has a clean baseline.
  5. Downstream FastLED switch (out of scope here): a separate issue against FastLED/FastLED asks them to consume fbuild compile-many from their ./compile wrapper. That work is independent of this fbuild PR — it can land any time after the primitive ships.

Expected savings

iter7 warm baseline: compile phase = 136 s. First-example toolchain-init residual ≈ 10.97 s. Remaining ≈ 125 s for ~80 sketches × ~1.5 s each, all serial.

scenario compile phase warm total
iter7 (FastLED ./compile, serial) 136 s 198 s
compile-many stage-1 + 2-wide stage-2 ~10 s init + ~63 s fan-out = ~73 s ~135 s
compile-many stage-1 + 4-wide stage-2 (bigger runner) ~10 s + ~32 s = ~42 s ~104 s

Acceptance

  • New fbuild compile-many subcommand exists with --framework-jobs + --sketch-jobs flags.
  • Warm compile phase ≤ 80 s on ubuntu-latest with --framework-jobs 1 --sketch-jobs 2.
  • Cold compile phase ≤ 110 s under the same flags.
  • No example-failure regression vs serial.
  • Memory peak across stage 2 ≤ (memory peak of one serial example) × --sketch-jobs + small overhead.
  • Per-phase TSV records stage1_archives and stage2_sketches separately.
  • Benchmark workflow calls fbuild directly; the ./compile wrapper is no longer on the bench path.

Out of scope

  • Caching fastled/.venv — tracked separately as bench iter8b: cache fastled/.venv across runs (#112) — uv_sync 22s → <5s on warm #239.
  • Pushing per-sketch step below 1.5 s. Stage 2's per-sketch cost is the lower bound this lever can move toward; going further belongs to a daemon-side compile/link reuse ticket.
  • Switching runners. The lever is "two stages on the same hardware", not "different infrastructure".
  • Changing FastLED's ./compile wrapper. Downstream concern; filed against FastLED/FastLED.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions