You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Memory and CPU should not scale with the number of sketches. After the framework + dependent libraries are built once, every additional sketch only pays for:
Compiling its own sketch.cpp (and any sketch-local sources) to .o.
Linking sketch.o against the pre-built fastled.a (+ framework archive) to produce firmware.elf → firmware.hex/.bin.
Both steps are tiny relative to a full build. So the right compile strategy is two-stage, and the parallelism story has to live inside fbuild — not in a downstream wrapper.
Where the parallelism belongs
Today the benchmark invokes FastLED's ./compile wrapper, which loops over examples and calls fbuild deploy once per sketch. Any parallelism decision FastLED's wrapper makes (or doesn't make — its ci/util/cpu_count.py short-circuits to 1 on GITHUB_ACTIONS) sits one level above fbuild and isn't fbuild's concern. FastLED is downstream.
What's missing is the right primitive on the fbuild side: a single command that takes a board and a list of sketches, builds the shared framework/library archives once, then compiles + links the sketches in parallel using fbuild's own scheduler. With that primitive in place, the benchmark workflow calls fbuild directly — FastLED's wrapper isn't in the loop, and whatever knob FastLED's wrapper exposes downstream is independent.
The primitive
New fbuild subcommand (working name): fbuild compile-many.
Two independent knobs because the two stages have very different resource profiles:
--framework-jobs governs how wide stage 1 fans out. The framework build is memory-heavy (many TUs, large per-TU includes after LDF, intermediate archive writes). On a 2-core ubuntu-latest this is the lever you keep modest (1 or 2). On a beefier runner you crank it up.
--sketch-jobs governs how wide stage 2 fans out. A stage-2 worker is one cc1plus over a sketch.cpp plus one ld linking against pre-built archives via read-only mmap. Per-worker memory is tiny. On ubuntu-latest 2 is the safe default; on bigger runners 4–8 is fine. This number can be much higher than --framework-jobs without risking OOM.
A reasonable default policy when neither flag is set: --framework-jobs = min(cores, 2), --sketch-jobs = cores. Conservative on framework, generous on sketches.
Why not just xargs -P 2 fbuild deploy?
Each fbuild deploy invocation today pays the full per-invocation cost: load pyproject.toml, run LDF (cached via #205 but still process-startup overhead), enumerate framework libraries, materialize the toolchain into memory. With two concurrent invocations, that overhead doubles and each invocation may race to populate the same on-disk artifacts (framework .o files, library archives, LDF cache entries). zccache makes the duplicated work cheap on hit, but it doesn't make the memory cost free — you load the toolchain twice, do LDF twice, etc.
compile-many sidesteps that: one process, one toolchain load, one LDF run, one archive build — then fan out only the small per-sketch work.
Accepts board + sketch list + --framework-jobs + --sketch-jobs.
Routes through the existing orchestrator infrastructure (fbuild-build) so platform code stays in one place.
Returns a per-sketch result map (success / failure + log path), suitable for the bench summary.
fbuild — two-stage execution:
Stage 1 runs once with --framework-jobs workers, produces the framework + library archives + records their cache key, and stops.
Stage 2 receives the cache key and the sketch list; spawns --sketch-jobs workers; each worker compiles its sketch's TUs and links against the stage-1 archives. Workers share read-only access to the archives via mmap; the daemon does not re-do LDF or re-archive.
Concurrent-safety review:
Confirm the zccache read path is lock-free under concurrent access from stage-2 workers.
Confirm the per-sketch output directory is unique per sketch so two workers can't race on the same firmware.elf path.
Add stage1_archives and stage2_sketches rows to the per-phase TSV so the next iteration has a clean baseline.
Downstream FastLED switch (out of scope here): a separate issue against FastLED/FastLED asks them to consume fbuild compile-many from their ./compile wrapper. That work is independent of this fbuild PR — it can land any time after the primitive ships.
Expected savings
iter7 warm baseline: compile phase = 136 s. First-example toolchain-init residual ≈ 10.97 s. Remaining ≈ 125 s for ~80 sketches × ~1.5 s each, all serial.
Pushing per-sketch step below 1.5 s. Stage 2's per-sketch cost is the lower bound this lever can move toward; going further belongs to a daemon-side compile/link reuse ticket.
Switching runners. The lever is "two stages on the same hardware", not "different infrastructure".
Changing FastLED's ./compile wrapper. Downstream concern; filed against FastLED/FastLED.
Sub-issue of #112 (iter8 candidate 1).
Design intent
Memory and CPU should not scale with the number of sketches. After the framework + dependent libraries are built once, every additional sketch only pays for:
sketch.cpp(and any sketch-local sources) to.o.sketch.oagainst the pre-builtfastled.a(+ framework archive) to producefirmware.elf→firmware.hex/.bin.Both steps are tiny relative to a full build. So the right compile strategy is two-stage, and the parallelism story has to live inside fbuild — not in a downstream wrapper.
Where the parallelism belongs
Today the benchmark invokes FastLED's
./compilewrapper, which loops over examples and callsfbuild deployonce per sketch. Any parallelism decision FastLED's wrapper makes (or doesn't make — itsci/util/cpu_count.pyshort-circuits to 1 onGITHUB_ACTIONS) sits one level above fbuild and isn't fbuild's concern. FastLED is downstream.What's missing is the right primitive on the fbuild side: a single command that takes a board and a list of sketches, builds the shared framework/library archives once, then compiles + links the sketches in parallel using fbuild's own scheduler. With that primitive in place, the benchmark workflow calls fbuild directly — FastLED's wrapper isn't in the loop, and whatever knob FastLED's wrapper exposes downstream is independent.
The primitive
New fbuild subcommand (working name):
fbuild compile-many.Two independent knobs because the two stages have very different resource profiles:
--framework-jobsgoverns how wide stage 1 fans out. The framework build is memory-heavy (many TUs, large per-TU includes after LDF, intermediate archive writes). On a 2-coreubuntu-latestthis is the lever you keep modest (1 or 2). On a beefier runner you crank it up.--sketch-jobsgoverns how wide stage 2 fans out. A stage-2 worker is onecc1plusover a sketch.cpp plus oneldlinking against pre-built archives via read-only mmap. Per-worker memory is tiny. Onubuntu-latest2 is the safe default; on bigger runners 4–8 is fine. This number can be much higher than--framework-jobswithout risking OOM.A reasonable default policy when neither flag is set:
--framework-jobs = min(cores, 2),--sketch-jobs = cores. Conservative on framework, generous on sketches.Why not just
xargs -P 2 fbuild deploy?Each
fbuild deployinvocation today pays the full per-invocation cost: loadpyproject.toml, run LDF (cached via #205 but still process-startup overhead), enumerate framework libraries, materialize the toolchain into memory. With two concurrent invocations, that overhead doubles and each invocation may race to populate the same on-disk artifacts (framework.ofiles, library archives, LDF cache entries). zccache makes the duplicated work cheap on hit, but it doesn't make the memory cost free — you load the toolchain twice, do LDF twice, etc.compile-manysidesteps that: one process, one toolchain load, one LDF run, one archive build — then fan out only the small per-sketch work.Plan
fbuild —
compile-manysubcommand (crates/fbuild-cli):--framework-jobs+--sketch-jobs.fbuild-build) so platform code stays in one place.fbuild — two-stage execution:
--framework-jobsworkers, produces the framework + library archives + records their cache key, and stops.--sketch-jobsworkers; each worker compiles its sketch's TUs and links against the stage-1 archives. Workers share read-only access to the archives via mmap; the daemon does not re-do LDF or re-archive.Concurrent-safety review:
firmware.elfpath.Benchmark workflow update (
.github/workflows/benchmark.yml):./compile --no-interactive --no-parallel ...line..venv/bin/fbuild compile-many --board $BOARD --framework-jobs 1 --sketch-jobs 2 $EXAMPLESdirectly.stage1_archivesandstage2_sketchesrows to the per-phase TSV so the next iteration has a clean baseline.Downstream FastLED switch (out of scope here): a separate issue against FastLED/FastLED asks them to consume
fbuild compile-manyfrom their./compilewrapper. That work is independent of this fbuild PR — it can land any time after the primitive ships.Expected savings
iter7 warm baseline: compile phase = 136 s. First-example toolchain-init residual ≈ 10.97 s. Remaining ≈ 125 s for ~80 sketches × ~1.5 s each, all serial.
./compile, serial)compile-manystage-1 + 2-wide stage-2compile-manystage-1 + 4-wide stage-2 (bigger runner)Acceptance
fbuild compile-manysubcommand exists with--framework-jobs+--sketch-jobsflags.compilephase ≤ 80 s onubuntu-latestwith--framework-jobs 1 --sketch-jobs 2.compilephase ≤ 110 s under the same flags.--sketch-jobs+ small overhead.stage1_archivesandstage2_sketchesseparately../compilewrapper is no longer on the bench path.Out of scope
fastled/.venv— tracked separately as bench iter8b: cache fastled/.venv across runs (#112) — uv_sync 22s → <5s on warm #239../compilewrapper. Downstream concern; filed against FastLED/FastLED.Related
fastled/.venvcaching, stacks with this.compile-many.