You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The LDF implementation shipped under #205 / #214 is correct and the warm-cache path is healthy, but the cold scan leaves real performance on the table:
The transitive include walker (fbuild-header-scan::walker::walk) is single-threaded BFS — no rayon, no concurrent I/O.
The fixed-point resolver (fbuild-library-select::resolve) calls walk() twice; Pass 2 re-reads and re-scans every file Pass 1 already touched, because visited is local to each walk() invocation.
No tracing spans wrap the scan/walk, so regressions are invisible without external profiling.
Search-path lookup is O(search_paths) per #include per library.
None of this breaks the #205 acceptance criteria today (cold scan still hits the ≤ 200 ms gate for teensy41), but on bigger FastLED examples / future ESP32 wiring the cold path is the one that bites users.
VecDeque::pop_front + blocking read_to_string per file. For a Teensy 4.1 sketch the BFS frontier easily reaches dozens of headers per level. fbuild-header-scan/Cargo.toml has no rayon / crossbeam / async runtime dep — scanner throughput (≥ 50 MB/s/thread per benches/scan_throughput.rs) goes unused.
2. Pass 2 throws away Pass 1's work
crates/fbuild-library-select/src/lib.rs:85-129
Pass 1 (let res = walk(seeds, ...)) and Pass 2 (walk(&recon_seeds, ...) inside loop) each construct their own visited set. Pass 2's recon_seeds are seeds ∪ selected_libs' source_files, so every header reached in Pass 1 is re-read + re-scanned in Pass 2. On a converged resolution this is roughly 2× the I/O and 2× the scanner work for the same answer.
3. No tracing instrumentation
No tracing::info_span!(\"ldf_walk\") or #[tracing::instrument] in walker.rs or lib.rs. Adding spans around walk() and the Pass 2 loop would expose per-iteration timing in the daemon's existing log stream without external profilers.
4. Linear search-path scan per include
crates/fbuild-header-scan/src/walker.rs:69-85
for sp in search_paths {let candidate = sp.join(&inc.path);if candidate.is_file(){returnSome(candidate);}}
Each #include does up to search_paths.len() filesystem stat calls. With ~30 Teensy libraries each contributing 1–2 include dirs, that's 30–60 stats × (every #include in every file). The hot misses (e.g. unresolved system headers) are the worst because they walk the full list.
Suggested scope (no design lock-in — pick what fits)
A. Rayon-parallelize the walker frontier (biggest win)
Process the BFS frontier in waves: each iteration pulls everything currently in the queue, fans out across rayon to read+scan in parallel, then merges the new edges back into a shared visited set guarded by a single Mutex (only locked at merge boundaries, not during scan).
Keeps the BFS-order semantics; only the within-wave work runs concurrently. Scanner stays pure (no Send/Sync plumbing needed beyond what it already has).
Target: ≥ 4× cold-scan speedup on an 8-core CI runner for the Teensy 4.1 corpus (~500 sources).
B. Memoize scan results across Pass 1 and Pass 2
Lift visited and a HashMap<PathBuf, Vec<IncludeRef>> cache out of walk() into resolve(). Pass &mut WalkState (visited + scan_cache) into a new walk_with_state() so Pass 2's recon walk reuses everything Pass 1 already learned.
No behavior change — the merged include set is the union either way.
Expected: ~halves cold-scan time on multi-pass resolutions (which is the common case once a library is selected).
C. Add tracing spans
#[tracing::instrument(skip_all, fields(seeds = seeds.len(), search_paths = search_paths.len()))] on walk() and resolve().
A tracing::info_span!(\"ldf_pass\", pass = i) around each Pass 2 iteration.
Visible in the existing daemon log without new wiring.
D. Precompute a header-name index per library (lower priority)
Build a HashMap<&str /* basename */, Vec<&FrameworkLibrary>> once per resolve; resolve(inc, …) consults it before the linear search-path scan.
Mostly speeds up unresolved-include misses (the dominant case on framework header walks). Skippable if A+B land.
E. CI perf gate using existing benches
benches/resolve_cold.rs and benches/scan_throughput.rs exist but aren't gated. Wire criterion-baseline into the workflow so regressions on cold scan / scan throughput fail PRs.
Summary
The LDF implementation shipped under #205 / #214 is correct and the warm-cache path is healthy, but the cold scan leaves real performance on the table:
fbuild-header-scan::walker::walk) is single-threaded BFS — no rayon, no concurrent I/O.fbuild-library-select::resolve) callswalk()twice; Pass 2 re-reads and re-scans every file Pass 1 already touched, becausevisitedis local to eachwalk()invocation.tracingspans wrap the scan/walk, so regressions are invisible without external profiling.O(search_paths)per#includeper library.None of this breaks the #205 acceptance criteria today (cold scan still hits the ≤ 200 ms gate for teensy41), but on bigger FastLED examples / future ESP32 wiring the cold path is the one that bites users.
Observations
1. Walker is fully sequential
crates/fbuild-header-scan/src/walker.rs:29-67VecDeque::pop_front+ blockingread_to_stringper file. For a Teensy 4.1 sketch the BFS frontier easily reaches dozens of headers per level.fbuild-header-scan/Cargo.tomlhas norayon/crossbeam/ async runtime dep — scanner throughput (≥ 50 MB/s/thread perbenches/scan_throughput.rs) goes unused.2. Pass 2 throws away Pass 1's work
crates/fbuild-library-select/src/lib.rs:85-129Pass 1 (
let res = walk(seeds, ...)) and Pass 2 (walk(&recon_seeds, ...)insideloop) each construct their ownvisitedset. Pass 2'srecon_seedsareseeds ∪ selected_libs' source_files, so every header reached in Pass 1 is re-read + re-scanned in Pass 2. On a converged resolution this is roughly 2× the I/O and 2× the scanner work for the same answer.3. No tracing instrumentation
No
tracing::info_span!(\"ldf_walk\")or#[tracing::instrument]inwalker.rsorlib.rs. Adding spans aroundwalk()and the Pass 2 loop would expose per-iteration timing in the daemon's existing log stream without external profilers.4. Linear search-path scan per include
crates/fbuild-header-scan/src/walker.rs:69-85Each
#includedoes up tosearch_paths.len()filesystem stat calls. With ~30 Teensy libraries each contributing 1–2 include dirs, that's 30–60 stats × (every #include in every file). The hot misses (e.g. unresolved system headers) are the worst because they walk the full list.Suggested scope (no design lock-in — pick what fits)
A. Rayon-parallelize the walker frontier (biggest win)
Mutex(only locked at merge boundaries, not during scan).Send/Syncplumbing needed beyond what it already has).B. Memoize scan results across Pass 1 and Pass 2
visitedand aHashMap<PathBuf, Vec<IncludeRef>>cache out ofwalk()intoresolve(). Pass&mut WalkState(visited + scan_cache) into a newwalk_with_state()so Pass 2's recon walk reuses everything Pass 1 already learned.C. Add tracing spans
#[tracing::instrument(skip_all, fields(seeds = seeds.len(), search_paths = search_paths.len()))]onwalk()andresolve().tracing::info_span!(\"ldf_pass\", pass = i)around each Pass 2 iteration.D. Precompute a header-name index per library (lower priority)
HashMap<&str /* basename */, Vec<&FrameworkLibrary>>once per resolve;resolve(inc, …)consults it before the linear search-path scan.E. CI perf gate using existing benches
benches/resolve_cold.rsandbenches/scan_throughput.rsexist but aren't gated. Wire criterion-baseline into the workflow so regressions on cold scan / scan throughput fail PRs.Acceptance criteria
resolve()onteensy41+ Blink ≤ 100 ms on CI runner (currently ≤ 200 ms gate per feat(library-selection): Rust-native LDF-style transitive header scanner, zccache-backed #205 AC#6).resolve()onteensy41Audio-shield example (larger transitive set) shows ≥ 2× speedup over current master.Readshim, or check via tracing-test).resolve_cached()is unchanged (still a single bincode deserialize fromzccache_artifact::KvStore).tracing::Subscribercapture showsldf_walkandldf_passspans with durations.resolve_coldandscan_throughput.Out of scope
memmap2). Files are small (< 64 KB headers); the win is in parallelism + memoization, not read syscall path.BTreeSet/HashSetwithDashSet. The wave-merge boundary is fine with aMutexand avoidsDashSet's lookup overhead.Related
resolve_cached(teensy, stm32 orchestrators)crates/fbuild-header-scan/benches/scan_throughput.rs,crates/fbuild-library-select/benches/resolve_{cold,warm}.rs— existing perf harnessFiles touched (likely)
crates/fbuild-header-scan/src/walker.rs— addwalk_with_state, rayon wave-fan-outcrates/fbuild-header-scan/Cargo.toml— addrayon(workspace dep)crates/fbuild-library-select/src/lib.rs:57-157— threadWalkStatethrough resolve, add tracing spans.github/workflows/ci.yml(or equivalent) — gateresolve_cold/scan_throughputbenches