Skip to content

perf(library-select): #205 follow-up — parallelize BFS walker and memoize Pass 1 scans across Pass 2 #236

@zackees

Description

@zackees

Summary

The LDF implementation shipped under #205 / #214 is correct and the warm-cache path is healthy, but the cold scan leaves real performance on the table:

  1. The transitive include walker (fbuild-header-scan::walker::walk) is single-threaded BFS — no rayon, no concurrent I/O.
  2. The fixed-point resolver (fbuild-library-select::resolve) calls walk() twice; Pass 2 re-reads and re-scans every file Pass 1 already touched, because visited is local to each walk() invocation.
  3. No tracing spans wrap the scan/walk, so regressions are invisible without external profiling.
  4. Search-path lookup is O(search_paths) per #include per library.

None of this breaks the #205 acceptance criteria today (cold scan still hits the ≤ 200 ms gate for teensy41), but on bigger FastLED examples / future ESP32 wiring the cold path is the one that bites users.

Observations

1. Walker is fully sequential

crates/fbuild-header-scan/src/walker.rs:29-67

while let Some(file) = queue.pop_front() {
    let Ok(text) = std::fs::read_to_string(&file) else { continue; };
    for inc in scan(&text) { ... }
}

VecDeque::pop_front + blocking read_to_string per file. For a Teensy 4.1 sketch the BFS frontier easily reaches dozens of headers per level. fbuild-header-scan/Cargo.toml has no rayon / crossbeam / async runtime dep — scanner throughput (≥ 50 MB/s/thread per benches/scan_throughput.rs) goes unused.

2. Pass 2 throws away Pass 1's work

crates/fbuild-library-select/src/lib.rs:85-129

Pass 1 (let res = walk(seeds, ...)) and Pass 2 (walk(&recon_seeds, ...) inside loop) each construct their own visited set. Pass 2's recon_seeds are seeds ∪ selected_libs' source_files, so every header reached in Pass 1 is re-read + re-scanned in Pass 2. On a converged resolution this is roughly 2× the I/O and 2× the scanner work for the same answer.

3. No tracing instrumentation

No tracing::info_span!(\"ldf_walk\") or #[tracing::instrument] in walker.rs or lib.rs. Adding spans around walk() and the Pass 2 loop would expose per-iteration timing in the daemon's existing log stream without external profilers.

4. Linear search-path scan per include

crates/fbuild-header-scan/src/walker.rs:69-85

for sp in search_paths {
    let candidate = sp.join(&inc.path);
    if candidate.is_file() { return Some(candidate); }
}

Each #include does up to search_paths.len() filesystem stat calls. With ~30 Teensy libraries each contributing 1–2 include dirs, that's 30–60 stats × (every #include in every file). The hot misses (e.g. unresolved system headers) are the worst because they walk the full list.

Suggested scope (no design lock-in — pick what fits)

A. Rayon-parallelize the walker frontier (biggest win)

  • Process the BFS frontier in waves: each iteration pulls everything currently in the queue, fans out across rayon to read+scan in parallel, then merges the new edges back into a shared visited set guarded by a single Mutex (only locked at merge boundaries, not during scan).
  • Keeps the BFS-order semantics; only the within-wave work runs concurrently. Scanner stays pure (no Send/Sync plumbing needed beyond what it already has).
  • Target: ≥ 4× cold-scan speedup on an 8-core CI runner for the Teensy 4.1 corpus (~500 sources).

B. Memoize scan results across Pass 1 and Pass 2

  • Lift visited and a HashMap<PathBuf, Vec<IncludeRef>> cache out of walk() into resolve(). Pass &mut WalkState (visited + scan_cache) into a new walk_with_state() so Pass 2's recon walk reuses everything Pass 1 already learned.
  • No behavior change — the merged include set is the union either way.
  • Expected: ~halves cold-scan time on multi-pass resolutions (which is the common case once a library is selected).

C. Add tracing spans

  • #[tracing::instrument(skip_all, fields(seeds = seeds.len(), search_paths = search_paths.len()))] on walk() and resolve().
  • A tracing::info_span!(\"ldf_pass\", pass = i) around each Pass 2 iteration.
  • Visible in the existing daemon log without new wiring.

D. Precompute a header-name index per library (lower priority)

  • Build a HashMap<&str /* basename */, Vec<&FrameworkLibrary>> once per resolve; resolve(inc, …) consults it before the linear search-path scan.
  • Mostly speeds up unresolved-include misses (the dominant case on framework header walks). Skippable if A+B land.

E. CI perf gate using existing benches

  • benches/resolve_cold.rs and benches/scan_throughput.rs exist but aren't gated. Wire criterion-baseline into the workflow so regressions on cold scan / scan throughput fail PRs.

Acceptance criteria

  1. Cold resolve() on teensy41 + Blink ≤ 100 ms on CI runner (currently ≤ 200 ms gate per feat(library-selection): Rust-native LDF-style transitive header scanner, zccache-backed #205 AC#6).
  2. Cold resolve() on teensy41 Audio-shield example (larger transitive set) shows ≥ 2× speedup over current master.
  3. Single-pass walk reads each file exactly once across Pass 1 + Pass 2 (assert in a test via a counting Read shim, or check via tracing-test).
  4. Warm resolve_cached() is unchanged (still a single bincode deserialize from zccache_artifact::KvStore).
  5. tracing::Subscriber capture shows ldf_walk and ldf_pass spans with durations.
  6. New criterion baseline check in CI for resolve_cold and scan_throughput.

Out of scope

  • mmap-based reads (memmap2). Files are small (< 64 KB headers); the win is in parallelism + memoization, not read syscall path.
  • Replacing BTreeSet/HashSet with DashSet. The wave-merge boundary is fine with a Mutex and avoids DashSet's lookup overhead.
  • ESP32 / AVR orchestrator wiring — separate task (out of scope per #205 follow-up: wire library-selection cache into orchestrators (teensy, stm32) #214).
  • Eviction policy for the library-selection KvStore. The existing zccache eviction handles it.

Related

Files touched (likely)

  • crates/fbuild-header-scan/src/walker.rs — add walk_with_state, rayon wave-fan-out
  • crates/fbuild-header-scan/Cargo.toml — add rayon (workspace dep)
  • crates/fbuild-library-select/src/lib.rs:57-157 — thread WalkState through resolve, add tracing spans
  • .github/workflows/ci.yml (or equivalent) — gate resolve_cold / scan_throughput benches

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions