feat(daemon): process containment via running-process (closes #32)#108
feat(daemon): process containment via running-process (closes #32)#108
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 45 minutes and 18 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (16)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5026916 to
6b1345e
Compare
Every subprocess the fbuild daemon spawns now dies when the daemon dies, including grandchildren forked by the child. On Windows this is a single Job Object with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE; on Linux it's setpgid(0,0) + PR_SET_PDEATHSIG(SIGKILL); on macOS it's a process group with a drop-time killpg. Eliminates the orphaned bash.exe / conhost.exe / OpenConsole.exe (and Unix init-reparented) leaks after a daemon crash. Dep: running-process-core via git, pinned to ff9d7972504f7a0dcee0f410274daf3b02e4fcc2 (not yet on crates.io). New module `fbuild_core::containment` exposes: - init_global_containment() — called from fbuild-daemon/src/main.rs - spawn_contained() / spawn_detached() for std::process::Command - tokio_spawn::spawn_contained() for tokio::process::Command (Unix: pre_exec; Windows: parallel Job Object since running-process-core doesn't expose its handle to tokio Commands) Spawn sites rerouted through containment: - fbuild_core::subprocess::run_command — the central blocking helper, covering compilers, linkers, esptool, avrdude, addr2line, teensy_loader and every other tool launched via run_command across fbuild-build, fbuild-deploy, fbuild-serial, fbuild-packages (~70 call sites). - fbuild-daemon/src/handlers/emulator.rs — find_node / ensure_avr8js_npm / find_simavr (run_command) and run_avr8js_headless / run_qemu_process (tokio_spawn::spawn_contained). - fbuild-build/src/script_runtime.rs — extra_scripts python runtime. - fbuild-packages/src/disk_cache/budget.rs — powershell/df probes. Intentional hold-outs (documented inline): - fbuild-cli/src/daemon_client.rs::spawn_daemon_process — the CLI spawns the daemon and exits; the daemon must outlive the CLI. - fbuild-python/src/lib.rs::Daemon::ensure_running (×2) — same pattern from Python host. - fbuild-build/src/zccache.rs::ensure_running — zccache is its own long-running daemon with independent lifecycle. Integration test: crates/fbuild-daemon/tests/process_containment.rs spawns a parent harness that installs containment, spawns a child which spawns a grandchild, hard-kills the parent, then asserts both child and grandchild PIDs disappear within 10 s. Passes on Windows; Unix uses libc::kill probes, Windows uses OpenProcess + GetExitCodeProcess. Verified: `uv run cargo check --workspace --all-targets`, `uv run cargo clippy --workspace --all-targets -- -D warnings`, `uv run cargo fmt --all --check`, full `uv run cargo test --workspace`, and `cargo test -p fbuild-daemon --test process_containment -- --include-ignored`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6b1345e to
cfbfec2
Compare
…arm build On Unix, `fbuild_core::containment::spawn_contained` delegated to `ContainedProcessGroup::spawn_with_containment` from the `running-process-core` crate. That implementation stores the first spawned child's PID as the group's PGID and then has every subsequent child call `setpgid(0, first_child_pid)` from its `pre_exec` hook. Once the first child exits and is reaped (e.g. the short `avr-gcc -dumpversion` call emitted by `log_toolchain_version`), the kernel tears down that process group. The second spawn's `setpgid(0, stale_pgid)` then fails with EPERM, which surfaces as `build error: build failed: io error: Operation not permitted (os error 1)` immediately after the `Toolchain: avr-gcc 7.3.0` line — exactly the failure reported in #129. This is reproducible on Linux CI from 2.1.17 onwards (Build Leonardo et al.) and blocks every AVR / ESP32 / etc. build made via the daemon. Fix: bypass the shared-pgid behaviour on Unix. Install a per-child `pre_exec` hook that creates a fresh process group with `setpgid(0, 0)` and, on Linux, requests `PR_SET_PDEATHSIG(SIGKILL)` so the kernel still kills the child when the spawning daemon thread exits. Windows is unchanged — Job Object assignment is stateless and has no analogous failure mode. macOS loses the drop-time `killpg` backstop, which was already a no-op in practice because the global `ContainedProcessGroup` lives in a `OnceLock` that never drops; this is the same coverage profile as before the fix. Regression test: `sequential_contained_spawns_do_not_fail_with_eperm` in `crates/fbuild-core/src/containment.rs` initialises the global group and performs two consecutive `spawn_contained` calls with a wait+sleep between them, mirroring the AVR build's "dumpversion then compile" shape. Refs #129. Reproducing commits: #108 (containment feature) + the interaction surfaced by #120 / #119 that made the second-spawn path universally reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arm build On Unix, `fbuild_core::containment::spawn_contained` delegated to `ContainedProcessGroup::spawn_with_containment` from the `running-process-core` crate. That implementation stores the first spawned child's PID as the group's PGID and then has every subsequent child call `setpgid(0, first_child_pid)` from its `pre_exec` hook. Once the first child exits and is reaped (e.g. the short `avr-gcc -dumpversion` call emitted by `log_toolchain_version`), the kernel tears down that process group. The second spawn's `setpgid(0, stale_pgid)` then fails with EPERM, which surfaces as `build error: build failed: io error: Operation not permitted (os error 1)` immediately after the `Toolchain: avr-gcc 7.3.0` line — exactly the failure reported in #129. This is reproducible on Linux CI from 2.1.17 onwards (Build Leonardo et al.) and blocks every AVR / ESP32 / etc. build made via the daemon. Fix: bypass the shared-pgid behaviour on Unix. Install a per-child `pre_exec` hook that creates a fresh process group with `setpgid(0, 0)` and, on Linux, requests `PR_SET_PDEATHSIG(SIGKILL)` so the kernel still kills the child when the spawning daemon thread exits. Windows is unchanged — Job Object assignment is stateless and has no analogous failure mode. macOS loses the drop-time `killpg` backstop, which was already a no-op in practice because the global `ContainedProcessGroup` lives in a `OnceLock` that never drops; this is the same coverage profile as before the fix. Regression test: `sequential_contained_spawns_do_not_fail_with_eperm` in `crates/fbuild-core/src/containment.rs` initialises the global group and performs two consecutive `spawn_contained` calls with a wait+sleep between them, mirroring the AVR build's "dumpversion then compile" shape. Refs #129. Reproducing commits: #108 (containment feature) + the interaction surfaced by #120 / #119 that made the second-spawn path universally reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The containment rewrite (#108) replaced `Command::output()` with a hand-rolled `read_to_end(stdout)` → `read_to_end(stderr)` → `wait()` sequence inside `fbuild_core::subprocess::run_command`. Reading the two pipes serially deadlocks the moment the child fills either OS pipe buffer (~64 KB): the child blocks on `write()` to the still-undrained pipe, while the parent sits forever in `read_to_end()` on the other. ESP32-C6 builds tripped this every run since 49307a3 — the riscv32-esp GCC writes more than 64 KB to stderr on a full build, so every CI invocation hung after `Toolchain: riscv32-esp-elf-gcc 14.2.0`, and the client's 1800 s reqwest timeout was the only thing that ended the build (`error: daemon error: stream error: error decoding response body`). Fix: - `run_no_timeout`: use `child.wait_with_output()`, which is the same background-thread drain that `Command::output()` used before. - `run_with_timeout`: spawn two reader threads up front, keep the `try_wait` poll loop, join the readers after the child exits. Drains both pipes in parallel for the entire lifetime of the subprocess. Both paths still spawn through `containment::spawn_contained`, so process-tree containment from #108 is preserved. Also adds `ci/find_direct_subprocess.py` — enumerates every remaining direct `Command::new(...)` site in `crates/` so they can be migrated to `running-process` and the wrapper-vs-direct split can be eliminated. Tracked by #141; the script self-deletes when the count reaches zero. Verified locally: `uv run cargo check`, `uv run cargo clippy -- -D warnings`, all 5 fbuild-core subprocess unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The containment rewrite (#108) replaced `Command::output()` with a hand-rolled `read_to_end(stdout)` → `read_to_end(stderr)` → `wait()` sequence inside `fbuild_core::subprocess::run_command`. Reading the two pipes serially deadlocks the moment the child fills either OS pipe buffer (~64 KB): the child blocks on `write()` to the still-undrained pipe, while the parent sits forever in `read_to_end()` on the other. ESP32-C6 builds tripped this every run since 49307a3 — the riscv32-esp GCC writes more than 64 KB to stderr on a full build, so every CI invocation hung after `Toolchain: riscv32-esp-elf-gcc 14.2.0`, and the client's 1800 s reqwest timeout was the only thing that ended the build (`error: daemon error: stream error: error decoding response body`). Fix: - `run_no_timeout`: use `child.wait_with_output()`, which is the same background-thread drain that `Command::output()` used before. - `run_with_timeout`: spawn two reader threads up front, keep the `try_wait` poll loop, join the readers after the child exits. Drains both pipes in parallel for the entire lifetime of the subprocess. Both paths still spawn through `containment::spawn_contained`, so process-tree containment from #108 is preserved. Also adds `ci/find_direct_subprocess.py` — enumerates every remaining direct `Command::new(...)` site in `crates/` so they can be migrated to `running-process` and the wrapper-vs-direct split can be eliminated. Tracked by #141; the script self-deletes when the count reaches zero. Verified locally: `uv run cargo check`, `uv run cargo clippy -- -D warnings`, all 5 fbuild-core subprocess unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…200) Closes #141. Every fbuild subprocess now flows through `fbuild_core::subprocess::{run_command,run_command_passthrough}`, backed by `running_process_core::NativeProcess`, which drains stdout and stderr concurrently from the moment the child starts. This eliminates the class of pipe-buffer deadlocks that broke ESP32-C6 CI after the hand-rolled drain in the containment rewrite (#108). Migrated sites: - fbuild-build: compiler_version, script_runtime harness + find_python - fbuild-cli: find_pio, run_pio_command (via run_command_passthrough), open_in_browser, kill_process, find_daemon_pids Intentional hold-outs annotated with `// allow-direct-spawn: <reason>`: - zccache bootstrap and daemon spawns from CLI / Python (must outlive parent) - containment module's own regression tests - integration-test drivers that spawn binaries under test - tokio async streaming emulator handlers (QEMU, avr8js/node) where the blocking NativeProcess API is unsuitable - tokio parallel async fan-out in the CLI (IWYU, clang-tidy) `ci/find_direct_subprocess.py` is promoted from a migration tracker to a CI-enforced linter (`--fail`), wired into the Ubuntu check workflow and a dedicated `lint-subprocess.yml`. Future direct `Command::new` spawns without an allow-marker break the build. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #32 — final piece of the daemon-teardown-hardening tracker. PRs #61 (Windows console ctrl handler) and #87 (SO_LINGER 0) landed the previous two.
Changes
running-process-coregit dep (not on crates.io as of 2026-04-18):fbuild_core::subprocess::run_commandhelper — single chokepoint covering ~70 callers acrossfbuild-build(every platform orchestrator),fbuild-deploy(esptool/avrdude/teensy_loader/reset),fbuild-serial(addr2line),fbuild-packages/library_compiler.fbuild-daemon/src/handlers/emulator.rs: containedfind_node,ensure_avr8js_npm,find_simavr(blocking path) +run_avr8js_headless,run_qemu_process(tokio path).fbuild-build/src/script_runtime.rs— extra_scripts python runtime.fbuild-packages/src/disk_cache/budget.rs— powershell/df probes.Intentional detached holdouts (inline-commented with
INTENTIONALLY DETACHED (FastLED/fbuild#32))fbuild-cli/src/daemon_client.rs::spawn_daemon_process— CLI→daemon; daemon must outlive CLIfbuild-python/src/lib.rs::Daemon::ensure_running(sync + async) — Python→daemon, samefbuild-build/src/zccache.rs::ensure_running— zccache is its own daemonIntegration test
crates/fbuild-daemon/tests/process_containment.rs+ helper bincrates/fbuild-daemon/src/bin/containment_harness.rs:taskkill /F(Windows) /SIGKILL(Unix)#[ignore]likeport_recovery.rs; picked up byuv run test --full(which passes--include-ignored)Test plan
uv run cargo test -p fbuild-daemon --test process_containment -- --include-ignoredpasses on Windowsuv run cargo clippy --workspace --all-targets -- -D warningscleanuv run cargo fmt --all -- --checkcleanrunning-process-coreuses verbatimpre_exec+killpg+PR_SET_PDEATHSIGfor POSIX; CI confirms across platforms🤖 Generated with Claude Code