fix(build): P0 regression — Operation not permitted (os error 1) on warm build#134
fix(build): P0 regression — Operation not permitted (os error 1) on warm build#134
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 20 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…arm build On Unix, `fbuild_core::containment::spawn_contained` delegated to `ContainedProcessGroup::spawn_with_containment` from the `running-process-core` crate. That implementation stores the first spawned child's PID as the group's PGID and then has every subsequent child call `setpgid(0, first_child_pid)` from its `pre_exec` hook. Once the first child exits and is reaped (e.g. the short `avr-gcc -dumpversion` call emitted by `log_toolchain_version`), the kernel tears down that process group. The second spawn's `setpgid(0, stale_pgid)` then fails with EPERM, which surfaces as `build error: build failed: io error: Operation not permitted (os error 1)` immediately after the `Toolchain: avr-gcc 7.3.0` line — exactly the failure reported in #129. This is reproducible on Linux CI from 2.1.17 onwards (Build Leonardo et al.) and blocks every AVR / ESP32 / etc. build made via the daemon. Fix: bypass the shared-pgid behaviour on Unix. Install a per-child `pre_exec` hook that creates a fresh process group with `setpgid(0, 0)` and, on Linux, requests `PR_SET_PDEATHSIG(SIGKILL)` so the kernel still kills the child when the spawning daemon thread exits. Windows is unchanged — Job Object assignment is stateless and has no analogous failure mode. macOS loses the drop-time `killpg` backstop, which was already a no-op in practice because the global `ContainedProcessGroup` lives in a `OnceLock` that never drops; this is the same coverage profile as before the fix. Regression test: `sequential_contained_spawns_do_not_fail_with_eperm` in `crates/fbuild-core/src/containment.rs` initialises the global group and performs two consecutive `spawn_contained` calls with a wait+sleep between them, mirroring the AVR build's "dumpversion then compile" shape. Refs #129. Reproducing commits: #108 (containment feature) + the interaction surfaced by #120 / #119 that made the second-spawn path universally reachable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0a3f411 to
0beb733
Compare
Cuts a release containing the two P0 fixes landed since 2.1.19: - #134 "P0 regression — Operation not permitted (os error 1) on warm build" - #135 "preserve exec bit on fbuild console script in wheel" Both are currently blocking every FastLED uno build on GitHub Actions: the wheel's console script installs without +x, so CI can't even run `fbuild --version`, and the subsequent compile fails with `Operation not permitted (os error 1)` on every example. Also includes: - #131 rustfmt on lnk pipeline - #133 DiskCache leases.refcount schema migration - #128 AVR orchestrator fingerprint fast-path + telemetry (#127) - #126 FBUILD_WATCH_SET_CACHE_SECS env override - f8533d3 extend watch-set fingerprint fast-path to AVR orchestrator Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts a release containing the two P0 fixes landed since 2.1.19: - #134 "P0 regression — Operation not permitted (os error 1) on warm build" - #135 "preserve exec bit on fbuild console script in wheel" Both are currently blocking every FastLED uno build on GitHub Actions: the wheel's console script installs without +x, so CI can't even run `fbuild --version`, and the subsequent compile fails with `Operation not permitted (os error 1)` on every example. Also includes: - #131 rustfmt on lnk pipeline - #133 DiskCache leases.refcount schema migration - #128 AVR orchestrator fingerprint fast-path + telemetry (#127) - #126 FBUILD_WATCH_SET_CACHE_SECS env override - f8533d3 extend watch-set fingerprint fast-path to AVR orchestrator Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Allows manually re-running the uno build from the Actions UI without making a trigger-path change. Also serves as a verification trigger for the fbuild 2.1.20 bump in #2353 — this PR's change lives in the uno workflow's own trigger-path list, so CI will actually run uno against the new fbuild, confirming FastLED/fbuild#134 (Operation not permitted) and #135 (exec bit on wheel) are resolved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
iter2 (008a065) picked up fbuild 2.1.20, which shipped the EPERM fix (#134) but still had the console-script exec bit stripped by the wheel publisher. fbuild 2.1.21 carries the S_IFREG fix (fda86d4). Add `chmod +x` after `pip install fbuild` so a future wheel regression can't mask our cache- behavior measurements by failing before the compile phase. Bumps CACHE_BUST to force a fresh cold cache for iter3. Warm run will follow as an empty commit once this lands. Tracked by: #112
Summary
Fixes #129 (the
os error 1half). AVR / ESP32 / every-platform builds fail onmainimmediately afterToolchain: avr-gcc 7.3.0with:This is reproducing on CI (Build Leonardo and siblings) from 2.1.17 onwards and on every
pip install fbuild==2.1.18.Root cause
fbuild_core::containment::spawn_contained(Unix) delegated toContainedProcessGroup::spawn_with_containmentin the externalrunning-process-corecrate. That implementation stores the first spawned child's PID as the group's PGID and has every subsequent child callsetpgid(0, first_child_pid)from itspre_exechook.Once the first child exits and is reaped — e.g. the short
avr-gcc -dumpversioncall that emits theToolchain:line — the kernel tears down that process group. The very next spawn'ssetpgid(0, stale_pgid)then fails with EPERM, which propagates asio error: Operation not permitted (os error 1)through the build orchestrator.Reproducing commit: #108 (added containment via the
running-processcrate). The failure became universal only once the daemon-centric build path in 2.1.17 / 2.1.18 always takesspawn_containedfor every subprocess.What this PR does
ContainedProcessGroup::spawn_with_containmentand installs our ownpre_exechook that:setpgid(0, 0)— each child gets a fresh process group, no stale-pgid dependency.prctl(PR_SET_PDEATHSIG, SIGKILL)so the kernel still SIGKILLs the child when the spawning daemon thread exits.killpgbackstop, which was already a no-op in practice: the globalContainedProcessGrouplives in aOnceLockthat never drops. Improving macOS containment can be tracked separately.Regression test
sequential_contained_spawns_do_not_fail_with_epermincrates/fbuild-core/src/containment.rs— initialises the global containment group, spawns a short-lived command, waits for it to exit, sleeps briefly, then spawns a second command. On the pre-fix code path, the second spawn reliably failed with EPERM; with this fix, both succeed.Test plan
uv run cargo clippy --workspace --all-targets -- -D warningscleanuv run cargo test -p fbuild-core --lib containment— 3/3 passuv run cargo test -p fbuild-daemon --test process_containment -- --ignored— 1/1 pass (Windows locally; CI will verify Linux/macOS)uv run cargo test --workspace --lib— greenBuild Leonardo/Build Arduino Uno(end-to-end reproducer) — verified by this PR's CI run🤖 Generated with Claude Code