fix(cli): close stale-lock cleanup race in acquireOnboardLock (#1281) by ColinM-sys · Pull Request #1656 · NVIDIA/NemoClaw

ColinM-sys · 2026-04-09T03:43:02Z

Summary

Closes #1281.

acquireOnboardLock in src/lib/onboard-session.ts had a stale-cleanup path that read a stale lock, decided the holder was dead, and unconditionally called fs.unlinkSync(LOCK_FILE) before retrying writeFileSync(wx).

Two concurrent processes that both observe the same stale lock both try to clean it up — and the slower of the two can unlink the fresh lock the faster process just claimed. Both processes then end up holding 'their' lock and onboard runs in parallel against the same shared session state.

Originally surfaced by CodeRabbit on #1272.

Race walkthrough

A: writeFile(wx) → EEXIST (stale lock present)
B: writeFile(wx) → EEXIST (same stale lock)
A: read stale, isProcessAlive(stale.pid) → false
A: unlinkSync(LOCK_FILE)                                     ← deletes stale
A: loop, writeFile(wx) → SUCCESS, A holds the lock
B: read stale (still has its own copy from before)
B: isProcessAlive(stale.pid) → false (B is reasoning about
   the OLD pid, not the fresh lock A just wrote)
B: unlinkSync(LOCK_FILE)                                     ← DELETES A's FRESH LOCK
B: loop, writeFile(wx) → SUCCESS, B also "holds" the lock

After this, both A and B believe they own the lock and proceed with onboard.

Fix

Capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents. Then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed.

The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. However the window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode and retry.

Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins.

Behavior preserved

Malformed lock files are still left on disk (the existing "treats unreadable or transient lock contents as a retry" test still passes — I explicitly do NOT call unlinkIfInodeMatches on malformed locks, only on parseable-but-stale ones).
A holder PID that is still alive is still reported correctly to the caller.
releaseOnboardLock() semantics unchanged.

Test plan

Added regression test regression #1281: stale-cleanup race does not unlink a fresh lock claimed by another process that simulates the race deterministically by wrapping fs.statSync so the first stat (inside acquireOnboardLock) succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns.
Negative case verified: stashed the source fix, re-ran the test against the unguarded code. The new regression test correctly fails because the unconditional unlinkSync deletes the fresh claim and writes a new one with a different command string. The test's expect(onDisk.command).toContain("fresh claim from concurrent process") then fails with the actual content showing the wrong winner.
Positive case verified: re-applied the fix, re-ran. All 14 lock-related tests pass.
No regression on existing lock tests (acquires and releases the onboard lock, replaces a stale onboard lock, treats unreadable or transient lock contents as a retry, not a stale lock, ignores malformed lock files when releasing the onboard lock).
One pre-existing Windows-only POSIX-permissions test (creates and persists a session with restrictive permissions checking mode & 0o777 === 0o600) fails on this branch with the same failure as main — this is the umask/POSIX-mode-on-Windows issue and is unrelated to the lock fix.

Why an inode check rather than the issue's suggested `linkSync` approach

The issue body suggests using fs.linkSync(temp, LOCK_FILE) which fails atomically with EEXIST if the target exists. That's a valid alternative, but:

The current writeFileSync(LOCK_FILE, payload, { flag: "wx" }) already provides the same atomic-create-or-fail semantics — there's no need to introduce a tempfile + link dance for the initial claim. The race is specifically in the cleanup of stale locks, not the initial claim.
Inode-comparison preserves the existing happy-path code (single-syscall writeFileSync(wx)) and only adds the check on the rare stale-cleanup branch.
Tempfile + linkSync would add a second possible failure mode (tempfile leak under crash) without solving the cleanup race any more cleanly.

Both approaches are correct; this one is the smaller surgical change.

Summary by CodeRabbit

Tests
- Added a regression test that reproduces a stale-lock race and verifies correct behavior without cross-test side effects.
Bug Fixes
- Increased retry attempts for lock acquisition to reduce transient failures.
- More robust, atomic lock creation to avoid partial writes.
- Safer stale-lock cleanup using on-disk verification before removal.
- More reliable lock release that verifies ownership before removing lock files.

Signed-off-by: ColinM-sys cmcdonough@50words.com

coderabbitai · 2026-04-09T03:43:18Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Hardened onboard lock acquisition and release to avoid stale-lock cleanup races: increased retry attempts (2→5), create locks using atomic open/write and track the descriptor, perform inode-verified stale unlinking, prefer fd/inode checks on release, and added a Vitest regression that simulates the race via fs.statSync monkey-patch.

Changes

Cohort / File(s)	Summary
Regression test for stale-lock race `src/lib/onboard-session.test.ts`	Added a Vitest regression that sets up a stale onboard lock, monkey-patches `fs.statSync` to simulate a race where another process replaces the lock, then asserts the on-disk lock reflects the replacer, `acquired: false`, and `holderPid` equals the replacer; restores original `fs.statSync`.
Stale-lock cleanup & lock ownership `src/lib/onboard-session.ts`	Increased retry budget to 5; switched lock creation to `fs.openSync(..., "wx")` + `fs.writeSync`, stored created fd in module-level `heldLockFd`; replaced unconditional `unlinkSync` with `unlinkIfInodeMatches(file, expectedInode)` (uses bigint inode via `statSync`) for inode-verified stale removal; simplified liveness check; updated `releaseOnboardLock` to prefer fd-based inode verification and close the fd when releasing; added `unlinkIfInodeMatches` helper.

Sequence Diagram(s)

sequenceDiagram
  participant ProcA as Process A
  participant Acquire as acquireOnboardLock
  participant FS as Filesystem
  participant ProcB as Process B

  ProcA->>Acquire: attempt acquire
  Acquire->>FS: stat/read LOCK_FILE -> sees stale PID
  Acquire->>FS: stat(LOCK_FILE,{bigint:true}) -> capture staleInode
  Acquire->>FS: unlinkIfInodeMatches(LOCK_FILE, staleInode)
  activate FS
  FS-->>Acquire: unlink success or ENOENT
  deactivate FS
  ProcB->>FS: concurrently creates fresh lock (open "wx" / write) -> new inode
  FS-->>ProcB: lock created
  Acquire->>FS: try openSync(LOCK_FILE,"wx")
  alt open fails (EEXIST)
    Acquire->>FS: read current lock -> reports ProcB pid
    Acquire-->>ProcA: return {acquired:false, holderPid:ProcB}
  else open succeeds
    Acquire-->>ProcA: return {acquired:true}
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇
I sniffed a stale, small LOCK in moss,
Two paws scurried—one lost, one boss.
I checked the inode, nudged with care,
The fresh claim stayed; I twitched my ear,
Now duty locked, I nibble spare. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main fix: closing a stale-lock cleanup race in acquireOnboardLock, matching the core change in the PR.
Linked Issues check	✅ Passed	The PR implements inode-verified stale-lock cleanup and increases retries to 5, directly addressing the race condition where concurrent processes could both remove and recreate the lock, but uses a different approach than the issue's suggested atomic linkSync method.
Out of Scope Changes check	✅ Passed	All code changes are directly scoped to fixing the stale-lock cleanup race: inode-verified unlink, increased MAX_ATTEMPTS, atomic fd-based lock creation, and fd-based release verification with fallback semantics.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/lib/onboard-session.test.ts (1)
219-251: Use a distinct live PID and assert the loser path.

This regression currently reuses process.pid for the simulated fresh claimant and then allows either acquired result. That means a future change that wrongly treats the replacement lock as self-owned can still pass, even though the slower contender should lose. Make the replacement claim use a different live PID and assert result.acquired === false/result.holderPid so the test exercises the mutual-exclusion contract, not just file survival.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.test.ts` around lines 219 - 251, The test currently
simulates a fresh claimant using process.pid which allows acquireOnboardLock to
ambiguously succeed; modify the simulated concurrent writer to use a distinct
live PID (not process.pid) when writing session.LOCK_FILE, then call
session.acquireOnboardLock("nemoclaw onboard --resume") and assert the
mutual-exclusion loser path by checking result.acquired === false and
result.holderPid equals the distinct PID you wrote, while still asserting the
on-disk lock content contains the fresh claim string.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-session.ts`:
- Around line 410-443: The current unlinkIfInodeMatches(filePath, expectedInode)
still allows a TOCTOU: another process can replace the file between statSync and
unlinkSync. Replace this approach by acquiring the lock via an atomic create
(fs.openSync(lockPath, 'wx')) and keep the returned file descriptor for the
lifetime of the lock; on release, use the descriptor to confirm ownership
(fs.fstatSync(fd).ino) and then close the fd and unlink the path only if
fs.statSync(path).ino matches that fstat ino — this ensures you created the file
atomically (no race at acquisition) and check ownership against the live
descriptor on release instead of relying on a stale expectedInode via
unlinkIfInodeMatches; remove or stop using unlinkIfInodeMatches and update the
lock acquisition/release code to use the create-with-'wx' + hold-fd pattern
(reference functions/file: unlinkIfInodeMatches, lock acquisition code that
calls writeFileSync(..., { flag: "wx" }) and any release/cleanup that currently
uses unlinkIfInodeMatches).

---

Nitpick comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 219-251: The test currently simulates a fresh claimant using
process.pid which allows acquireOnboardLock to ambiguously succeed; modify the
simulated concurrent writer to use a distinct live PID (not process.pid) when
writing session.LOCK_FILE, then call session.acquireOnboardLock("nemoclaw
onboard --resume") and assert the mutual-exclusion loser path by checking
result.acquired === false and result.holderPid equals the distinct PID you
wrote, while still asserting the on-disk lock content contains the fresh claim
string.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c2579469-1b12-4b69-a79f-a071bb49e84e

📥 Commits

Reviewing files that changed from the base of the PR and between b9acacf and 687e360.

📒 Files selected for processing (2)

src/lib/onboard-session.test.ts
src/lib/onboard-session.ts

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/lib/onboard-session.ts (1)
445-462: ⚠️ Potential issue | 🔴 Critical

The stale-cleanup race is still present here.

fs.statSync() and fs.unlinkSync() are still separate pathname operations. If another contender recreates LOCK_FILE after the inode check but before fs.unlinkSync(filePath), this helper will delete that fresh lock, and the next openSync("wx") retry can still succeed while the earlier claimant already returned acquired: true. The fd-based ownership check hardens release, but it does not make acquisition-side cleanup safe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 445 - 462, The helper
unlinkIfInodeMatches currently does separate pathname stat and unlink which
allows a race where a new LOCK_FILE can be created between the two operations;
change it to open the file first (fs.openSync(filePath, 'r')) and then use
fs.fstatSync(fd) to compare the inode (stat.ino === expectedInode), close the
fd, and only then call fs.unlinkSync(filePath) if the fstat matched; update
unlinkIfInodeMatches to perform open -> fstat -> close -> unlink sequence (using
the same filePath and expectedInode symbols) and propagate errors the same way.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 195-260: The simulated race currently swaps the lock file on
statCallCount === 1, which lets acquireOnboardLock read the fresh file and skip
the stale-cleanup path; instead delay the swap until the cleanup helper's window
so unlinkIfInodeMatches is exercised. Update the fs.statSync shim used in the
test (the overridden function that increments statCallCount) to perform the
unlink+recreate swap on statCallCount === 2 (or otherwise detect the second stat
call coming from unlinkIfInodeMatches) so the first stat returns the original
stale inode and the second stat inside unlinkIfInodeMatches sees a different
inode, then assert via session.acquireOnboardLock, session.LOCK_FILE,
unlinkIfInodeMatches and isProcessAlive as before.

In `@src/lib/onboard-session.ts`:
- Around line 421-423: The current call fs.writeSync(fd, payload) may perform a
short write and truncate the JSON; update the write logic in onboard-session.ts
where fs.writeSync is used (the fd and payload variables) to loop until the
entire payload buffer/string is written (track bytesWritten and call
fs.writeSync repeatedly with an offset/length until bytesWritten ===
payload.length), or replace with a platform-safe helper that guarantees
full-write semantics; ensure the loop handles the returned byte count and throws
on unexpected errors.

---

Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 445-462: The helper unlinkIfInodeMatches currently does separate
pathname stat and unlink which allows a race where a new LOCK_FILE can be
created between the two operations; change it to open the file first
(fs.openSync(filePath, 'r')) and then use fs.fstatSync(fd) to compare the inode
(stat.ino === expectedInode), close the fd, and only then call
fs.unlinkSync(filePath) if the fstat matched; update unlinkIfInodeMatches to
perform open -> fstat -> close -> unlink sequence (using the same filePath and
expectedInode symbols) and propagate errors the same way.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 630ea7ac-04b8-499f-b421-a51b7a669c1b

📥 Commits

Reviewing files that changed from the base of the PR and between 687e360 and 951f9ca.

📒 Files selected for processing (2)

src/lib/onboard-session.test.ts
src/lib/onboard-session.ts

coderabbitai · 2026-04-09T04:10:07Z

+    try {
+      fs.writeSync(fd, payload);
+    } catch (writeError) {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check the file exists and read the relevant section head -n 5 src/lib/onboard-session.ts

Repository: NVIDIA/NemoClaw

Length of output: 269

🏁 Script executed:

# Read the code around lines 421-423 with broader context sed -n '410,435p' src/lib/onboard-session.ts

Repository: NVIDIA/NemoClaw

Length of output: 932

🏁 Script executed:

# Check for SPDX header compliance head -n 5 src/lib/onboard-session.ts

Repository: NVIDIA/NemoClaw

Length of output: 269

🏁 Script executed:

# Get the full lock file management section to understand the context sed -n '333,463p' src/lib/onboard-session.ts | cat -n

Repository: NVIDIA/NemoClaw

Length of output: 6028

🌐 Web query:

Node.js fs.writeFileSync file descriptor parameter usage

💡 Result:

In fs.writeFileSync(file, data[, options]), the first argument is not a special “fd parameter”—it’s the file parameter, and it can be either:

a path (string | Buffer | URL), or

a numeric file descriptor (integer) you previously opened with fs.openSync() (or received elsewhere). (nodejs.org)

Using a file descriptor (fd)

import fs from "node:fs"; const fd = fs.openSync("out.txt", "w"); // open -> fd fs.writeFileSync(fd, "hello\n"); // write using the fd fs.closeSync(fd); // you must close it

Key behaviors when file is an fd:

It behaves like writing directly via fs.write() (Node even recommends using fs.write() for the fd case). (nodejs.org)

The file is not automatically “replaced/truncated” the way it is when you pass a filename; data is written at the current fd position, and old bytes may remain before/after your write. (nodejs.org)

Because you opened the fd yourself, you’re responsible for closing it (otherwise you can leak descriptors). (This is explicitly called out for similar “*FileSync” APIs that accept an fd.) (nodejs.org)

Citations:

1: https://nodejs.org/api/fs.html

2: https://nodejs.org/api/fs.html

3: https://nodejs.org/api/fs.html

4: https://nodejs.org/api/fs.html

🌐 Web query:

Node.js fs.writeSync short write behavior regular files

💡 Result:

fs.writeSync() can legally perform a short write (return a value < length) even when fd refers to a regular file.

In Node.js, fs.writeSync(...) returns “the number of bytes written” (it does not promise to write the full requested length). If it returns n, only the first n bytes were written and you must write the remainder yourself if you need “write all”. [1]

On POSIX systems, the underlying write(2) may write fewer than count bytes for reasons such as insufficient space, process file-size limits (RLIMIT_FSIZE), or being interrupted by a signal after writing some bytes. [2]

Practical guidance: if you need a “guaranteed full write”, loop until all bytes are written (or an exception is thrown), even for regular files.

Sources: [1] Node.js fs.writeSync docs, [2] Linux write(2) semantics.

Ensure fs.writeSync() writes the full payload in a single call, or loop until all bytes are written.

fs.writeSync(fd, payload) can return fewer bytes than requested, leaving the JSON truncated. For the lock mechanism to function correctly, the entire payload must be written atomically. The suggested fs.writeFileSync() has identical short-write behavior on file descriptors and does not solve this.

Either ensure the payload fits in a single fs.writeSync() call (it does at ~150–300 bytes, but this is not guaranteed across all platforms and configurations), or loop until all bytes are written:

let bytesWritten = 0; while (bytesWritten < payload.length) { bytesWritten += fs.writeSync(fd, payload, bytesWritten); }

Alternatively, use a platform-safe wrapper if the codebase has one.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 421 - 423, The current call fs.writeSync(fd, payload) may perform a short write and truncate the JSON; update the write logic in onboard-session.ts where fs.writeSync is used (the fd and payload variables) to loop until the entire payload buffer/string is written (track bytesWritten and call fs.writeSync repeatedly with an offset/length until bytesWritten === payload.length), or replace with a platform-safe helper that guarantees full-write semantics; ensure the loop handles the returned byte count and throws on unexpected errors.

wscurran · 2026-04-09T14:07:39Z

✨ Thanks for submitting this PR, which proposes a way to fix a bug in the onboard lock mechanism and may improve the reliability of the NemoClaw CLI.

Possibly related open PRs:

#1272 refactor(cli): migrate onboard-session.js to TypeScript

Possibly related open issues:

#1281 fix(cli): stale-lock cleanup race in acquireOnboardLock

wscurran · 2026-04-15T00:26:49Z

Thanks for tracking down this race in acquireOnboardLock — stale-lock cleanup races during concurrent onboard operations could be a real source of unpredictable failures. The codebase has changed significantly since this was opened, including a TypeScript migration (#1673). Could you rebase onto the current main? Also noting you have three other open PRs (#1667, #1676, #1677) — a joint rebase across all four would be the most efficient path. We appreciate the dedication to these fixes.

ColinM-sys · 2026-04-15T02:23:34Z

Rebased all four (#1656, #1667, #1676, #1677) onto current main. Thank you for the review!

coderabbitai

♻️ Duplicate comments (2)

src/lib/onboard-session.ts (2)

425-430: ⚠️ Potential issue | 🟠 Major

Handle short writes before publishing the lock.

fs.writeSync(fd, payload) can return fewer bytes than requested. If that happens here, this code still returns acquired: true with a truncated lock file, and after a crash that malformed file becomes non-reclaimable because stale cleanup intentionally skips unparsable locks.

🔧 Suggested fix

-      fs.writeSync(fd, payload);
+      const buffer = Buffer.from(payload);
+      let offset = 0;
+      while (offset < buffer.length) {
+        const written = fs.writeSync(fd, buffer, offset, buffer.length - offset);
+        if (written <= 0) {
+          throw new Error("Failed to fully write onboard lock payload");
+        }
+        offset += written;
+      }

Does Node.js fs.writeSync(...) on a regular file descriptor guarantee a full write, or can it legally return fewer bytes than requested?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 425 - 430, The write call may
perform a short write; replace the single fs.writeSync(fd, payload) with a loop
that repeatedly calls fs.writeSync(fd, payload, offset, length) (or slices)
until totalBytesWritten === payload.length, updating offset/remaining each
iteration, and treat any 0/negative return as an error; on any write error
ensure you still run the existing cleanup (fs.closeSync(fd) and
fs.unlinkSync(LOCK_FILE) in their try/catch blocks) and rethrow so a partially
written/truncated LOCK_FILE is never treated as a successful acquire.

439-467: ⚠️ Potential issue | 🔴 Critical

The stale-cleanup helper can still delete a fresh lock.

This narrows the window, but it does not close it: two contenders can both observe stale inode A, the faster one can unlink A and create B, and the slower one can still reach Line 466 and unlink B because statSync() and unlinkSync() are separate pathname lookups. heldLockFd only makes release safer; it does not prevent this acquire-time interleaving, so mutual exclusion can still break.

Does Node.js expose any atomic filesystem primitive to unlink a path only if it still refers to a previously observed inode, or are fs.statSync(...) and fs.unlinkSync(...) always separate pathname-based operations?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 439 - 467, The current
unlinkIfInodeMatches(filePath, expectedInode) still allows an acquire-time race
where two processes see the same inode and one can unlink a freshly created
lock; stop attempting an unconditional pathname-based unlink during acquisition
and instead rely on atomic creation (fs.openSync with 'wx' / writeFileSync(wx))
to win the race. Change unlinkIfInodeMatches usage so that we only remove a lock
in the stale-cleanup path when we either (a) hold the lock file descriptor
(heldLockFd) and verify via fs.fstatSync(heldLockFd) that the inode matches
expectedInode before unlinking, or (b) the file is demonstrably stale (mtime
threshold) and we still treat failure to create with 'wx' as a lost race. In
practice: remove the unlink-if-seen code from the acquire flow, ensure acquire
uses fs.openSync(filePath, 'wx')/writeFileSync(wx) atomically, and in
unlinkIfInodeMatches only perform unlink when validated via the open fd
(heldLockFd) or after stale-time checks to avoid the separate pathname lookup
race.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 425-430: The write call may perform a short write; replace the
single fs.writeSync(fd, payload) with a loop that repeatedly calls
fs.writeSync(fd, payload, offset, length) (or slices) until totalBytesWritten
=== payload.length, updating offset/remaining each iteration, and treat any
0/negative return as an error; on any write error ensure you still run the
existing cleanup (fs.closeSync(fd) and fs.unlinkSync(LOCK_FILE) in their
try/catch blocks) and rethrow so a partially written/truncated LOCK_FILE is
never treated as a successful acquire.
- Around line 439-467: The current unlinkIfInodeMatches(filePath, expectedInode)
still allows an acquire-time race where two processes see the same inode and one
can unlink a freshly created lock; stop attempting an unconditional
pathname-based unlink during acquisition and instead rely on atomic creation
(fs.openSync with 'wx' / writeFileSync(wx)) to win the race. Change
unlinkIfInodeMatches usage so that we only remove a lock in the stale-cleanup
path when we either (a) hold the lock file descriptor (heldLockFd) and verify
via fs.fstatSync(heldLockFd) that the inode matches expectedInode before
unlinking, or (b) the file is demonstrably stale (mtime threshold) and we still
treat failure to create with 'wx' as a lost race. In practice: remove the
unlink-if-seen code from the acquire flow, ensure acquire uses
fs.openSync(filePath, 'wx')/writeFileSync(wx) atomically, and in
unlinkIfInodeMatches only perform unlink when validated via the open fd
(heldLockFd) or after stale-time checks to avoid the separate pathname lookup
race.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 77acc3cf-f379-42e5-b6d2-72e3e6e66c16

📥 Commits

Reviewing files that changed from the base of the PR and between 5e484d2 and ce43a45.

📒 Files selected for processing (2)

src/lib/onboard-session.test.ts
src/lib/onboard-session.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard-session.test.ts

ColinM-sys · 2026-04-15T14:58:13Z

Rebased onto current main.

coderabbitai

♻️ Duplicate comments (2)

src/lib/onboard-session.ts (2)

422-427: ⚠️ Potential issue | 🟠 Major

Write the full lock payload before returning success.

Line 426 assumes fs.writeSync() writes the entire payload in one call. If it short-writes, this returns acquired: true with truncated JSON on disk, and every other process will treat the lock as malformed and stop reclaiming it.

Proposed fix

-    try {
-      fs.writeSync(fd, payload);
+    try {
+      const bytes = Buffer.from(payload, "utf8");
+      let offset = 0;
+      while (offset < bytes.length) {
+        offset += fs.writeSync(fd, bytes, offset, bytes.length - offset);
+      }

Does Node.js `fs.writeSync()` on a regular file guarantee writing the entire string/buffer in one call, or can it legally return a short write that requires looping until all bytes are written?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 422 - 427, The current atomic-create
success path writes the lock payload with a single fs.writeSync(fd, payload)
call (in the function handling onboard lock creation) which can short-write and
leave truncated JSON while still returning acquired: true; change this to loop
until the entire payload buffer is written (tracking bytesWritten and advancing
the buffer/offset) before returning success and leaving the fd open for
releaseOnboardLock() to verify ownership, ensuring the written JSON is complete
and valid for other processes to read.

412-419: ⚠️ Potential issue | 🔴 Critical

The stale-lock reclaim path is still racy.

Between Line 454 and Line 466, another process can recreate LOCK_FILE, and this helper will then unlink that fresh lock. The slower contender can immediately win the next openSync("wx"), so two onboard runs can still proceed concurrently. heldLockFd only hardens releaseOnboardLock(); it does not make this reclaim path safe.

Does Node.js expose any atomic filesystem operation that unlinks a pathname only if it still refers to a previously observed inode, or are `fs.statSync()` and `fs.unlinkSync()` always separate pathname-based operations?

Also applies to: 439-472

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-session.ts` around lines 412 - 419, The stale-lock reclaim
path is racy because unlinkIfInodeMatches(LOCK_FILE, staleInode) can remove a
freshly created lock; remove this unsafe unlink and instead stop trying to
forcibly reclaim by inode. Change the logic in the lock-acquire loop (the code
calling unlinkIfInodeMatches, around the functions/methods handling LOCK_FILE
and heldLockFd) to: when you detect a stale inode, do not unlink—sleep/backs off
and retry opening LOCK_FILE with fs.openSync(..., "wx") until you succeed (or
time out); keep releaseOnboardLock() and heldLockFd behavior for clean release
but do not attempt inode-based unlinks from this reclaim path. This removes the
race without relying on an atomic unlink-by-inode operation (which Node.js does
not provide).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 422-427: The current atomic-create success path writes the lock
payload with a single fs.writeSync(fd, payload) call (in the function handling
onboard lock creation) which can short-write and leave truncated JSON while
still returning acquired: true; change this to loop until the entire payload
buffer is written (tracking bytesWritten and advancing the buffer/offset) before
returning success and leaving the fd open for releaseOnboardLock() to verify
ownership, ensuring the written JSON is complete and valid for other processes
to read.
- Around line 412-419: The stale-lock reclaim path is racy because
unlinkIfInodeMatches(LOCK_FILE, staleInode) can remove a freshly created lock;
remove this unsafe unlink and instead stop trying to forcibly reclaim by inode.
Change the logic in the lock-acquire loop (the code calling
unlinkIfInodeMatches, around the functions/methods handling LOCK_FILE and
heldLockFd) to: when you detect a stale inode, do not unlink—sleep/backs off and
retry opening LOCK_FILE with fs.openSync(..., "wx") until you succeed (or time
out); keep releaseOnboardLock() and heldLockFd behavior for clean release but do
not attempt inode-based unlinks from this reclaim path. This removes the race
without relying on an atomic unlink-by-inode operation (which Node.js does not
provide).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f9dade3d-824b-4414-a83a-7c941eadc7cf

📥 Commits

Reviewing files that changed from the base of the PR and between ce43a45 and 5d31136.

📒 Files selected for processing (2)

src/lib/onboard-session.test.ts
src/lib/onboard-session.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/onboard-session.test.ts

The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281

CodeRabbit flagged a residual TOCTOU window in unlinkIfInodeMatches: between statSync and unlinkSync, another process could replace the file. Switch the acquisition primitive to fs.openSync(LOCK_FILE, "wx", 0o600) and keep the resulting file descriptor at module scope for the lifetime of the lock. On release, compare fstatSync(fd).ino against statSync(LOCK_FILE).ino — if the two diverge, another process owns the path now and we leave it alone. The legacy pid-based release fallback is preserved so tests that write the lock file directly (without going through acquireOnboardLock) keep their existing semantics for malformed and foreign-pid locks. Also tighten the NVIDIA#1281 regression test per CodeRabbit nit: the simulated concurrent writer now uses process.ppid (a distinct live PID, not process.pid) so the test asserts the mutual-exclusion loser path — result.acquired === false and result.holderPid === ppid — rather than ambiguously accepting either outcome. Refs: NVIDIA#1281 Signed-off-by: ColinM-sys <cmcdonough@50words.com>

prekshivyas

Code fix is solid — inode-verified stale-lock cleanup with fd-based ownership in release. Pushed a test fix: swap now happens just before stat #2 (inside unlinkIfInodeMatches) so the stale-cleanup branch is actually exercised. Uses write-to-temp + rename instead of unlink + recreate to guarantee a different inode on tmpfs/overlayfs (which can reuse inodes, causing the CI failure). All 15 lock tests pass.

…exercised CodeRabbit correctly flagged that swapping on stat NVIDIA#1 caused readFileSync to see the live PID and exit via isProcessAlive — unlinkIfInodeMatches was never called. Move the swap to just before stat NVIDIA#2 (inside unlinkIfInodeMatches): stat NVIDIA#1 reads the original stale inode, readFileSync sees the dead PID, isProcessAlive returns false, stale-cleanup runs, and stat NVIDIA#2 sees the new inode and skips the unlink. Use write-to-temp + rename instead of unlink + recreate to guarantee a different inode even on tmpfs/overlayfs which can reuse inodes. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/lib/onboard-session.ts

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

wscurran added NemoClaw CLI Use this label to identify issues with the NemoClaw command-line interface (CLI). fix labels Apr 9, 2026

ColinM-sys force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from 951f9ca to 5e484d2 Compare April 10, 2026 01:18

wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 15, 2026

wscurran mentioned this pull request Apr 15, 2026

fix(blueprint): chown copied files to sandbox user after restore (#1229) #1667

Merged

2 tasks

ColinM-sys force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from 5e484d2 to ce43a45 Compare April 15, 2026 02:23

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

wscurran mentioned this pull request Apr 15, 2026

docs(k8s): document evaluation-only patterns and production alternatives (#1442) #1676

Open

3 tasks

ColinM-sys force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from ce43a45 to 5d31136 Compare April 15, 2026 14:57

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

wscurran mentioned this pull request Apr 15, 2026

test(cli): lock in #1555 destroy-side session clear so list cannot resurrect destroyed sandboxes (#1641) #1677

Open

2 tasks

wscurran removed the status: rebase PR needs to be rebased against main before review can continue label Apr 15, 2026

cv added the v0.0.18 Release target label Apr 16, 2026

ColinM-sys added 2 commits April 16, 2026 11:22

ColinM-sys force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from 5d31136 to 19219a3 Compare April 16, 2026 15:22

prekshivyas self-assigned this Apr 16, 2026

Merge branch 'main' into fix/1281-onboard-lock-stale-cleanup-race

88a995a

prekshivyas approved these changes Apr 16, 2026

View reviewed changes

prekshivyas force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from f142971 to 4dfc607 Compare April 16, 2026 21:26

prekshivyas requested a review from cv April 16, 2026 21:30

prekshivyas force-pushed the fix/1281-onboard-lock-stale-cleanup-race branch from af5c877 to 6629df6 Compare April 16, 2026 21:34

Merge branch 'main' into fix/1281-onboard-lock-stale-cleanup-race

08a1df2

ericksoa merged commit 18cbc63 into NVIDIA:main Apr 16, 2026
12 checks passed

Conversation

ColinM-sys commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Race walkthrough

Fix

Behavior preserved

Test plan

Why an inode check rather than the issue's suggested linkSync approach

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Using a file descriptor (fd)

Uh oh!

wscurran commented Apr 9, 2026

Uh oh!

wscurran commented Apr 15, 2026

Uh oh!

ColinM-sys commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ColinM-sys commented Apr 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

prekshivyas left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ColinM-sys commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

Why an inode check rather than the issue's suggested `linkSync` approach

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

prekshivyas left a comment •

edited

Loading