fix(cli): close stale-lock cleanup race in acquireOnboardLock (#1281)#1656
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughHardened onboard lock acquisition and release to avoid stale-lock cleanup races: increased retry attempts (2→5), create locks using atomic open/write and track the descriptor, perform inode-verified stale unlinking, prefer fd/inode checks on release, and added a Vitest regression that simulates the race via Changes
Sequence Diagram(s)sequenceDiagram
participant ProcA as Process A
participant Acquire as acquireOnboardLock
participant FS as Filesystem
participant ProcB as Process B
ProcA->>Acquire: attempt acquire
Acquire->>FS: stat/read LOCK_FILE -> sees stale PID
Acquire->>FS: stat(LOCK_FILE,{bigint:true}) -> capture staleInode
Acquire->>FS: unlinkIfInodeMatches(LOCK_FILE, staleInode)
activate FS
FS-->>Acquire: unlink success or ENOENT
deactivate FS
ProcB->>FS: concurrently creates fresh lock (open "wx" / write) -> new inode
FS-->>ProcB: lock created
Acquire->>FS: try openSync(LOCK_FILE,"wx")
alt open fails (EEXIST)
Acquire->>FS: read current lock -> reports ProcB pid
Acquire-->>ProcA: return {acquired:false, holderPid:ProcB}
else open succeeds
Acquire-->>ProcA: return {acquired:true}
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard-session.test.ts (1)
219-251: Use a distinct live PID and assert the loser path.This regression currently reuses
process.pidfor the simulated fresh claimant and then allows eitheracquiredresult. That means a future change that wrongly treats the replacement lock as self-owned can still pass, even though the slower contender should lose. Make the replacement claim use a different live PID and assertresult.acquired === false/result.holderPidso the test exercises the mutual-exclusion contract, not just file survival.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.test.ts` around lines 219 - 251, The test currently simulates a fresh claimant using process.pid which allows acquireOnboardLock to ambiguously succeed; modify the simulated concurrent writer to use a distinct live PID (not process.pid) when writing session.LOCK_FILE, then call session.acquireOnboardLock("nemoclaw onboard --resume") and assert the mutual-exclusion loser path by checking result.acquired === false and result.holderPid equals the distinct PID you wrote, while still asserting the on-disk lock content contains the fresh claim string.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-session.ts`:
- Around line 410-443: The current unlinkIfInodeMatches(filePath, expectedInode)
still allows a TOCTOU: another process can replace the file between statSync and
unlinkSync. Replace this approach by acquiring the lock via an atomic create
(fs.openSync(lockPath, 'wx')) and keep the returned file descriptor for the
lifetime of the lock; on release, use the descriptor to confirm ownership
(fs.fstatSync(fd).ino) and then close the fd and unlink the path only if
fs.statSync(path).ino matches that fstat ino — this ensures you created the file
atomically (no race at acquisition) and check ownership against the live
descriptor on release instead of relying on a stale expectedInode via
unlinkIfInodeMatches; remove or stop using unlinkIfInodeMatches and update the
lock acquisition/release code to use the create-with-'wx' + hold-fd pattern
(reference functions/file: unlinkIfInodeMatches, lock acquisition code that
calls writeFileSync(..., { flag: "wx" }) and any release/cleanup that currently
uses unlinkIfInodeMatches).
---
Nitpick comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 219-251: The test currently simulates a fresh claimant using
process.pid which allows acquireOnboardLock to ambiguously succeed; modify the
simulated concurrent writer to use a distinct live PID (not process.pid) when
writing session.LOCK_FILE, then call session.acquireOnboardLock("nemoclaw
onboard --resume") and assert the mutual-exclusion loser path by checking
result.acquired === false and result.holderPid equals the distinct PID you
wrote, while still asserting the on-disk lock content contains the fresh claim
string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c2579469-1b12-4b69-a79f-a071bb49e84e
📒 Files selected for processing (2)
src/lib/onboard-session.test.tssrc/lib/onboard-session.ts
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/lib/onboard-session.ts (1)
445-462:⚠️ Potential issue | 🔴 CriticalThe stale-cleanup race is still present here.
fs.statSync()andfs.unlinkSync()are still separate pathname operations. If another contender recreatesLOCK_FILEafter the inode check but beforefs.unlinkSync(filePath), this helper will delete that fresh lock, and the nextopenSync("wx")retry can still succeed while the earlier claimant already returnedacquired: true. The fd-based ownership check hardens release, but it does not make acquisition-side cleanup safe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 445 - 462, The helper unlinkIfInodeMatches currently does separate pathname stat and unlink which allows a race where a new LOCK_FILE can be created between the two operations; change it to open the file first (fs.openSync(filePath, 'r')) and then use fs.fstatSync(fd) to compare the inode (stat.ino === expectedInode), close the fd, and only then call fs.unlinkSync(filePath) if the fstat matched; update unlinkIfInodeMatches to perform open -> fstat -> close -> unlink sequence (using the same filePath and expectedInode symbols) and propagate errors the same way.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-session.test.ts`:
- Around line 195-260: The simulated race currently swaps the lock file on
statCallCount === 1, which lets acquireOnboardLock read the fresh file and skip
the stale-cleanup path; instead delay the swap until the cleanup helper's window
so unlinkIfInodeMatches is exercised. Update the fs.statSync shim used in the
test (the overridden function that increments statCallCount) to perform the
unlink+recreate swap on statCallCount === 2 (or otherwise detect the second stat
call coming from unlinkIfInodeMatches) so the first stat returns the original
stale inode and the second stat inside unlinkIfInodeMatches sees a different
inode, then assert via session.acquireOnboardLock, session.LOCK_FILE,
unlinkIfInodeMatches and isProcessAlive as before.
In `@src/lib/onboard-session.ts`:
- Around line 421-423: The current call fs.writeSync(fd, payload) may perform a
short write and truncate the JSON; update the write logic in onboard-session.ts
where fs.writeSync is used (the fd and payload variables) to loop until the
entire payload buffer/string is written (track bytesWritten and call
fs.writeSync repeatedly with an offset/length until bytesWritten ===
payload.length), or replace with a platform-safe helper that guarantees
full-write semantics; ensure the loop handles the returned byte count and throws
on unexpected errors.
---
Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 445-462: The helper unlinkIfInodeMatches currently does separate
pathname stat and unlink which allows a race where a new LOCK_FILE can be
created between the two operations; change it to open the file first
(fs.openSync(filePath, 'r')) and then use fs.fstatSync(fd) to compare the inode
(stat.ino === expectedInode), close the fd, and only then call
fs.unlinkSync(filePath) if the fstat matched; update unlinkIfInodeMatches to
perform open -> fstat -> close -> unlink sequence (using the same filePath and
expectedInode symbols) and propagate errors the same way.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 630ea7ac-04b8-499f-b421-a51b7a669c1b
📒 Files selected for processing (2)
src/lib/onboard-session.test.tssrc/lib/onboard-session.ts
| try { | ||
| fs.writeSync(fd, payload); | ||
| } catch (writeError) { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check the file exists and read the relevant section
head -n 5 src/lib/onboard-session.tsRepository: NVIDIA/NemoClaw
Length of output: 269
🏁 Script executed:
# Read the code around lines 421-423 with broader context
sed -n '410,435p' src/lib/onboard-session.tsRepository: NVIDIA/NemoClaw
Length of output: 932
🏁 Script executed:
# Check for SPDX header compliance
head -n 5 src/lib/onboard-session.tsRepository: NVIDIA/NemoClaw
Length of output: 269
🏁 Script executed:
# Get the full lock file management section to understand the context
sed -n '333,463p' src/lib/onboard-session.ts | cat -nRepository: NVIDIA/NemoClaw
Length of output: 6028
🌐 Web query:
Node.js fs.writeFileSync file descriptor parameter usage
💡 Result:
In fs.writeFileSync(file, data[, options]), the first argument is not a special “fd parameter”—it’s the file parameter, and it can be either:
- a path (
string | Buffer | URL), or - a numeric file descriptor (
integer) you previously opened withfs.openSync()(or received elsewhere). (nodejs.org)
Using a file descriptor (fd)
import fs from "node:fs";
const fd = fs.openSync("out.txt", "w"); // open -> fd
fs.writeFileSync(fd, "hello\n"); // write using the fd
fs.closeSync(fd); // you must close itKey behaviors when file is an fd:
- It behaves like writing directly via
fs.write()(Node even recommends usingfs.write()for the fd case). (nodejs.org) - The file is not automatically “replaced/truncated” the way it is when you pass a filename; data is written at the current fd position, and old bytes may remain before/after your write. (nodejs.org)
- Because you opened the fd yourself, you’re responsible for closing it (otherwise you can leak descriptors). (This is explicitly called out for similar “*FileSync” APIs that accept an fd.) (nodejs.org)
Citations:
- 1: https://nodejs.org/api/fs.html
- 2: https://nodejs.org/api/fs.html
- 3: https://nodejs.org/api/fs.html
- 4: https://nodejs.org/api/fs.html
🌐 Web query:
Node.js fs.writeSync short write behavior regular files
💡 Result:
fs.writeSync() can legally perform a short write (return a value < length) even when fd refers to a regular file.
- In Node.js,
fs.writeSync(...)returns “the number of bytes written” (it does not promise to write the full requested length). If it returnsn, only the firstnbytes were written and you must write the remainder yourself if you need “write all”. [1] - On POSIX systems, the underlying
write(2)may write fewer thancountbytes for reasons such as insufficient space, process file-size limits (RLIMIT_FSIZE), or being interrupted by a signal after writing some bytes. [2]
Practical guidance: if you need a “guaranteed full write”, loop until all bytes are written (or an exception is thrown), even for regular files.
Sources: [1] Node.js fs.writeSync docs, [2] Linux write(2) semantics.
Ensure fs.writeSync() writes the full payload in a single call, or loop until all bytes are written.
fs.writeSync(fd, payload) can return fewer bytes than requested, leaving the JSON truncated. For the lock mechanism to function correctly, the entire payload must be written atomically. The suggested fs.writeFileSync() has identical short-write behavior on file descriptors and does not solve this.
Either ensure the payload fits in a single fs.writeSync() call (it does at ~150–300 bytes, but this is not guaranteed across all platforms and configurations), or loop until all bytes are written:
let bytesWritten = 0;
while (bytesWritten < payload.length) {
bytesWritten += fs.writeSync(fd, payload, bytesWritten);
}Alternatively, use a platform-safe wrapper if the codebase has one.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-session.ts` around lines 421 - 423, The current call
fs.writeSync(fd, payload) may perform a short write and truncate the JSON;
update the write logic in onboard-session.ts where fs.writeSync is used (the fd
and payload variables) to loop until the entire payload buffer/string is written
(track bytesWritten and call fs.writeSync repeatedly with an offset/length until
bytesWritten === payload.length), or replace with a platform-safe helper that
guarantees full-write semantics; ensure the loop handles the returned byte count
and throws on unexpected errors.
|
✨ Thanks for submitting this PR, which proposes a way to fix a bug in the onboard lock mechanism and may improve the reliability of the NemoClaw CLI. Possibly related open PRs: Possibly related open issues: |
951f9ca to
5e484d2
Compare
|
Thanks for tracking down this race in |
5e484d2 to
ce43a45
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/lib/onboard-session.ts (2)
425-430:⚠️ Potential issue | 🟠 MajorHandle short writes before publishing the lock.
fs.writeSync(fd, payload)can return fewer bytes than requested. If that happens here, this code still returnsacquired: truewith a truncated lock file, and after a crash that malformed file becomes non-reclaimable because stale cleanup intentionally skips unparsable locks.🔧 Suggested fix
- fs.writeSync(fd, payload); + const buffer = Buffer.from(payload); + let offset = 0; + while (offset < buffer.length) { + const written = fs.writeSync(fd, buffer, offset, buffer.length - offset); + if (written <= 0) { + throw new Error("Failed to fully write onboard lock payload"); + } + offset += written; + }Does Node.js fs.writeSync(...) on a regular file descriptor guarantee a full write, or can it legally return fewer bytes than requested?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 425 - 430, The write call may perform a short write; replace the single fs.writeSync(fd, payload) with a loop that repeatedly calls fs.writeSync(fd, payload, offset, length) (or slices) until totalBytesWritten === payload.length, updating offset/remaining each iteration, and treat any 0/negative return as an error; on any write error ensure you still run the existing cleanup (fs.closeSync(fd) and fs.unlinkSync(LOCK_FILE) in their try/catch blocks) and rethrow so a partially written/truncated LOCK_FILE is never treated as a successful acquire.
439-467:⚠️ Potential issue | 🔴 CriticalThe stale-cleanup helper can still delete a fresh lock.
This narrows the window, but it does not close it: two contenders can both observe stale inode
A, the faster one can unlinkAand createB, and the slower one can still reach Line 466 and unlinkBbecausestatSync()andunlinkSync()are separate pathname lookups.heldLockFdonly makes release safer; it does not prevent this acquire-time interleaving, so mutual exclusion can still break.Does Node.js expose any atomic filesystem primitive to unlink a path only if it still refers to a previously observed inode, or are fs.statSync(...) and fs.unlinkSync(...) always separate pathname-based operations?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 439 - 467, The current unlinkIfInodeMatches(filePath, expectedInode) still allows an acquire-time race where two processes see the same inode and one can unlink a freshly created lock; stop attempting an unconditional pathname-based unlink during acquisition and instead rely on atomic creation (fs.openSync with 'wx' / writeFileSync(wx)) to win the race. Change unlinkIfInodeMatches usage so that we only remove a lock in the stale-cleanup path when we either (a) hold the lock file descriptor (heldLockFd) and verify via fs.fstatSync(heldLockFd) that the inode matches expectedInode before unlinking, or (b) the file is demonstrably stale (mtime threshold) and we still treat failure to create with 'wx' as a lost race. In practice: remove the unlink-if-seen code from the acquire flow, ensure acquire uses fs.openSync(filePath, 'wx')/writeFileSync(wx) atomically, and in unlinkIfInodeMatches only perform unlink when validated via the open fd (heldLockFd) or after stale-time checks to avoid the separate pathname lookup race.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 425-430: The write call may perform a short write; replace the
single fs.writeSync(fd, payload) with a loop that repeatedly calls
fs.writeSync(fd, payload, offset, length) (or slices) until totalBytesWritten
=== payload.length, updating offset/remaining each iteration, and treat any
0/negative return as an error; on any write error ensure you still run the
existing cleanup (fs.closeSync(fd) and fs.unlinkSync(LOCK_FILE) in their
try/catch blocks) and rethrow so a partially written/truncated LOCK_FILE is
never treated as a successful acquire.
- Around line 439-467: The current unlinkIfInodeMatches(filePath, expectedInode)
still allows an acquire-time race where two processes see the same inode and one
can unlink a freshly created lock; stop attempting an unconditional
pathname-based unlink during acquisition and instead rely on atomic creation
(fs.openSync with 'wx' / writeFileSync(wx)) to win the race. Change
unlinkIfInodeMatches usage so that we only remove a lock in the stale-cleanup
path when we either (a) hold the lock file descriptor (heldLockFd) and verify
via fs.fstatSync(heldLockFd) that the inode matches expectedInode before
unlinking, or (b) the file is demonstrably stale (mtime threshold) and we still
treat failure to create with 'wx' as a lost race. In practice: remove the
unlink-if-seen code from the acquire flow, ensure acquire uses
fs.openSync(filePath, 'wx')/writeFileSync(wx) atomically, and in
unlinkIfInodeMatches only perform unlink when validated via the open fd
(heldLockFd) or after stale-time checks to avoid the separate pathname lookup
race.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 77acc3cf-f379-42e5-b6d2-72e3e6e66c16
📒 Files selected for processing (2)
src/lib/onboard-session.test.tssrc/lib/onboard-session.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard-session.test.ts
ce43a45 to
5d31136
Compare
|
Rebased onto current main. |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/lib/onboard-session.ts (2)
422-427:⚠️ Potential issue | 🟠 MajorWrite the full lock payload before returning success.
Line 426 assumes
fs.writeSync()writes the entire payload in one call. If it short-writes, this returnsacquired: truewith truncated JSON on disk, and every other process will treat the lock as malformed and stop reclaiming it.Proposed fix
- try { - fs.writeSync(fd, payload); + try { + const bytes = Buffer.from(payload, "utf8"); + let offset = 0; + while (offset < bytes.length) { + offset += fs.writeSync(fd, bytes, offset, bytes.length - offset); + }Does Node.js `fs.writeSync()` on a regular file guarantee writing the entire string/buffer in one call, or can it legally return a short write that requires looping until all bytes are written?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 422 - 427, The current atomic-create success path writes the lock payload with a single fs.writeSync(fd, payload) call (in the function handling onboard lock creation) which can short-write and leave truncated JSON while still returning acquired: true; change this to loop until the entire payload buffer is written (tracking bytesWritten and advancing the buffer/offset) before returning success and leaving the fd open for releaseOnboardLock() to verify ownership, ensuring the written JSON is complete and valid for other processes to read.
412-419:⚠️ Potential issue | 🔴 CriticalThe stale-lock reclaim path is still racy.
Between Line 454 and Line 466, another process can recreate
LOCK_FILE, and this helper will then unlink that fresh lock. The slower contender can immediately win the nextopenSync("wx"), so two onboard runs can still proceed concurrently.heldLockFdonly hardensreleaseOnboardLock(); it does not make this reclaim path safe.Does Node.js expose any atomic filesystem operation that unlinks a pathname only if it still refers to a previously observed inode, or are `fs.statSync()` and `fs.unlinkSync()` always separate pathname-based operations?Also applies to: 439-472
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-session.ts` around lines 412 - 419, The stale-lock reclaim path is racy because unlinkIfInodeMatches(LOCK_FILE, staleInode) can remove a freshly created lock; remove this unsafe unlink and instead stop trying to forcibly reclaim by inode. Change the logic in the lock-acquire loop (the code calling unlinkIfInodeMatches, around the functions/methods handling LOCK_FILE and heldLockFd) to: when you detect a stale inode, do not unlink—sleep/backs off and retry opening LOCK_FILE with fs.openSync(..., "wx") until you succeed (or time out); keep releaseOnboardLock() and heldLockFd behavior for clean release but do not attempt inode-based unlinks from this reclaim path. This removes the race without relying on an atomic unlink-by-inode operation (which Node.js does not provide).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard-session.ts`:
- Around line 422-427: The current atomic-create success path writes the lock
payload with a single fs.writeSync(fd, payload) call (in the function handling
onboard lock creation) which can short-write and leave truncated JSON while
still returning acquired: true; change this to loop until the entire payload
buffer is written (tracking bytesWritten and advancing the buffer/offset) before
returning success and leaving the fd open for releaseOnboardLock() to verify
ownership, ensuring the written JSON is complete and valid for other processes
to read.
- Around line 412-419: The stale-lock reclaim path is racy because
unlinkIfInodeMatches(LOCK_FILE, staleInode) can remove a freshly created lock;
remove this unsafe unlink and instead stop trying to forcibly reclaim by inode.
Change the logic in the lock-acquire loop (the code calling
unlinkIfInodeMatches, around the functions/methods handling LOCK_FILE and
heldLockFd) to: when you detect a stale inode, do not unlink—sleep/backs off and
retry opening LOCK_FILE with fs.openSync(..., "wx") until you succeed (or time
out); keep releaseOnboardLock() and heldLockFd behavior for clean release but do
not attempt inode-based unlinks from this reclaim path. This removes the race
without relying on an atomic unlink-by-inode operation (which Node.js does not
provide).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: f9dade3d-824b-4414-a83a-7c941eadc7cf
📒 Files selected for processing (2)
src/lib/onboard-session.test.tssrc/lib/onboard-session.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard-session.test.ts
The acquireOnboardLock stale-cleanup path read a stale lock, decided the holder was dead, and unconditionally unlinked LOCK_FILE before retrying writeFileSync(wx). Two concurrent processes that both observe the same stale lock will both try to clean it up — and the slower of the two can unlink the *fresh* lock the faster process just claimed, breaking mutual exclusion: both processes end up holding 'their' lock and onboard runs in parallel against the same shared session state. Reported as NVIDIA#1281, originally surfaced by CodeRabbit on NVIDIA#1272. Fix: capture the stale file's inode via fs.statSync({ bigint: true }) at the same time we read its contents, then in a new unlinkIfInodeMatches() helper, re-stat right before fs.unlinkSync and bail if the inode has changed. The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. The window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next writeFileSync(wx) attempt because either we re-create the file or we observe a new lock with a different inode. Also bumps MAX_ATTEMPTS from 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins. Adds a regression test that simulates the race deterministically by wrapping fs.statSync so the first stat succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode before unlinkIfInodeMatches re-stats it. The test asserts the fresh claim survives the race and is the file on disk after acquireOnboardLock returns. Verified by stashing the source fix and re-running: the new test fails on the unguarded code as expected, and passes with the inode guard in place. Closes NVIDIA#1281
CodeRabbit flagged a residual TOCTOU window in unlinkIfInodeMatches: between statSync and unlinkSync, another process could replace the file. Switch the acquisition primitive to fs.openSync(LOCK_FILE, "wx", 0o600) and keep the resulting file descriptor at module scope for the lifetime of the lock. On release, compare fstatSync(fd).ino against statSync(LOCK_FILE).ino — if the two diverge, another process owns the path now and we leave it alone. The legacy pid-based release fallback is preserved so tests that write the lock file directly (without going through acquireOnboardLock) keep their existing semantics for malformed and foreign-pid locks. Also tighten the NVIDIA#1281 regression test per CodeRabbit nit: the simulated concurrent writer now uses process.ppid (a distinct live PID, not process.pid) so the test asserts the mutual-exclusion loser path — result.acquired === false and result.holderPid === ppid — rather than ambiguously accepting either outcome. Refs: NVIDIA#1281 Signed-off-by: ColinM-sys <cmcdonough@50words.com>
5d31136 to
19219a3
Compare
There was a problem hiding this comment.
Code fix is solid — inode-verified stale-lock cleanup with fd-based ownership in release. Pushed a test fix: swap now happens just before stat #2 (inside unlinkIfInodeMatches) so the stale-cleanup branch is actually exercised. Uses write-to-temp + rename instead of unlink + recreate to guarantee a different inode on tmpfs/overlayfs (which can reuse inodes, causing the CI failure). All 15 lock tests pass.
f142971 to
4dfc607
Compare
…exercised CodeRabbit correctly flagged that swapping on stat NVIDIA#1 caused readFileSync to see the live PID and exit via isProcessAlive — unlinkIfInodeMatches was never called. Move the swap to just before stat NVIDIA#2 (inside unlinkIfInodeMatches): stat NVIDIA#1 reads the original stale inode, readFileSync sees the dead PID, isProcessAlive returns false, stale-cleanup runs, and stat NVIDIA#2 sees the new inode and skips the unlink. Use write-to-temp + rename instead of unlink + recreate to guarantee a different inode even on tmpfs/overlayfs which can reuse inodes. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
af5c877 to
6629df6
Compare
Summary
Closes #1281.
acquireOnboardLockinsrc/lib/onboard-session.tshad a stale-cleanup path that read a stale lock, decided the holder was dead, and unconditionally calledfs.unlinkSync(LOCK_FILE)before retryingwriteFileSync(wx).Two concurrent processes that both observe the same stale lock both try to clean it up — and the slower of the two can unlink the fresh lock the faster process just claimed. Both processes then end up holding 'their' lock and onboard runs in parallel against the same shared session state.
Originally surfaced by CodeRabbit on #1272.
Race walkthrough
After this, both A and B believe they own the lock and proceed with onboard.
Fix
Capture the stale file's inode via
fs.statSync({ bigint: true })at the same time we read its contents. Then in a newunlinkIfInodeMatches()helper, re-stat right beforefs.unlinkSyncand bail if the inode has changed.The dual stat-then-unlink is the only portable POSIX primitive Node exposes for this — there is no atomic "unlink-if-inode" syscall — so a sufficiently unlucky race can still slip through. However the window is orders of magnitude smaller than the unconditional unlink it replaces, and the outer retry loop will detect a wrong unlink on its next
writeFileSync(wx)attempt because either we re-create the file or we observe a new lock with a different inode and retry.Also bumps
MAX_ATTEMPTSfrom 2 to 5 because the inode-verified cleanup can take a few more spins under contention before one cleaner wins.Behavior preserved
unlinkIfInodeMatcheson malformed locks, only on parseable-but-stale ones).releaseOnboardLock()semantics unchanged.Test plan
regression #1281: stale-cleanup race does not unlink a fresh lock claimed by another processthat simulates the race deterministically by wrappingfs.statSyncso the first stat (inside acquireOnboardLock) succeeds against the original stale inode, then atomically swaps the lock file (unlink + recreate) to give it a new inode beforeunlinkIfInodeMatchesre-stats it. The test asserts the fresh claim survives the race and is the file on disk afteracquireOnboardLockreturns.unlinkSyncdeletes the fresh claim and writes a new one with a different command string. The test'sexpect(onDisk.command).toContain("fresh claim from concurrent process")then fails with the actual content showing the wrong winner.acquires and releases the onboard lock,replaces a stale onboard lock,treats unreadable or transient lock contents as a retry, not a stale lock,ignores malformed lock files when releasing the onboard lock).creates and persists a session with restrictive permissionscheckingmode & 0o777 === 0o600) fails on this branch with the same failure asmain— this is the umask/POSIX-mode-on-Windows issue and is unrelated to the lock fix.Why an inode check rather than the issue's suggested
linkSyncapproachThe issue body suggests using
fs.linkSync(temp, LOCK_FILE)which fails atomically with EEXIST if the target exists. That's a valid alternative, but:writeFileSync(LOCK_FILE, payload, { flag: "wx" })already provides the same atomic-create-or-fail semantics — there's no need to introduce a tempfile + link dance for the initial claim. The race is specifically in the cleanup of stale locks, not the initial claim.writeFileSync(wx)) and only adds the check on the rare stale-cleanup branch.Both approaches are correct; this one is the smaller surgical change.
Summary by CodeRabbit
Tests
Bug Fixes
Signed-off-by: ColinM-sys cmcdonough@50words.com