Symptom
The mount daemon's lifecycle management is leaky: relayfile stop and pkill -f "mount <workspace>" do not reliably terminate the daemon. Orphans accumulate over a session and race the new daemon on local state writes, producing visible corruption symptoms.
In today's rw_fc7b534b session, after a series of relayfile mount … --background / relayfile stop / pkill -f cycles to test 0.7.27 then 0.7.28, ps showed:
pid: 22788 etime: 11:40 ← the 0.7.28 daemon I thought was the only one
pid: 62068 etime: 33:42 ← orphan v0.7.27 daemon that had survived
multiple earlier `pkill -f "mount rw_fc7b534b"`
calls and a `relayfile stop` call
The 33:42-old orphan persisted across at least 3 explicit termination attempts. relayfile stop rw_fc7b534b invoked while it was running returned no running mount found for workspace rw_fc7b534b — i.e. stop didn't even know it existed. pkill -9 -f "mount rw_fc7b534b" was what finally killed it.
Consequence (the real impact)
Two daemons writing to the same relayfile-mount/.relayfile-mount-state.json and provider files via temp+rename produces sustained races. Today's mount.log showed the smoking gun:
mount local change failed: lstat .../relayfile-mount/google-mail/threads/.19d20d2cb91ef830.json.tmp-402007899: no such file or directory
mount local change failed: lstat .../relayfile-mount/.relayfile-mount-state.json.tmp-3553847730: no such file or directory
mount sync cycle failed: context deadline exceeded
Each daemon's atomic-write tmp file vanishes under the other daemon's rename, so lstat after rename keeps failing. As a result the cursor stalled at evt_27501 and files=4350 for the entire 10-minute observation window even though both #179 (auth race fixed) and #177 (mid-cycle checkpoint) were live in the binary — neither fix can help when two daemons fight for the same state file.
Once I killed both with pkill -9 and started a single clean daemon, the symptom disappeared.
Likely root causes
- Stale / wrong
.relay/mount.pid. relayfile stop appears to read the pid file written at start-time. When --background forks (or when the daemon process gets re-PID'd through node→Go shim), the pid file may not match the real running daemon. So stop sends a signal to a non-existent or wrong pid (silent no-op) while reporting "no running mount found."
pkill -f racing the daemon's process tree. The daemon is a node wrapper invoking a darwin-arm64 Go binary; signals to the wrapper may not propagate to the child, or vice versa. Today the surviving orphan was the child binary (relayfile-cli-darwin-arm64 mount …) with PPID 1 (reparented to init after the wrapper exited).
- No startup lock.
relayfile mount … --background does not check for an existing daemon for the same workspace/local-dir before forking a new one. If the pid file is stale or the existing daemon is orphaned, a second daemon happily starts and competes.
Suggested fixes (any one closes the worst exposure)
- Make
relayfile stop discover daemons by scanning ps, not just the pid file. Match on mount <workspace> in the cmdline, signal every match, then confirm by re-scanning. Falls back gracefully when the pid file is stale.
- Atomic, fsynced pid file written by the child (Go) process after fork, not by the parent wrapper. Then the pid file always reflects what's actually running.
- Refuse-to-start lock at mount start time. If a process matching
mount <workspace> (or the configured localDir) is already running, error out with the existing pid, instead of silently launching a competitor that will corrupt state.
- Surface orphans in
relayfile status. If ps scan finds a daemon process but the pid file doesn't match, report daemon: orphan(pid=…) so operators can spot it without grepping ps themselves. (Touches the same observability gap as relayfile#175 C, where status already misreports daemon liveness for foreground mounts.)
Why it matters more than a footgun
The cluster of "the daemon doesn't seem to be making progress" symptoms today had three distinct root causes — broken events ordering (cloud#862), refresh-token rotation race (relayfile#178/#179), and this orphan-races-state problem. The first two are now fixed; this one will silently undo their fixes any time an operator does what looks like a routine stop && start. The state-file .tmp-X: no such file or directory log line is the canonical fingerprint.
Reproducible
Today's session is the repro: run relayfile mount … --background, then relayfile stop (it'll report no daemon found in some race conditions), then relayfile mount … --background again. Run ps -axo pid,etime,command | grep "mount <workspace>" and you may find two processes for the same workspace.
Related: relayfile#175 C (status misreports liveness), relayfile#178 (auth thrash, separate but compounded by this).
🤖 Generated with Claude Code
Symptom
The mount daemon's lifecycle management is leaky:
relayfile stopandpkill -f "mount <workspace>"do not reliably terminate the daemon. Orphans accumulate over a session and race the new daemon on local state writes, producing visible corruption symptoms.In today's
rw_fc7b534bsession, after a series ofrelayfile mount … --background/relayfile stop/pkill -fcycles to test 0.7.27 then 0.7.28,psshowed:The 33:42-old orphan persisted across at least 3 explicit termination attempts.
relayfile stop rw_fc7b534binvoked while it was running returnedno running mount found for workspace rw_fc7b534b— i.e. stop didn't even know it existed.pkill -9 -f "mount rw_fc7b534b"was what finally killed it.Consequence (the real impact)
Two daemons writing to the same
relayfile-mount/.relayfile-mount-state.jsonand provider files via temp+rename produces sustained races. Today'smount.logshowed the smoking gun:Each daemon's atomic-write tmp file vanishes under the other daemon's rename, so
lstatafter rename keeps failing. As a result the cursor stalled atevt_27501andfiles=4350for the entire 10-minute observation window even though both #179 (auth race fixed) and #177 (mid-cycle checkpoint) were live in the binary — neither fix can help when two daemons fight for the same state file.Once I killed both with
pkill -9and started a single clean daemon, the symptom disappeared.Likely root causes
.relay/mount.pid.relayfile stopappears to read the pid file written at start-time. When--backgroundforks (or when the daemon process gets re-PID'd through node→Go shim), the pid file may not match the real running daemon. Sostopsends a signal to a non-existent or wrong pid (silent no-op) while reporting "no running mount found."pkill -fracing the daemon's process tree. The daemon is a node wrapper invoking a darwin-arm64 Go binary; signals to the wrapper may not propagate to the child, or vice versa. Today the surviving orphan was the child binary (relayfile-cli-darwin-arm64 mount …) with PPID 1 (reparented to init after the wrapper exited).relayfile mount … --backgrounddoes not check for an existing daemon for the same workspace/local-dir before forking a new one. If the pid file is stale or the existing daemon is orphaned, a second daemon happily starts and competes.Suggested fixes (any one closes the worst exposure)
relayfile stopdiscover daemons by scanningps, not just the pid file. Match onmount <workspace>in the cmdline, signal every match, then confirm by re-scanning. Falls back gracefully when the pid file is stale.mount <workspace>(or the configuredlocalDir) is already running, error out with the existing pid, instead of silently launching a competitor that will corrupt state.relayfile status. Ifpsscan finds a daemon process but the pid file doesn't match, reportdaemon: orphan(pid=…)so operators can spot it without greppingpsthemselves. (Touches the same observability gap as relayfile#175 C, wherestatusalready misreports daemon liveness for foreground mounts.)Why it matters more than a footgun
The cluster of "the daemon doesn't seem to be making progress" symptoms today had three distinct root causes — broken events ordering (cloud#862), refresh-token rotation race (relayfile#178/#179), and this orphan-races-state problem. The first two are now fixed; this one will silently undo their fixes any time an operator does what looks like a routine
stop && start. The state-file.tmp-X: no such file or directorylog line is the canonical fingerprint.Reproducible
Today's session is the repro: run
relayfile mount … --background, thenrelayfile stop(it'll report no daemon found in some race conditions), thenrelayfile mount … --backgroundagain. Runps -axo pid,etime,command | grep "mount <workspace>"and you may find two processes for the same workspace.Related: relayfile#175 C (
statusmisreports liveness), relayfile#178 (auth thrash, separate but compounded by this).🤖 Generated with Claude Code