Skip to content

mount: orphan daemons survive stop/pkill, race on state writes #180

@khaliqgant

Description

@khaliqgant

Symptom

The mount daemon's lifecycle management is leaky: relayfile stop and pkill -f "mount <workspace>" do not reliably terminate the daemon. Orphans accumulate over a session and race the new daemon on local state writes, producing visible corruption symptoms.

In today's rw_fc7b534b session, after a series of relayfile mount … --background / relayfile stop / pkill -f cycles to test 0.7.27 then 0.7.28, ps showed:

pid: 22788  etime: 11:40   ← the 0.7.28 daemon I thought was the only one
pid: 62068  etime: 33:42   ← orphan v0.7.27 daemon that had survived
                            multiple earlier `pkill -f "mount rw_fc7b534b"`
                            calls and a `relayfile stop` call

The 33:42-old orphan persisted across at least 3 explicit termination attempts. relayfile stop rw_fc7b534b invoked while it was running returned no running mount found for workspace rw_fc7b534b — i.e. stop didn't even know it existed. pkill -9 -f "mount rw_fc7b534b" was what finally killed it.

Consequence (the real impact)

Two daemons writing to the same relayfile-mount/.relayfile-mount-state.json and provider files via temp+rename produces sustained races. Today's mount.log showed the smoking gun:

mount local change failed: lstat .../relayfile-mount/google-mail/threads/.19d20d2cb91ef830.json.tmp-402007899: no such file or directory
mount local change failed: lstat .../relayfile-mount/.relayfile-mount-state.json.tmp-3553847730: no such file or directory
mount sync cycle failed: context deadline exceeded

Each daemon's atomic-write tmp file vanishes under the other daemon's rename, so lstat after rename keeps failing. As a result the cursor stalled at evt_27501 and files=4350 for the entire 10-minute observation window even though both #179 (auth race fixed) and #177 (mid-cycle checkpoint) were live in the binary — neither fix can help when two daemons fight for the same state file.

Once I killed both with pkill -9 and started a single clean daemon, the symptom disappeared.

Likely root causes

  1. Stale / wrong .relay/mount.pid. relayfile stop appears to read the pid file written at start-time. When --background forks (or when the daemon process gets re-PID'd through node→Go shim), the pid file may not match the real running daemon. So stop sends a signal to a non-existent or wrong pid (silent no-op) while reporting "no running mount found."
  2. pkill -f racing the daemon's process tree. The daemon is a node wrapper invoking a darwin-arm64 Go binary; signals to the wrapper may not propagate to the child, or vice versa. Today the surviving orphan was the child binary (relayfile-cli-darwin-arm64 mount …) with PPID 1 (reparented to init after the wrapper exited).
  3. No startup lock. relayfile mount … --background does not check for an existing daemon for the same workspace/local-dir before forking a new one. If the pid file is stale or the existing daemon is orphaned, a second daemon happily starts and competes.

Suggested fixes (any one closes the worst exposure)

  1. Make relayfile stop discover daemons by scanning ps, not just the pid file. Match on mount <workspace> in the cmdline, signal every match, then confirm by re-scanning. Falls back gracefully when the pid file is stale.
  2. Atomic, fsynced pid file written by the child (Go) process after fork, not by the parent wrapper. Then the pid file always reflects what's actually running.
  3. Refuse-to-start lock at mount start time. If a process matching mount <workspace> (or the configured localDir) is already running, error out with the existing pid, instead of silently launching a competitor that will corrupt state.
  4. Surface orphans in relayfile status. If ps scan finds a daemon process but the pid file doesn't match, report daemon: orphan(pid=…) so operators can spot it without grepping ps themselves. (Touches the same observability gap as relayfile#175 C, where status already misreports daemon liveness for foreground mounts.)

Why it matters more than a footgun

The cluster of "the daemon doesn't seem to be making progress" symptoms today had three distinct root causes — broken events ordering (cloud#862), refresh-token rotation race (relayfile#178/#179), and this orphan-races-state problem. The first two are now fixed; this one will silently undo their fixes any time an operator does what looks like a routine stop && start. The state-file .tmp-X: no such file or directory log line is the canonical fingerprint.

Reproducible

Today's session is the repro: run relayfile mount … --background, then relayfile stop (it'll report no daemon found in some race conditions), then relayfile mount … --background again. Run ps -axo pid,etime,command | grep "mount <workspace>" and you may find two processes for the same workspace.

Related: relayfile#175 C (status misreports liveness), relayfile#178 (auth thrash, separate but compounded by this).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions