mount: orphan daemons survive stop/pkill, race on state writes

## Symptom

The mount daemon's lifecycle management is leaky: `relayfile stop` and `pkill -f "mount <workspace>"` do **not** reliably terminate the daemon. Orphans accumulate over a session and race the new daemon on local state writes, producing visible corruption symptoms.

In today's `rw_fc7b534b` session, after a series of `relayfile mount … --background` / `relayfile stop` / `pkill -f` cycles to test 0.7.27 then 0.7.28, `ps` showed:

```
pid: 22788  etime: 11:40   ← the 0.7.28 daemon I thought was the only one
pid: 62068  etime: 33:42   ← orphan v0.7.27 daemon that had survived
                            multiple earlier `pkill -f "mount rw_fc7b534b"`
                            calls and a `relayfile stop` call
```

The 33:42-old orphan persisted across at least 3 explicit termination attempts. `relayfile stop rw_fc7b534b` invoked while it was running returned `no running mount found for workspace rw_fc7b534b` — i.e. stop didn't even know it existed. `pkill -9 -f "mount rw_fc7b534b"` was what finally killed it.

## Consequence (the real impact)

Two daemons writing to the same `relayfile-mount/.relayfile-mount-state.json` and provider files via temp+rename produces sustained races. Today's `mount.log` showed the smoking gun:

```
mount local change failed: lstat .../relayfile-mount/google-mail/threads/.19d20d2cb91ef830.json.tmp-402007899: no such file or directory
mount local change failed: lstat .../relayfile-mount/.relayfile-mount-state.json.tmp-3553847730: no such file or directory
mount sync cycle failed: context deadline exceeded
```

Each daemon's atomic-write tmp file vanishes under the *other* daemon's rename, so `lstat` after rename keeps failing. As a result the cursor stalled at `evt_27501` and `files=4350` for the entire 10-minute observation window even though **both** #179 (auth race fixed) and #177 (mid-cycle checkpoint) were live in the binary — neither fix can help when two daemons fight for the same state file.

Once I killed both with `pkill -9` and started a single clean daemon, the symptom disappeared.

## Likely root causes

1. **Stale / wrong `.relay/mount.pid`.** `relayfile stop` appears to read the pid file written at start-time. When `--background` forks (or when the daemon process gets re-PID'd through node→Go shim), the pid file may not match the real running daemon. So `stop` sends a signal to a non-existent or wrong pid (silent no-op) while reporting "no running mount found."
2. **`pkill -f` racing the daemon's process tree.** The daemon is a node wrapper invoking a darwin-arm64 Go binary; signals to the wrapper may not propagate to the child, or vice versa. Today the surviving orphan was the *child* binary (`relayfile-cli-darwin-arm64 mount …`) with PPID 1 (reparented to init after the wrapper exited).
3. **No startup lock.** `relayfile mount … --background` does not check for an existing daemon for the same workspace/local-dir before forking a new one. If the pid file is stale or the existing daemon is orphaned, a second daemon happily starts and competes.

## Suggested fixes (any one closes the worst exposure)

1. **Make `relayfile stop` discover daemons by scanning `ps`**, not just the pid file. Match on `mount <workspace>` in the cmdline, signal every match, then confirm by re-scanning. Falls back gracefully when the pid file is stale.
2. **Atomic, fsynced pid file written by the *child* (Go) process** after fork, not by the parent wrapper. Then the pid file always reflects what's actually running.
3. **Refuse-to-start lock at mount start time.** If a process matching `mount <workspace>` (or the configured `localDir`) is already running, error out with the existing pid, instead of silently launching a competitor that will corrupt state.
4. **Surface orphans in `relayfile status`.** If `ps` scan finds a daemon process but the pid file doesn't match, report `daemon: orphan(pid=…)` so operators can spot it without grepping `ps` themselves. (Touches the same observability gap as relayfile#175 C, where `status` already misreports daemon liveness for foreground mounts.)

## Why it matters more than a footgun

The cluster of "the daemon doesn't seem to be making progress" symptoms today had three distinct root causes — broken events ordering (cloud#862), refresh-token rotation race (relayfile#178/#179), and **this orphan-races-state problem**. The first two are now fixed; this one will silently undo their fixes any time an operator does what looks like a routine `stop && start`. The state-file `.tmp-X: no such file or directory` log line is the canonical fingerprint.

## Reproducible

Today's session is the repro: run `relayfile mount … --background`, then `relayfile stop` (it'll report no daemon found in some race conditions), then `relayfile mount … --background` again. Run `ps -axo pid,etime,command | grep "mount <workspace>"` and you may find two processes for the same workspace.

Related: relayfile#175 C (`status` misreports liveness), relayfile#178 (auth thrash, separate but compounded by this).

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mount: orphan daemons survive stop/pkill, race on state writes #180

Symptom

Consequence (the real impact)

Likely root causes

Suggested fixes (any one closes the worst exposure)

Why it matters more than a footgun

Reproducible

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mount: orphan daemons survive stop/pkill, race on state writes #180

Description

Symptom

Consequence (the real impact)

Likely root causes

Suggested fixes (any one closes the worst exposure)

Why it matters more than a footgun

Reproducible

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions