fix(bridge): prevent process leak via PID file singleton guard#1765
Conversation
Each `hermes chat` session spawns a new bridge.cts process. When the
chat session ends (stdin closes), the bridge transitions to a
"staying alive as daemon" state to keep the Memory Viewer accessible.
However, the next chat session spawns yet another bridge without
killing the previous one, causing unbounded process accumulation
(observed: 10+ zombie bridge pairs on long-running servers).
Root cause: bridge.cts has no mechanism to detect or clean up a
previously running instance before starting.
Fix: introduce a PID file (`~/<agent>/memos-plugin/daemon/bridge.pid`)
as a lightweight singleton lock:
- On startup (unless --no-viewer), read the PID file; if the
recorded process is still alive, send SIGTERM and wait up to 5s
before SIGKILL.
- Write own PID to the file after acquiring the slot.
- Remove the PID file on all exit paths (SIGTERM/SIGINT handler,
daemon shutdown, headless exit, keepalive viewer-closed check).
- --no-viewer (headless) bridges skip the kill — they don't need
the port and coexist with the daemon that owns it.
The existing install.sh kill logic is preserved as a deployment-time
fallback; the two mechanisms are complementary.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Large-DB crash loop: PID singleton kills old bridge before DB lock releases We reverted a 655MB MemOS DB from v2.0.0-beta.1 to v2.0.5 and hit an infinite bridge restart cycle. The PID singleton guard kills the old bridge, but the new bridge then opens the same SQLite DB and hangs in WAL recovery — long enough to exceed the gateway's 30s timeout. The gateway spawns yet another bridge, which kills the current one, and the cycle repeats indefinitely. On a smaller 43MB DB the same code works fine — WAL recovery completes before the next bridge spawns. The threshold where it breaks seems to be somewhere between those two, likely dependent on DB size and I/O speed. Root cause: Cross-ref: #1452 describes a related class of startup-hang → restart-loop issues. |
Summary
hermes chatsession spawns a new bridge.cts, and the previous one stays alive as a "daemon" but is never cleaned up. On long-running servers this results in 10+ zombie bridge pairs.daemon/bridge.pid) as a lightweight singleton lock. On startup, the new bridge reads the PID file, kills the stale process (SIGTERM → 5s → SIGKILL), writes its own PID, and removes it on all exit paths.--no-viewer(headless) bridges skip the kill — they coexist with the daemon that owns the viewer port.install.shkill logic is preserved as a complementary deployment-time fallback.Test plan
daemon/bridge.pidis created with correct PIDhermes chattwice in sequence → verify only 1 bridge pair remains (no accumulation)--no-viewerbridge alongside daemon → verify both coexist (no kill)