Skip to content

fix(bridge): prevent process leak via PID file singleton guard#1765

Merged
syzsunshine219 merged 1 commit into
MemTensor:mem-agent-0514from
syzsunshine219:fix/bridge-process-leak-pid-singleton
May 19, 2026
Merged

fix(bridge): prevent process leak via PID file singleton guard#1765
syzsunshine219 merged 1 commit into
MemTensor:mem-agent-0514from
syzsunshine219:fix/bridge-process-leak-pid-singleton

Conversation

@syzsunshine219
Copy link
Copy Markdown
Collaborator

@syzsunshine219 syzsunshine219 commented May 19, 2026

Summary

  • Bridge processes accumulate indefinitely: each hermes chat session spawns a new bridge.cts, and the previous one stays alive as a "daemon" but is never cleaned up. On long-running servers this results in 10+ zombie bridge pairs.
  • Introduces a PID file (daemon/bridge.pid) as a lightweight singleton lock. On startup, the new bridge reads the PID file, kills the stale process (SIGTERM → 5s → SIGKILL), writes its own PID, and removes it on all exit paths.
  • --no-viewer (headless) bridges skip the kill — they coexist with the daemon that owns the viewer port.
  • The existing install.sh kill logic is preserved as a complementary deployment-time fallback.

Test plan

  • Start bridge daemon → verify daemon/bridge.pid is created with correct PID
  • Start a second bridge → verify the first is killed and PID file updated
  • SIGTERM the running bridge → verify PID file is removed
  • Run hermes chat twice in sequence → verify only 1 bridge pair remains (no accumulation)
  • Run --no-viewer bridge alongside daemon → verify both coexist (no kill)

Each `hermes chat` session spawns a new bridge.cts process. When the
chat session ends (stdin closes), the bridge transitions to a
"staying alive as daemon" state to keep the Memory Viewer accessible.
However, the next chat session spawns yet another bridge without
killing the previous one, causing unbounded process accumulation
(observed: 10+ zombie bridge pairs on long-running servers).

Root cause: bridge.cts has no mechanism to detect or clean up a
previously running instance before starting.

Fix: introduce a PID file (`~/<agent>/memos-plugin/daemon/bridge.pid`)
as a lightweight singleton lock:

  - On startup (unless --no-viewer), read the PID file; if the
    recorded process is still alive, send SIGTERM and wait up to 5s
    before SIGKILL.
  - Write own PID to the file after acquiring the slot.
  - Remove the PID file on all exit paths (SIGTERM/SIGINT handler,
    daemon shutdown, headless exit, keepalive viewer-closed check).
  - --no-viewer (headless) bridges skip the kill — they don't need
    the port and coexist with the daemon that owns it.

The existing install.sh kill logic is preserved as a deployment-time
fallback; the two mechanisms are complementary.

Co-authored-by: Cursor <cursoragent@cursor.com>
@syzsunshine219 syzsunshine219 merged commit 2908e87 into MemTensor:mem-agent-0514 May 19, 2026
@chiefmojo
Copy link
Copy Markdown

Large-DB crash loop: PID singleton kills old bridge before DB lock releases

We reverted a 655MB MemOS DB from v2.0.0-beta.1 to v2.0.5 and hit an infinite bridge restart cycle. The PID singleton guard kills the old bridge, but the new bridge then opens the same SQLite DB and hangs in WAL recovery — long enough to exceed the gateway's 30s timeout. The gateway spawns yet another bridge, which kills the current one, and the cycle repeats indefinitely.

On a smaller 43MB DB the same code works fine — WAL recovery completes before the next bridge spawns. The threshold where it breaks seems to be somewhere between those two, likely dependent on DB size and I/O speed.

Root cause: killExistingBridge() sends SIGTERM but doesn't wait for the old process to fully exit and release file locks. A waitpid() or polling fuser on the DB file before proceeding would prevent the race.

Cross-ref: #1452 describes a related class of startup-hang → restart-loop issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants