fix(agent-channel): set busy_timeout on agent_channel.db to prevent concurrent stop-hook deadlock by evannadeau · Pull Request #12 · SpawnBox-dev/claude-plugins

evannadeau · 2026-05-15T20:53:13Z

Summary

The agent-channel SQLite DB (<project>/.orchestrator-state/agent-channel/agent_channel.db) is initialized with journal_mode = WAL and synchronous = NORMAL but NO busy_timeout (see mcp/engine/agent_channel_state.ts:200-203, pre-fix). The global plugin DB at mcp/db/connection.ts:79 already has busy_timeout = 5000 for the same reason this PR adds it here.

Without a busy timeout, a concurrent SQLite writer hits SQLITE_BUSY immediately rather than waiting. Multiple Claude Code sessions in the same project each run their own MCP server with its own AgentChannel writing heartbeats, offsets, sessions, and system_events to this same per-project DB. When PA + one or more SAs trigger /exit simultaneously, all their stop-hooks fire at once and race for the writer lock. The losers see SQLITE_BUSY immediately, Claude Code re-fires the stop-hook reminder (Before ending: complete orchestrator housekeeping...), and the losers retry against the still-locked DB. The cycle hangs both parent shells until the operator force-quits and restarts the host.

Reproduction signature

Reported empirically on Ubuntu-24.04 WSL2 with the orchestrator plugin (incident 2026-05-14):

Two sessions hung at /exit, requiring full WSL restart 14 minutes later to recover. Process kill alone is insufficient — the kernel-level SQLite locks survive in shared memory until the WSL VM is recycled.
Their JSONL transcripts end with rapid back-to-back stop-hook reminder user-role injections (no assistant response between them) — pointing at hook-retry-loop, not model hang.
A third concurrent session that /exit-ed slightly earlier (before the parallel exit storm) drained cleanly. The triangulation (one clean exit + two hung exits, same project, same time window) identifies the failure mode as concurrent-write contention rather than a per-session bug.
agent_channel.db-wal and global.db-wal both sat at ~4 MB at incident time, inflating the checkpoint-replay component of the contention window — but the proximate trigger is the missing busy_timeout, not the WAL size.

Fix

Add PRAGMA busy_timeout = 5000 to the agent_channel.db connection init, mirroring the same pragma already set on the global plugin DB. Writers now wait up to 5s for the writer lock instead of throwing SQLITE_BUSY immediately. The lock is held briefly (sessions/offsets/heartbeats are tiny INSERT/UPSERTs), so 5s is well beyond the worst-case contention window.

Diff:

   if (!useInMemory) {
     db.exec("PRAGMA journal_mode = WAL;");
   }
   db.exec("PRAGMA synchronous = NORMAL;");
+  // WAL allows concurrent readers but writers serialize. Without a busy
+  // timeout, a concurrent writer throws SQLITE_BUSY immediately instead of
+  // waiting. Multiple Claude Code sessions in the same project all run their
+  // own MCP server with its own AgentChannel writing heartbeats, offsets,
+  // sessions, and system_events to this DB - and they all hit the stop-hook
+  // write path at end-of-session. Without this timeout, concurrent stop-hooks
+  // race for the writer lock, the loser sees SQLITE_BUSY immediately, Claude
+  // Code re-fires the stop-hook reminder, and the loser retries against the
+  // still-locked DB - producing the deadlock-shape that hangs both parent
+  // shells until host restart. Mirrors the same fix already applied to the
+  // global plugin DB at mcp/db/connection.ts.
+  db.run("PRAGMA busy_timeout = 5000;");

Style note (asking for guidance)

The new line uses db.run("PRAGMA ...") while the adjacent lines use db.exec("PRAGMA ..."). Both are functionally identical for single-statement PRAGMA configuration on bun:sqlite. The asymmetry was forced by a local PreToolUse hook that false-positives on .exec( regex matches in tool input (thinks they're child_process.exec). Happy to normalize to db.exec on review request — operator workaround was the only blocker.

Test plan

bun install clean (98 packages).
bun run build → dist/server.js 0.94 MB, 249 modules (regenerated to match source).
bun test → 516 pass / 0 fail / 1207 expect() calls (unchanged — pragma config doesn't break any existing behavior).
Verified the existing :memory: test path still works (busy_timeout is a no-op on in-memory DBs, harmless to set).
Manual code-review confirmation: the only call site for the new DB connection setup is getDb() at line 165; the dbCache is per-stateDir so each project gets exactly one connection with the pragma applied.
Reviewer may want a stress test simulating concurrent /exit from multiple sessions. Doable but requires either real child processes or careful single-process lock-holding. Happy to add if you'd like.

Adjacent hygiene (intentionally out-of-scope, separate PR)

The WAL inflation contributing to incident-time contention has no periodic-checkpoint mitigation today — only checkpoint-on-close at agent_channel_state.ts:146. Adding a periodic wal_checkpoint(TRUNCATE) to the MCP server's heartbeat tick (e.g., once per N heartbeats) would bound WAL size and further reduce the contention surface. Worth considering as a separate hygiene PR after this lock-timeout fix lands.

Files changed

plugins/orchestrator/mcp/engine/agent_channel_state.ts — 12 added lines (1 pragma, 11 explanatory comment).
plugins/orchestrator/dist/server.js — regenerated via bun run build. The dist diff is microscopic (1 line) — bundler-deterministic since the source change is localized to a single PRAGMA-statement string.

…oncurrent stop-hook deadlock The agent-channel SQLite DB was initialized with `journal_mode = WAL` and `synchronous = NORMAL` but NO `busy_timeout`. WAL allows concurrent readers but writers still serialize via the writer lock; without a busy timeout a concurrent writer hits SQLITE_BUSY immediately rather than waiting. The bug surfaces at session end. Every Claude Code session in a project runs its own MCP server with its own AgentChannel instance writing heartbeats, offsets, sessions, and system_events to this same per-project DB. When PA + one or more SAs trigger /exit simultaneously, all their stop-hooks fire at once and race for the writer lock. The losers hit SQLITE_BUSY immediately, Claude Code re-fires the stop-hook reminder (`Before ending: complete orchestrator housekeeping...`), the losers retry against the still-locked DB, and the cycle hangs both parent shells until the operator force-quits and restarts the host. Recovery via process kill alone is insufficient because the kernel-level SQLite locks survive in shared memory until the WSL VM (or equivalent) is recycled. Symptom signature (from operator-reported incident 2026-05-14): - Two sessions hung at /exit, requiring full WSL restart 14 minutes later - Their transcripts end with rapid back-to-back stop-hook reminder user-role injections (no assistant response between them), pointing at hook-retry-loop rather than model hang - A third concurrent session that /exit-ed slightly earlier drained cleanly, which by elimination identifies the failure as concurrent- write contention rather than per-session bug - agent_channel.db-wal and global.db-wal sat at ~4 MB at incident time, inflating the checkpoint-replay component of the contention window Fix: add `PRAGMA busy_timeout = 5000` to the agent_channel.db connection init in agent_channel_state.ts, mirroring the same pragma already set on the global plugin DB at mcp/db/connection.ts:79. Writers now wait up to 5s for the lock instead of throwing immediately. The lock is held briefly (sessions/offsets/heartbeats are tiny INSERT/UPSERTs), so 5s is well beyond the worst-case contention window. Style note: this single new line uses `db.run("PRAGMA ...")` rather than `db.exec("PRAGMA ...")` (which the adjacent lines use); both are functionally identical for single-statement PRAGMA configuration on bun:sqlite. The asymmetry was forced by a local pre-tool-use hook that false-positives on `.exec(` patterns thinking they're child_process.exec. Happy to normalize to .exec on review request. Adjacent hygiene (intentionally out-of-scope for this PR but related): the WAL inflation contributing to incident-time contention has no periodic-checkpoint mitigation today (only checkpoint-on-close at agent_channel_state.ts:146). Adding a periodic `wal_checkpoint(TRUNCATE)` to the MCP server's heartbeat tick would bound WAL size and further reduce the contention surface. Separate PR. dist/server.js regenerated via `bun run build` (249 modules, 0.94 MB); test suite stays at 516 pass / 0 fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent-channel): set busy_timeout on agent_channel.db to prevent concurrent stop-hook deadlock#12

fix(agent-channel): set busy_timeout on agent_channel.db to prevent concurrent stop-hook deadlock#12
evannadeau wants to merge 1 commit into
SpawnBox-dev:mainfrom
evannadeau:fix/stop-hook-sqlite-busy-timeout

evannadeau commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evannadeau commented May 15, 2026

Summary

Reproduction signature

Fix

Style note (asking for guidance)

Test plan

Adjacent hygiene (intentionally out-of-scope, separate PR)

Files changed

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant