Skip to content

fix(hooks): SessionStats.flush() race condition allows lost updates between concurrent processes #1493

@JeremyDev87

Description

@JeremyDev87

Problem

SessionStats.flush() in packages/claude-code-plugin/hooks/lib/stats.py performs a non-atomic read-modify-write against the on-disk stats file. The file lock is released between the read and the write, so two PostToolUse hook processes that fire concurrently can each read the same baseline, each apply their delta, and the second writer overwrites the first writer's update. The session loses one or more tool calls.

Affected code

File: packages/claude-code-plugin/hooks/lib/stats.py
Method: SessionStats.flush()

```python
def flush(self) -> None:
"""Flush accumulated in-memory stats to disk."""
if self._pending_count == 0:
return
data = self._locked_read() # <-- LOCK_SH acquired and released
data["tool_count"] = data.get("tool_count", 0) + self._mem_tool_count
data["error_count"] = data.get("error_count", 0) + self._mem_error_count
tool_names = data.get("tool_names", {})
for name, count in self._mem_tool_names.items():
tool_names[name] = tool_names.get(name, 0) + count
data["tool_names"] = tool_names
# Merge hook timings
hook_timings = data.get("hook_timings", {})
for name, times in self._mem_hook_timings.items():
if name not in hook_timings:
hook_timings[name] = []
hook_timings[name].extend(times)
data["hook_timings"] = hook_timings
self._locked_write(data) # <-- new LOCK_EX acquired here
# Reset in-memory accumulators
...
```

The helpers _locked_read() and _locked_write() each open the file inside their own with block and release the lock when the block exits. Between the two calls there is a window in which another process can grab LOCK_EX.

Race scenario

Two PostToolUse hook processes A and B fire for two parallel tool calls:

  1. Process A: _locked_read() returns {tool_count: 5}, releases lock
  2. Process B: _locked_read() returns {tool_count: 5}, releases lock
  3. Process A: applies delta, _locked_write({tool_count: 6})
  4. Process B: applies delta, _locked_write({tool_count: 6}) — should be 7

One tool call has been silently lost.

Why this matters

  • The fix shipped in fix(hooks): persist tool stats from short-lived hook processes #1492 ensures every recorded call reaches disk through a single-process flush. This issue is the next layer of correctness: ensuring that concurrent flushes do not clobber each other.
  • Claude Code can dispatch tools in parallel (e.g. multiple Agent invocations, or several Read/Grep in one assistant turn). Each spawns its own short-lived PostToolUse process, so the race window is realistic, not hypothetical.
  • Users will see slightly under-reported [CB] Xm | N tools | ... numbers in the Stop hook summary on busy turns.

Reproduction

```bash
cd packages/claude-code-plugin
python3 - <<'PY'
import multiprocessing as mp
import os, sys, tempfile, json

sys.path.insert(0, 'hooks/lib')
from stats import SessionStats

def worker(data_dir, session_id, n):
s = SessionStats(session_id=session_id, data_dir=data_dir, flush_interval=10)
for _ in range(n):
s.record_tool_call("Bash")
s.flush()

with tempfile.TemporaryDirectory() as tmp:
procs = [mp.Process(target=worker, args=(tmp, "race", 100)) for _ in range(8)]
for p in procs: p.start()
for p in procs: p.join()
s = SessionStats(session_id="race", data_dir=tmp)
on_disk = s._locked_read()
print("expected", 8 * 100, "got", on_disk["tool_count"])
PY
```

Expected output: `expected 800 got 800`. Today you will typically see a number well under 800 (varies per run).

Fix direction

Replace _locked_read() + _locked_write() inside flush() with a single critical section:

  1. Open the stats file in `r+` mode
  2. Acquire `fcntl.flock(LOCK_EX)`
  3. `json.load()` from the file handle
  4. Apply in-memory deltas to the loaded dict
  5. `f.seek(0)` + `f.truncate()` + `json.dump()`
  6. Release the lock by closing the file

A clean way to do this is to add a private `_locked_modify(self, mutator)` helper on `SessionStats` that takes a callable `(data: dict) -> dict` and runs it inside one `LOCK_EX` window. `flush()` then becomes a thin caller of `_locked_modify`.

Note: `_locked_read()` and `_locked_write()` may still be useful for callers that only read or only write, so leave them in place.

Acceptance criteria

  • `SessionStats.flush()` performs read-modify-write inside a single `fcntl.flock(LOCK_EX)` window.
  • New regression test in `packages/claude-code-plugin/tests/test_stats.py` (suggested class `TestConcurrentFlush`):
    • Spawns 8 `multiprocessing.Process` workers, each calling `record_tool_call() + flush()` 100 times against the same session/data_dir.
    • Asserts the final on-disk `tool_count` equals `8 * 100` exactly.
    • Asserts the final `tool_names["Bash"]` equals `8 * 100`.
  • Existing `test_concurrent_writes_dont_corrupt` (single-process) still passes.
  • All other tests in `test_stats.py` still pass.
  • Verify behavior on macOS (fcntl available) and document fallback when `HAS_FCNTL` is False (currently the code silently skips locking — that fallback also needs a comment about the lost-update risk).

Out of scope

  • `record_hook_timing` is never called by any hook — tracked separately.
  • `HookTimer` (`hooks/lib/hook_timer.py`) is also dead code — tracked in the same separate issue.

References

  • Introduced by: discovered while reviewing fix(hooks): persist tool stats from short-lived hook processes #1492 (`fix(hooks): persist tool stats from short-lived hook processes`)
  • Fixed in fix(hooks): persist tool stats from short-lived hook processes #1492: the "single record per process is lost on exit" bug. This issue is the orthogonal "concurrent flushes lose updates" bug.
  • Touched files (read-only context):
    • `packages/claude-code-plugin/hooks/lib/stats.py` — `SessionStats.flush`, `_locked_read`, `_locked_write`
    • `packages/claude-code-plugin/hooks/post-tool-use.py` — caller that triggers concurrent flushes
    • `packages/claude-code-plugin/tests/test_stats.py` — existing single-process locking test

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions