fix(hooks): SessionStats.flush() race condition allows lost updates between concurrent processes

## Problem

`SessionStats.flush()` in `packages/claude-code-plugin/hooks/lib/stats.py` performs a non-atomic read-modify-write against the on-disk stats file. The file lock is released between the read and the write, so two `PostToolUse` hook processes that fire concurrently can each read the same baseline, each apply their delta, and the second writer overwrites the first writer's update. The session loses one or more tool calls.

## Affected code

File: `packages/claude-code-plugin/hooks/lib/stats.py`
Method: `SessionStats.flush()`

\`\`\`python
def flush(self) -> None:
    \"\"\"Flush accumulated in-memory stats to disk.\"\"\"
    if self._pending_count == 0:
        return
    data = self._locked_read()         # <-- LOCK_SH acquired and released
    data[\"tool_count\"] = data.get(\"tool_count\", 0) + self._mem_tool_count
    data[\"error_count\"] = data.get(\"error_count\", 0) + self._mem_error_count
    tool_names = data.get(\"tool_names\", {})
    for name, count in self._mem_tool_names.items():
        tool_names[name] = tool_names.get(name, 0) + count
    data[\"tool_names\"] = tool_names
    # Merge hook timings
    hook_timings = data.get(\"hook_timings\", {})
    for name, times in self._mem_hook_timings.items():
        if name not in hook_timings:
            hook_timings[name] = []
        hook_timings[name].extend(times)
    data[\"hook_timings\"] = hook_timings
    self._locked_write(data)            # <-- new LOCK_EX acquired here
    # Reset in-memory accumulators
    ...
\`\`\`

The helpers `_locked_read()` and `_locked_write()` each open the file inside their own `with` block and release the lock when the block exits. Between the two calls there is a window in which another process can grab `LOCK_EX`.

## Race scenario

Two `PostToolUse` hook processes A and B fire for two parallel tool calls:

1. Process A: `_locked_read()` returns `{tool_count: 5}`, releases lock
2. Process B: `_locked_read()` returns `{tool_count: 5}`, releases lock
3. Process A: applies delta, `_locked_write({tool_count: 6})`
4. Process B: applies delta, `_locked_write({tool_count: 6})` — should be 7

One tool call has been silently lost.

## Why this matters

- The fix shipped in #1492 ensures every recorded call reaches disk through a single-process flush. This issue is the **next layer** of correctness: ensuring that concurrent flushes do not clobber each other.
- Claude Code can dispatch tools in parallel (e.g. multiple `Agent` invocations, or several `Read`/`Grep` in one assistant turn). Each spawns its own short-lived `PostToolUse` process, so the race window is realistic, not hypothetical.
- Users will see slightly under-reported `[CB] Xm | N tools | ...` numbers in the Stop hook summary on busy turns.

## Reproduction

\`\`\`bash
cd packages/claude-code-plugin
python3 - <<'PY'
import multiprocessing as mp
import os, sys, tempfile, json

sys.path.insert(0, 'hooks/lib')
from stats import SessionStats

def worker(data_dir, session_id, n):
    s = SessionStats(session_id=session_id, data_dir=data_dir, flush_interval=10)
    for _ in range(n):
        s.record_tool_call(\"Bash\")
        s.flush()

with tempfile.TemporaryDirectory() as tmp:
    procs = [mp.Process(target=worker, args=(tmp, \"race\", 100)) for _ in range(8)]
    for p in procs: p.start()
    for p in procs: p.join()
    s = SessionStats(session_id=\"race\", data_dir=tmp)
    on_disk = s._locked_read()
    print(\"expected\", 8 * 100, \"got\", on_disk[\"tool_count\"])
PY
\`\`\`

Expected output: \`expected 800 got 800\`. Today you will typically see a number well under 800 (varies per run).

## Fix direction

Replace `_locked_read()` + `_locked_write()` inside `flush()` with a single critical section:

1. Open the stats file in \`r+\` mode
2. Acquire \`fcntl.flock(LOCK_EX)\`
3. \`json.load()\` from the file handle
4. Apply in-memory deltas to the loaded dict
5. \`f.seek(0)\` + \`f.truncate()\` + \`json.dump()\`
6. Release the lock by closing the file

A clean way to do this is to add a private \`_locked_modify(self, mutator)\` helper on \`SessionStats\` that takes a callable \`(data: dict) -> dict\` and runs it inside one \`LOCK_EX\` window. \`flush()\` then becomes a thin caller of \`_locked_modify\`.

Note: \`_locked_read()\` and \`_locked_write()\` may still be useful for callers that only read or only write, so leave them in place.

## Acceptance criteria

- [ ] \`SessionStats.flush()\` performs read-modify-write inside a single \`fcntl.flock(LOCK_EX)\` window.
- [ ] New regression test in \`packages/claude-code-plugin/tests/test_stats.py\` (suggested class \`TestConcurrentFlush\`):
  - Spawns 8 \`multiprocessing.Process\` workers, each calling \`record_tool_call() + flush()\` 100 times against the same session/data_dir.
  - Asserts the final on-disk \`tool_count\` equals \`8 * 100\` exactly.
  - Asserts the final \`tool_names[\"Bash\"]\` equals \`8 * 100\`.
- [ ] Existing \`test_concurrent_writes_dont_corrupt\` (single-process) still passes.
- [ ] All other tests in \`test_stats.py\` still pass.
- [ ] Verify behavior on macOS (fcntl available) and document fallback when \`HAS_FCNTL\` is False (currently the code silently skips locking — that fallback also needs a comment about the lost-update risk).

## Out of scope

- \`record_hook_timing\` is never called by any hook — tracked separately.
- \`HookTimer\` (\`hooks/lib/hook_timer.py\`) is also dead code — tracked in the same separate issue.

## References

- Introduced by: discovered while reviewing #1492 (\`fix(hooks): persist tool stats from short-lived hook processes\`)
- Fixed in #1492: the \"single record per process is lost on exit\" bug. This issue is the orthogonal \"concurrent flushes lose updates\" bug.
- Touched files (read-only context):
  - \`packages/claude-code-plugin/hooks/lib/stats.py\` — \`SessionStats.flush\`, \`_locked_read\`, \`_locked_write\`
  - \`packages/claude-code-plugin/hooks/post-tool-use.py\` — caller that triggers concurrent flushes
  - \`packages/claude-code-plugin/tests/test_stats.py\` — existing single-process locking test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hooks): SessionStats.flush() race condition allows lost updates between concurrent processes #1493

Problem

Affected code

Race scenario

Why this matters

Reproduction

Fix direction

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

fix(hooks): SessionStats.flush() race condition allows lost updates between concurrent processes #1493

Description

Problem

Affected code

Race scenario

Why this matters

Reproduction

Fix direction

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions