Serialize claude stdin writes; drain cancelled turns to boundary (closes #979) by FidoCanCode · Pull Request #980 · FidoCanCode/home

FidoCanCode · 2026-04-25T18:00:56Z

Fixes #979.

Real root cause

_fire_worker_cancel writes control_request interrupt to claude's stdin without holding the session _lock. The worker thread holds _lock for its in-flight turn — can't release until it sees the cancel — so the webhook firing the preempt has no choice but to skirt the lock.

Linux pipe writes are only atomic up to PIPE_BUF=4096 bytes. Worker mid-write of a multi-KB user message gets its bytes interleaved with the webhook's tiny control_request, producing malformed JSON. claude consumes the corrupt bytes and goes idle in epoll_wait waiting for valid input — the wedged state we kept observing in production.

This is the underlying cause of the entire post-#955 cascade (#963, #967, #971, #973, #974, #975). Each prior fix addressed a downstream symptom; this addresses the source.

Two changes that close the race

1. _stdin_lock (separate from _lock)

Held briefly around every proc.stdin.write + flush. Both lock-holding callers and the unlocked cancel path acquire it, so byte-level writes from concurrent threads cannot interleave. Worker still holds _lock for its whole turn; webhook still fires interrupt without _lock. The byte-level invariant is restored regardless of the protocol-level lock state.

2. iter_events drains to boundary on cancel

Old behavior: cancel → break immediately. Left claude's response (control_response + type=result) in the pipe as stale events, requiring _drain_to_boundary on the next send to clean up.

New behavior: cancel → keep reading until type=result (just record _last_turn_cancelled=True). Same iter_events call drains the response. Next holder gets a clean session.

A new _CANCEL_DRAIN_TIMEOUT (5s) bounds the drain — if claude is wedged and never emits the boundary, kill subprocess + recover. Replaces the 10s drain-to-boundary in send() and the _in_turn-based stale-turn detection.

Tests

TestClaudeSessionStdinAtomicity (new): drives concurrent send + _send_control_interrupt at byte granularity and asserts no interleave. Plus per-method tests that each write site holds _stdin_lock.
test_cancel_drains_to_boundary (rewritten): asserts events ARE consumed past cancel and _in_turn=False at end.
test_recovers_when_cancel_drain_times_out (rewritten): asserts ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT) when claude doesn't emit the boundary.
test_handler_prompt_runs_after_preempt_does_not_inherit_cancel (simplified): the cancel-leak bug from Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt #973 is structurally impossible in the new design.

All 263 claude tests pass. The single failing copilotcli test (test_webhook_preempts_worker_cancels_runtime) is pre-existing and unrelated.

Follow-ups

The cancel-firer-tid tracking from Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973) #974 (_cancel_fired_by_tid + __enter__) is now redundant — iter_events naturally clears the cancel state via the existing _preempt_pending gate. Worth removing in a follow-up.
_drain_to_boundary is no longer called from production code, only from its own tests. Worth deleting.
Both follow-ups would simplify the codebase further; left out of this PR to keep the scope tight.

…to boundary (closes #979) The post-#955 cascade of preempt-related hangs (#963, #967, #971, #973, #974, #975) all traced to one underlying race: _fire_worker_cancel writes a control_request interrupt to claude's stdin without holding the session _lock. The worker thread holds _lock for its in-flight turn (can't release until it sees the cancel), so the webhook firing the preempt has no choice but to skirt the lock. Linux pipe writes are only atomic up to PIPE_BUF=4096 bytes; the worker mid-write of a multi-KB user message gets its bytes interleaved with the webhook's tiny control_request, producing malformed JSON. claude consumes the corrupt bytes and goes idle in epoll_wait waiting for valid input — the wedged state we kept observing. Two changes that together close the race: 1. Add ClaudeSession._stdin_lock, separate from _lock, that serializes only the bytes-on-the-wire layer. Held briefly per write+flush. Both _lock-holding callers and the un-locked cancel path acquire it, so writes from concurrent threads cannot interleave. 2. iter_events on cancel keeps reading until the type=result/error boundary instead of breaking immediately. After the cancel signal, claude emits its control_response and a closing type=result for the cancelled turn — fido consumes them inside the same iter_events call, so the next holder gets a clean session with no leftover events to drain. A new _CANCEL_DRAIN_TIMEOUT (5s) bounds the drain: if claude is wedged and never emits the boundary, the subprocess is killed and recovered. This replaces the 10s drain-to-boundary call from send() and the _in_turn-based stale-turn detection. Adds TestClaudeSessionStdinAtomicity that drives concurrent send + _send_control_interrupt at byte granularity and asserts no interleave. Updates iter_events tests for the new keep-reading-on-cancel contract.

…loses #981) (#982) Fixes #981. ## Root cause `ClaudeSession._pending_talker_kind` was a session-level instance attribute that `prompt()` wrote and `__enter__` read. Cross-thread race: when worker and webhook threads called `prompt()` in overlapping windows, one's write clobbered the other's. The lock-winner then read the wrong kind in `__enter__`. Production captured the bug clearly: status XML showed the worker tid (140034430314176) registered with `kind="webhook"`. Every subsequent `preempt_worker()` call read "holder is webhook" → returned False → no cancel fired → webhooks queued behind the worker for 4+ minutes. ## Fix `__enter__` reads the kind directly from `provider.current_thread_kind()` — the calling thread's thread-local. No shared attribute. No race. Each thread carries its own kind by construction. Removes `_pending_talker_kind` entirely (set + read + clear). Removes the now-unused `Literal` import. ## Test `test_enter_uses_thread_kind_not_shared_attribute` plants a stale "webhook" annotation on the session-level attribute (simulating what another thread might have written) while the current thread's kind is "worker". - Pre-fix: `__enter__` reads the annotation and mis-registers as webhook → test fails - Post-fix: `__enter__` reads the thread-local and registers as worker → test passes Verified by stashing the source fix and re-running: pre-fix FAILS, post-fix PASSES. So the test catches the bug. ## Related - #979 / #980 — byte-level stdin race; fixed prior. This is a separate higher-level race on talker registration that #980 unmasked (same cause: lock handoff is faster, race window more frequent). - #973 / #974 — earlier preempt-related race fixes (cancel-leak) - #786 — original talker registration design Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

Fixes #983. ## Bug When webhook A held the lock and webhook B queued behind it via `hold_for_handler`, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's `RLock` isn't FIFO; the worker often won. Result: `A → worker → B` instead of `A → B → worker`. Every subsequent webhook in a flurry waited behind a full worker turn. Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued. ## Fix **ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set `_preempt_pending` and `notify_all` under this mutex (via new `_signal_pending_preempt` method). Workers entering `__enter__` `wait_for(not pending)`, and **re-check after acquiring** the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting". **OwnedSession** gains an abstract `_signal_pending_preempt` hook (default no-op). `hold_for_handler` calls it on the queueing branch (where `preempt_worker` returned False because the holder is also a webhook). ClaudeSession overrides. The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes. ## Test `test_hold_for_handler_signals_pending_when_queueing_behind_webhook` patches `try_preempt_worker` to return `(False, "webhook")` and asserts `hold_for_handler` calls `_signal_pending_preempt` exactly once and `_preempt_pending` is set afterward. Deterministic — no `time.sleep`. Pre-fix: `AttributeError` (no such method). Post-fix: passes. ## Related - #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all the prior fixes in this preempt cascade. Each peeled a layer; this one closes the lock-handoff prioritization. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

FidoCanCode requested a review from rhencke April 25, 2026 18:01

rhencke approved these changes Apr 25, 2026

View reviewed changes

FidoCanCode merged commit d1e8953 into main Apr 25, 2026
1 check passed

FidoCanCode deleted the fix-stdin-race-and-cancel-drain branch April 25, 2026 18:02

This was referenced Apr 25, 2026

Bug: _pending_talker_kind race lets worker register with kind=webhook (or vice versa) — breaks preempt detection #981

Closed

Read talker kind from current thread, not session-shared attribute (closes #981) #982

Merged

This was referenced Apr 25, 2026

Bug: webhook→webhook handoff lets the worker win the lock between handlers #983

Closed

Webhook-priority lock handoff via Condition mutex (closes #983) #984

Merged

FidoCanCode mentioned this pull request Apr 25, 2026

Bug: switch_model hangs when called with _in_turn=True after preempt-cancelled drain #975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize claude stdin writes; drain cancelled turns to boundary (closes #979)#980

Serialize claude stdin writes; drain cancelled turns to boundary (closes #979)#980
FidoCanCode merged 1 commit into
mainfrom
fix-stdin-race-and-cancel-drain

FidoCanCode commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FidoCanCode commented Apr 25, 2026

Real root cause

Two changes that close the race

Tests

Follow-ups

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants