Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973) by FidoCanCode · Pull Request #974 · FidoCanCode/home

FidoCanCode · 2026-04-25T15:38:36Z

Fixes #973.

Root cause

After #967 (synchronous webhook preempt) and #971 (in-place model switch), webhook handlers consistently aborted their first prompt with an empty result. Production trace at 15:24:24:

hold_for_handler: preempting worker
ClaudeClient.run_turn: preempted mid-flight — retry 2
triage classifier: requesting category from opus
session.prompt: lock acquired (waited=0.00s)
switch_model: sonnet → opus (control_request)
claude result:                                      ← empty, instant
(silence)

The #786 fix added the _preempt_pending gate in iter_events to prevent clobbering an in-flight cancel signal. That gate is correct for its scenario (thread A fires cancel while thread B is mid-turn). But after lock turnover — thread A fires cancel, previous holder yields, thread A itself acquires the lock — A's own prompt was getting aborted by A's own signal.

_drain_to_boundary saw cancel set, aborted early. send() saw cancel set, returned without writing. iter_events gated on _preempt_pending (still set from preempt), refused to clear cancel, broke immediately.

Fix

Track which thread fired the most recent cancel (_cancel_fired_by_tid). In __enter__, clear the cancel signal when the entering thread is the same as the firer — that signal was meant for the previous holder, not for the firer's own first turn.

Different-thread acquisition (the #786 lock-handoff race) preserves the cancel and lets iter_events consume it.

Test

Adds test_handler_prompt_runs_after_preempt_does_not_inherit_cancel that drives a real preempt → hold_for_handler → prompt sequence and asserts the handler's user message actually reaches stdin. The pre-fix behavior was that proc.stdin.write only saw control_request interrupt writes from drain — never the handler's content.

All 263 claude-related tests pass including #786's invariant tests (test_enter_preserves_cancel_event, test_cancel_survives_lock_handoff).

Bug: webhooks no longer preempt the worker — they queue behind the session lock #955 — synchronous preempt regression
Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955) #967 — landed synchronous preempt
Use control_request set_model for mid-session model switch instead of session restart (closes #852) #971 — landed in-place model switch (fixed init-handshake deadlock; this PR fixes the next-layer hang)
Webhook preempt no longer cancels worker mid-flight — 147s lag between request and cede #786 — original cancel-survival fix; this PR is compatible
Bug: fido goes silent for 5–10+ minutes during ACT/ANSWER cycles on review comments #949 — silent ANSWER cycles. Same surface; Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt #973 is the precise root cause.

…loses #973) After #967 added synchronous webhook preempt and #971 made the model switch in-place, webhook handlers consistently aborted their first prompt with an empty result. Trace showed `_drain_to_boundary: cancel set — aborting drain early` immediately followed by `claude result: ` (empty) — the handler's own preempt-cancel signal was aborting the handler's send(). The #786 fix added a `_preempt_pending` gate in iter_events to prevent clobbering an in-flight cancel. That gate is correct for its scenario (thread A fires cancel while thread B is mid-turn) but was inadvertently keeping the cancel alive across lock turnover when the firer becomes the new holder (thread A fires cancel, previous holder thread B yields, thread A acquires the lock — A's prompt gets aborted by A's own signal). Track the tid that fired the most recent cancel; clear the cancel in __enter__ when the entering thread matches. Different-thread acquisition preserves #786's invariant (cancel survives lock handoff for an unrelated preempter). Adds regression test that drives a real preempt+hold_for_handler+prompt sequence and asserts the handler's user message actually reaches stdin. The pre-fix behavior was that proc.stdin.write only saw control_request interrupts from drain — never the handler's content.

…oses #975) (#977) Fixes #975. ## Real bug Probed the actual claude-code subprocess (`claude --version` → 2.1.120). Its `control_response` shape is: ```json {"type": "control_response", "response": {"subtype": "success", "request_id": "..."}} ``` `request_id` is nested under `response`. Fido's `_send_control_set_model` checked `obj.get("request_id")` at the top level. Always None. Predicate never matched. Every `switch_model` call hung until the subprocess was killed. ## Proof ``` $ grep -c 'switch_model: now on model=' ~/log/fido.log 0 $ grep -c 'switch_model:.*(control_request)' ~/log/fido.log 9 ``` Zero successful returns. Nine attempts. Every one hung. Concrete trace: - 15:40:23 `[home] switch_model: opus → haiku` - (no [home] log lines for 5 minutes) - 15:45:23 `[home] worker started` (fresh thread post-fido-restart) The session was wedged the entire window. Same shape across every `switch_model` log line in the file. ## Why fido seemed to work anyway Many `switch_model` calls are no-ops (target == current model) and return at line 1102 before sending the control_request — so sessions that didn't need an actual model change kept working. The 15:40:30 webhook test that succeeded landed on a freshly-booted confusio session that was already on opus from spawn — no switch needed. Bug only fires on real model changes. ## Fix claude.py:1023-1028: ```python if obj.get("type") == "control_response": response = obj.get("response") or {} if response.get("request_id") == request_id: return ``` ## Test Existing test fixtures had the wrong shape too — they put `request_id` at the top level, matching fido's broken expectation. Both helpers (`TestClaudeSessionSwitchModel._make_response_line` and `TestClaudeSessionSendControlSetModel._make_response_line`) updated to emit the real nested shape. New regression test `test_ignores_top_level_request_id_in_control_response` explicitly emits a malformed top-level placement first, then the correct nested one — asserts the malformed one is skipped and the wait completes only on the nested response. This catches any future regression that re-introduces top-level lookup. All 264 claude tests pass. ## Related - #975 — this issue - #971 — landed control_request set_model with the broken predicate - #973 / #974 — cancel-leak fix; uncovered the hang underneath - #976 — silent .get linter (would have caught this at write time) Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

#979) (#980) Fixes #979. ## Real root cause `_fire_worker_cancel` writes `control_request interrupt` to claude's stdin **without holding the session `_lock`**. The worker thread holds `_lock` for its in-flight turn — can't release until it sees the cancel — so the webhook firing the preempt has no choice but to skirt the lock. Linux pipe writes are only atomic up to `PIPE_BUF=4096 bytes`. Worker mid-write of a multi-KB user message gets its bytes interleaved with the webhook's tiny `control_request`, producing malformed JSON. claude consumes the corrupt bytes and goes idle in `epoll_wait` waiting for valid input — the wedged state we kept observing in production. This is the underlying cause of the entire post-#955 cascade (#963, #967, #971, #973, #974, #975). Each prior fix addressed a downstream symptom; this addresses the source. ## Two changes that close the race **1. `_stdin_lock` (separate from `_lock`)** Held briefly around every `proc.stdin.write` + `flush`. Both lock-holding callers and the unlocked cancel path acquire it, so byte-level writes from concurrent threads cannot interleave. Worker still holds `_lock` for its whole turn; webhook still fires interrupt without `_lock`. The byte-level invariant is restored regardless of the protocol-level lock state. **2. iter_events drains to boundary on cancel** Old behavior: cancel → break immediately. Left claude's response (`control_response` + `type=result`) in the pipe as stale events, requiring `_drain_to_boundary` on the next send to clean up. New behavior: cancel → keep reading until `type=result` (just record `_last_turn_cancelled=True`). Same iter_events call drains the response. Next holder gets a clean session. A new `_CANCEL_DRAIN_TIMEOUT` (5s) bounds the drain — if claude is wedged and never emits the boundary, kill subprocess + recover. Replaces the 10s drain-to-boundary in send() and the `_in_turn`-based stale-turn detection. ## Tests - `TestClaudeSessionStdinAtomicity` (new): drives concurrent `send` + `_send_control_interrupt` at byte granularity and asserts no interleave. Plus per-method tests that each write site holds `_stdin_lock`. - `test_cancel_drains_to_boundary` (rewritten): asserts events ARE consumed past cancel and `_in_turn=False` at end. - `test_recovers_when_cancel_drain_times_out` (rewritten): asserts `ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT)` when claude doesn't emit the boundary. - `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel` (simplified): the cancel-leak bug from #973 is structurally impossible in the new design. All 263 claude tests pass. The single failing copilotcli test (`test_webhook_preempts_worker_cancels_runtime`) is pre-existing and unrelated. ## Follow-ups - The cancel-firer-tid tracking from #974 (`_cancel_fired_by_tid` + `__enter__`) is now redundant — iter_events naturally clears the cancel state via the existing `_preempt_pending` gate. Worth removing in a follow-up. - `_drain_to_boundary` is no longer called from production code, only from its own tests. Worth deleting. - Both follow-ups would simplify the codebase further; left out of this PR to keep the scope tight. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

…loses #981) (#982) Fixes #981. ## Root cause `ClaudeSession._pending_talker_kind` was a session-level instance attribute that `prompt()` wrote and `__enter__` read. Cross-thread race: when worker and webhook threads called `prompt()` in overlapping windows, one's write clobbered the other's. The lock-winner then read the wrong kind in `__enter__`. Production captured the bug clearly: status XML showed the worker tid (140034430314176) registered with `kind="webhook"`. Every subsequent `preempt_worker()` call read "holder is webhook" → returned False → no cancel fired → webhooks queued behind the worker for 4+ minutes. ## Fix `__enter__` reads the kind directly from `provider.current_thread_kind()` — the calling thread's thread-local. No shared attribute. No race. Each thread carries its own kind by construction. Removes `_pending_talker_kind` entirely (set + read + clear). Removes the now-unused `Literal` import. ## Test `test_enter_uses_thread_kind_not_shared_attribute` plants a stale "webhook" annotation on the session-level attribute (simulating what another thread might have written) while the current thread's kind is "worker". - Pre-fix: `__enter__` reads the annotation and mis-registers as webhook → test fails - Post-fix: `__enter__` reads the thread-local and registers as worker → test passes Verified by stashing the source fix and re-running: pre-fix FAILS, post-fix PASSES. So the test catches the bug. ## Related - #979 / #980 — byte-level stdin race; fixed prior. This is a separate higher-level race on talker registration that #980 unmasked (same cause: lock handoff is faster, race window more frequent). - #973 / #974 — earlier preempt-related race fixes (cancel-leak) - #786 — original talker registration design Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

Fixes #983. ## Bug When webhook A held the lock and webhook B queued behind it via `hold_for_handler`, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's `RLock` isn't FIFO; the worker often won. Result: `A → worker → B` instead of `A → B → worker`. Every subsequent webhook in a flurry waited behind a full worker turn. Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued. ## Fix **ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set `_preempt_pending` and `notify_all` under this mutex (via new `_signal_pending_preempt` method). Workers entering `__enter__` `wait_for(not pending)`, and **re-check after acquiring** the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting". **OwnedSession** gains an abstract `_signal_pending_preempt` hook (default no-op). `hold_for_handler` calls it on the queueing branch (where `preempt_worker` returned False because the holder is also a webhook). ClaudeSession overrides. The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes. ## Test `test_hold_for_handler_signals_pending_when_queueing_behind_webhook` patches `try_preempt_worker` to return `(False, "webhook")` and asserts `hold_for_handler` calls `_signal_pending_preempt` exactly once and `_preempt_pending` is set afterward. Deterministic — no `time.sleep`. Pre-fix: `AttributeError` (no such method). Post-fix: passes. ## Related - #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all the prior fixes in this preempt cascade. Each peeled a layer; this one closes the lock-handoff prioritization. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

FidoCanCode requested a review from rhencke April 25, 2026 15:38

rhencke approved these changes Apr 25, 2026

View reviewed changes

FidoCanCode merged commit 60bed7c into main Apr 25, 2026
1 check passed

FidoCanCode deleted the fix-cancel-tid-tracking branch April 25, 2026 15:39

This was referenced Apr 25, 2026

Bug: switch_model hangs when called with _in_turn=True after preempt-cancelled drain #975

Closed

Fix control_response request_id check — read from nested response (closes #975) #977

Merged

This was referenced Apr 25, 2026

Bug: cross-thread stdin writes corrupt claude protocol stream — actual root cause of preempt hangs #979

Closed

Serialize claude stdin writes; drain cancelled turns to boundary (closes #979) #980

Merged

FidoCanCode mentioned this pull request Apr 25, 2026

Read talker kind from current thread, not session-shared attribute (closes #981) #982

Merged

FidoCanCode mentioned this pull request Apr 25, 2026

Webhook-priority lock handoff via Condition mutex (closes #983) #984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973)#974

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973)#974
FidoCanCode merged 1 commit into
mainfrom
fix-cancel-tid-tracking

FidoCanCode commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FidoCanCode commented Apr 25, 2026

Root cause

Fix

Test

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants