Webhook-priority lock handoff via Condition mutex (closes #983) by FidoCanCode · Pull Request #984 · FidoCanCode/home

FidoCanCode · 2026-04-25T18:39:35Z

Fixes #983.

Bug

When webhook A held the lock and webhook B queued behind it via hold_for_handler, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's RLock isn't FIFO; the worker often won. Result: A → worker → B instead of A → B → worker. Every subsequent webhook in a flurry waited behind a full worker turn.

Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued.

Fix

ClaudeSession gains a _preempt_cond Condition. Webhook callers set _preempt_pending and notify_all under this mutex (via new _signal_pending_preempt method). Workers entering __enter__ wait_for(not pending), and re-check after acquiring the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting".

OwnedSession gains an abstract _signal_pending_preempt hook (default no-op). hold_for_handler calls it on the queueing branch (where preempt_worker returned False because the holder is also a webhook). ClaudeSession overrides.

The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes.

Test

test_hold_for_handler_signals_pending_when_queueing_behind_webhook patches try_preempt_worker to return (False, "webhook") and asserts hold_for_handler calls _signal_pending_preempt exactly once and _preempt_pending is set afterward. Deterministic — no time.sleep. Pre-fix: AttributeError (no such method). Post-fix: passes.

Bug: webhooks no longer preempt the worker — they queue behind the session lock #955, Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955) #963, Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955) #967, Use control_request set_model for mid-session model switch instead of session restart (closes #852) #971, Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt #973, Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973) #974, Bug: switch_model hangs when called with _in_turn=True after preempt-cancelled drain #975, Bug: cross-thread stdin writes corrupt claude protocol stream — actual root cause of preempt hangs #979, Serialize claude stdin writes; drain cancelled turns to boundary (closes #979) #980, Bug: _pending_talker_kind race lets worker register with kind=webhook (or vice versa) — breaks preempt detection #981, Read talker kind from current thread, not session-shared attribute (closes #981) #982 — all the prior fixes in this preempt cascade. Each peeled a layer; this one closes the lock-handoff prioritization.

Previously, when webhook A was holding the lock and webhook B queued behind it via hold_for_handler, the lock release at A→exit went into a free-for-all between B and any worker thread also blocked on the lock. Python's RLock is not FIFO, so the worker often won. Result: A → worker → B instead of A → B → worker. Each subsequent webhook in a flurry waited behind a full worker turn. Two changes: 1. ClaudeSession gains a _preempt_cond Condition. Webhook callers set _preempt_pending and notify under this mutex (via the new _signal_pending_preempt method). Workers entering __enter__ wait on _preempt_cond until pending clears, and re-check pending under the mutex AFTER acquiring the lock — if a webhook arrived during the gap, release and retry. Worker re-acquire is now strictly gated on "no webhook waiting", with no race window. 2. OwnedSession gains an abstract _signal_pending_preempt hook (default no-op). hold_for_handler calls it on the queueing branch (the case where preempt_worker returned False because the holder is itself a webhook). ClaudeSession overrides to set _preempt_pending and notify on _preempt_cond. The cond mutex is held only briefly per check/set/clear — it never covers the actual session work, so no new contention. Adds test_hold_for_handler_signals_pending_when_queueing_behind_webhook that patches try_preempt_worker to return (False, "webhook") and asserts hold_for_handler calls _signal_pending_preempt exactly once and _preempt_pending is set afterward. Pre-fix: AttributeError. Post-fix: passes. (Pre-existing rocq generation drift in the workspace was reset; this commit is source-only.)

…loses #975, #983) (#986) Fixes #975. Closes #983. ## What we thought was happening `set_model` wedges in claude-code 2.1.114 after 2-3 cycles per session (#975). Reproducer used a single-threaded probe that drove turns and switches against the bare claude binary; it wedged on cycle 2 every time. We attributed it to upstream claude-code. ## What was actually happening fido was violating its own invariant. When a webhook fired `_fire_worker_cancel`, it wrote the `control_request interrupt` to claude's stdin from the **webhook thread** — but the worker thread held the session lock and was the conversation owner. That side-channel write left claude's internal protocol state half-cancelled (turn observably interrupted from claude's side, but fido didn't drain the ack). Subsequent `control_request`s on the same session inherited that protocol-state divergence and stopped getting `control_response` back. Looked like an upstream wedge; was actually fido leaving claude confused. ## Fix Only the lock-holder writes to stdin. The webhook signals, the worker acts. - `_fire_worker_cancel` (called from webhook thread): sets `_cancel` + `_signal_pending_preempt` + `_wake()`. **No stdin write.** - `iter_events` (worker thread, lock-holder): on observed `_cancel`, sends the `control_request interrupt` itself — and **assertively drains**: tracks the interrupt's `request_id`, asserts the matching `control_response` ack arrives before the `type=result` boundary. If the ack is missing, the drain wasn't clean — kill + recover rather than hand a half-cancelled session to the next holder. - `_send_control_interrupt` returns the `request_id` so `iter_events` can match. - `_stdin_lock` removed entirely. No cross-thread writes; nothing to serialize. Real preemption (sub-millisecond from cancel-fired to interrupt-on-the-wire), single-owner invariant restored, claude's protocol state stays consistent with fido's. ## Verification Wrote a Python harness driving the real `ClaudeSession` production code path against real claude-code 2.1.114. Worker runs a long task, webhook fires preempt, repeat. **4/4 cycles pass** with each cycle doing multiple `set_model`s. Pre-fix the same flow wedged at cycle 2. So the wedge was directly caused by the cross-thread write — fix it, the wedge goes away. ``` ========== summary ========== cycle 0: OK wh_elapsed=2.00s cycle 1: OK wh_elapsed=6.83s cycle 2: OK wh_elapsed=1.54s cycle 3: OK wh_elapsed=3.59s ``` ## Tests 264 claude tests pass. Drops the obsolete `_stdin_lock` tests (4 tests removed); the invariant they enforced is replaced by the simpler "only the lock-holder writes" rule. ## Closes - #975 — set_model wedge. Was a fido-side cause, not upstream. - #983 — webhook→webhook handoff. The handoff Condition logic from #984 still applies; this PR keeps that and adds the cancel-drain assertion. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

…-handler starvation (closes #1022) Stopgap patch on the way to the larger session_lock.v FSM-driven rewrite tracked in #1022. Same root cause; smaller surgery. Symptom (#984/#1017/#1019/#1022): when N webhook handlers signal pending-preempt and the first acquires the lock, the binary Event ``_preempt_pending`` clears. The worker, waiting on ``not is_set()``, sees the clear and races for the lock — beating the second queued handler because RLock isn't FIFO. Each prior fix narrowed the race window without addressing the underlying signal granularity. Fix: ``_preempt_pending`` is now an ``int`` counter reflecting the number of queued handlers. ``_signal_pending_preempt`` increments it; each handler's outermost ``__enter__`` decrements; workers wait for ``_preempt_pending == 0``. A second queued handler keeps the counter above zero after the first acquires, so the worker keeps yielding until ALL queued handlers drain. This closes the symptom directly observed at 04:43:05Z on PR #1021 (comment 3142991935 went unanswered because the worker won the lock over a queued webhook B after webhook A released). Workers don't decrement the counter — they aren't tracked. The prompt() finally-block safety net decrements by 1 (was: clear()) so concurrent unrelated handlers aren't over-cleared. The architectural FSM rewrite remains tracked in #1022 as an experiment for a focused PR (~half-day effort with stress-test validation). This commit reduces the cascade frequency in the meantime; #1022 closes it structurally. Tests: - All 251 existing tests in test_claude.py + test_claude_hold_for_handler.py updated for the new API (.set/.is_set/.clear → +=/>0/=0) and pass. - New test_two_queued_handlers_keep_counter_above_zero_after_first_acquires asserts the #1022 invariant: counter remains > 0 between first and second handler acquires.

…ation (closes #1022) (#1023) ## Summary Stopgap patch closing [#1022](#1022) (and the symptom of #984/#1017/#1019). Same root cause; smaller surgery than the full FSM rewrite (which remains tracked in #1022 as a focused-PR experiment). ## What changed \`_preempt_pending\` was a \`threading.Event\` (binary: anyone-pending vs. none). It's now an \`int\` counter (queue length). All readers updated: - Workers wait for \`_preempt_pending == 0\` (was \`not is_set()\`) - Handlers' \`__enter__\` decrements once on outermost entry (was \`clear()\`) - Workers \*don't\* decrement — they aren't queued - \`_signal_pending_preempt\` increments (was \`set()\`) - \`prompt()\` safety net decrements by 1 (was \`clear()\`) to avoid over-decrement of concurrent handlers ## Why this fixes #1022 Two handlers signal pending → counter = 2. Worker waits for ==0. First handler acquires + decrements → counter = 1. Worker still waits (counter != 0). First releases. Second handler acquires + decrements → counter = 0. Second releases. Worker finally acquires. **The second queued handler is no longer beat by the worker on the wake race.** Under the old binary Event: first handler's \`clear()\` made the worker see \"no one pending\" while the second handler was still queued. Worker raced ahead. (See trace on PR #1021 04:43:05Z — comment 3142991935 went unanswered.) ## Why not the full FSM rewrite Started that during the vet (this branch's earlier commits — now reverted). Surgery was ~300 lines across claude.py + 36 test updates + a new stress harness — too much for safe ship-tonight. #1022 stays open with the architectural rewrite as a focused next-day PR. ## Test plan - [x] All 251 tests in \`test_claude.py\` + \`test_claude_hold_for_handler.py\` updated for the new API and pass - [x] New \`test_two_queued_handlers_keep_counter_above_zero_after_first_acquires\` asserts the invariant directly - [x] \`./fido ci\` green - [ ] Live verification: bring fido up, fire two near-simultaneous review comments, confirm both get individual replies (not just the first) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

…oordination (validates real fix for #984/#1017/#1019/#1022 cascade) (closes #1022) (#1031) Fixes #1022. Flip the extracted FSM in session_lock.v from a side-band oracle to the authoritative lock coordinator in OwnedSession, then migrate both ClaudeSession and CopilotCLISession to delegate to it — deleting the ad-hoc Event/Condition/sleep-poll primitives that caused the #984→#1017→#1019 race cascade. Validates that the proved invariants (no_dual_ownership, release_only_by_owner) eliminate the entire class of coordination bugs by construction. --- ## Work queue  - [ ] [Replace individual FSM type imports with a module-level alias](#1031 (comment)) **→ next**  <details><summary>Completed (5)</summary> - [x] Remove wait_for_pending_preempt from Protocol and worker callers  - [x] Final sweep: remove dead code, stale comments, unused imports  - [x] Rewrite CopilotCLISession lock to delegate to OwnedSession FSM  - [x] Rewrite ClaudeSession lock to delegate to OwnedSession FSM  - [x] Promote FSM to authoritative lock coordinator in OwnedSession  </details>  --------- Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

FidoCanCode requested a review from rhencke April 25, 2026 18:39

rhencke approved these changes Apr 25, 2026

View reviewed changes

FidoCanCode merged commit ec05a88 into main Apr 25, 2026
1 check passed

FidoCanCode deleted the fix-webhook-handoff-priority branch April 25, 2026 18:40

This was referenced Apr 25, 2026

Bug: switch_model hangs when called with _in_turn=True after preempt-cancelled drain #975

Closed

Worker drives cancel: interrupt + assertive drain from iter_events (closes #975, #983) #986

Merged

FidoCanCode mentioned this pull request Apr 26, 2026

Counter-based _preempt_pending — stopgap fix for queued-handler starvation (closes #1022) #1023

Merged

4 tasks

FidoCanCode mentioned this pull request Apr 26, 2026

Experiment: drive the lock from session_lock.v's FSM, delete ad-hoc coordination (validates real fix for #984/#1017/#1019/#1022 cascade) (closes #1022) #1031

Merged

6 tasks

FidoCanCode mentioned this pull request Apr 26, 2026

Survey: bug-mined invariants — what coordination invariants do our closed bugs imply? #1040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook-priority lock handoff via Condition mutex (closes #983)#984

Webhook-priority lock handoff via Condition mutex (closes #983)#984
FidoCanCode merged 1 commit into
mainfrom
fix-webhook-handoff-priority

FidoCanCode commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FidoCanCode commented Apr 25, 2026

Bug

Fix

Test

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants