Webhook-priority lock handoff via Condition mutex (closes #983)#984
Merged
Conversation
Previously, when webhook A was holding the lock and webhook B queued behind it via hold_for_handler, the lock release at A→exit went into a free-for-all between B and any worker thread also blocked on the lock. Python's RLock is not FIFO, so the worker often won. Result: A → worker → B instead of A → B → worker. Each subsequent webhook in a flurry waited behind a full worker turn. Two changes: 1. ClaudeSession gains a _preempt_cond Condition. Webhook callers set _preempt_pending and notify under this mutex (via the new _signal_pending_preempt method). Workers entering __enter__ wait on _preempt_cond until pending clears, and re-check pending under the mutex AFTER acquiring the lock — if a webhook arrived during the gap, release and retry. Worker re-acquire is now strictly gated on "no webhook waiting", with no race window. 2. OwnedSession gains an abstract _signal_pending_preempt hook (default no-op). hold_for_handler calls it on the queueing branch (the case where preempt_worker returned False because the holder is itself a webhook). ClaudeSession overrides to set _preempt_pending and notify on _preempt_cond. The cond mutex is held only briefly per check/set/clear — it never covers the actual session work, so no new contention. Adds test_hold_for_handler_signals_pending_when_queueing_behind_webhook that patches try_preempt_worker to return (False, "webhook") and asserts hold_for_handler calls _signal_pending_preempt exactly once and _preempt_pending is set afterward. Pre-fix: AttributeError. Post-fix: passes. (Pre-existing rocq generation drift in the workspace was reset; this commit is source-only.)
rhencke
approved these changes
Apr 25, 2026
This was referenced Apr 25, 2026
FidoCanCode
added a commit
that referenced
this pull request
Apr 25, 2026
…loses #975, #983) (#986) Fixes #975. Closes #983. ## What we thought was happening `set_model` wedges in claude-code 2.1.114 after 2-3 cycles per session (#975). Reproducer used a single-threaded probe that drove turns and switches against the bare claude binary; it wedged on cycle 2 every time. We attributed it to upstream claude-code. ## What was actually happening fido was violating its own invariant. When a webhook fired `_fire_worker_cancel`, it wrote the `control_request interrupt` to claude's stdin from the **webhook thread** — but the worker thread held the session lock and was the conversation owner. That side-channel write left claude's internal protocol state half-cancelled (turn observably interrupted from claude's side, but fido didn't drain the ack). Subsequent `control_request`s on the same session inherited that protocol-state divergence and stopped getting `control_response` back. Looked like an upstream wedge; was actually fido leaving claude confused. ## Fix Only the lock-holder writes to stdin. The webhook signals, the worker acts. - `_fire_worker_cancel` (called from webhook thread): sets `_cancel` + `_signal_pending_preempt` + `_wake()`. **No stdin write.** - `iter_events` (worker thread, lock-holder): on observed `_cancel`, sends the `control_request interrupt` itself — and **assertively drains**: tracks the interrupt's `request_id`, asserts the matching `control_response` ack arrives before the `type=result` boundary. If the ack is missing, the drain wasn't clean — kill + recover rather than hand a half-cancelled session to the next holder. - `_send_control_interrupt` returns the `request_id` so `iter_events` can match. - `_stdin_lock` removed entirely. No cross-thread writes; nothing to serialize. Real preemption (sub-millisecond from cancel-fired to interrupt-on-the-wire), single-owner invariant restored, claude's protocol state stays consistent with fido's. ## Verification Wrote a Python harness driving the real `ClaudeSession` production code path against real claude-code 2.1.114. Worker runs a long task, webhook fires preempt, repeat. **4/4 cycles pass** with each cycle doing multiple `set_model`s. Pre-fix the same flow wedged at cycle 2. So the wedge was directly caused by the cross-thread write — fix it, the wedge goes away. ``` ========== summary ========== cycle 0: OK wh_elapsed=2.00s cycle 1: OK wh_elapsed=6.83s cycle 2: OK wh_elapsed=1.54s cycle 3: OK wh_elapsed=3.59s ``` ## Tests 264 claude tests pass. Drops the obsolete `_stdin_lock` tests (4 tests removed); the invariant they enforced is replaced by the simpler "only the lock-holder writes" rule. ## Closes - #975 — set_model wedge. Was a fido-side cause, not upstream. - #983 — webhook→webhook handoff. The handoff Condition logic from #984 still applies; this PR keeps that and adds the cancel-drain assertion. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode
added a commit
that referenced
this pull request
Apr 26, 2026
…-handler starvation (closes #1022) Stopgap patch on the way to the larger session_lock.v FSM-driven rewrite tracked in #1022. Same root cause; smaller surgery. Symptom (#984/#1017/#1019/#1022): when N webhook handlers signal pending-preempt and the first acquires the lock, the binary Event ``_preempt_pending`` clears. The worker, waiting on ``not is_set()``, sees the clear and races for the lock — beating the second queued handler because RLock isn't FIFO. Each prior fix narrowed the race window without addressing the underlying signal granularity. Fix: ``_preempt_pending`` is now an ``int`` counter reflecting the number of queued handlers. ``_signal_pending_preempt`` increments it; each handler's outermost ``__enter__`` decrements; workers wait for ``_preempt_pending == 0``. A second queued handler keeps the counter above zero after the first acquires, so the worker keeps yielding until ALL queued handlers drain. This closes the symptom directly observed at 04:43:05Z on PR #1021 (comment 3142991935 went unanswered because the worker won the lock over a queued webhook B after webhook A released). Workers don't decrement the counter — they aren't tracked. The prompt() finally-block safety net decrements by 1 (was: clear()) so concurrent unrelated handlers aren't over-cleared. The architectural FSM rewrite remains tracked in #1022 as an experiment for a focused PR (~half-day effort with stress-test validation). This commit reduces the cascade frequency in the meantime; #1022 closes it structurally. Tests: - All 251 existing tests in test_claude.py + test_claude_hold_for_handler.py updated for the new API (.set/.is_set/.clear → +=/>0/=0) and pass. - New test_two_queued_handlers_keep_counter_above_zero_after_first_acquires asserts the #1022 invariant: counter remains > 0 between first and second handler acquires.
Merged
4 tasks
FidoCanCode
added a commit
that referenced
this pull request
Apr 26, 2026
…ation (closes #1022) (#1023) ## Summary Stopgap patch closing [#1022](#1022) (and the symptom of #984/#1017/#1019). Same root cause; smaller surgery than the full FSM rewrite (which remains tracked in #1022 as a focused-PR experiment). ## What changed \`_preempt_pending\` was a \`threading.Event\` (binary: anyone-pending vs. none). It's now an \`int\` counter (queue length). All readers updated: - Workers wait for \`_preempt_pending == 0\` (was \`not is_set()\`) - Handlers' \`__enter__\` decrements once on outermost entry (was \`clear()\`) - Workers \*don't\* decrement — they aren't queued - \`_signal_pending_preempt\` increments (was \`set()\`) - \`prompt()\` safety net decrements by 1 (was \`clear()\`) to avoid over-decrement of concurrent handlers ## Why this fixes #1022 Two handlers signal pending → counter = 2. Worker waits for ==0. First handler acquires + decrements → counter = 1. Worker still waits (counter != 0). First releases. Second handler acquires + decrements → counter = 0. Second releases. Worker finally acquires. **The second queued handler is no longer beat by the worker on the wake race.** Under the old binary Event: first handler's \`clear()\` made the worker see \"no one pending\" while the second handler was still queued. Worker raced ahead. (See trace on PR #1021 04:43:05Z — comment 3142991935 went unanswered.) ## Why not the full FSM rewrite Started that during the vet (this branch's earlier commits — now reverted). Surgery was ~300 lines across claude.py + 36 test updates + a new stress harness — too much for safe ship-tonight. #1022 stays open with the architectural rewrite as a focused next-day PR. ## Test plan - [x] All 251 tests in \`test_claude.py\` + \`test_claude_hold_for_handler.py\` updated for the new API and pass - [x] New \`test_two_queued_handlers_keep_counter_above_zero_after_first_acquires\` asserts the invariant directly - [x] \`./fido ci\` green - [ ] Live verification: bring fido up, fire two near-simultaneous review comments, confirm both get individual replies (not just the first) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
6 tasks
FidoCanCode
added a commit
that referenced
this pull request
Apr 26, 2026
…oordination (validates real fix for #984/#1017/#1019/#1022 cascade) (closes #1022) (#1031) Fixes #1022. Flip the extracted FSM in session_lock.v from a side-band oracle to the authoritative lock coordinator in OwnedSession, then migrate both ClaudeSession and CopilotCLISession to delegate to it — deleting the ad-hoc Event/Condition/sleep-poll primitives that caused the #984→#1017→#1019 race cascade. Validates that the proved invariants (no_dual_ownership, release_only_by_owner) eliminate the entire class of coordination bugs by construction. --- ## Work queue <!-- WORK_QUEUE_START --> - [ ] [Replace individual FSM type imports with a module-level alias](#1031 (comment)) **→ next** <!-- type:thread --> <details><summary>Completed (5)</summary> - [x] Remove wait_for_pending_preempt from Protocol and worker callers <!-- type:spec --> - [x] Final sweep: remove dead code, stale comments, unused imports <!-- type:spec --> - [x] Rewrite CopilotCLISession lock to delegate to OwnedSession FSM <!-- type:spec --> - [x] Rewrite ClaudeSession lock to delegate to OwnedSession FSM <!-- type:spec --> - [x] Promote FSM to authoritative lock coordinator in OwnedSession <!-- type:spec --> </details> <!-- WORK_QUEUE_END --> --------- Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #983.
Bug
When webhook A held the lock and webhook B queued behind it via
hold_for_handler, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python'sRLockisn't FIFO; the worker often won. Result:A → worker → Binstead ofA → B → worker. Every subsequent webhook in a flurry waited behind a full worker turn.Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued.
Fix
ClaudeSession gains a
_preempt_condCondition. Webhook callers set_preempt_pendingandnotify_allunder this mutex (via new_signal_pending_preemptmethod). Workers entering__enter__wait_for(not pending), and re-check after acquiring the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting".OwnedSession gains an abstract
_signal_pending_preempthook (default no-op).hold_for_handlercalls it on the queueing branch (wherepreempt_workerreturned False because the holder is also a webhook). ClaudeSession overrides.The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes.
Test
test_hold_for_handler_signals_pending_when_queueing_behind_webhookpatchestry_preempt_workerto return(False, "webhook")and assertshold_for_handlercalls_signal_pending_preemptexactly once and_preempt_pendingis set afterward. Deterministic — notime.sleep. Pre-fix:AttributeError(no such method). Post-fix: passes.Related