Skip to content

Webhook-priority lock handoff via Condition mutex (closes #983)#984

Merged
FidoCanCode merged 1 commit into
mainfrom
fix-webhook-handoff-priority
Apr 25, 2026
Merged

Webhook-priority lock handoff via Condition mutex (closes #983)#984
FidoCanCode merged 1 commit into
mainfrom
fix-webhook-handoff-priority

Conversation

@FidoCanCode
Copy link
Copy Markdown
Owner

Fixes #983.

Bug

When webhook A held the lock and webhook B queued behind it via hold_for_handler, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's RLock isn't FIFO; the worker often won. Result: A → worker → B instead of A → B → worker. Every subsequent webhook in a flurry waited behind a full worker turn.

Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued.

Fix

ClaudeSession gains a _preempt_cond Condition. Webhook callers set _preempt_pending and notify_all under this mutex (via new _signal_pending_preempt method). Workers entering __enter__ wait_for(not pending), and re-check after acquiring the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting".

OwnedSession gains an abstract _signal_pending_preempt hook (default no-op). hold_for_handler calls it on the queueing branch (where preempt_worker returned False because the holder is also a webhook). ClaudeSession overrides.

The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes.

Test

test_hold_for_handler_signals_pending_when_queueing_behind_webhook patches try_preempt_worker to return (False, "webhook") and asserts hold_for_handler calls _signal_pending_preempt exactly once and _preempt_pending is set afterward. Deterministic — no time.sleep. Pre-fix: AttributeError (no such method). Post-fix: passes.

Related

Previously, when webhook A was holding the lock and webhook B queued
behind it via hold_for_handler, the lock release at A→exit went into
a free-for-all between B and any worker thread also blocked on the
lock.  Python's RLock is not FIFO, so the worker often won.  Result:
A → worker → B instead of A → B → worker.  Each subsequent webhook in
a flurry waited behind a full worker turn.

Two changes:

1. ClaudeSession gains a _preempt_cond Condition.  Webhook callers set
   _preempt_pending and notify under this mutex (via the new
   _signal_pending_preempt method).  Workers entering __enter__ wait
   on _preempt_cond until pending clears, and re-check pending under
   the mutex AFTER acquiring the lock — if a webhook arrived during
   the gap, release and retry.  Worker re-acquire is now strictly
   gated on "no webhook waiting", with no race window.

2. OwnedSession gains an abstract _signal_pending_preempt hook (default
   no-op).  hold_for_handler calls it on the queueing branch (the case
   where preempt_worker returned False because the holder is itself a
   webhook).  ClaudeSession overrides to set _preempt_pending and
   notify on _preempt_cond.

The cond mutex is held only briefly per check/set/clear — it never
covers the actual session work, so no new contention.

Adds test_hold_for_handler_signals_pending_when_queueing_behind_webhook
that patches try_preempt_worker to return (False, "webhook") and asserts
hold_for_handler calls _signal_pending_preempt exactly once and
_preempt_pending is set afterward.  Pre-fix: AttributeError.  Post-fix:
passes.

(Pre-existing rocq generation drift in the workspace was reset; this
commit is source-only.)
@FidoCanCode FidoCanCode requested a review from rhencke April 25, 2026 18:39
@FidoCanCode FidoCanCode merged commit ec05a88 into main Apr 25, 2026
1 check passed
@FidoCanCode FidoCanCode deleted the fix-webhook-handoff-priority branch April 25, 2026 18:40
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
…loses #975, #983) (#986)

Fixes #975. Closes #983.

## What we thought was happening

`set_model` wedges in claude-code 2.1.114 after 2-3 cycles per session
(#975). Reproducer used a single-threaded probe that drove turns and
switches against the bare claude binary; it wedged on cycle 2 every
time. We attributed it to upstream claude-code.

## What was actually happening

fido was violating its own invariant. When a webhook fired
`_fire_worker_cancel`, it wrote the `control_request interrupt` to
claude's stdin from the **webhook thread** — but the worker thread held
the session lock and was the conversation owner. That side-channel write
left claude's internal protocol state half-cancelled (turn observably
interrupted from claude's side, but fido didn't drain the ack).
Subsequent `control_request`s on the same session inherited that
protocol-state divergence and stopped getting `control_response` back.
Looked like an upstream wedge; was actually fido leaving claude
confused.

## Fix

Only the lock-holder writes to stdin. The webhook signals, the worker
acts.

- `_fire_worker_cancel` (called from webhook thread): sets `_cancel` +
`_signal_pending_preempt` + `_wake()`. **No stdin write.**
- `iter_events` (worker thread, lock-holder): on observed `_cancel`,
sends the `control_request interrupt` itself — and **assertively
drains**: tracks the interrupt's `request_id`, asserts the matching
`control_response` ack arrives before the `type=result` boundary. If the
ack is missing, the drain wasn't clean — kill + recover rather than hand
a half-cancelled session to the next holder.
- `_send_control_interrupt` returns the `request_id` so `iter_events`
can match.
- `_stdin_lock` removed entirely. No cross-thread writes; nothing to
serialize.

Real preemption (sub-millisecond from cancel-fired to
interrupt-on-the-wire), single-owner invariant restored, claude's
protocol state stays consistent with fido's.

## Verification

Wrote a Python harness driving the real `ClaudeSession` production code
path against real claude-code 2.1.114. Worker runs a long task, webhook
fires preempt, repeat. **4/4 cycles pass** with each cycle doing
multiple `set_model`s. Pre-fix the same flow wedged at cycle 2. So the
wedge was directly caused by the cross-thread write — fix it, the wedge
goes away.

```
========== summary ==========
  cycle 0: OK  wh_elapsed=2.00s
  cycle 1: OK  wh_elapsed=6.83s
  cycle 2: OK  wh_elapsed=1.54s
  cycle 3: OK  wh_elapsed=3.59s
```

## Tests

264 claude tests pass. Drops the obsolete `_stdin_lock` tests (4 tests
removed); the invariant they enforced is replaced by the simpler "only
the lock-holder writes" rule.

## Closes

- #975 — set_model wedge. Was a fido-side cause, not upstream.
- #983 — webhook→webhook handoff. The handoff Condition logic from #984
still applies; this PR keeps that and adds the cancel-drain assertion.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 26, 2026
…-handler starvation (closes #1022)

Stopgap patch on the way to the larger session_lock.v FSM-driven
rewrite tracked in #1022. Same root cause; smaller surgery.

Symptom (#984/#1017/#1019/#1022): when N webhook handlers signal
pending-preempt and the first acquires the lock, the binary Event
``_preempt_pending`` clears.  The worker, waiting on ``not is_set()``,
sees the clear and races for the lock — beating the second queued
handler because RLock isn't FIFO.  Each prior fix narrowed the race
window without addressing the underlying signal granularity.

Fix: ``_preempt_pending`` is now an ``int`` counter reflecting the
number of queued handlers.  ``_signal_pending_preempt`` increments
it; each handler's outermost ``__enter__`` decrements; workers wait
for ``_preempt_pending == 0``.  A second queued handler keeps the
counter above zero after the first acquires, so the worker keeps
yielding until ALL queued handlers drain.

This closes the symptom directly observed at 04:43:05Z on PR #1021
(comment 3142991935 went unanswered because the worker won the lock
over a queued webhook B after webhook A released).

Workers don't decrement the counter — they aren't tracked.  The
prompt() finally-block safety net decrements by 1 (was: clear()) so
concurrent unrelated handlers aren't over-cleared.

The architectural FSM rewrite remains tracked in #1022 as an
experiment for a focused PR (~half-day effort with stress-test
validation).  This commit reduces the cascade frequency in the
meantime; #1022 closes it structurally.

Tests:
- All 251 existing tests in test_claude.py + test_claude_hold_for_handler.py
  updated for the new API (.set/.is_set/.clear → +=/>0/=0) and pass.
- New test_two_queued_handlers_keep_counter_above_zero_after_first_acquires
  asserts the #1022 invariant: counter remains > 0 between first and
  second handler acquires.
FidoCanCode added a commit that referenced this pull request Apr 26, 2026
…ation (closes #1022) (#1023)

## Summary

Stopgap patch closing
[#1022](#1022) (and the
symptom of #984/#1017/#1019). Same root cause; smaller surgery than the
full FSM rewrite (which remains tracked in #1022 as a focused-PR
experiment).

## What changed

\`_preempt_pending\` was a \`threading.Event\` (binary: anyone-pending
vs. none). It's now an \`int\` counter (queue length). All readers
updated:

- Workers wait for \`_preempt_pending == 0\` (was \`not is_set()\`)
- Handlers' \`__enter__\` decrements once on outermost entry (was
\`clear()\`)
- Workers \*don't\* decrement — they aren't queued
- \`_signal_pending_preempt\` increments (was \`set()\`)
- \`prompt()\` safety net decrements by 1 (was \`clear()\`) to avoid
over-decrement of concurrent handlers

## Why this fixes #1022

Two handlers signal pending → counter = 2. Worker waits for ==0. First
handler acquires + decrements → counter = 1. Worker still waits (counter
!= 0). First releases. Second handler acquires + decrements → counter =
0. Second releases. Worker finally acquires. **The second queued handler
is no longer beat by the worker on the wake race.**

Under the old binary Event: first handler's \`clear()\` made the worker
see \"no one pending\" while the second handler was still queued. Worker
raced ahead. (See trace on PR #1021 04:43:05Z — comment 3142991935 went
unanswered.)

## Why not the full FSM rewrite

Started that during the vet (this branch's earlier commits — now
reverted). Surgery was ~300 lines across claude.py + 36 test updates + a
new stress harness — too much for safe ship-tonight. #1022 stays open
with the architectural rewrite as a focused next-day PR.

## Test plan

- [x] All 251 tests in \`test_claude.py\` +
\`test_claude_hold_for_handler.py\` updated for the new API and pass
- [x] New
\`test_two_queued_handlers_keep_counter_above_zero_after_first_acquires\`
asserts the invariant directly
- [x] \`./fido ci\` green
- [ ] Live verification: bring fido up, fire two near-simultaneous
review comments, confirm both get individual replies (not just the
first)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 26, 2026
…oordination (validates real fix for #984/#1017/#1019/#1022 cascade) (closes #1022) (#1031)

Fixes #1022.

Flip the extracted FSM in session_lock.v from a side-band oracle to the
authoritative lock coordinator in OwnedSession, then migrate both
ClaudeSession and CopilotCLISession to delegate to it — deleting the
ad-hoc Event/Condition/sleep-poll primitives that caused the
#984#1017#1019 race cascade. Validates that the proved invariants
(no_dual_ownership, release_only_by_owner) eliminate the entire class of
coordination bugs by construction.

---

## Work queue

<!-- WORK_QUEUE_START -->
- [ ] [Replace individual FSM type imports with a module-level
alias](#1031 (comment))
**→ next** <!-- type:thread -->

<details><summary>Completed (5)</summary>

- [x] Remove wait_for_pending_preempt from Protocol and worker callers
<!-- type:spec -->
- [x] Final sweep: remove dead code, stale comments, unused imports <!--
type:spec -->
- [x] Rewrite CopilotCLISession lock to delegate to OwnedSession FSM
<!-- type:spec -->
- [x] Rewrite ClaudeSession lock to delegate to OwnedSession FSM <!--
type:spec -->
- [x] Promote FSM to authoritative lock coordinator in OwnedSession <!--
type:spec -->
</details>
<!-- WORK_QUEUE_END -->

---------

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: webhook→webhook handoff lets the worker win the lock between handlers

2 participants