Skip to content

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973)#974

Merged
FidoCanCode merged 1 commit into
mainfrom
fix-cancel-tid-tracking
Apr 25, 2026
Merged

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973)#974
FidoCanCode merged 1 commit into
mainfrom
fix-cancel-tid-tracking

Conversation

@FidoCanCode
Copy link
Copy Markdown
Owner

Fixes #973.

Root cause

After #967 (synchronous webhook preempt) and #971 (in-place model switch), webhook handlers consistently aborted their first prompt with an empty result. Production trace at 15:24:24:

hold_for_handler: preempting worker
ClaudeClient.run_turn: preempted mid-flight — retry 2
triage classifier: requesting category from opus
session.prompt: lock acquired (waited=0.00s)
switch_model: sonnet → opus (control_request)
claude result:                                      ← empty, instant
(silence)

The #786 fix added the _preempt_pending gate in iter_events to prevent clobbering an in-flight cancel signal. That gate is correct for its scenario (thread A fires cancel while thread B is mid-turn). But after lock turnover — thread A fires cancel, previous holder yields, thread A itself acquires the lock — A's own prompt was getting aborted by A's own signal.

_drain_to_boundary saw cancel set, aborted early. send() saw cancel set, returned without writing. iter_events gated on _preempt_pending (still set from preempt), refused to clear cancel, broke immediately.

Fix

Track which thread fired the most recent cancel (_cancel_fired_by_tid). In __enter__, clear the cancel signal when the entering thread is the same as the firer — that signal was meant for the previous holder, not for the firer's own first turn.

Different-thread acquisition (the #786 lock-handoff race) preserves the cancel and lets iter_events consume it.

Test

Adds test_handler_prompt_runs_after_preempt_does_not_inherit_cancel that drives a real preempt → hold_for_handler → prompt sequence and asserts the handler's user message actually reaches stdin. The pre-fix behavior was that proc.stdin.write only saw control_request interrupt writes from drain — never the handler's content.

All 263 claude-related tests pass including #786's invariant tests (test_enter_preserves_cancel_event, test_cancel_survives_lock_handoff).

Related

…loses #973)

After #967 added synchronous webhook preempt and #971 made the model switch
in-place, webhook handlers consistently aborted their first prompt with an
empty result. Trace showed `_drain_to_boundary: cancel set — aborting drain
early` immediately followed by `claude result: ` (empty) — the handler's
own preempt-cancel signal was aborting the handler's send().

The #786 fix added a `_preempt_pending` gate in iter_events to prevent
clobbering an in-flight cancel. That gate is correct for its scenario
(thread A fires cancel while thread B is mid-turn) but was inadvertently
keeping the cancel alive across lock turnover when the firer becomes the
new holder (thread A fires cancel, previous holder thread B yields,
thread A acquires the lock — A's prompt gets aborted by A's own signal).

Track the tid that fired the most recent cancel; clear the cancel in
__enter__ when the entering thread matches. Different-thread acquisition
preserves #786's invariant (cancel survives lock handoff for an unrelated
preempter).

Adds regression test that drives a real preempt+hold_for_handler+prompt
sequence and asserts the handler's user message actually reaches stdin.
The pre-fix behavior was that proc.stdin.write only saw control_request
interrupts from drain — never the handler's content.
@FidoCanCode FidoCanCode requested a review from rhencke April 25, 2026 15:38
@FidoCanCode FidoCanCode merged commit 60bed7c into main Apr 25, 2026
1 check passed
@FidoCanCode FidoCanCode deleted the fix-cancel-tid-tracking branch April 25, 2026 15:39
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
…oses #975) (#977)

Fixes #975.

## Real bug

Probed the actual claude-code subprocess (`claude --version` → 2.1.120).
Its `control_response` shape is:

```json
{"type": "control_response",
 "response": {"subtype": "success", "request_id": "..."}}
```

`request_id` is nested under `response`. Fido's
`_send_control_set_model` checked `obj.get("request_id")` at the top
level. Always None. Predicate never matched. Every `switch_model` call
hung until the subprocess was killed.

## Proof

```
$ grep -c 'switch_model: now on model=' ~/log/fido.log
0
$ grep -c 'switch_model:.*(control_request)' ~/log/fido.log
9
```

Zero successful returns. Nine attempts. Every one hung. Concrete trace:

- 15:40:23 `[home] switch_model: opus → haiku`
- (no [home] log lines for 5 minutes)
- 15:45:23 `[home] worker started` (fresh thread post-fido-restart)

The session was wedged the entire window. Same shape across every
`switch_model` log line in the file.

## Why fido seemed to work anyway

Many `switch_model` calls are no-ops (target == current model) and
return at line 1102 before sending the control_request — so sessions
that didn't need an actual model change kept working. The 15:40:30
webhook test that succeeded landed on a freshly-booted confusio session
that was already on opus from spawn — no switch needed. Bug only fires
on real model changes.

## Fix

claude.py:1023-1028:

```python
if obj.get("type") == "control_response":
    response = obj.get("response") or {}
    if response.get("request_id") == request_id:
        return
```

## Test

Existing test fixtures had the wrong shape too — they put `request_id`
at the top level, matching fido's broken expectation. Both helpers
(`TestClaudeSessionSwitchModel._make_response_line` and
`TestClaudeSessionSendControlSetModel._make_response_line`) updated to
emit the real nested shape.

New regression test
`test_ignores_top_level_request_id_in_control_response` explicitly emits
a malformed top-level placement first, then the correct nested one —
asserts the malformed one is skipped and the wait completes only on the
nested response. This catches any future regression that re-introduces
top-level lookup.

All 264 claude tests pass.

## Related

- #975 — this issue
- #971 — landed control_request set_model with the broken predicate
- #973 / #974 — cancel-leak fix; uncovered the hang underneath
- #976 — silent .get linter (would have caught this at write time)

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
 #979) (#980)

Fixes #979.

## Real root cause

`_fire_worker_cancel` writes `control_request interrupt` to claude's
stdin **without holding the session `_lock`**. The worker thread holds
`_lock` for its in-flight turn — can't release until it sees the cancel
— so the webhook firing the preempt has no choice but to skirt the lock.

Linux pipe writes are only atomic up to `PIPE_BUF=4096 bytes`. Worker
mid-write of a multi-KB user message gets its bytes interleaved with the
webhook's tiny `control_request`, producing malformed JSON. claude
consumes the corrupt bytes and goes idle in `epoll_wait` waiting for
valid input — the wedged state we kept observing in production.

This is the underlying cause of the entire post-#955 cascade (#963,
#967, #971, #973, #974, #975). Each prior fix addressed a downstream
symptom; this addresses the source.

## Two changes that close the race

**1. `_stdin_lock` (separate from `_lock`)**

Held briefly around every `proc.stdin.write` + `flush`. Both
lock-holding callers and the unlocked cancel path acquire it, so
byte-level writes from concurrent threads cannot interleave. Worker
still holds `_lock` for its whole turn; webhook still fires interrupt
without `_lock`. The byte-level invariant is restored regardless of the
protocol-level lock state.

**2. iter_events drains to boundary on cancel**

Old behavior: cancel → break immediately. Left claude's response
(`control_response` + `type=result`) in the pipe as stale events,
requiring `_drain_to_boundary` on the next send to clean up.

New behavior: cancel → keep reading until `type=result` (just record
`_last_turn_cancelled=True`). Same iter_events call drains the response.
Next holder gets a clean session.

A new `_CANCEL_DRAIN_TIMEOUT` (5s) bounds the drain — if claude is
wedged and never emits the boundary, kill subprocess + recover. Replaces
the 10s drain-to-boundary in send() and the `_in_turn`-based stale-turn
detection.

## Tests

- `TestClaudeSessionStdinAtomicity` (new): drives concurrent `send` +
`_send_control_interrupt` at byte granularity and asserts no interleave.
Plus per-method tests that each write site holds `_stdin_lock`.
- `test_cancel_drains_to_boundary` (rewritten): asserts events ARE
consumed past cancel and `_in_turn=False` at end.
- `test_recovers_when_cancel_drain_times_out` (rewritten): asserts
`ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT)` when claude
doesn't emit the boundary.
- `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel`
(simplified): the cancel-leak bug from #973 is structurally impossible
in the new design.

All 263 claude tests pass. The single failing copilotcli test
(`test_webhook_preempts_worker_cancels_runtime`) is pre-existing and
unrelated.

## Follow-ups

- The cancel-firer-tid tracking from #974 (`_cancel_fired_by_tid` +
`__enter__`) is now redundant — iter_events naturally clears the cancel
state via the existing `_preempt_pending` gate. Worth removing in a
follow-up.
- `_drain_to_boundary` is no longer called from production code, only
from its own tests. Worth deleting.
- Both follow-ups would simplify the codebase further; left out of this
PR to keep the scope tight.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
…loses #981) (#982)

Fixes #981.

## Root cause

`ClaudeSession._pending_talker_kind` was a session-level instance
attribute that `prompt()` wrote and `__enter__` read. Cross-thread race:
when worker and webhook threads called `prompt()` in overlapping
windows, one's write clobbered the other's. The lock-winner then read
the wrong kind in `__enter__`.

Production captured the bug clearly: status XML showed the worker tid
(140034430314176) registered with `kind="webhook"`. Every subsequent
`preempt_worker()` call read "holder is webhook" → returned False → no
cancel fired → webhooks queued behind the worker for 4+ minutes.

## Fix

`__enter__` reads the kind directly from
`provider.current_thread_kind()` — the calling thread's thread-local. No
shared attribute. No race. Each thread carries its own kind by
construction.

Removes `_pending_talker_kind` entirely (set + read + clear). Removes
the now-unused `Literal` import.

## Test

`test_enter_uses_thread_kind_not_shared_attribute` plants a stale
"webhook" annotation on the session-level attribute (simulating what
another thread might have written) while the current thread's kind is
"worker".

- Pre-fix: `__enter__` reads the annotation and mis-registers as webhook
→ test fails
- Post-fix: `__enter__` reads the thread-local and registers as worker →
test passes

Verified by stashing the source fix and re-running: pre-fix FAILS,
post-fix PASSES. So the test catches the bug.

## Related

- #979 / #980 — byte-level stdin race; fixed prior. This is a separate
higher-level race on talker registration that #980 unmasked (same cause:
lock handoff is faster, race window more frequent).
- #973 / #974 — earlier preempt-related race fixes (cancel-leak)
- #786 — original talker registration design

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
Fixes #983.

## Bug

When webhook A held the lock and webhook B queued behind it via
`hold_for_handler`, the lock handoff at A→exit went into a free-for-all
between B and any worker thread also waiting. Python's `RLock` isn't
FIFO; the worker often won. Result: `A → worker → B` instead of `A → B →
worker`. Every subsequent webhook in a flurry waited behind a full
worker turn.

Production trace at 18:25:44: review submitted with two inline comments
→ three webhooks → handler 1 EXITs → worker grabs lock and runs Agent
Explore tool → handler 3 still queued.

## Fix

**ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set
`_preempt_pending` and `notify_all` under this mutex (via new
`_signal_pending_preempt` method). Workers entering `__enter__`
`wait_for(not pending)`, and **re-check after acquiring** the lock — if
a webhook arrived in the gap, release and retry. Worker re-acquire is
strictly gated on "no webhook waiting".

**OwnedSession** gains an abstract `_signal_pending_preempt` hook
(default no-op). `hold_for_handler` calls it on the queueing branch
(where `preempt_worker` returned False because the holder is also a
webhook). ClaudeSession overrides.

The cond mutex is held only briefly per check/set/clear — never during
session work. RLock semantics for the holder unchanged; only
acquire-time prioritization changes.

## Test

`test_hold_for_handler_signals_pending_when_queueing_behind_webhook`
patches `try_preempt_worker` to return `(False, "webhook")` and asserts
`hold_for_handler` calls `_signal_pending_preempt` exactly once and
`_preempt_pending` is set afterward. Deterministic — no `time.sleep`.
Pre-fix: `AttributeError` (no such method). Post-fix: passes.

## Related

- #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all
the prior fixes in this preempt cascade. Each peeled a layer; this one
closes the lock-handoff prioritization.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt

2 participants