Skip to content

Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955)#967

Merged
FidoCanCode merged 3 commits into
mainfrom
fix-webhook-preemption
Apr 25, 2026
Merged

Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955)#967
FidoCanCode merged 3 commits into
mainfrom
fix-webhook-preemption

Conversation

@FidoCanCode
Copy link
Copy Markdown
Owner

@FidoCanCode FidoCanCode commented Apr 25, 2026

Fixes #955.

Moves the preemption signal from the background handler thread into the synchronous HTTP webhook path, eliminating the thread-scheduling race that let worker turns run to completion when a webhook arrived seconds before the turn ended. Removes the now-redundant try_preempt_worker call from session.prompt (keeping hold_for_handler as a secondary safety net for new-turn races), and adds an end-to-end test for the new timing guarantee.


Work queue

Completed (3)
  • Remove redundant preemption from hold_for_handler and session.prompt
  • Add end-to-end test: synchronous preemption fires before background thread runs
  • Fire preemption signal synchronously in HTTP handler before background spawn

The preemption cancel was previously deferred to the background handler
thread's hold_for_handler(preempt_worker=True) call.  This meant if the
background thread was delayed by OS scheduling, the worker's turn could
complete before the cancel ever fired — exactly the race observed when a
256-second turn had only 9 seconds left when the webhook arrived.

Move the cancel to _do_post_inner, synchronous on the HTTP handler thread,
before the background thread is spawned.  session.preempt_worker() checks
the talker registry directly (no caller-kind gate needed — the HTTP handler
is inherently a webhook context) and fires _fire_worker_cancel() immediately
if a worker is running.  hold_for_handler(preempt_worker=True) in the
background thread remains as a second preemption point to handle the edge
case where the worker starts a new turn in the window between the synchronous
cancel and the background thread acquiring the session lock.
@FidoCanCode FidoCanCode force-pushed the fix-webhook-preemption branch from 6b0cfdd to 5c76683 Compare April 25, 2026 14:19
….__enter__

With preemption now firing synchronously on the HTTP handler thread
(OwnedSession.preempt_worker, #955) and hold_for_handler as the secondary
safety net, the try_preempt_worker calls in ClaudeSession.prompt and
CopilotCLISession.__enter__ are redundant — they can never fire (webhook
handlers always enter via hold_for_handler which already preempted; workers
and reentrant webhook calls are gated out by the caller_kind check).

Remove both preemption sites and their associated logging.  Preemption is
now documented at exactly two points: preempt_worker() for the synchronous
HTTP-handler fire, and hold_for_handler(preempt_worker=True) for the
background-thread safety net.  Update _fire_worker_cancel docstrings to
reference the callers by their correct names.

test_claude_preempt_kind.py shrinks to just the set_thread_kind roundtrip —
the preempt-kind tests for session.prompt are deleted since that code is
gone.  Preemption tests for hold_for_handler remain in
test_claude_hold_for_handler.py and test_copilot_hold_for_handler.py.
Add TestSynchronousPreemption to test_server.py with three tests:
preempt fires before spawn for model-needing webhooks, no preempt for
non-model actions (PR merge), and no crash when get_session returns None.
@FidoCanCode FidoCanCode marked this pull request as ready for review April 25, 2026 14:33
@FidoCanCode FidoCanCode requested a review from rhencke April 25, 2026 14:33
@FidoCanCode FidoCanCode merged commit 2c73bd5 into main Apr 25, 2026
1 check passed
@FidoCanCode FidoCanCode deleted the fix-webhook-preemption branch April 25, 2026 14:36
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
…ry misclassified as task completion (closes #969) (#970)

Fixes #969.

## Root cause: collision between #965 and #967

`Worker._execute_task`'s resume loop checked at worker.py:2387:

```python
if not any(
    t["id"] == task["id"] and t.get("status") == TaskStatus.PENDING
    for t in current_task_list
):
    log.info("task externally completed — stopping retry ...")
    break
```

The intent was always *"task was marked COMPLETED by an external
action"*, but the implementation was *"status is anything other than
PENDING"*. Pre-#965 those were equivalent. #965 added
`self._tasks.update(task_id, IN_PROGRESS)` at task start
(worker.py:2362), so the running task is now IN_PROGRESS — the check
breaks out immediately on every resume-loop iteration.

The bug was latent until #967 added synchronous webhook preempt: a
preempted turn returns empty without commits, `while head_before ==
head_after` runs, the check misclassifies, the worker breaks the resume
loop, and downstream logic closes the in-flight PR.

## Fix

Compare for COMPLETED directly. One-line change.

## Test

Adds `test_does_not_treat_in_progress_task_as_externally_completed` that
drives `execute_task` through one resume-loop retry with an IN_PROGRESS
task and asserts the loop does not short-circuit.

## Trace from the wild

```
14:42:02 task XXX → in_progress
14:42:18 webhook: preempting worker synchronously
14:42:18 ClaudeSession: cancelled — exiting turn early
14:42:18 session.prompt: turn complete (cancelled=True)
14:42:18 ClaudeClient.run_turn: preempted mid-flight — retry 1
14:42:18 session.prompt: turn complete (total=0.01s, result_len=0)   ← empty
14:42:18 task done (session=)
14:42:18 task externally completed — stopping retry (id=XXX)         ← bug fires
14:42:22 PR #968 closed by FidoCanCode                                ← self-close
```

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
…rom worker preempt (closes #973) (#974)

Fixes #973.

## Root cause

After #967 (synchronous webhook preempt) and #971 (in-place model
switch), webhook handlers consistently aborted their first prompt with
an empty result. Production trace at 15:24:24:

```
hold_for_handler: preempting worker
ClaudeClient.run_turn: preempted mid-flight — retry 2
triage classifier: requesting category from opus
session.prompt: lock acquired (waited=0.00s)
switch_model: sonnet → opus (control_request)
claude result:                                      ← empty, instant
(silence)
```

The #786 fix added the `_preempt_pending` gate in `iter_events` to
prevent clobbering an in-flight cancel signal. That gate is correct for
its scenario (thread A fires cancel while thread B is mid-turn). But
after lock turnover — thread A fires cancel, previous holder yields,
**thread A itself acquires the lock** — A's own prompt was getting
aborted by A's own signal.

`_drain_to_boundary` saw cancel set, aborted early. `send()` saw cancel
set, returned without writing. `iter_events` gated on `_preempt_pending`
(still set from preempt), refused to clear cancel, broke immediately.

## Fix

Track which thread fired the most recent cancel
(`_cancel_fired_by_tid`). In `__enter__`, clear the cancel signal when
the entering thread is the same as the firer — that signal was meant for
the previous holder, not for the firer's own first turn.

Different-thread acquisition (the #786 lock-handoff race) preserves the
cancel and lets `iter_events` consume it.

## Test

Adds `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel`
that drives a real preempt → hold_for_handler → prompt sequence and
asserts the handler's user message actually reaches stdin. The pre-fix
behavior was that `proc.stdin.write` only saw `control_request
interrupt` writes from drain — never the handler's content.

All 263 claude-related tests pass including #786's invariant tests
(`test_enter_preserves_cancel_event`,
`test_cancel_survives_lock_handoff`).

## Related

- #955 — synchronous preempt regression
- #967 — landed synchronous preempt
- #971 — landed in-place model switch (fixed init-handshake deadlock;
this PR fixes the next-layer hang)
- #786 — original cancel-survival fix; this PR is compatible
- #949 — silent ANSWER cycles. Same surface; #973 is the precise root
cause.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
 #979) (#980)

Fixes #979.

## Real root cause

`_fire_worker_cancel` writes `control_request interrupt` to claude's
stdin **without holding the session `_lock`**. The worker thread holds
`_lock` for its in-flight turn — can't release until it sees the cancel
— so the webhook firing the preempt has no choice but to skirt the lock.

Linux pipe writes are only atomic up to `PIPE_BUF=4096 bytes`. Worker
mid-write of a multi-KB user message gets its bytes interleaved with the
webhook's tiny `control_request`, producing malformed JSON. claude
consumes the corrupt bytes and goes idle in `epoll_wait` waiting for
valid input — the wedged state we kept observing in production.

This is the underlying cause of the entire post-#955 cascade (#963,
#967, #971, #973, #974, #975). Each prior fix addressed a downstream
symptom; this addresses the source.

## Two changes that close the race

**1. `_stdin_lock` (separate from `_lock`)**

Held briefly around every `proc.stdin.write` + `flush`. Both
lock-holding callers and the unlocked cancel path acquire it, so
byte-level writes from concurrent threads cannot interleave. Worker
still holds `_lock` for its whole turn; webhook still fires interrupt
without `_lock`. The byte-level invariant is restored regardless of the
protocol-level lock state.

**2. iter_events drains to boundary on cancel**

Old behavior: cancel → break immediately. Left claude's response
(`control_response` + `type=result`) in the pipe as stale events,
requiring `_drain_to_boundary` on the next send to clean up.

New behavior: cancel → keep reading until `type=result` (just record
`_last_turn_cancelled=True`). Same iter_events call drains the response.
Next holder gets a clean session.

A new `_CANCEL_DRAIN_TIMEOUT` (5s) bounds the drain — if claude is
wedged and never emits the boundary, kill subprocess + recover. Replaces
the 10s drain-to-boundary in send() and the `_in_turn`-based stale-turn
detection.

## Tests

- `TestClaudeSessionStdinAtomicity` (new): drives concurrent `send` +
`_send_control_interrupt` at byte granularity and asserts no interleave.
Plus per-method tests that each write site holds `_stdin_lock`.
- `test_cancel_drains_to_boundary` (rewritten): asserts events ARE
consumed past cancel and `_in_turn=False` at end.
- `test_recovers_when_cancel_drain_times_out` (rewritten): asserts
`ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT)` when claude
doesn't emit the boundary.
- `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel`
(simplified): the cancel-leak bug from #973 is structurally impossible
in the new design.

All 263 claude tests pass. The single failing copilotcli test
(`test_webhook_preempts_worker_cancels_runtime`) is pre-existing and
unrelated.

## Follow-ups

- The cancel-firer-tid tracking from #974 (`_cancel_fired_by_tid` +
`__enter__`) is now redundant — iter_events naturally clears the cancel
state via the existing `_preempt_pending` gate. Worth removing in a
follow-up.
- `_drain_to_boundary` is no longer called from production code, only
from its own tests. Worth deleting.
- Both follow-ups would simplify the codebase further; left out of this
PR to keep the scope tight.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
FidoCanCode added a commit that referenced this pull request Apr 25, 2026
Fixes #983.

## Bug

When webhook A held the lock and webhook B queued behind it via
`hold_for_handler`, the lock handoff at A→exit went into a free-for-all
between B and any worker thread also waiting. Python's `RLock` isn't
FIFO; the worker often won. Result: `A → worker → B` instead of `A → B →
worker`. Every subsequent webhook in a flurry waited behind a full
worker turn.

Production trace at 18:25:44: review submitted with two inline comments
→ three webhooks → handler 1 EXITs → worker grabs lock and runs Agent
Explore tool → handler 3 still queued.

## Fix

**ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set
`_preempt_pending` and `notify_all` under this mutex (via new
`_signal_pending_preempt` method). Workers entering `__enter__`
`wait_for(not pending)`, and **re-check after acquiring** the lock — if
a webhook arrived in the gap, release and retry. Worker re-acquire is
strictly gated on "no webhook waiting".

**OwnedSession** gains an abstract `_signal_pending_preempt` hook
(default no-op). `hold_for_handler` calls it on the queueing branch
(where `preempt_worker` returned False because the holder is also a
webhook). ClaudeSession overrides.

The cond mutex is held only briefly per check/set/clear — never during
session work. RLock semantics for the holder unchanged; only
acquire-time prioritization changes.

## Test

`test_hold_for_handler_signals_pending_when_queueing_behind_webhook`
patches `try_preempt_worker` to return `(False, "webhook")` and asserts
`hold_for_handler` calls `_signal_pending_preempt` exactly once and
`_preempt_pending` is set afterward. Deterministic — no `time.sleep`.
Pre-fix: `AttributeError` (no such method). Post-fix: passes.

## Related

- #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all
the prior fixes in this preempt cascade. Each peeled a layer; this one
closes the lock-handoff prioritization.

Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: webhooks no longer preempt the worker — they queue behind the session lock

2 participants