Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955) by FidoCanCode · Pull Request #967 · FidoCanCode/home

FidoCanCode · 2026-04-25T14:13:11Z

Fixes #955.

Moves the preemption signal from the background handler thread into the synchronous HTTP webhook path, eliminating the thread-scheduling race that let worker turns run to completion when a webhook arrived seconds before the turn ended. Removes the now-redundant try_preempt_worker call from session.prompt (keeping hold_for_handler as a secondary safety net for new-turn races), and adds an end-to-end test for the new timing guarantee.

Work queue

Completed (3)

Remove redundant preemption from hold_for_handler and session.prompt
Add end-to-end test: synchronous preemption fires before background thread runs
Fire preemption signal synchronously in HTTP handler before background spawn

The preemption cancel was previously deferred to the background handler thread's hold_for_handler(preempt_worker=True) call. This meant if the background thread was delayed by OS scheduling, the worker's turn could complete before the cancel ever fired — exactly the race observed when a 256-second turn had only 9 seconds left when the webhook arrived. Move the cancel to _do_post_inner, synchronous on the HTTP handler thread, before the background thread is spawned. session.preempt_worker() checks the talker registry directly (no caller-kind gate needed — the HTTP handler is inherently a webhook context) and fires _fire_worker_cancel() immediately if a worker is running. hold_for_handler(preempt_worker=True) in the background thread remains as a second preemption point to handle the edge case where the worker starts a new turn in the window between the synchronous cancel and the background thread acquiring the session lock.

….__enter__ With preemption now firing synchronously on the HTTP handler thread (OwnedSession.preempt_worker, #955) and hold_for_handler as the secondary safety net, the try_preempt_worker calls in ClaudeSession.prompt and CopilotCLISession.__enter__ are redundant — they can never fire (webhook handlers always enter via hold_for_handler which already preempted; workers and reentrant webhook calls are gated out by the caller_kind check). Remove both preemption sites and their associated logging. Preemption is now documented at exactly two points: preempt_worker() for the synchronous HTTP-handler fire, and hold_for_handler(preempt_worker=True) for the background-thread safety net. Update _fire_worker_cancel docstrings to reference the callers by their correct names. test_claude_preempt_kind.py shrinks to just the set_thread_kind roundtrip — the preempt-kind tests for session.prompt are deleted since that code is gone. Preemption tests for hold_for_handler remain in test_claude_hold_for_handler.py and test_copilot_hold_for_handler.py.

Add TestSynchronousPreemption to test_server.py with three tests: preempt fires before spawn for model-needing webhooks, no preempt for non-model actions (PR merge), and no crash when get_session returns None.

…ry misclassified as task completion (closes #969) (#970) Fixes #969. ## Root cause: collision between #965 and #967 `Worker._execute_task`'s resume loop checked at worker.py:2387: ```python if not any( t["id"] == task["id"] and t.get("status") == TaskStatus.PENDING for t in current_task_list ): log.info("task externally completed — stopping retry ...") break ``` The intent was always *"task was marked COMPLETED by an external action"*, but the implementation was *"status is anything other than PENDING"*. Pre-#965 those were equivalent. #965 added `self._tasks.update(task_id, IN_PROGRESS)` at task start (worker.py:2362), so the running task is now IN_PROGRESS — the check breaks out immediately on every resume-loop iteration. The bug was latent until #967 added synchronous webhook preempt: a preempted turn returns empty without commits, `while head_before == head_after` runs, the check misclassifies, the worker breaks the resume loop, and downstream logic closes the in-flight PR. ## Fix Compare for COMPLETED directly. One-line change. ## Test Adds `test_does_not_treat_in_progress_task_as_externally_completed` that drives `execute_task` through one resume-loop retry with an IN_PROGRESS task and asserts the loop does not short-circuit. ## Trace from the wild ``` 14:42:02 task XXX → in_progress 14:42:18 webhook: preempting worker synchronously 14:42:18 ClaudeSession: cancelled — exiting turn early 14:42:18 session.prompt: turn complete (cancelled=True) 14:42:18 ClaudeClient.run_turn: preempted mid-flight — retry 1 14:42:18 session.prompt: turn complete (total=0.01s, result_len=0) ← empty 14:42:18 task done (session=) 14:42:18 task externally completed — stopping retry (id=XXX) ← bug fires 14:42:22 PR #968 closed by FidoCanCode ← self-close ``` Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

…rom worker preempt (closes #973) (#974) Fixes #973. ## Root cause After #967 (synchronous webhook preempt) and #971 (in-place model switch), webhook handlers consistently aborted their first prompt with an empty result. Production trace at 15:24:24: ``` hold_for_handler: preempting worker ClaudeClient.run_turn: preempted mid-flight — retry 2 triage classifier: requesting category from opus session.prompt: lock acquired (waited=0.00s) switch_model: sonnet → opus (control_request) claude result: ← empty, instant (silence) ``` The #786 fix added the `_preempt_pending` gate in `iter_events` to prevent clobbering an in-flight cancel signal. That gate is correct for its scenario (thread A fires cancel while thread B is mid-turn). But after lock turnover — thread A fires cancel, previous holder yields, **thread A itself acquires the lock** — A's own prompt was getting aborted by A's own signal. `_drain_to_boundary` saw cancel set, aborted early. `send()` saw cancel set, returned without writing. `iter_events` gated on `_preempt_pending` (still set from preempt), refused to clear cancel, broke immediately. ## Fix Track which thread fired the most recent cancel (`_cancel_fired_by_tid`). In `__enter__`, clear the cancel signal when the entering thread is the same as the firer — that signal was meant for the previous holder, not for the firer's own first turn. Different-thread acquisition (the #786 lock-handoff race) preserves the cancel and lets `iter_events` consume it. ## Test Adds `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel` that drives a real preempt → hold_for_handler → prompt sequence and asserts the handler's user message actually reaches stdin. The pre-fix behavior was that `proc.stdin.write` only saw `control_request interrupt` writes from drain — never the handler's content. All 263 claude-related tests pass including #786's invariant tests (`test_enter_preserves_cancel_event`, `test_cancel_survives_lock_handoff`). ## Related - #955 — synchronous preempt regression - #967 — landed synchronous preempt - #971 — landed in-place model switch (fixed init-handshake deadlock; this PR fixes the next-layer hang) - #786 — original cancel-survival fix; this PR is compatible - #949 — silent ANSWER cycles. Same surface; #973 is the precise root cause. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

#979) (#980) Fixes #979. ## Real root cause `_fire_worker_cancel` writes `control_request interrupt` to claude's stdin **without holding the session `_lock`**. The worker thread holds `_lock` for its in-flight turn — can't release until it sees the cancel — so the webhook firing the preempt has no choice but to skirt the lock. Linux pipe writes are only atomic up to `PIPE_BUF=4096 bytes`. Worker mid-write of a multi-KB user message gets its bytes interleaved with the webhook's tiny `control_request`, producing malformed JSON. claude consumes the corrupt bytes and goes idle in `epoll_wait` waiting for valid input — the wedged state we kept observing in production. This is the underlying cause of the entire post-#955 cascade (#963, #967, #971, #973, #974, #975). Each prior fix addressed a downstream symptom; this addresses the source. ## Two changes that close the race **1. `_stdin_lock` (separate from `_lock`)** Held briefly around every `proc.stdin.write` + `flush`. Both lock-holding callers and the unlocked cancel path acquire it, so byte-level writes from concurrent threads cannot interleave. Worker still holds `_lock` for its whole turn; webhook still fires interrupt without `_lock`. The byte-level invariant is restored regardless of the protocol-level lock state. **2. iter_events drains to boundary on cancel** Old behavior: cancel → break immediately. Left claude's response (`control_response` + `type=result`) in the pipe as stale events, requiring `_drain_to_boundary` on the next send to clean up. New behavior: cancel → keep reading until `type=result` (just record `_last_turn_cancelled=True`). Same iter_events call drains the response. Next holder gets a clean session. A new `_CANCEL_DRAIN_TIMEOUT` (5s) bounds the drain — if claude is wedged and never emits the boundary, kill subprocess + recover. Replaces the 10s drain-to-boundary in send() and the `_in_turn`-based stale-turn detection. ## Tests - `TestClaudeSessionStdinAtomicity` (new): drives concurrent `send` + `_send_control_interrupt` at byte granularity and asserts no interleave. Plus per-method tests that each write site holds `_stdin_lock`. - `test_cancel_drains_to_boundary` (rewritten): asserts events ARE consumed past cancel and `_in_turn=False` at end. - `test_recovers_when_cancel_drain_times_out` (rewritten): asserts `ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT)` when claude doesn't emit the boundary. - `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel` (simplified): the cancel-leak bug from #973 is structurally impossible in the new design. All 263 claude tests pass. The single failing copilotcli test (`test_webhook_preempts_worker_cancels_runtime`) is pre-existing and unrelated. ## Follow-ups - The cancel-firer-tid tracking from #974 (`_cancel_fired_by_tid` + `__enter__`) is now redundant — iter_events naturally clears the cancel state via the existing `_preempt_pending` gate. Worth removing in a follow-up. - `_drain_to_boundary` is no longer called from production code, only from its own tests. Worth deleting. - Both follow-ups would simplify the codebase further; left out of this PR to keep the scope tight. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

Fixes #983. ## Bug When webhook A held the lock and webhook B queued behind it via `hold_for_handler`, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's `RLock` isn't FIFO; the worker often won. Result: `A → worker → B` instead of `A → B → worker`. Every subsequent webhook in a flurry waited behind a full worker turn. Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued. ## Fix **ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set `_preempt_pending` and `notify_all` under this mutex (via new `_signal_pending_preempt` method). Workers entering `__enter__` `wait_for(not pending)`, and **re-check after acquiring** the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting". **OwnedSession** gains an abstract `_signal_pending_preempt` hook (default no-op). `hold_for_handler` calls it on the queueing branch (where `preempt_worker` returned False because the holder is also a webhook). ClaudeSession overrides. The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes. ## Test `test_hold_for_handler_signals_pending_when_queueing_behind_webhook` patches `try_preempt_worker` to return `(False, "webhook")` and asserts `hold_for_handler` calls `_signal_pending_preempt` exactly once and `_preempt_pending` is set afterward. Deterministic — no `time.sleep`. Pre-fix: `AttributeError` (no such method). Post-fix: passes. ## Related - #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all the prior fixes in this preempt cascade. Each peeled a layer; this one closes the lock-handoff prioritization. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

FidoCanCode force-pushed the fix-webhook-preemption branch from 6b0cfdd to 5c76683 Compare April 25, 2026 14:19

FidoCanCode added 2 commits April 25, 2026 14:25

test: verify synchronous preemption fires before background spawn (#955)

4ec06c8

Add TestSynchronousPreemption to test_server.py with three tests: preempt fires before spawn for model-needing webhooks, no preempt for non-model actions (PR merge), and no crash when get_session returns None.

FidoCanCode marked this pull request as ready for review April 25, 2026 14:33

FidoCanCode requested a review from rhencke April 25, 2026 14:33

rhencke approved these changes Apr 25, 2026

View reviewed changes

FidoCanCode merged commit 2c73bd5 into main Apr 25, 2026
1 check passed

FidoCanCode deleted the fix-webhook-preemption branch April 25, 2026 14:36

This was referenced Apr 25, 2026

Bug: fido closes his own PR after webhook preempt — empty post-cancel retry misclassified as task completion #969

Closed

Bug: fido closes own PR after webhook preempt — empty post-cancel retry misclassified as task completion (closes #969) #970

Merged

This was referenced Apr 25, 2026

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt #973

Closed

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973) #974

Merged

This was referenced Apr 25, 2026

Bug: cross-thread stdin writes corrupt claude protocol stream — actual root cause of preempt hangs #979

Closed

Serialize claude stdin writes; drain cancelled turns to boundary (closes #979) #980

Merged

This was referenced Apr 25, 2026

Bug: webhook→webhook handoff lets the worker win the lock between handlers #983

Closed

Webhook-priority lock handoff via Condition mutex (closes #983) #984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955)#967

Bug: webhooks no longer preempt the worker — they queue behind the session lock (closes #955)#967
FidoCanCode merged 3 commits into
mainfrom
fix-webhook-preemption

FidoCanCode commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FidoCanCode commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work queue

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FidoCanCode commented Apr 25, 2026 •

edited

Loading