Use control_request set_model for mid-session model switch instead of session restart (closes #852) by FidoCanCode · Pull Request #971 · FidoCanCode/home

FidoCanCode · 2026-04-25T14:57:50Z

Fixes #852.

Replaces the restart-based model switch in ClaudeSession with a control_request set_model message sent over the existing subprocess stdin pipe — no more kill/respawn cycle, no init-handshake hang window, no cache-cold re-read of system prompts. Deletes the restart-to-switch fallback entirely and updates all callers and docstrings to match.

Work queue

Completed (4)

Update callers and delete restart-to-switch-model fallback
Clean up stale spawn/resume comments and docstrings
Replace restart-based switch_model with set_model control_request
Add set_model control_request method to ClaudeSession

rhencke · 2026-04-25T15:04:09Z

Use ./fido ci to check everything

FidoCanCode · 2026-04-25T15:07:32Z

Yep, ./fido ci is my go-to! That's the canonical check path — format, lint, typecheck, tests, the whole pack. Will make sure every commit clears it before pushing. 🐾

New _send_control_set_model(model) method sends a control_request set_model to subprocess stdin and reads stdout until the matching control_response arrives (correlated by request_id). No subprocess kill, no restart, no init-handshake — same session, same pipes. Also adds _RETURNCODE_SET_MODEL_TIMEOUT sentinel (-2) for the ClaudeStreamError raised on EOF, subprocess exit, or timeout while waiting for the response. 13 tests cover: wire format, request_id correlation, response skipping (wrong id, wrong type, empty lines), EOF/timeout/exit error paths, continue-when-not-ready loop, and session_id tracking from response.

Rewrites switch_model() to call _send_control_set_model() instead of kill/spawn. No more child-unregister/kill/wait/spawn/register cycle on every model transition — the running subprocess switches in-place via the control_request protocol. Updates TestClaudeSessionSwitchModel to assert on the control_request stdin write rather than kill/respawn.

…tics Audit confirmed all three callers (session_agent.ensure_session, ClaudeSession.prompt, ClaudeClient.ensure_session) are already compatible with the control_request implementation — no defensive wrappers or restart fallbacks to remove. Update the protocol docstring to make explicit that the switch is in-place: no kill, no respawn, no session-state loss.

…ments Four comments and docstrings still described the old kill/spawn model swap after switch_model was rewritten to use control_request: - _lock init comment: drop the phantom :meth:`restart` reference - _session_id init comment: now says recover/reset use --resume on _spawn, not switch_model - _spawn docstring: replace "context survives a model swap" with "context survives a subprocess restart (recover/reset)" - iter_events session_id comment: now says recover/reset, not switch_model restart

…rom worker preempt (closes #973) (#974) Fixes #973. ## Root cause After #967 (synchronous webhook preempt) and #971 (in-place model switch), webhook handlers consistently aborted their first prompt with an empty result. Production trace at 15:24:24: ``` hold_for_handler: preempting worker ClaudeClient.run_turn: preempted mid-flight — retry 2 triage classifier: requesting category from opus session.prompt: lock acquired (waited=0.00s) switch_model: sonnet → opus (control_request) claude result: ← empty, instant (silence) ``` The #786 fix added the `_preempt_pending` gate in `iter_events` to prevent clobbering an in-flight cancel signal. That gate is correct for its scenario (thread A fires cancel while thread B is mid-turn). But after lock turnover — thread A fires cancel, previous holder yields, **thread A itself acquires the lock** — A's own prompt was getting aborted by A's own signal. `_drain_to_boundary` saw cancel set, aborted early. `send()` saw cancel set, returned without writing. `iter_events` gated on `_preempt_pending` (still set from preempt), refused to clear cancel, broke immediately. ## Fix Track which thread fired the most recent cancel (`_cancel_fired_by_tid`). In `__enter__`, clear the cancel signal when the entering thread is the same as the firer — that signal was meant for the previous holder, not for the firer's own first turn. Different-thread acquisition (the #786 lock-handoff race) preserves the cancel and lets `iter_events` consume it. ## Test Adds `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel` that drives a real preempt → hold_for_handler → prompt sequence and asserts the handler's user message actually reaches stdin. The pre-fix behavior was that `proc.stdin.write` only saw `control_request interrupt` writes from drain — never the handler's content. All 263 claude-related tests pass including #786's invariant tests (`test_enter_preserves_cancel_event`, `test_cancel_survives_lock_handoff`). ## Related - #955 — synchronous preempt regression - #967 — landed synchronous preempt - #971 — landed in-place model switch (fixed init-handshake deadlock; this PR fixes the next-layer hang) - #786 — original cancel-survival fix; this PR is compatible - #949 — silent ANSWER cycles. Same surface; #973 is the precise root cause. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

…oses #975) (#977) Fixes #975. ## Real bug Probed the actual claude-code subprocess (`claude --version` → 2.1.120). Its `control_response` shape is: ```json {"type": "control_response", "response": {"subtype": "success", "request_id": "..."}} ``` `request_id` is nested under `response`. Fido's `_send_control_set_model` checked `obj.get("request_id")` at the top level. Always None. Predicate never matched. Every `switch_model` call hung until the subprocess was killed. ## Proof ``` $ grep -c 'switch_model: now on model=' ~/log/fido.log 0 $ grep -c 'switch_model:.*(control_request)' ~/log/fido.log 9 ``` Zero successful returns. Nine attempts. Every one hung. Concrete trace: - 15:40:23 `[home] switch_model: opus → haiku` - (no [home] log lines for 5 minutes) - 15:45:23 `[home] worker started` (fresh thread post-fido-restart) The session was wedged the entire window. Same shape across every `switch_model` log line in the file. ## Why fido seemed to work anyway Many `switch_model` calls are no-ops (target == current model) and return at line 1102 before sending the control_request — so sessions that didn't need an actual model change kept working. The 15:40:30 webhook test that succeeded landed on a freshly-booted confusio session that was already on opus from spawn — no switch needed. Bug only fires on real model changes. ## Fix claude.py:1023-1028: ```python if obj.get("type") == "control_response": response = obj.get("response") or {} if response.get("request_id") == request_id: return ``` ## Test Existing test fixtures had the wrong shape too — they put `request_id` at the top level, matching fido's broken expectation. Both helpers (`TestClaudeSessionSwitchModel._make_response_line` and `TestClaudeSessionSendControlSetModel._make_response_line`) updated to emit the real nested shape. New regression test `test_ignores_top_level_request_id_in_control_response` explicitly emits a malformed top-level placement first, then the correct nested one — asserts the malformed one is skipped and the wait completes only on the nested response. This catches any future regression that re-introduces top-level lookup. All 264 claude tests pass. ## Related - #975 — this issue - #971 — landed control_request set_model with the broken predicate - #973 / #974 — cancel-leak fix; uncovered the hang underneath - #976 — silent .get linter (would have caught this at write time) Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

#979) (#980) Fixes #979. ## Real root cause `_fire_worker_cancel` writes `control_request interrupt` to claude's stdin **without holding the session `_lock`**. The worker thread holds `_lock` for its in-flight turn — can't release until it sees the cancel — so the webhook firing the preempt has no choice but to skirt the lock. Linux pipe writes are only atomic up to `PIPE_BUF=4096 bytes`. Worker mid-write of a multi-KB user message gets its bytes interleaved with the webhook's tiny `control_request`, producing malformed JSON. claude consumes the corrupt bytes and goes idle in `epoll_wait` waiting for valid input — the wedged state we kept observing in production. This is the underlying cause of the entire post-#955 cascade (#963, #967, #971, #973, #974, #975). Each prior fix addressed a downstream symptom; this addresses the source. ## Two changes that close the race **1. `_stdin_lock` (separate from `_lock`)** Held briefly around every `proc.stdin.write` + `flush`. Both lock-holding callers and the unlocked cancel path acquire it, so byte-level writes from concurrent threads cannot interleave. Worker still holds `_lock` for its whole turn; webhook still fires interrupt without `_lock`. The byte-level invariant is restored regardless of the protocol-level lock state. **2. iter_events drains to boundary on cancel** Old behavior: cancel → break immediately. Left claude's response (`control_response` + `type=result`) in the pipe as stale events, requiring `_drain_to_boundary` on the next send to clean up. New behavior: cancel → keep reading until `type=result` (just record `_last_turn_cancelled=True`). Same iter_events call drains the response. Next holder gets a clean session. A new `_CANCEL_DRAIN_TIMEOUT` (5s) bounds the drain — if claude is wedged and never emits the boundary, kill subprocess + recover. Replaces the 10s drain-to-boundary in send() and the `_in_turn`-based stale-turn detection. ## Tests - `TestClaudeSessionStdinAtomicity` (new): drives concurrent `send` + `_send_control_interrupt` at byte granularity and asserts no interleave. Plus per-method tests that each write site holds `_stdin_lock`. - `test_cancel_drains_to_boundary` (rewritten): asserts events ARE consumed past cancel and `_in_turn=False` at end. - `test_recovers_when_cancel_drain_times_out` (rewritten): asserts `ClaudeStreamError(_RETURNCODE_CANCEL_DRAIN_TIMEOUT)` when claude doesn't emit the boundary. - `test_handler_prompt_runs_after_preempt_does_not_inherit_cancel` (simplified): the cancel-leak bug from #973 is structurally impossible in the new design. All 263 claude tests pass. The single failing copilotcli test (`test_webhook_preempts_worker_cancels_runtime`) is pre-existing and unrelated. ## Follow-ups - The cancel-firer-tid tracking from #974 (`_cancel_fired_by_tid` + `__enter__`) is now redundant — iter_events naturally clears the cancel state via the existing `_preempt_pending` gate. Worth removing in a follow-up. - `_drain_to_boundary` is no longer called from production code, only from its own tests. Worth deleting. - Both follow-ups would simplify the codebase further; left out of this PR to keep the scope tight. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

Fixes #983. ## Bug When webhook A held the lock and webhook B queued behind it via `hold_for_handler`, the lock handoff at A→exit went into a free-for-all between B and any worker thread also waiting. Python's `RLock` isn't FIFO; the worker often won. Result: `A → worker → B` instead of `A → B → worker`. Every subsequent webhook in a flurry waited behind a full worker turn. Production trace at 18:25:44: review submitted with two inline comments → three webhooks → handler 1 EXITs → worker grabs lock and runs Agent Explore tool → handler 3 still queued. ## Fix **ClaudeSession** gains a `_preempt_cond` Condition. Webhook callers set `_preempt_pending` and `notify_all` under this mutex (via new `_signal_pending_preempt` method). Workers entering `__enter__` `wait_for(not pending)`, and **re-check after acquiring** the lock — if a webhook arrived in the gap, release and retry. Worker re-acquire is strictly gated on "no webhook waiting". **OwnedSession** gains an abstract `_signal_pending_preempt` hook (default no-op). `hold_for_handler` calls it on the queueing branch (where `preempt_worker` returned False because the holder is also a webhook). ClaudeSession overrides. The cond mutex is held only briefly per check/set/clear — never during session work. RLock semantics for the holder unchanged; only acquire-time prioritization changes. ## Test `test_hold_for_handler_signals_pending_when_queueing_behind_webhook` patches `try_preempt_worker` to return `(False, "webhook")` and asserts `hold_for_handler` calls `_signal_pending_preempt` exactly once and `_preempt_pending` is set afterward. Deterministic — no `time.sleep`. Pre-fix: `AttributeError` (no such method). Post-fix: passes. ## Related - #955, #963, #967, #971, #973, #974, #975, #979, #980, #981, #982 — all the prior fixes in this preempt cascade. Each peeled a layer; this one closes the lock-handoff prioritization. Co-authored-by: Fido Can Code <190991155+FidoCanCode@users.noreply.github.com>

FidoCanCode force-pushed the mid-session-model-switch branch from 2375660 to 01809c4 Compare April 25, 2026 15:09

FidoCanCode added 3 commits April 25, 2026 15:13

FidoCanCode marked this pull request as ready for review April 25, 2026 15:18

FidoCanCode requested a review from rhencke April 25, 2026 15:18

rhencke approved these changes Apr 25, 2026

View reviewed changes

FidoCanCode merged commit e7d1b29 into main Apr 25, 2026
1 check passed

FidoCanCode deleted the mid-session-model-switch branch April 25, 2026 15:20

This was referenced Apr 25, 2026

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt #973

Closed

Bug: webhook handler's first prompt aborted by leftover cancel flag from worker preempt (closes #973) #974

Merged

This was referenced Apr 25, 2026

Bug: cross-thread stdin writes corrupt claude protocol stream — actual root cause of preempt hangs #979

Closed

Serialize claude stdin writes; drain cancelled turns to boundary (closes #979) #980

Merged

FidoCanCode mentioned this pull request Apr 25, 2026

Webhook-priority lock handoff via Condition mutex (closes #983) #984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use control_request set_model for mid-session model switch instead of session restart (closes #852)#971

Use control_request set_model for mid-session model switch instead of session restart (closes #852)#971
FidoCanCode merged 4 commits into
mainfrom
mid-session-model-switch

FidoCanCode commented Apr 25, 2026 •

edited

Loading

Uh oh!

rhencke commented Apr 25, 2026

Uh oh!

FidoCanCode commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FidoCanCode commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work queue

Uh oh!

rhencke commented Apr 25, 2026

Uh oh!

FidoCanCode commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FidoCanCode commented Apr 25, 2026 •

edited

Loading