Fix stuck Thinking after restart: 30s quiescence timeout#211
Merged
Conversation
…sessions When the app restarts, sessions that were mid-turn show 'Thinking...' and the user has to click Stop. The existing watchdog waits 600s (10 min) for resumed sessions, which is far too long. Changes: - Add WatchdogResumeQuiescenceTimeoutSeconds (30s) for resumed sessions that receive ZERO SDK events since restart - Seed watchdog LastEventAtTicks from events.jsonl file write time instead of DateTime.UtcNow, so elapsed time reflects actual idle time - If events.jsonl was written 5+ min ago, the watchdog fires on its first 15s check cycle (effectively instant) - If events start flowing (turn is genuinely active), HasReceivedEventsSinceResume goes true and the normal 120s/600s timeout tiers apply Respects all 8 processing-state-safety invariants. No new code paths for clearing IsProcessing — reuses existing watchdog fire path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The WatchdogTimeoutSelection tests embedded the OLD two-tier formula (hasActiveTool || isResumed || hasUsedTools) instead of the new three-tier logic with useResumeQuiescence. They passed because they were self-contained (testing their own local formula, not calling real code), giving false coverage. Changes: - Extract ComputeEffectiveTimeout helper that mirrors production logic - Update 4 existing tests to use the new formula - Add 4 new tests: resumed+events (600s), resumed+activeTool (600s), multi-agent (600s), multi-agent+resumed+no-events (30s) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three models (Opus 4.6, Sonnet 4.6, Codex 5.3) independently identified a critical bug: seeding LastEventAtTicks from events.jsonl file write time caused the 30s quiescence timeout to fire on the FIRST 15s watchdog check for any session where the file was >15s old. This is exactly the PR #148 regression (killing active sessions with a too-short timeout). Fix: seed from DateTime.UtcNow (not file time) so the 30s quiescence measures from app restart, not from last SDK event before crash. This gives the SDK a full 30s to reconnect and start streaming events. Added 16 new regression guard tests: - ResumeQuiescenceTimeout_IsReasonable (20-120s range, >= 2x check interval) - ResumeQuiescenceTimeout_IsLessThanInactivityTimeout - ResumeQuiescence_OnlyTriggersWhenResumedAndNoEvents (exhaustive combos) - ResumeQuiescence_NotResumed_NeverTriggersQuiescence - ResumeQuiescence_TransitionsToToolTimeout_WhenEventsArrive - ResumeQuiescence_TransitionsToInactivity_AfterIsResumedCleared - WatchdogTimeoutSelection_ExhaustiveMatrix (10 Theory rows) - SeedTime_MustNotCauseImmediateKill_RegressionGuard (PR #148 guard) - SeedTime_FromStaleFile_WouldCauseImmediateKill_DocumentsRisk - QuiescenceTimeout_EscapesOnFirstEvent - QuiescenceTimeout_DoesNotAffect_NormalSendPromptPath - WatchdogResumeQuiescence_Constant_MatchesExpectedValue - AllThreeTimeoutTiers_AreDistinct Also updated existing tests to use ComputeEffectiveTimeout helper mirroring the production 3-tier formula, preventing test-vs-production logic drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ime param Three-model review (Opus 4.6, Sonnet 4.6, Codex 5.3) found: 1. Error message shows '0 minute(s)' for 30s quiescence timeout due to integer division (30/60=0). Fixed: format as seconds when < 60s. 2. StartProcessingWatchdog had a seedTime parameter with a comment describing the OPPOSITE of correct usage (said 'seed from file time' when file-time seeding causes PR #148 regression). Removed the dead parameter entirely and replaced comment with a warning. 3. Stale comment at resume site said 'no separate short timeout needed' which contradicts the 30s quiescence this PR adds. Updated to list all three timeout tiers. Added 5 Theory tests for timeout message formatting (30s/59s/60s/120s/600s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
Three bugs fixed: 1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent (e.g., FailedDelegation) without terminal events. Root cause: HandleSessionEvent unconditionally updated LastEventAtTicks for ALL events, including metrics-only events that don't indicate progress. Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and AssistantUsageEvent. 2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety net. Even if progress events keep arriving, no turn should run for 60 minutes without user notification. Uses ProcessingStartedAt (set in SendPromptAsync) so it cannot be reset by events. 3. False 'session stuck' warnings after app restart. Root cause: GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds (120s) as file age threshold, but tool executions can go 5-10 minutes without events.jsonl writes. Fix: Use WatchdogToolExecutionTimeoutSeconds (600s) threshold. Added ~25 regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224, plus invariant tests for the 8 processing state safety invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
Three bugs fixed: 1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent (e.g., FailedDelegation) without terminal events. Root cause: HandleSessionEvent unconditionally updated LastEventAtTicks for ALL events, including metrics-only events that don't indicate progress. Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and AssistantUsageEvent. 2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety net. Even if progress events keep arriving, no turn should run for 60 minutes without user notification. Uses ProcessingStartedAt (set in SendPromptAsync) so it cannot be reset by events. 3. False 'session stuck' warnings after app restart. Root cause: GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds (120s) as file age threshold, but tool executions can go 5-10 minutes without events.jsonl writes. Fix: Use WatchdogToolExecutionTimeoutSeconds (600s) threshold. Added ~25 regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224, plus invariant tests for the 8 processing state safety invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 26, 2026
## Problem Three bugs in the processing watchdog: 1. **Session stuck at 361 minutes** — MergeNET11 session had `IsProcessing=true` for 361m. The SDK was sending repeated `SessionUsageInfoEvent` (FailedDelegation) which reset `LastEventAtTicks` without ever sending a terminal event. The watchdog timer kept resetting and never fired. 2. **No max processing time** — Even with fix #1, there was no absolute ceiling. A session could run forever as long as progress events arrived. 3. **False 'stuck' warnings** — After app restart, sessions got `⚠️ Session appears stuck — no events received for over 30 seconds` even though they were actively working. `GetEventsFileRestoreHints` used 120s threshold, but tool executions can go 5-10 minutes without events.jsonl writes. ## Fix 1. **Gate `LastEventAtTicks` update** — Skip for `SessionUsageInfoEvent` and `AssistantUsageEvent` (metrics-only events that don't indicate turn progress). 2. **Add `WatchdogMaxProcessingTimeSeconds`** (3600s = 60 min) — Absolute safety net using `ProcessingStartedAt`, which cannot be reset by events. 3. **Fix restore hints threshold** — Changed from `WatchdogInactivityTimeoutSeconds` (120s) to `WatchdogToolExecutionTimeoutSeconds` (600s). ## Testing - ~25 new regression guard tests covering every known failure mode from PRs #148, #163, #195, #211, #224 - Invariant tests for INV-1, INV-4, INV-5, INV-6 - End-to-end scenario tests - All 1447 tests pass --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
After app restart, resumed sessions that were mid-turn show Thinking... with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing.
Solution
Add a 30s resume quiescence timeout for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck.
Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1):
Timeout tiers (3-tier, was 2-tier):
Tests
Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow.
Fixes the 'click Stop on every restart' UX issue.