Skip to content

Fix stuck Thinking after restart: 30s quiescence timeout#211

Merged
PureWeen merged 4 commits intomainfrom
fix/watchdog-quiescence
Feb 25, 2026
Merged

Fix stuck Thinking after restart: 30s quiescence timeout#211
PureWeen merged 4 commits intomainfrom
fix/watchdog-quiescence

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

After app restart, resumed sessions that were mid-turn show Thinking... with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing.

Solution

Add a 30s resume quiescence timeout for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck.

Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1):

  1. 30s quiescence — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin)
  2. Event-gated — only fires when \HasReceivedEventsSinceResume == false. Once events start flowing, transitions to normal 120s/600s timeout tiers
  3. Seed from DateTime.UtcNow, NOT file time — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR Fix stuck sessions: watchdog recovery, model labels, and thread safety #148 regression pattern)
  4. Reuses existing watchdog fire path — no new IsProcessing cleanup code, all 8 invariants preserved

Timeout tiers (3-tier, was 2-tier):

Condition Timeout
Resumed, zero events since restart 30s (NEW)
Normal processing, no tools 120s
Active tools / resumed with events / multi-agent 600s

Tests

  • 16 new regression guard tests covering quiescence edge cases, seed time safety, exhaustive timeout matrix
  • Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula
  • 108 total watchdog+recovery tests pass

Regression history context

This code has been through 7 PRs of fix/regression cycles (PRs #141#147#148#153#158#163#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

PureWeen and others added 4 commits February 25, 2026 13:08
…sessions

When the app restarts, sessions that were mid-turn show 'Thinking...' and
the user has to click Stop. The existing watchdog waits 600s (10 min) for
resumed sessions, which is far too long.

Changes:
- Add WatchdogResumeQuiescenceTimeoutSeconds (30s) for resumed sessions
  that receive ZERO SDK events since restart
- Seed watchdog LastEventAtTicks from events.jsonl file write time instead
  of DateTime.UtcNow, so elapsed time reflects actual idle time
- If events.jsonl was written 5+ min ago, the watchdog fires on its first
  15s check cycle (effectively instant)
- If events start flowing (turn is genuinely active), HasReceivedEventsSinceResume
  goes true and the normal 120s/600s timeout tiers apply

Respects all 8 processing-state-safety invariants. No new code paths for
clearing IsProcessing — reuses existing watchdog fire path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The WatchdogTimeoutSelection tests embedded the OLD two-tier formula
(hasActiveTool || isResumed || hasUsedTools) instead of the new
three-tier logic with useResumeQuiescence. They passed because they
were self-contained (testing their own local formula, not calling
real code), giving false coverage.

Changes:
- Extract ComputeEffectiveTimeout helper that mirrors production logic
- Update 4 existing tests to use the new formula
- Add 4 new tests: resumed+events (600s), resumed+activeTool (600s),
  multi-agent (600s), multi-agent+resumed+no-events (30s)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three models (Opus 4.6, Sonnet 4.6, Codex 5.3) independently identified a
critical bug: seeding LastEventAtTicks from events.jsonl file write time
caused the 30s quiescence timeout to fire on the FIRST 15s watchdog check
for any session where the file was >15s old. This is exactly the PR #148
regression (killing active sessions with a too-short timeout).

Fix: seed from DateTime.UtcNow (not file time) so the 30s quiescence
measures from app restart, not from last SDK event before crash. This gives
the SDK a full 30s to reconnect and start streaming events.

Added 16 new regression guard tests:
- ResumeQuiescenceTimeout_IsReasonable (20-120s range, >= 2x check interval)
- ResumeQuiescenceTimeout_IsLessThanInactivityTimeout
- ResumeQuiescence_OnlyTriggersWhenResumedAndNoEvents (exhaustive combos)
- ResumeQuiescence_NotResumed_NeverTriggersQuiescence
- ResumeQuiescence_TransitionsToToolTimeout_WhenEventsArrive
- ResumeQuiescence_TransitionsToInactivity_AfterIsResumedCleared
- WatchdogTimeoutSelection_ExhaustiveMatrix (10 Theory rows)
- SeedTime_MustNotCauseImmediateKill_RegressionGuard (PR #148 guard)
- SeedTime_FromStaleFile_WouldCauseImmediateKill_DocumentsRisk
- QuiescenceTimeout_EscapesOnFirstEvent
- QuiescenceTimeout_DoesNotAffect_NormalSendPromptPath
- WatchdogResumeQuiescence_Constant_MatchesExpectedValue
- AllThreeTimeoutTiers_AreDistinct

Also updated existing tests to use ComputeEffectiveTimeout helper mirroring
the production 3-tier formula, preventing test-vs-production logic drift.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ime param

Three-model review (Opus 4.6, Sonnet 4.6, Codex 5.3) found:

1. Error message shows '0 minute(s)' for 30s quiescence timeout due to
   integer division (30/60=0). Fixed: format as seconds when < 60s.

2. StartProcessingWatchdog had a seedTime parameter with a comment
   describing the OPPOSITE of correct usage (said 'seed from file time'
   when file-time seeding causes PR #148 regression). Removed the dead
   parameter entirely and replaced comment with a warning.

3. Stale comment at resume site said 'no separate short timeout needed'
   which contradicts the 30s quiescence this PR adds. Updated to list
   all three timeout tiers.

Added 5 Theory tests for timeout message formatting (30s/59s/60s/120s/600s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen marked this pull request as ready for review February 25, 2026 20:44
@PureWeen PureWeen merged commit 6be33ca into main Feb 25, 2026
@PureWeen PureWeen deleted the fix/watchdog-quiescence branch February 25, 2026 20:44
PureWeen added a commit that referenced this pull request Feb 26, 2026
Three bugs fixed:

1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent
   (e.g., FailedDelegation) without terminal events. Root cause:
   HandleSessionEvent unconditionally updated LastEventAtTicks for ALL
   events, including metrics-only events that don't indicate progress.
   Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and
   AssistantUsageEvent.

2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety
   net. Even if progress events keep arriving, no turn should run for
   60 minutes without user notification. Uses ProcessingStartedAt
   (set in SendPromptAsync) so it cannot be reset by events.

3. False 'session stuck' warnings after app restart. Root cause:
   GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds
   (120s) as file age threshold, but tool executions can go 5-10
   minutes without events.jsonl writes. Fix: Use
   WatchdogToolExecutionTimeoutSeconds (600s) threshold.

Added ~25 regression guard tests covering every known failure mode
from PRs #148, #163, #195, #211, #224, plus invariant tests for
the 8 processing state safety invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 26, 2026
Three bugs fixed:

1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent
   (e.g., FailedDelegation) without terminal events. Root cause:
   HandleSessionEvent unconditionally updated LastEventAtTicks for ALL
   events, including metrics-only events that don't indicate progress.
   Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and
   AssistantUsageEvent.

2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety
   net. Even if progress events keep arriving, no turn should run for
   60 minutes without user notification. Uses ProcessingStartedAt
   (set in SendPromptAsync) so it cannot be reset by events.

3. False 'session stuck' warnings after app restart. Root cause:
   GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds
   (120s) as file age threshold, but tool executions can go 5-10
   minutes without events.jsonl writes. Fix: Use
   WatchdogToolExecutionTimeoutSeconds (600s) threshold.

Added ~25 regression guard tests covering every known failure mode
from PRs #148, #163, #195, #211, #224, plus invariant tests for
the 8 processing state safety invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 26, 2026
## Problem
Three bugs in the processing watchdog:

1. **Session stuck at 361 minutes** — MergeNET11 session had
`IsProcessing=true` for 361m. The SDK was sending repeated
`SessionUsageInfoEvent` (FailedDelegation) which reset
`LastEventAtTicks` without ever sending a terminal event. The watchdog
timer kept resetting and never fired.

2. **No max processing time** — Even with fix #1, there was no absolute
ceiling. A session could run forever as long as progress events arrived.

3. **False 'stuck' warnings** — After app restart, sessions got `⚠️
Session appears stuck — no events received for over 30 seconds` even
though they were actively working. `GetEventsFileRestoreHints` used 120s
threshold, but tool executions can go 5-10 minutes without events.jsonl
writes.

## Fix
1. **Gate `LastEventAtTicks` update** — Skip for `SessionUsageInfoEvent`
and `AssistantUsageEvent` (metrics-only events that don't indicate turn
progress).
2. **Add `WatchdogMaxProcessingTimeSeconds`** (3600s = 60 min) —
Absolute safety net using `ProcessingStartedAt`, which cannot be reset
by events.
3. **Fix restore hints threshold** — Changed from
`WatchdogInactivityTimeoutSeconds` (120s) to
`WatchdogToolExecutionTimeoutSeconds` (600s).

## Testing
- ~25 new regression guard tests covering every known failure mode from
PRs #148, #163, #195, #211, #224
- Invariant tests for INV-1, INV-4, INV-5, INV-6
- End-to-end scenario tests
- All 1447 tests pass

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant