Skip to content

Fix false 'session stuck' watchdog warning and response content loss on restart#224

Merged
PureWeen merged 5 commits intomainfrom
fix/session-session-20260225-205601-why-did-20260226-1434
Feb 26, 2026
Merged

Fix false 'session stuck' watchdog warning and response content loss on restart#224
PureWeen merged 5 commits intomainfrom
fix/session-session-20260225-205601-why-did-20260226-1434

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Two bugs when PolyPilot restarts while sessions are actively processing:

1. False 'Session appears stuck' warning (30s quiescence timeout)

When the app restarts and resumes a session that was mid-turn, the 30s quiescence timeout fires even though the session was genuinely still active on the server. The quiescence timeout is designed for sessions that had already finished before the restart — but if the SDK takes >30s to reconnect and start streaming events, the watchdog fires incorrectly.

2. Response content lost between turn_end and session.idle

When the app restarts after assistant.turn_end but before session.idle, accumulated response text in CurrentResponse is never flushed to history/DB. This caused the ReviewPRs session to lose a 6123-character PR review — the review was generated (visible in events.jsonl) but never displayed because CompleteResponse (which flushes content) is only triggered by SessionIdleEvent.

Fix

Quiescence bypass for recently-active sessions

  • Add GetEventsFileRestoreHints() to check events.jsonl freshness during restore
  • If file was modified within 120s (WatchdogInactivityTimeoutSeconds), pre-seed HasReceivedEventsSinceResume=true to bypass the 30s quiescence
  • If last event was a tool event, also set HasUsedToolsThisTurn=true for 600s timeout
  • Stale files (>120s) still use the original 30s quiescence behavior

Flush response on turn_end

  • Call FlushCurrentResponse(state) in the AssistantTurnEndEvent handler
  • This persists accumulated content to history/DB at the end of each sub-turn
  • If CurrentResponse is empty (tool-only sub-turns), it's a no-op

Tests

  • 10 new tests in ProcessingWatchdogTests: RestoreHints_* tests for file freshness, tool detection, stale files, and integration with timeout selection
  • 3 new tests in ResponseFlushTests: TurnEndFlush_* tests for content preservation, empty response handling, and deduplication
  • All 1412 tests pass

PureWeen and others added 5 commits February 26, 2026 08:51
…on restart

Two bugs fixed:

1. False quiescence timeout on resumed sessions: When the app restarts and
   resumes a session that was mid-turn, the 30s quiescence timeout could fire
   incorrectly even though the session was genuinely still active on the server.
   Fix: During restore, check events.jsonl freshness. If recently modified
   (< 120s), pre-seed HasReceivedEventsSinceResume to bypass the 30s
   quiescence and use the appropriate longer timeout (120s or 600s).

2. Response content lost between turn_end and session.idle: When the app
   restarts after assistant.turn_end but before session.idle, accumulated
   response text in CurrentResponse was never flushed to history/DB.
   Fix: Call FlushCurrentResponse on AssistantTurnEndEvent so content is
   persisted at the end of each sub-turn, not just on idle.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code review findings:
- Add dedup guard in FlushCurrentResponse: if last assistant message in
  History has identical content, skip the add to prevent duplicates when
  SDK replays events after session resume.
- Add malformed JSON test for GetEventsFileRestoreHints
- Add FlushCurrentResponse idempotency test
- Add content dedup model-level test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- SKILL.md: Add Content Persistence Safety section (turn_end flush,
  dedup guard, quiescence bypass, zero-idle sessions), update PR count
  to 8, add missing-flush to recurring mistakes list
- copilot-instructions.md: Add 30s quiescence tier with bypass to
  watchdog section, document turn_end flush in IsProcessing invariant
  and SDK event flow sections

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SKILL.md changes:
- Remove checklist item 10 (internal behavior, not caller requirement)
- Remove inaccurate FlushCurrentResponse call sites table
- Remove quiescence bypass and zero-idle sections (moved to instructions)
- Fix PR count to 7 (not 8, PR #224 is pending)
- Keep turn-end flush and dedup guard sections (directly relevant)

copilot-instructions.md changes:
- Move FlushCurrentResponse paragraph to new Content Persistence section
  (was incorrectly under IsProcessing Cleanup Invariant)
- Integrate SDK Event Flow sub-turn note into item 8 text
- Remove GetEventsFileRestoreHints function name (too specific)
- Add zero-idle session context to Processing Watchdog section
- Fix PR count references

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Move FlushCurrentResponse inside Invoke() lambda in AssistantTurnEndEvent
   handler to avoid racing with UI thread on List<ChatMessage> (Finding #1)

2. Preserve isRecentlyActive across catch block in GetEventsFileRestoreHints
   — malformed JSON in a recently-written events.jsonl no longer loses the
   file-age signal that bypasses the 30s quiescence timeout (Finding #2)

3. Add Debug log when dedup guard fires in FlushCurrentResponse so
   legitimate content drops are observable in diagnostics (Finding #3)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit d1c569b into main Feb 26, 2026
@PureWeen PureWeen deleted the fix/session-session-20260225-205601-why-did-20260226-1434 branch February 26, 2026 16:14
PureWeen added a commit that referenced this pull request Feb 26, 2026
Three bugs fixed:

1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent
   (e.g., FailedDelegation) without terminal events. Root cause:
   HandleSessionEvent unconditionally updated LastEventAtTicks for ALL
   events, including metrics-only events that don't indicate progress.
   Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and
   AssistantUsageEvent.

2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety
   net. Even if progress events keep arriving, no turn should run for
   60 minutes without user notification. Uses ProcessingStartedAt
   (set in SendPromptAsync) so it cannot be reset by events.

3. False 'session stuck' warnings after app restart. Root cause:
   GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds
   (120s) as file age threshold, but tool executions can go 5-10
   minutes without events.jsonl writes. Fix: Use
   WatchdogToolExecutionTimeoutSeconds (600s) threshold.

Added ~25 regression guard tests covering every known failure mode
from PRs #148, #163, #195, #211, #224, plus invariant tests for
the 8 processing state safety invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 26, 2026
Three bugs fixed:

1. Sessions stuck forever when SDK sends repeated SessionUsageInfoEvent
   (e.g., FailedDelegation) without terminal events. Root cause:
   HandleSessionEvent unconditionally updated LastEventAtTicks for ALL
   events, including metrics-only events that don't indicate progress.
   Fix: Skip LastEventAtTicks update for SessionUsageInfoEvent and
   AssistantUsageEvent.

2. Added WatchdogMaxProcessingTimeSeconds (3600s) as absolute safety
   net. Even if progress events keep arriving, no turn should run for
   60 minutes without user notification. Uses ProcessingStartedAt
   (set in SendPromptAsync) so it cannot be reset by events.

3. False 'session stuck' warnings after app restart. Root cause:
   GetEventsFileRestoreHints used WatchdogInactivityTimeoutSeconds
   (120s) as file age threshold, but tool executions can go 5-10
   minutes without events.jsonl writes. Fix: Use
   WatchdogToolExecutionTimeoutSeconds (600s) threshold.

Added ~25 regression guard tests covering every known failure mode
from PRs #148, #163, #195, #211, #224, plus invariant tests for
the 8 processing state safety invariants.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 26, 2026
## Problem
Three bugs in the processing watchdog:

1. **Session stuck at 361 minutes** — MergeNET11 session had
`IsProcessing=true` for 361m. The SDK was sending repeated
`SessionUsageInfoEvent` (FailedDelegation) which reset
`LastEventAtTicks` without ever sending a terminal event. The watchdog
timer kept resetting and never fired.

2. **No max processing time** — Even with fix #1, there was no absolute
ceiling. A session could run forever as long as progress events arrived.

3. **False 'stuck' warnings** — After app restart, sessions got `⚠️
Session appears stuck — no events received for over 30 seconds` even
though they were actively working. `GetEventsFileRestoreHints` used 120s
threshold, but tool executions can go 5-10 minutes without events.jsonl
writes.

## Fix
1. **Gate `LastEventAtTicks` update** — Skip for `SessionUsageInfoEvent`
and `AssistantUsageEvent` (metrics-only events that don't indicate turn
progress).
2. **Add `WatchdogMaxProcessingTimeSeconds`** (3600s = 60 min) —
Absolute safety net using `ProcessingStartedAt`, which cannot be reset
by events.
3. **Fix restore hints threshold** — Changed from
`WatchdogInactivityTimeoutSeconds` (120s) to
`WatchdogToolExecutionTimeoutSeconds` (600s).

## Testing
- ~25 new regression guard tests covering every known failure mode from
PRs #148, #163, #195, #211, #224
- Invariant tests for INV-1, INV-4, INV-5, INV-6
- End-to-end scenario tests
- All 1447 tests pass

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant