fix: detect dead connections in watchdog Case B via file size growth check by PureWeen · Pull Request #426 · PureWeen/PolyPilot

PureWeen · 2026-03-24T19:44:36Z

Problem

Multi-agent worker sessions get stuck in IsProcessing=true for up to 30 minutes when the JSON-RPC connection is lost (ConnectionLostException). The watchdog Case B deferral logic uses events.jsonl modification time to detect if the CLI is still active, but the 1800s multi-agent freshness window means a file modified before the connection died still appears "fresh" for 30 minutes.

Observed symptoms

Sessions show "Thinking..." or spinner stuck for 20+ minutes
Debug log: [WATCHDOG] Case B deferred — events.jsonl modified since turn start, session still active (elapsed=120s, totalProcessing=1455s, deferral=12/40, freshness=1800s [multi-agent])
crash.log shows ConnectionLostException: The JSON-RPC connection with the remote party was lost

Root Cause

When a ConnectionLostException kills the JSON-RPC connection:

No more SDK events arrive → watchdog inactivity timer fires after 120s
Case B checks events.jsonl freshness: file was written before connection died, passes age < 1800s
Each deferral resets LastEventAtTicks, creating a 120s cycle
Session stays stuck until file age exceeds 1800s (~30 min) or 40-deferral cap (~80 min)

Fix

Added a file size growth check to Case B deferral logic:

Track events.jsonl file size at each Case B check (WatchdogCaseBLastFileSize)
If the file hasn't grown across 2 consecutive checks (WatchdogCaseBMaxStaleChecks), the CLI is dead
Override the freshness signal and force-complete the session

This detects dead connections within ~4 minutes (2 × ~120s cycles) without reducing the 1800s freshness window — no regression for issue #365.

Changes

CopilotService.cs: Added WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount fields to SessionState
CopilotService.Events.cs: Added WatchdogCaseBMaxStaleChecks constant, file size comparison in Case B, reset logic on event arrival and watchdog start
ProcessingWatchdogTests.cs: 8 new tests verifying the new behavior
StateChangeCoalescerTests.cs: Fixed flaky timing test

Testing

2923 tests pass (0 failures)
Mac Catalyst build succeeds

… size growth check Multi-agent worker sessions were getting stuck in IsProcessing=true for up to 30 minutes when the JSON-RPC connection was lost (ConnectionLostException). The watchdog Case B deferral logic checked events.jsonl modification time freshness, but the 1800s multi-agent window meant a file modified before the connection died still appeared "fresh" for 30 minutes. Added a file size growth check: if events.jsonl has not grown across 2 consecutive Case B deferral cycles (~4 minutes), the CLI is no longer writing events and the session is force-completed. This detects dead connections quickly without reducing the freshness window (no regression for issue #365). Changes: - SessionState: added WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount fields for tracking file size across deferral cycles - CopilotService.Events.cs: added WatchdogCaseBMaxStaleChecks constant (2), file size comparison in Case B, reset logic on event arrival and watchdog start - ProcessingWatchdogTests: 8 new tests verifying constant values, source code structure, field existence, and reset behavior - StateChangeCoalescerTests: increased timing margins to fix flaky test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Code review found that ForceCompleteProcessingAsync resets WatchdogCaseBResets but not the new companion fields WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount. All three fields must travel together per the pattern established in StartProcessingWatchdog and the event handler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-24T19:58:00Z

🔍 PR #426 — Code Review: Watchdog Case B Dead Connection Detection

1 commit eb8a6e22 | Files: CopilotService.cs, CopilotService.Events.cs, ProcessingWatchdogTests.cs, StateChangeCoalescerTests.cs | Tests: ✅ 2923/2923

Summary

PR fixes a real bug: when a ConnectionLostException kills the JSON-RPC connection, events.jsonl stops growing but its modification time stays "fresh" within the 1800s multi-agent window — keeping the session stuck for up to 30 minutes. The fix adds a file size growth check across consecutive Case B deferrals, declaring the connection dead if the file hasn't grown after WatchdogCaseBMaxStaleChecks=2 checks.

Findings

🟡 M1 — Timing comment is wrong: detection takes ~6 minutes, not ~4

ProcessingWatchdogTests.cs test comment and PR description both say:

"2 checks × ~120s cycle = ~240s (4 min)"

This is incorrect. Tracing the logic:

Cycle 1 (prevSize=0): The prevSize > 0 guard is false → baseline established, no stale count
Cycle 2 (file same size): staleCount incremented to 1 — 1 >= 2? No → still defers
Cycle 3 (file same size): staleCount incremented to 2 — 2 >= 2? Yes → force complete

That's 3 × ~120s ≈ 6 minutes, not 4. The WatchdogCaseBMaxStaleChecks=2 constant means "2 consecutive stale checks after establishing a baseline," which takes 3 cycles total.

This doesn't affect correctness (6 min is still far better than 30 min), but the test comment and PR description should say "~3 cycles × ~120s ≈ 360s (~6 min)" to match the actual behavior. The Assert.InRange(WatchdogCaseBMaxStaleChecks, 1, 5) bounds test is still appropriate.

🟢 N1 — SKILL.md documents the "why NOT file-staleness detection" reasoning, but this is file SIZE staleness (different)

.claude/skills/multi-agent-orchestration/SKILL.md (line ~578) documents:

"Why NOT mtime staleness detection: legitimate workers pause for 5+ minutes between tool rounds (model thinking), which would trigger false positives."

The PR uses file size staleness, not mtime staleness. These are different: during model "thinking" between tool calls, the CLI still writes AssistantMessageDeltaEvent entries to events.jsonl (file grows). Only a dead connection stops all writes. So the false-positive concern for mtime-based staleness does NOT apply to size-based staleness.

The SKILL.md note is now stale/misleading — it should clarify that size-based staleness (not mtime-based) is safe. A small SKILL.md update would prevent future agents from incorrectly citing this note to reject similar fixes.

🟢 N2 — No SKILL.md update for the new Case B behavior

The multi-agent-orchestration SKILL doc describes the 30-min stuck session scenario (line 557) but doesn't document the new file-size growth mitigation. Future agents reading the SKILL doc will see the bug scenario but not know it's been fixed. The "Long-Running Session Safety" section should note this.

What's Good

Core logic is correct: prevSize > 0 guard correctly skips the bootstrap cycle (avoiding false positives on first check). Interlocked.Read/Exchange/Increment used consistently. FileInfo reads both LastWriteTimeUtc and Length in one stat call (avoids TOCTOU).
Reset paths are complete: Both HandleSessionEvent (real event arrives) and StartProcessingWatchdog (new turn) reset WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount alongside WatchdogCaseBResets.
False positive analysis: The fix only fires inside if (caseBEventsActive) — already requires file is fresh AND written after turn start. Adding size-stale check on top of a fresh file is the right layering.
StateChangeCoalescerTests fix: Increasing 300ms→500ms delays is appropriate for CI flakiness. Low risk.
8 new tests: Cover constant value, source-code structure (field existence, reset points), and FileInfo usage. All 10 Case B tests pass, all 2923 total tests pass.
Thread safety: WatchdogCaseBLastFileSize (long) uses Interlocked.Read (required for atomic 64-bit reads). WatchdogCaseBStaleCount (int) uses Interlocked.Increment/Exchange. Correct.
caseBEventsActive=false path: Correctly falls through to CompleteResponse via InvokeOnUI with generation guard.

Verdict

✅ Approve — the fix is correct and addresses a real 30-minute stuck-session regression. M1 is a documentation/comment inaccuracy (actual behavior is ~6 min detection, not 4) but doesn't affect correctness. N1/N2 are SKILL.md hygiene.

One optional follow-up: update the test comment and PR/commit body to say "~3 cycles × ~120s ≈ 6 min" and add a line to the SKILL.md clarifying that size-based staleness detection was added.

…PR review Addresses PR #426 review findings: M1: Detection takes ~6 min (3 cycles: 1 baseline + 2 stale checks), not ~4 min. Fixed comments in CopilotService.Events.cs, ProcessingWatchdogTests.cs, copilot-instructions.md, and both SKILL.md files. N1: Clarified that file-SIZE staleness (safe) is different from mtime staleness (risky). During model thinking, the CLI writes AssistantMessageDeltaEvent entries so the file grows — only a dead connection stops writes. Updated multi-agent-orchestration SKILL.md to distinguish the two. N2: Added INV-11b to processing-state-safety SKILL.md and INV-O16 to multi-agent-orchestration SKILL.md documenting the file-size-growth dead-connection detection mechanism. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-24T20:21:16Z

🔍 PR #426 — Re-Review Round 2

2 new commits since Round 1 (4ce777f9, e0603000) | Tests: ✅ 2921/2923 (2 pre-existing flaky, pass in isolation)

Previous Findings Status

M1 (timing comment wrong — ~4 min, actually ~6 min): ✅ FIXED
- CopilotService.Events.cs:1630 — constant XML doc now correctly says "3 watchdog cycles (~6 minutes) — 1 baseline cycle to record the initial file size, then 2 consecutive stale checks"
- ProcessingWatchdogTests.cs:2913 — test comment updated to "3 cycles × ~120s = ~360s (6 min)"
- .github/copilot-instructions.md — correctly says "~360s (3 cycles: 1 baseline + 2 stale checks)"
- processing-state-safety/SKILL.md INV-11b — correctly says "~6 minutes (1 baseline cycle + 2 stale checks × ~120s each)"
N1 (SKILL.md "Why NOT mtime staleness" note misleading): ✅ FIXED
- multi-agent-orchestration/SKILL.md:606 — the "Why NOT mtime staleness detection" note now has an explicit callout block distinguishing mtime-based staleness (risky) from file-size-growth (safe), with reference to INV-O16
N2 (SKILL.md didn't document the new mitigation): ✅ FIXED
- multi-agent-orchestration/SKILL.md:489 — INV-O16 added with mechanism, reset points, and safety analysis
- processing-state-safety/SKILL.md:195 — INV-11b added with fields table, logic, detection time, and reset sites

New Issue Found (commit `4ce777f`)

🟢 Minor — Two stale "~240s" mentions remain in multi-agent-orchestration/SKILL.md

Commit e0603000 fixed most timing references but missed two in multi-agent-orchestration/SKILL.md:

Line 515 (INV-O16 "Timeline" field):

**Timeline for dead connection detection**: ~240s (2 × 120s Case B check interval)
Line 962 (troubleshooting table):

Dead connection — file-size-growth check should catch in ~240s (INV-O16)

Both should say ~360s (3 cycles: 1 baseline + 2 stale checks) to match INV-11b, copilot-instructions.md, and the commit message. Not a correctness issue — the code itself is correct — but creates inconsistency within the SKILL doc.

New Code (commit `4ce777f`) — ForceCompleteProcessingAsync Reset

The commit correctly adds WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount resets alongside WatchdogCaseBResets in ForceCompleteProcessingAsync (Organization.cs:1988-1989). Verified all three reset sites now consistently reset all three fields:

Reset site	CaseBResets	LastFileSize	StaleCount
`HandleSessionEvent` (real event)	✅	✅	✅
`StartProcessingWatchdog`	✅	✅	✅
`ForceCompleteProcessingAsync`	✅	✅ (new)	✅ (new)

No other reset sites exist — checked both Events.cs and Organization.cs.

Tests

2921/2923 passed. The 2 failures (PurgeOldCaptures_KeepsOnlyMostRecent, FallbackTimer_NotCancelled_FiresAfterDelay) pass in isolation — pre-existing timing-sensitive flaky tests, not a regression from this PR.

Verdict

✅ Approve — All Round 1 findings addressed. The two remaining ~240s mentions in the SKILL.md are cosmetic inconsistencies; the code is correct and the critical documentation (INV-O16, INV-11b, copilot-instructions.md) is accurate.

PureWeen · 2026-03-24T20:26:22Z

🔍 PR Review Squad — Round 2 Re-Review

Previous Findings Status

M1 (timing comment "~4 minutes"): ⚠️ Partially fixed — corrected in xmldoc, .github/copilot-instructions.md, and processing-state-safety/SKILL.md, but two stale "~240s" references remain in multi-agent-orchestration/SKILL.md (lines 515 and 962). See F1 below.
N1 (SKILL.md stale): ✅ Fixed — Comprehensive INV-O16 (multi-agent skill) and INV-11b (processing-state-safety skill) additions with mechanism, timeline, and reset points documented.
N2 (no behavior test): ℹ️ Acknowledged / Still present — No behavioral test covering the 3-cycle detection path. Carried from Round 1. Acceptable given filesystem mocking difficulty.

New Commits Reviewed

4ce777f9: Added WatchdogCaseBLastFileSize/WatchdogCaseBStaleCount resets to ForceCompleteProcessingAsync — correct defensive cleanup ✅
e0603000: Docs update — partially fixed timing comments, added INV-O16/INV-11b ✅ (see F1 for remaining gap)

Build & Tests

✅ Build: 0 errors (msbuild warning about cache file is pre-existing)
✅ All targeted / tests pass

Findings

#	Severity	Location	Description	Consensus
F1	🟡 MODERATE	`multi-agent-orchestration/SKILL.md:515,962`	Two stale "~240s" references contradict the corrected "~360s" documented at line 621 of the same file and in `processing-state-safety/SKILL.md` and `copilot-instructions.md`. Line 515: "~240s (2 × 120s Case B check interval)". Line 962 troubleshooting table: "should catch in ~240s (INV-O16)". Correct value is ~360s (1 baseline + 2 stale × 120s). Operators debugging 30-min stuck sessions will be misled.	4/5

Cleared Concerns

ForceCompleteProcessingAsync reset: Correct. already resets both fields, but closes a narrow race window where a lingering watchdog iteration could run after force-complete but before the CancellationToken is honored. ✅

False positive for LLM thinking phases: Not a real production bug. Case B only fires after 120s of no SDK events. If the model were actively streaming tokens, events arrive and Case B never fires. The 6-minute file-growth window is a strong dead-connection signal in the scenario where Case B fires. During active tool execution, the path uses the 600s timeout, not Case B. ✅

One-Line Fix

Verdict: ⚠️ Request changes (one nit)

Core logic is sound. The ForceCompleteProcessingAsync reset is a good defensive addition. Just fix the two remaining stale "~240s" in SKILL.md and this is ready to merge.

Lines 515 and 962 still said ~240s. Correct value is ~360s (3 cycles: 1 baseline + 2 stale checks × ~120s each). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 2 commits March 24, 2026 14:44

docs: fix two remaining stale ~240s references in orchestration SKILL.md

8c9bdcf

Lines 515 and 962 still said ~240s. Correct value is ~360s (3 cycles: 1 baseline + 2 stale checks × ~120s each). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen merged commit 58243f7 into main Mar 24, 2026

PureWeen deleted the fix/session-whatismultiagentdoing-what-is-ha-20260324-1928 branch March 24, 2026 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: detect dead connections in watchdog Case B via file size growth check#426

fix: detect dead connections in watchdog Case B via file size growth check#426
PureWeen merged 4 commits intomainfrom
fix/session-whatismultiagentdoing-what-is-ha-20260324-1928

PureWeen commented Mar 24, 2026

Uh oh!

PureWeen commented Mar 24, 2026

Uh oh!

PureWeen commented Mar 24, 2026

Uh oh!

PureWeen commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 24, 2026

Problem

Observed symptoms

Root Cause

Fix

Changes

Testing

Uh oh!

PureWeen commented Mar 24, 2026

🔍 PR #426 — Code Review: Watchdog Case B Dead Connection Detection

Summary

Findings

What's Good

Verdict

Uh oh!

PureWeen commented Mar 24, 2026

🔍 PR #426 — Re-Review Round 2

Previous Findings Status

New Issue Found (commit 4ce777f)

New Code (commit 4ce777f) — ForceCompleteProcessingAsync Reset

Tests

Verdict

Uh oh!

PureWeen commented Mar 24, 2026

🔍 PR Review Squad — Round 2 Re-Review

Previous Findings Status

New Commits Reviewed

Build & Tests

Findings

Cleared Concerns

One-Line Fix

Verdict: ⚠️ Request changes (one nit)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

New Issue Found (commit `4ce777f`)

New Code (commit `4ce777f`) — ForceCompleteProcessingAsync Reset