Skip to content

fix: detect dead connections in watchdog Case B via file size growth check#426

Merged
PureWeen merged 4 commits intomainfrom
fix/session-whatismultiagentdoing-what-is-ha-20260324-1928
Mar 24, 2026
Merged

fix: detect dead connections in watchdog Case B via file size growth check#426
PureWeen merged 4 commits intomainfrom
fix/session-whatismultiagentdoing-what-is-ha-20260324-1928

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Multi-agent worker sessions get stuck in IsProcessing=true for up to 30 minutes when the JSON-RPC connection is lost (ConnectionLostException). The watchdog Case B deferral logic uses events.jsonl modification time to detect if the CLI is still active, but the 1800s multi-agent freshness window means a file modified before the connection died still appears "fresh" for 30 minutes.

Observed symptoms

  • Sessions show "Thinking..." or spinner stuck for 20+ minutes
  • Debug log: [WATCHDOG] Case B deferred — events.jsonl modified since turn start, session still active (elapsed=120s, totalProcessing=1455s, deferral=12/40, freshness=1800s [multi-agent])
  • crash.log shows ConnectionLostException: The JSON-RPC connection with the remote party was lost

Root Cause

When a ConnectionLostException kills the JSON-RPC connection:

  1. No more SDK events arrive → watchdog inactivity timer fires after 120s
  2. Case B checks events.jsonl freshness: file was written before connection died, passes age < 1800s
  3. Each deferral resets LastEventAtTicks, creating a 120s cycle
  4. Session stays stuck until file age exceeds 1800s (~30 min) or 40-deferral cap (~80 min)

Fix

Added a file size growth check to Case B deferral logic:

  • Track events.jsonl file size at each Case B check (WatchdogCaseBLastFileSize)
  • If the file hasn't grown across 2 consecutive checks (WatchdogCaseBMaxStaleChecks), the CLI is dead
  • Override the freshness signal and force-complete the session

This detects dead connections within ~4 minutes (2 × ~120s cycles) without reducing the 1800s freshness window — no regression for issue #365.

Changes

  • CopilotService.cs: Added WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount fields to SessionState
  • CopilotService.Events.cs: Added WatchdogCaseBMaxStaleChecks constant, file size comparison in Case B, reset logic on event arrival and watchdog start
  • ProcessingWatchdogTests.cs: 8 new tests verifying the new behavior
  • StateChangeCoalescerTests.cs: Fixed flaky timing test

Testing

  • 2923 tests pass (0 failures)
  • Mac Catalyst build succeeds

PureWeen and others added 2 commits March 24, 2026 14:44
… size growth check

Multi-agent worker sessions were getting stuck in IsProcessing=true for up to
30 minutes when the JSON-RPC connection was lost (ConnectionLostException). The
watchdog Case B deferral logic checked events.jsonl modification time freshness,
but the 1800s multi-agent window meant a file modified before the connection
died still appeared "fresh" for 30 minutes.

Added a file size growth check: if events.jsonl has not grown across 2
consecutive Case B deferral cycles (~4 minutes), the CLI is no longer writing
events and the session is force-completed. This detects dead connections quickly
without reducing the freshness window (no regression for issue #365).

Changes:
- SessionState: added WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount
  fields for tracking file size across deferral cycles
- CopilotService.Events.cs: added WatchdogCaseBMaxStaleChecks constant (2),
  file size comparison in Case B, reset logic on event arrival and watchdog start
- ProcessingWatchdogTests: 8 new tests verifying constant values, source code
  structure, field existence, and reset behavior
- StateChangeCoalescerTests: increased timing margins to fix flaky test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code review found that ForceCompleteProcessingAsync resets WatchdogCaseBResets
but not the new companion fields WatchdogCaseBLastFileSize and
WatchdogCaseBStaleCount. All three fields must travel together per the pattern
established in StartProcessingWatchdog and the event handler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 PR #426 — Code Review: Watchdog Case B Dead Connection Detection

1 commit eb8a6e22 | Files: CopilotService.cs, CopilotService.Events.cs, ProcessingWatchdogTests.cs, StateChangeCoalescerTests.cs | Tests: ✅ 2923/2923


Summary

PR fixes a real bug: when a ConnectionLostException kills the JSON-RPC connection, events.jsonl stops growing but its modification time stays "fresh" within the 1800s multi-agent window — keeping the session stuck for up to 30 minutes. The fix adds a file size growth check across consecutive Case B deferrals, declaring the connection dead if the file hasn't grown after WatchdogCaseBMaxStaleChecks=2 checks.


Findings

🟡 M1 — Timing comment is wrong: detection takes ~6 minutes, not ~4

ProcessingWatchdogTests.cs test comment and PR description both say:

"2 checks × ~120s cycle = ~240s (4 min)"

This is incorrect. Tracing the logic:

  1. Cycle 1 (prevSize=0): The prevSize > 0 guard is false → baseline established, no stale count
  2. Cycle 2 (file same size): staleCount incremented to 1 — 1 >= 2? No → still defers
  3. Cycle 3 (file same size): staleCount incremented to 2 — 2 >= 2? Yes → force complete

That's 3 × ~120s ≈ 6 minutes, not 4. The WatchdogCaseBMaxStaleChecks=2 constant means "2 consecutive stale checks after establishing a baseline," which takes 3 cycles total.

This doesn't affect correctness (6 min is still far better than 30 min), but the test comment and PR description should say "~3 cycles × ~120s ≈ 360s (~6 min)" to match the actual behavior. The Assert.InRange(WatchdogCaseBMaxStaleChecks, 1, 5) bounds test is still appropriate.


🟢 N1 — SKILL.md documents the "why NOT file-staleness detection" reasoning, but this is file SIZE staleness (different)

.claude/skills/multi-agent-orchestration/SKILL.md (line ~578) documents:

"Why NOT mtime staleness detection: legitimate workers pause for 5+ minutes between tool rounds (model thinking), which would trigger false positives."

The PR uses file size staleness, not mtime staleness. These are different: during model "thinking" between tool calls, the CLI still writes AssistantMessageDeltaEvent entries to events.jsonl (file grows). Only a dead connection stops all writes. So the false-positive concern for mtime-based staleness does NOT apply to size-based staleness.

The SKILL.md note is now stale/misleading — it should clarify that size-based staleness (not mtime-based) is safe. A small SKILL.md update would prevent future agents from incorrectly citing this note to reject similar fixes.


🟢 N2 — No SKILL.md update for the new Case B behavior

The multi-agent-orchestration SKILL doc describes the 30-min stuck session scenario (line 557) but doesn't document the new file-size growth mitigation. Future agents reading the SKILL doc will see the bug scenario but not know it's been fixed. The "Long-Running Session Safety" section should note this.


What's Good

  • Core logic is correct: prevSize > 0 guard correctly skips the bootstrap cycle (avoiding false positives on first check). Interlocked.Read/Exchange/Increment used consistently. FileInfo reads both LastWriteTimeUtc and Length in one stat call (avoids TOCTOU).
  • Reset paths are complete: Both HandleSessionEvent (real event arrives) and StartProcessingWatchdog (new turn) reset WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount alongside WatchdogCaseBResets.
  • False positive analysis: The fix only fires inside if (caseBEventsActive) — already requires file is fresh AND written after turn start. Adding size-stale check on top of a fresh file is the right layering.
  • StateChangeCoalescerTests fix: Increasing 300ms→500ms delays is appropriate for CI flakiness. Low risk.
  • 8 new tests: Cover constant value, source-code structure (field existence, reset points), and FileInfo usage. All 10 Case B tests pass, all 2923 total tests pass.
  • Thread safety: WatchdogCaseBLastFileSize (long) uses Interlocked.Read (required for atomic 64-bit reads). WatchdogCaseBStaleCount (int) uses Interlocked.Increment/Exchange. Correct.
  • caseBEventsActive=false path: Correctly falls through to CompleteResponse via InvokeOnUI with generation guard.

Verdict

Approve — the fix is correct and addresses a real 30-minute stuck-session regression. M1 is a documentation/comment inaccuracy (actual behavior is ~6 min detection, not 4) but doesn't affect correctness. N1/N2 are SKILL.md hygiene.

One optional follow-up: update the test comment and PR/commit body to say "~3 cycles × ~120s ≈ 6 min" and add a line to the SKILL.md clarifying that size-based staleness detection was added.

…PR review

Addresses PR #426 review findings:

M1: Detection takes ~6 min (3 cycles: 1 baseline + 2 stale checks), not ~4 min.
    Fixed comments in CopilotService.Events.cs, ProcessingWatchdogTests.cs,
    copilot-instructions.md, and both SKILL.md files.

N1: Clarified that file-SIZE staleness (safe) is different from mtime staleness
    (risky). During model thinking, the CLI writes AssistantMessageDeltaEvent
    entries so the file grows — only a dead connection stops writes. Updated
    multi-agent-orchestration SKILL.md to distinguish the two.

N2: Added INV-11b to processing-state-safety SKILL.md and INV-O16 to
    multi-agent-orchestration SKILL.md documenting the file-size-growth
    dead-connection detection mechanism.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 PR #426 — Re-Review Round 2

2 new commits since Round 1 (4ce777f9, e0603000) | Tests: ✅ 2921/2923 (2 pre-existing flaky, pass in isolation)


Previous Findings Status

  • M1 (timing comment wrong — ~4 min, actually ~6 min):FIXED

    • CopilotService.Events.cs:1630 — constant XML doc now correctly says "3 watchdog cycles (~6 minutes) — 1 baseline cycle to record the initial file size, then 2 consecutive stale checks"
    • ProcessingWatchdogTests.cs:2913 — test comment updated to "3 cycles × ~120s = ~360s (6 min)"
    • .github/copilot-instructions.md — correctly says "~360s (3 cycles: 1 baseline + 2 stale checks)"
    • processing-state-safety/SKILL.md INV-11b — correctly says "~6 minutes (1 baseline cycle + 2 stale checks × ~120s each)"
  • N1 (SKILL.md "Why NOT mtime staleness" note misleading):FIXED

    • multi-agent-orchestration/SKILL.md:606 — the "Why NOT mtime staleness detection" note now has an explicit callout block distinguishing mtime-based staleness (risky) from file-size-growth (safe), with reference to INV-O16
  • N2 (SKILL.md didn't document the new mitigation):FIXED

    • multi-agent-orchestration/SKILL.md:489 — INV-O16 added with mechanism, reset points, and safety analysis
    • processing-state-safety/SKILL.md:195 — INV-11b added with fields table, logic, detection time, and reset sites

New Issue Found (commit 4ce777f)

🟢 Minor — Two stale "~240s" mentions remain in multi-agent-orchestration/SKILL.md

Commit e0603000 fixed most timing references but missed two in multi-agent-orchestration/SKILL.md:

  1. Line 515 (INV-O16 "Timeline" field):

    **Timeline for dead connection detection**: ~240s (2 × 120s Case B check interval)

  2. Line 962 (troubleshooting table):

    Dead connection — file-size-growth check should catch in ~240s (INV-O16)

Both should say ~360s (3 cycles: 1 baseline + 2 stale checks) to match INV-11b, copilot-instructions.md, and the commit message. Not a correctness issue — the code itself is correct — but creates inconsistency within the SKILL doc.


New Code (commit 4ce777f) — ForceCompleteProcessingAsync Reset

The commit correctly adds WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount resets alongside WatchdogCaseBResets in ForceCompleteProcessingAsync (Organization.cs:1988-1989). Verified all three reset sites now consistently reset all three fields:

Reset site CaseBResets LastFileSize StaleCount
HandleSessionEvent (real event)
StartProcessingWatchdog
ForceCompleteProcessingAsync ✅ (new) ✅ (new)

No other reset sites exist — checked both Events.cs and Organization.cs.


Tests

2921/2923 passed. The 2 failures (PurgeOldCaptures_KeepsOnlyMostRecent, FallbackTimer_NotCancelled_FiresAfterDelay) pass in isolation — pre-existing timing-sensitive flaky tests, not a regression from this PR.


Verdict

Approve — All Round 1 findings addressed. The two remaining ~240s mentions in the SKILL.md are cosmetic inconsistencies; the code is correct and the critical documentation (INV-O16, INV-11b, copilot-instructions.md) is accurate.

@PureWeen
Copy link
Copy Markdown
Owner Author

🔍 PR Review Squad — Round 2 Re-Review

Previous Findings Status

  • M1 (timing comment "~4 minutes"): ⚠️ Partially fixed — corrected in xmldoc, .github/copilot-instructions.md, and processing-state-safety/SKILL.md, but two stale "~240s" references remain in multi-agent-orchestration/SKILL.md (lines 515 and 962). See F1 below.
  • N1 (SKILL.md stale): ✅ Fixed — Comprehensive INV-O16 (multi-agent skill) and INV-11b (processing-state-safety skill) additions with mechanism, timeline, and reset points documented.
  • N2 (no behavior test): ℹ️ Acknowledged / Still present — No behavioral test covering the 3-cycle detection path. Carried from Round 1. Acceptable given filesystem mocking difficulty.

New Commits Reviewed

  • 4ce777f9: Added WatchdogCaseBLastFileSize/WatchdogCaseBStaleCount resets to ForceCompleteProcessingAsync — correct defensive cleanup ✅
  • e0603000: Docs update — partially fixed timing comments, added INV-O16/INV-11b ✅ (see F1 for remaining gap)

Build & Tests

  • ✅ Build: 0 errors (msbuild warning about cache file is pre-existing)
  • ✅ All targeted / tests pass

Findings

# Severity Location Description Consensus
F1 🟡 MODERATE multi-agent-orchestration/SKILL.md:515,962 Two stale "~240s" references contradict the corrected "~360s" documented at line 621 of the same file and in processing-state-safety/SKILL.md and copilot-instructions.md. Line 515: "~240s (2 × 120s Case B check interval)". Line 962 troubleshooting table: "should catch in ~240s (INV-O16)". Correct value is ~360s (1 baseline + 2 stale × 120s). Operators debugging 30-min stuck sessions will be misled. 4/5

Cleared Concerns

ForceCompleteProcessingAsync reset: Correct. already resets both fields, but closes a narrow race window where a lingering watchdog iteration could run after force-complete but before the CancellationToken is honored. ✅

False positive for LLM thinking phases: Not a real production bug. Case B only fires after 120s of no SDK events. If the model were actively streaming tokens, events arrive and Case B never fires. The 6-minute file-growth window is a strong dead-connection signal in the scenario where Case B fires. During active tool execution, the path uses the 600s timeout, not Case B. ✅

One-Line Fix

Verdict: ⚠️ Request changes (one nit)

Core logic is sound. The ForceCompleteProcessingAsync reset is a good defensive addition. Just fix the two remaining stale "~240s" in SKILL.md and this is ready to merge.

Lines 515 and 962 still said ~240s. Correct value is ~360s (3 cycles:
1 baseline + 2 stale checks × ~120s each).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 58243f7 into main Mar 24, 2026
@PureWeen PureWeen deleted the fix/session-whatismultiagentdoing-what-is-ha-20260324-1928 branch March 24, 2026 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant