Conversation
… size growth check Multi-agent worker sessions were getting stuck in IsProcessing=true for up to 30 minutes when the JSON-RPC connection was lost (ConnectionLostException). The watchdog Case B deferral logic checked events.jsonl modification time freshness, but the 1800s multi-agent window meant a file modified before the connection died still appeared "fresh" for 30 minutes. Added a file size growth check: if events.jsonl has not grown across 2 consecutive Case B deferral cycles (~4 minutes), the CLI is no longer writing events and the session is force-completed. This detects dead connections quickly without reducing the freshness window (no regression for issue #365). Changes: - SessionState: added WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount fields for tracking file size across deferral cycles - CopilotService.Events.cs: added WatchdogCaseBMaxStaleChecks constant (2), file size comparison in Case B, reset logic on event arrival and watchdog start - ProcessingWatchdogTests: 8 new tests verifying constant values, source code structure, field existence, and reset behavior - StateChangeCoalescerTests: increased timing margins to fix flaky test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code review found that ForceCompleteProcessingAsync resets WatchdogCaseBResets but not the new companion fields WatchdogCaseBLastFileSize and WatchdogCaseBStaleCount. All three fields must travel together per the pattern established in StartProcessingWatchdog and the event handler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 PR #426 — Code Review: Watchdog Case B Dead Connection Detection1 commit SummaryPR fixes a real bug: when a Findings🟡 M1 — Timing comment is wrong: detection takes ~6 minutes, not ~4
This is incorrect. Tracing the logic:
That's 3 × ~120s ≈ 6 minutes, not 4. The This doesn't affect correctness (6 min is still far better than 30 min), but the test comment and PR description should say "~3 cycles × ~120s ≈ 360s (~6 min)" to match the actual behavior. The 🟢 N1 — SKILL.md documents the "why NOT file-staleness detection" reasoning, but this is file SIZE staleness (different)
The PR uses file size staleness, not mtime staleness. These are different: during model "thinking" between tool calls, the CLI still writes The SKILL.md note is now stale/misleading — it should clarify that size-based staleness (not mtime-based) is safe. A small SKILL.md update would prevent future agents from incorrectly citing this note to reject similar fixes. 🟢 N2 — No SKILL.md update for the new Case B behavior The multi-agent-orchestration SKILL doc describes the 30-min stuck session scenario (line 557) but doesn't document the new file-size growth mitigation. Future agents reading the SKILL doc will see the bug scenario but not know it's been fixed. The "Long-Running Session Safety" section should note this. What's Good
Verdict✅ Approve — the fix is correct and addresses a real 30-minute stuck-session regression. M1 is a documentation/comment inaccuracy (actual behavior is ~6 min detection, not 4) but doesn't affect correctness. N1/N2 are SKILL.md hygiene. One optional follow-up: update the test comment and PR/commit body to say "~3 cycles × ~120s ≈ 6 min" and add a line to the SKILL.md clarifying that size-based staleness detection was added. |
…PR review Addresses PR #426 review findings: M1: Detection takes ~6 min (3 cycles: 1 baseline + 2 stale checks), not ~4 min. Fixed comments in CopilotService.Events.cs, ProcessingWatchdogTests.cs, copilot-instructions.md, and both SKILL.md files. N1: Clarified that file-SIZE staleness (safe) is different from mtime staleness (risky). During model thinking, the CLI writes AssistantMessageDeltaEvent entries so the file grows — only a dead connection stops writes. Updated multi-agent-orchestration SKILL.md to distinguish the two. N2: Added INV-11b to processing-state-safety SKILL.md and INV-O16 to multi-agent-orchestration SKILL.md documenting the file-size-growth dead-connection detection mechanism. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔍 PR #426 — Re-Review Round 22 new commits since Round 1 ( Previous Findings Status
New Issue Found (commit 4ce777f)🟢 Minor — Two stale "~240s" mentions remain in Commit
Both should say New Code (commit 4ce777f) — ForceCompleteProcessingAsync ResetThe commit correctly adds
No other reset sites exist — checked both Tests2921/2923 passed. The 2 failures ( Verdict✅ Approve — All Round 1 findings addressed. The two remaining |
🔍 PR Review Squad — Round 2 Re-ReviewPrevious Findings Status
New Commits Reviewed
Build & Tests
Findings
Cleared ConcernsForceCompleteProcessingAsync reset: Correct. already resets both fields, but closes a narrow race window where a lingering watchdog iteration could run after force-complete but before the CancellationToken is honored. ✅ False positive for LLM thinking phases: Not a real production bug. Case B only fires after 120s of no SDK events. If the model were actively streaming tokens, events arrive and Case B never fires. The 6-minute file-growth window is a strong dead-connection signal in the scenario where Case B fires. During active tool execution, the path uses the 600s timeout, not Case B. ✅ One-Line FixVerdict:
|
Lines 515 and 962 still said ~240s. Correct value is ~360s (3 cycles: 1 baseline + 2 stale checks × ~120s each). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
Multi-agent worker sessions get stuck in
IsProcessing=truefor up to 30 minutes when the JSON-RPC connection is lost (ConnectionLostException). The watchdog Case B deferral logic uses events.jsonl modification time to detect if the CLI is still active, but the 1800s multi-agent freshness window means a file modified before the connection died still appears "fresh" for 30 minutes.Observed symptoms
[WATCHDOG] Case B deferred — events.jsonl modified since turn start, session still active (elapsed=120s, totalProcessing=1455s, deferral=12/40, freshness=1800s [multi-agent])ConnectionLostException: The JSON-RPC connection with the remote party was lostRoot Cause
When a ConnectionLostException kills the JSON-RPC connection:
age < 1800sLastEventAtTicks, creating a 120s cycleFix
Added a file size growth check to Case B deferral logic:
WatchdogCaseBLastFileSize)WatchdogCaseBMaxStaleChecks), the CLI is deadThis detects dead connections within ~4 minutes (2 × ~120s cycles) without reducing the 1800s freshness window — no regression for issue #365.
Changes
CopilotService.cs: AddedWatchdogCaseBLastFileSizeandWatchdogCaseBStaleCountfields toSessionStateCopilotService.Events.cs: AddedWatchdogCaseBMaxStaleChecksconstant, file size comparison in Case B, reset logic on event arrival and watchdog startProcessingWatchdogTests.cs: 8 new tests verifying the new behaviorStateChangeCoalescerTests.cs: Fixed flaky timing testTesting