fix: throttle sibling re-resume + faster dead-event-stream detection (#406)#409
fix: throttle sibling re-resume + faster dead-event-stream detection (#406)#409
Conversation
…406) Root cause (from event-diagnostics.log analysis): - When SendAsync hits a connection error, PolyPilot recreates the client and fires 35-47 concurrent ResumeSessionAsync calls for all sibling sessions. - This flood overwhelms the server's event delivery mechanism — the primary session's SendAsync succeeds but its event stream is silently dead. - The watchdog then takes 120s to detect no events arrived and kills/retries. - ci-agentic hit this pattern twice on 2026-03-18: 1173s watchdog timeout, then reconnect-retry that also went silent until the user had to prod it. Fixes: 1. Throttle sibling re-resume to 3 concurrent ResumeSessionAsync calls (SemaphoreSlim) with parallel Task dispatch instead of sequential. This prevents flooding the server while still resuming all siblings. 2. Add IsReconnectedSend flag on SessionState. Set on the reconnect path before StartProcessingWatchdog. Cleared when first real SDK event arrives. 3. Add WatchdogReconnectInactivityTimeoutSeconds = 35s. Reconnected sessions with IsReconnectedSend=true use this shorter timeout instead of 120s, so a dead event stream after reconnect is detected and retried in ~35s rather than 2 full minutes. 4. Re-snapshot EventsFileSizeAtSend after reconnect so the watchdog's Case D (dead send detection) uses a fresh baseline, not the stale value from the failed primary send. 5. Reset WatchdogAbortAttempted=false on reconnect so Case D can fire if the retried send is also a dead send. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…x semaphore over-release + add tests (#406) Address Challenger feedback on PR #409: 1. IsReconnectedSend lifecycle fix (bug): The flag was only cleared when a real SDK event arrived. If a dead event stream → watchdog fires at 35s → user sends a new message, the next turn would wrongly use the 35s timeout instead of the normal 120s, causing cascading false timeouts. - CopilotService.cs SendPromptAsync: clear alongside HasUsedToolsThisTurn - CopilotService.Events.cs CompleteResponse: clear as defense-in-depth 2. Semaphore over-release fix: WaitAsync(cancellationToken) can throw OperationCanceledException before acquiring the slot. The finally block was unconditionally calling Release(), causing SemaphoreFullException. Fixed with 'acquired = false' guard, set to true only after WaitAsync. Also: OperationCanceledException is now handled separately (no zombie marking needed — the sibling simply wasn't resumed). 3. Tests in ProcessingWatchdogTests.cs: - WatchdogReconnectTimeout_IsWithinValidRange: 30s < 35s < 120s - WatchdogTimeoutSelection_ReconnectedSend_NoTools_UsesReconnectTimeout - WatchdogTimeoutSelection_ReconnectedSend_WithActiveTool_UsesToolTimeout - WatchdogTimeoutSelection_ReconnectedSend_Resumed_NoEvents_UsesQuiescenceTimeout - WatchdogTimeoutSelection_NotReconnectedSend_UsesInactivityTimeout - IsReconnectedSend_IsDeclaredVolatile (reflection check) - Updated ComputeEffectiveTimeout helper to include isReconnectedSend param All 2800 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #409 Review — Throttle sibling re-resume + faster dead-event-stream detectionCI: Production Logic: ✅ CorrectAll three fixes are well-designed and thread-safe:
Finding — 🟡 MODERATE (4/5 models)
The test helper's // Test helper (line 901) — WRONG
var useToolTimeout = hasActiveTool || (isResumed && !useResumeQuiescence) || hasUsedTools;Production ( // Production — correct
var useToolTimeout = hasActiveTool || (state.Info.IsResumed && !useResumeQuiescence);Consequences:
The divergence is pre-existing (the Fix: // Remove || hasUsedTools from useToolTimeout in ComputeEffectiveTimeout
var useToolTimeout = hasActiveTool || (isResumed && !useResumeQuiescence);
var useUsedToolsTimeout = !useToolTimeout && hasUsedTools && !hasActiveTool;
var useReconnectTimeout = isReconnectedSend && !useToolTimeout && !useUsedToolsTimeout && !useResumeQuiescence;
return useResumeQuiescence
? CopilotService.WatchdogResumeQuiescenceTimeoutSeconds
: useToolTimeout
? CopilotService.WatchdogToolExecutionTimeoutSeconds
: useUsedToolsTimeout
? CopilotService.WatchdogUsedToolsIdleTimeoutSeconds
: useReconnectTimeout
? CopilotService.WatchdogReconnectInactivityTimeoutSeconds
: CopilotService.WatchdogInactivityTimeoutSeconds;And update Test Coverage AssessmentThe new reconnect timeout tests adequately cover the happy path (
Verdict:
|
… used-tools tier (#406) The test helper ComputeEffectiveTimeout had || hasUsedTools in the useToolTimeout branch, making the WatchdogUsedToolsIdleTimeoutSeconds (180s) tier unreachable. This caused tests to assert 600s for the hasUsedTools=true, hasActiveTool=false case, diverging from production which correctly routes through useUsedToolsTimeout → 180s. Changes: - Fix ComputeEffectiveTimeout to mirror production formula exactly - Update WatchdogTimeoutSelection_HasUsedTools_UsesToolTimeout → rename to _UsesUsedToolsTimeout, assert 180 not 600 - Update Invariant_INV5 to assert WatchdogUsedToolsIdleTimeoutSeconds (180) - Update Regression_PR148_ToolLoops → rename, assert 180 not 600 - Update Scenario_LongAgentTask to assert 180 not 600 - Update ResumeQuiescence_NotResumed to assert 180 for hasUsedTools case - Update ExhaustiveMatrix InlineData: hasUsedTools=T, isResumed=F → 180 - Add WatchdogTimeoutSelection_ReconnectedSend_WithUsedTools_UsesUsedToolsTimeout verifying hasUsedTools takes priority over IsReconnectedSend (180 > 35) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #409 Re-Review — Round 2 (Post-Fix)Commit reviewed: Previous Findings Status
Fix VerificationThe author's fix commit correctly addresses F1: Before (buggy): var useToolTimeout = hasActiveTool || (isResumed && !useResumeQuiescence) || hasUsedTools;
// return chain: quiescence → tool → inactivity (useUsedToolsTimeout was dead code)After (fixed): var useToolTimeout = hasActiveTool || (isResumed && !useResumeQuiescence);
var useUsedToolsTimeout = !useToolTimeout && hasUsedTools && !hasActiveTool;
var useReconnectTimeout = isReconnectedSend && !useToolTimeout && !useUsedToolsTimeout && !useResumeQuiescence;
// return chain: quiescence → tool → usedTools (180s) → reconnect (35s) → inactivity (120s)All affected tests updated:
New Findings (Consensus Filter: 2+ models required)None. One model (Sonnet) raised a potential concern about Production Logic Confirmed Clean (5/5 models)
Test Coverage✅ The new reconnect timeout tier ( No coverage gaps introduced by this PR. Verdict: ✅ ApproveThe single blocking finding from Round 1 is definitively fixed. All three features (sibling throttle, Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com |
PR #409 Round 3 Re-ReviewNew commit reviewed: Previous Findings Status
New Commit AnalysisTwo new Line 801 — SessionErrorEvent termination path: SDK error event triggers Line 2311 — Watchdog Case C kill: Watchdog times out and kills the session. Clears These two paths were the real risks: Coverage Assessment: Remaining Uncovered Paths (Consensus: Safe)One model (Gemini, 1/5) flagged that 9 other
1/5 model consensus — below 2-model threshold, not included in findings. Verdict: ✅ ApproveThe new commit adds correct, well-placed INV-1 hygiene in the two paths that actually needed it. F1 remains fixed. 163/163 watchdog tests pass. All 4 commits are clean. Ready to merge. Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com |
AbortSessionAsync cleared HasUsedToolsThisTurn but not IsReconnectedSend, leaving a stale flag if the user aborts a reconnected turn then sends again. The watchdog would incorrectly use the 35s reconnect timeout for the new turn. Add IsReconnectedSend = false alongside HasUsedToolsThisTurn in AbortSessionAsync, completing INV-1 compliance across all IsProcessing=false termination paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #409 Round 4 Re-ReviewNew commit reviewed: Previous Findings Status
New Commit Analysis (5/5 models: no issues)Single-line addition to state.HasUsedToolsThisTurn = false;
state.IsReconnectedSend = false; // INV-1: clear all per-turn flags on abort ← NEW
Interlocked.Exchange(ref state.SuccessfulToolCountThisTurn, 0);Placement: Correct — grouped logically with adjacent per-turn flag resets ( Is this a real bug fix or defense-in-depth? Per 5/5 model consensus: defense-in-depth. Two existing mechanisms already prevented the "35s timeout on next turn" scenario described in the commit message:
The commit message slightly overclaims, but the addition is correct, harmless, and improves consistency with INV-1 hygiene across all termination paths. IsReconnectedSend Coverage After This Commit (Complete)
Remaining uncovered paths (SendAsync errors, steer error, permission recovery, watchdog crash, ForceCompleteProcessing) are all safe via Verdict: ✅ Approve5/5 models confirmed clean. The addition is correct and consistent. All prior findings remain fixed. 163/163 watchdog tests pass. Ready to merge. Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com |
Problem
When a session hits a connection error and reconnects, PolyPilot fires 35-47 concurrent
ResumeSessionAsynccalls for all sibling sessions. This floods the server's event delivery mechanism —SendAsyncon the primary session succeeds, but the event stream goes silently dead. The 120s watchdog eventually kills and retries, but the retry can also be affected.Evidence from
~/.polypilot/event-diagnostics.log(ci-agentic session, 2026-03-18):SessionIdleEvent"awaiting events"IsProcessing=false — watchdog timeout after 120s)The user had to prod it manually.
Root Cause
The sibling re-resume
Task.Runlaunched all 47 sessions concurrently with no throttling. The resulting flood ofResumeSessionAsynccalls likely exhausted the server's per-client connection limits or I/O queues, silently breaking the event delivery for the session that just reconnected.Fixes
1. Throttle sibling re-resume concurrency (primary fix)
Replace the sequential loop (which was also calling
awaitsequentially — actually serialized, but with no back-pressure) with a proper parallel dispatch bounded bySemaphoreSlim(3, 3). At most 3 siblings resume at a time. The primary session's event handler is already registered before theTask.Runstarts.2. Faster dead-event-stream detection
Add
IsReconnectedSendflag onSessionState. On the reconnect path, set ittruebeforeStartProcessingWatchdog. When the first real SDK event arrives (HandleSessionEvent), clear it. In the watchdog, use a newWatchdogReconnectInactivityTimeoutSeconds = 35(vs 120s) whenIsReconnectedSend=trueand no tools are involved. This means a dead event stream after reconnect is detected and retried in ~35s, not 2 minutes.3. Re-snapshot
EventsFileSizeAtSendon reconnectThe stale snapshot from the failed primary send is no longer valid after reconnect. A fresh snapshot enables the watchdog's Case D (dead-send detection) on the retried send. Also reset
WatchdogAbortAttempted=falseso Case D can fire on the retry.Test Results