fix: abort interrupted tools on session resume + Settings cleanup#393
fix: abort interrupted tools on session resume + Settings cleanup#393
Conversation
The CLI Source setting was moved to the Developer group in b94d0f4 but the nav button and empty section shell were left behind, creating a menu item that scrolled to nothing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a crash mid-tool-execution, the SDK session gets stuck waiting for tool results that will never arrive. New SendAsync calls are silently queued/ignored, causing the session to appear permanently stuck. Three fixes: 1. Detect interrupted tools on resume (HasInterruptedToolExecution): Scans events.jsonl for unmatched tool.execution_start events before session.shutdown or end-of-file. Handles both graceful shutdown and force-kill (SIGKILL/Stop-Process) scenarios. 2. Send AbortAsync on resume when interrupted tools detected: In EnsureSessionConnectedAsync, after ResumeSessionAsync, check for interrupted tools and abort to clear the SDK's pending tool state. This allows subsequent SendAsync calls to work immediately. 3. Fix INV-16 violation in EnsureSessionConnectedAsync: Move .On() callback registration BEFORE setting state.Session, matching the safe pattern in sibling reconnect and worker revival. 4. Watchdog Case D (dead-send detection): If events.jsonl hasn't grown 30s after SendAsync, try AbortAsync as recovery. Safety net for cases the resume-abort doesn't catch. Also fixes: - relaunch.ps1 PowerShell 5.1 compatibility (bracket parsing, Join-Path) - Update MauiDevFlow NuGet packages to v0.23.1 - Add [RESUME-ABORT] to diagnostic log filter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SQLite's CloseAsync is fire-and-forget via ObserveClose — on Windows the file handle isn't released before File.Delete runs. Add GC.Collect + retry loop with brief delay to handle the async file release. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Multi-Model Consensus Review (4/5 models)CI: PASS 2668/2669 (1 pre-existing flaky) CRITICAL (4/4 models)HasInterruptedToolExecution false positive (CopilotService.Utilities.cs:199-216) The backwards scan is inverted for paired events. When reading end-of-file backwards, tool.execution_complete is seen first with pendingToolStarts==0 (decrement skipped), then tool.execution_start increments to 1. A matched start→complete pair returns true = interrupted. Verified: zero-idle session with matched tools → pending=1, result=TRUE (false positive). Every session completing tools via the zero-idle SDK path will trigger spurious AbortAsync on next lazy-resume. Fix: invert the scan — increment on complete, decrement on start. MODERATE (4/4 models)Watchdog Case D fires on healthy slow sessions (CopilotService.Events.cs:1975-2010) Fires after 30s with no events.jsonl growth. This matches any session where LLM/tool takes >30s before first event. EventsFileSizeAtSend > 0 for all real sessions. Case D is also placed BEFORE the hasActiveTool check (Case A), so it fires before the correct long-tool-run handler. MINOR (2/4 models)HasInterruptedToolExecution loads entire file (Utilities.cs:164-169) — only needs last 30 lines. Missing testsHasInterruptedToolExecution has a testable overload but no tests. Need:
Verdict: Request ChangesSettings.razor cleanup (commit 1) is correct. The interrupted-tool detection has a critical false-positive that will abort healthy sessions. Please fix backwards scan algorithm and add tests before merging. |
… perf 1. CRITICAL: HasInterruptedToolExecution backwards scan was inverted. When scanning events.jsonl in reverse, tool.execution_complete is seen before its matching tool.execution_start. The old code incremented on start and decremented on complete — but complete was seen first with pendingToolStarts==0 so the decrement was skipped, then start incremented to 1, causing every session that ran tools to be falsely flagged as interrupted. Fixed: use pendingCompletions counter (increment on complete, decrement on start). Unmatched starts are tracked separately. 2. MODERATE: Watchdog Case D moved after Case A (active tools) and Case B (lost terminal event). Added !hasActiveTool guard so dead-send detection never fires on sessions with actively running tools. 3. MINOR: Replaced full file load with streaming Queue<string>(31) ring buffer that keeps only the last 30 lines in memory. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Round 2 Re-Review — PR #3935-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: 2+ models must agree. CI: Previous Findings Status
Fix VerificationF1 (
F2 ( F3 ( New Logic Reviewed (INV-16, abort-on-resume, dead-send detection)INV-16 fix ( Abort on resume ( Dead-send detection ( No Consensus Issues FoundNo new bugs reach the 2-model consensus threshold. Informational (below consensus threshold)
Verdict: ✅ ApproveAll three Round 1 findings are correctly resolved. No consensus issues. Logic is sound. |
## Summary Two fixes in this PR: ### 1. Dead Event Stream Recovery (abort interrupted tools on resume) **Root cause:** When a session crashes mid-tool-execution (e.g., force-kill via `relaunch.ps1`), the SDK resumes the session but remains stuck waiting for `tool.execution_complete` results that will never arrive. All subsequent `SendAsync` calls are silently queued/ignored — zero events written to disk, zero callbacks fired. The session appears permanently stuck. **Evidence pattern in events.jsonl:** ``` tool.execution_start ← last real event before crash session.resume ← resume, zero new events after sends abort ← THIS unlocks the session user.message ← now everything works normally ``` **Fixes:** - **Abort on resume** (`CopilotService.Persistence.cs`): After `ResumeSessionAsync`, scan `events.jsonl` for unmatched `tool.execution_start` events. If found, send `AbortAsync` to clear the SDK's pending tool state before the user's message. - **HasInterruptedToolExecution helper** (`CopilotService.Utilities.cs`): Streams last 30 lines of `events.jsonl` via ring buffer. Scans backwards with correct reverse-order semantics (`pendingCompletions` counter). Handles both graceful shutdown and force-kill scenarios. - **INV-16 fix** (`CopilotService.Persistence.cs`): Moved `.On()` callback registration BEFORE `state.Session` assignment — closes a race window where events arrive with no handler registered. - **Watchdog Case D** (`CopilotService.Events.cs`): Safety net — if `events.jsonl` hasn't grown 30s after `SendAsync` and no tools are active, try `AbortAsync` to recover. Positioned after Case A/B with `!hasActiveTool` guard to avoid false positives. ### 2. Settings.razor Cleanup Removed orphaned **Copilot CLI** nav button and empty section shell from Settings. The underlying setting was moved to the **Developer** group in b94d0f4 but the nav button was left behind. ### Other Changes - `relaunch.ps1`: Fixed PowerShell 5.1 compatibility (bracket strings parsed as array indexers, `Join-Path` with 4 args) - `PolyPilot.csproj`: Updated MauiDevFlow packages 0.12.1 → 0.23.1 - `ChatDatabaseResilienceTests.cs`: Fixed Windows-specific flaky test (async file handle release) ### Invariants Validated - INV-1 ✅ INV-3 ✅ INV-4 ✅ INV-11 ✅ INV-12 ✅ INV-16 ✅ INV-17 ✅ ### Tests 2669/2669 passing --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Two fixes in this PR:
1. Dead Event Stream Recovery (abort interrupted tools on resume)
Root cause: When a session crashes mid-tool-execution (e.g., force-kill via
relaunch.ps1), the SDK resumes the session but remains stuck waiting fortool.execution_completeresults that will never arrive. All subsequentSendAsynccalls are silently queued/ignored — zero events written to disk, zero callbacks fired. The session appears permanently stuck.Evidence pattern in events.jsonl:
Fixes:
CopilotService.Persistence.cs): AfterResumeSessionAsync, scanevents.jsonlfor unmatchedtool.execution_startevents. If found, sendAbortAsyncto clear the SDK's pending tool state before the user's message.CopilotService.Utilities.cs): Streams last 30 lines ofevents.jsonlvia ring buffer. Scans backwards with correct reverse-order semantics (pendingCompletionscounter). Handles both graceful shutdown and force-kill scenarios.CopilotService.Persistence.cs): Moved.On()callback registration BEFOREstate.Sessionassignment — closes a race window where events arrive with no handler registered.CopilotService.Events.cs): Safety net — ifevents.jsonlhasn't grown 30s afterSendAsyncand no tools are active, tryAbortAsyncto recover. Positioned after Case A/B with!hasActiveToolguard to avoid false positives.2. Settings.razor Cleanup
Removed orphaned Copilot CLI nav button and empty section shell from Settings. The underlying setting was moved to the Developer group in b94d0f4 but the nav button was left behind.
Other Changes
relaunch.ps1: Fixed PowerShell 5.1 compatibility (bracket strings parsed as array indexers,Join-Pathwith 4 args)PolyPilot.csproj: Updated MauiDevFlow packages 0.12.1 → 0.23.1ChatDatabaseResilienceTests.cs: Fixed Windows-specific flaky test (async file handle release)Invariants Validated
Tests
2669/2669 passing