Skip to content

fix: abort interrupted tools on session resume + Settings cleanup#393

Merged
PureWeen merged 4 commits intomainfrom
session-20260316-095052
Mar 16, 2026
Merged

fix: abort interrupted tools on session resume + Settings cleanup#393
PureWeen merged 4 commits intomainfrom
session-20260316-095052

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 16, 2026

Summary

Two fixes in this PR:

1. Dead Event Stream Recovery (abort interrupted tools on resume)

Root cause: When a session crashes mid-tool-execution (e.g., force-kill via relaunch.ps1), the SDK resumes the session but remains stuck waiting for tool.execution_complete results that will never arrive. All subsequent SendAsync calls are silently queued/ignored — zero events written to disk, zero callbacks fired. The session appears permanently stuck.

Evidence pattern in events.jsonl:

tool.execution_start   ← last real event before crash
session.resume         ← resume, zero new events after sends
abort                  ← THIS unlocks the session
user.message           ← now everything works normally

Fixes:

  • Abort on resume (CopilotService.Persistence.cs): After ResumeSessionAsync, scan events.jsonl for unmatched tool.execution_start events. If found, send AbortAsync to clear the SDK's pending tool state before the user's message.
  • HasInterruptedToolExecution helper (CopilotService.Utilities.cs): Streams last 30 lines of events.jsonl via ring buffer. Scans backwards with correct reverse-order semantics (pendingCompletions counter). Handles both graceful shutdown and force-kill scenarios.
  • INV-16 fix (CopilotService.Persistence.cs): Moved .On() callback registration BEFORE state.Session assignment — closes a race window where events arrive with no handler registered.
  • Watchdog Case D (CopilotService.Events.cs): Safety net — if events.jsonl hasn't grown 30s after SendAsync and no tools are active, try AbortAsync to recover. Positioned after Case A/B with !hasActiveTool guard to avoid false positives.

2. Settings.razor Cleanup

Removed orphaned Copilot CLI nav button and empty section shell from Settings. The underlying setting was moved to the Developer group in b94d0f4 but the nav button was left behind.

Other Changes

  • relaunch.ps1: Fixed PowerShell 5.1 compatibility (bracket strings parsed as array indexers, Join-Path with 4 args)
  • PolyPilot.csproj: Updated MauiDevFlow packages 0.12.1 → 0.23.1
  • ChatDatabaseResilienceTests.cs: Fixed Windows-specific flaky test (async file handle release)

Invariants Validated

  • INV-1 ✅ INV-3 ✅ INV-4 ✅ INV-11 ✅ INV-12 ✅ INV-16 ✅ INV-17 ✅

Tests

2669/2669 passing

PureWeen and others added 3 commits March 16, 2026 11:14
The CLI Source setting was moved to the Developer group in b94d0f4 but
the nav button and empty section shell were left behind, creating a
menu item that scrolled to nothing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After a crash mid-tool-execution, the SDK session gets stuck waiting for
tool results that will never arrive. New SendAsync calls are silently
queued/ignored, causing the session to appear permanently stuck.

Three fixes:

1. Detect interrupted tools on resume (HasInterruptedToolExecution):
   Scans events.jsonl for unmatched tool.execution_start events before
   session.shutdown or end-of-file. Handles both graceful shutdown and
   force-kill (SIGKILL/Stop-Process) scenarios.

2. Send AbortAsync on resume when interrupted tools detected:
   In EnsureSessionConnectedAsync, after ResumeSessionAsync, check for
   interrupted tools and abort to clear the SDK's pending tool state.
   This allows subsequent SendAsync calls to work immediately.

3. Fix INV-16 violation in EnsureSessionConnectedAsync:
   Move .On() callback registration BEFORE setting state.Session,
   matching the safe pattern in sibling reconnect and worker revival.

4. Watchdog Case D (dead-send detection):
   If events.jsonl hasn't grown 30s after SendAsync, try AbortAsync
   as recovery. Safety net for cases the resume-abort doesn't catch.

Also fixes:
- relaunch.ps1 PowerShell 5.1 compatibility (bracket parsing, Join-Path)
- Update MauiDevFlow NuGet packages to v0.23.1
- Add [RESUME-ABORT] to diagnostic log filter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SQLite's CloseAsync is fire-and-forget via ObserveClose — on Windows
the file handle isn't released before File.Delete runs. Add GC.Collect
+ retry loop with brief delay to handle the async file release.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Multi-Model Consensus Review (4/5 models)

CI: PASS 2668/2669 (1 pre-existing flaky)

CRITICAL (4/4 models)

HasInterruptedToolExecution false positive (CopilotService.Utilities.cs:199-216)

The backwards scan is inverted for paired events. When reading end-of-file backwards, tool.execution_complete is seen first with pendingToolStarts==0 (decrement skipped), then tool.execution_start increments to 1. A matched start→complete pair returns true = interrupted.

Verified: zero-idle session with matched tools → pending=1, result=TRUE (false positive).

Every session completing tools via the zero-idle SDK path will trigger spurious AbortAsync on next lazy-resume.

Fix: invert the scan — increment on complete, decrement on start.

MODERATE (4/4 models)

Watchdog Case D fires on healthy slow sessions (CopilotService.Events.cs:1975-2010)

Fires after 30s with no events.jsonl growth. This matches any session where LLM/tool takes >30s before first event. EventsFileSizeAtSend > 0 for all real sessions. Case D is also placed BEFORE the hasActiveTool check (Case A), so it fires before the correct long-tool-run handler.

MINOR (2/4 models)

HasInterruptedToolExecution loads entire file (Utilities.cs:164-169) — only needs last 30 lines.

Missing tests

HasInterruptedToolExecution has a testable overload but no tests. Need:

  • matched start+complete (no session.idle): must return false
  • start only (SIGKILL): must return true
  • start+shutdown (graceful crash): must return true
  • matched pair + restart marker: must return false

Verdict: Request Changes

Settings.razor cleanup (commit 1) is correct. The interrupted-tool detection has a critical false-positive that will abort healthy sessions. Please fix backwards scan algorithm and add tests before merging.

… perf

1. CRITICAL: HasInterruptedToolExecution backwards scan was inverted.
   When scanning events.jsonl in reverse, tool.execution_complete is seen
   before its matching tool.execution_start. The old code incremented on
   start and decremented on complete — but complete was seen first with
   pendingToolStarts==0 so the decrement was skipped, then start incremented
   to 1, causing every session that ran tools to be falsely flagged as
   interrupted. Fixed: use pendingCompletions counter (increment on complete,
   decrement on start). Unmatched starts are tracked separately.

2. MODERATE: Watchdog Case D moved after Case A (active tools) and Case B
   (lost terminal event). Added !hasActiveTool guard so dead-send detection
   never fires on sessions with actively running tools.

3. MINOR: Replaced full file load with streaming Queue<string>(31) ring
   buffer that keeps only the last 30 lines in memory.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

Round 2 Re-Review — PR #393

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex). Consensus filter: 2+ models must agree.

CI: ⚠️ 1 test failure (ZeroIdleCaptureTests.PurgeOldCaptures_KeepsOnlyMostRecent) — passes in isolation; pre-existing flaky test, not PR-specific.


Previous Findings Status

Finding Status
🔴 F1 — HasInterruptedToolExecution backwards scan inverted (false-positive on every matched tool pair) FIXED (4/4 models)
🟡 F2 — Watchdog Case D fired before Case A hasActiveTool check, aborting healthy slow sessions FIXED (4/4 models)
🟢 F3 — Full file load for 30-line scan FIXED (4/4 models)

Fix Verification

F1 (CopilotService.Utilities.cs): The pendingCompletions counter correctly handles the backwards-scan semantics. Traced through all scenarios:

  • Matched pair (start→complete): backwards sees complete first → pendingCompletions++=1; then startpendingCompletions--=0, unmatchedStarts=0false
  • Interrupted (start, no complete): sees start with pendingCompletions=0unmatchedStarts=1true
  • Force-kill (no shutdown): sawOnlyControlEvents=false from the start event itself → condition satisfied ✓
  • Control-events only (resume/shutdown): unmatchedStarts=0false

F2 (CopilotService.Events.cs): Case D is now placed after Case A (active tool) and Case B (clean complete), with an explicit !hasActiveTool guard. Cannot fire on sessions with running tools.

F3 (CopilotService.Utilities.cs): File.ReadLines() streams lazily; Queue<string>(31) ring buffer keeps only the last 30 lines in memory.


New Logic Reviewed (INV-16, abort-on-resume, dead-send detection)

INV-16 fix (Persistence.cs:371): copilotSession.On(...) registered before state.Session = copilotSession. No event-loss window on lazy resume. ✓

Abort on resume (Persistence.cs:379-391): HasInterruptedToolExecution called after successful ResumeSessionAsync; abort clears stuck SDK state. Safe: IsProcessing=false during lazy resume, so any SDK events from the abort hit CompleteResponse's !IsProcessing guard and are no-op'd. ✓

Dead-send detection (CopilotService.cs + Events.cs): EventsFileSizeAtSend snapshotted at each SendPromptAsync; WatchdogAbortAttempted prevents repeated aborts; File.Exists guard inside Case D prevents spurious aborts when file is absent. ✓


No Consensus Issues Found

No new bugs reach the 2-model consensus threshold.


Informational (below consensus threshold)

EventsFileSizeAtSend stale value (1/4 models — Sonnet): If events.jsonl does not exist at SendPromptAsync time, the field retains its value from the previous send. If the SDK then creates a small file for the new session, Case D could fire spuriously. Trivial fix: reset EventsFileSizeAtSend to 0 unconditionally before the try-block in SendPromptAsync (before overwriting it if the file exists).

HasInterruptedToolExecution test coverage (1/4 models — Opus B): The testable (sessionId, basePath) overload has no unit tests. Core scenarios (matched pair → false, interrupted → true, force-kill → true) would guard against future regressions on this previously-critical path.


Verdict: ✅ Approve

All three Round 1 findings are correctly resolved. No consensus issues. Logic is sound.

@PureWeen PureWeen changed the title fix: remove orphaned Copilot CLI nav item from Settings fix: abort interrupted tools on session resume + Settings cleanup Mar 16, 2026
@PureWeen PureWeen merged commit 393cc06 into main Mar 16, 2026
@PureWeen PureWeen deleted the session-20260316-095052 branch March 16, 2026 19:55
PureWeen added a commit that referenced this pull request Mar 16, 2026
## Summary

Two fixes in this PR:

### 1. Dead Event Stream Recovery (abort interrupted tools on resume)

**Root cause:** When a session crashes mid-tool-execution (e.g.,
force-kill via `relaunch.ps1`), the SDK resumes the session but remains
stuck waiting for `tool.execution_complete` results that will never
arrive. All subsequent `SendAsync` calls are silently queued/ignored —
zero events written to disk, zero callbacks fired. The session appears
permanently stuck.

**Evidence pattern in events.jsonl:**
```
tool.execution_start   ← last real event before crash
session.resume         ← resume, zero new events after sends
abort                  ← THIS unlocks the session
user.message           ← now everything works normally
```

**Fixes:**
- **Abort on resume** (`CopilotService.Persistence.cs`): After
`ResumeSessionAsync`, scan `events.jsonl` for unmatched
`tool.execution_start` events. If found, send `AbortAsync` to clear the
SDK's pending tool state before the user's message.
- **HasInterruptedToolExecution helper**
(`CopilotService.Utilities.cs`): Streams last 30 lines of `events.jsonl`
via ring buffer. Scans backwards with correct reverse-order semantics
(`pendingCompletions` counter). Handles both graceful shutdown and
force-kill scenarios.
- **INV-16 fix** (`CopilotService.Persistence.cs`): Moved `.On()`
callback registration BEFORE `state.Session` assignment — closes a race
window where events arrive with no handler registered.
- **Watchdog Case D** (`CopilotService.Events.cs`): Safety net — if
`events.jsonl` hasn't grown 30s after `SendAsync` and no tools are
active, try `AbortAsync` to recover. Positioned after Case A/B with
`!hasActiveTool` guard to avoid false positives.

### 2. Settings.razor Cleanup

Removed orphaned **Copilot CLI** nav button and empty section shell from
Settings. The underlying setting was moved to the **Developer** group in
b94d0f4 but the nav button was left behind.

### Other Changes
- `relaunch.ps1`: Fixed PowerShell 5.1 compatibility (bracket strings
parsed as array indexers, `Join-Path` with 4 args)
- `PolyPilot.csproj`: Updated MauiDevFlow packages 0.12.1 → 0.23.1
- `ChatDatabaseResilienceTests.cs`: Fixed Windows-specific flaky test
(async file handle release)

### Invariants Validated
- INV-1 ✅ INV-3 ✅ INV-4 ✅ INV-11 ✅ INV-12 ✅ INV-16 ✅ INV-17 ✅

### Tests
2669/2669 passing

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant