Skip to content

fix: handle stale CLI process error during worker session resume#374

Merged
PureWeen merged 2 commits intomainfrom
fix/session-fiesta-orchestrator-fiesta-worke-20260315-1504
Mar 15, 2026
Merged

fix: handle stale CLI process error during worker session resume#374
PureWeen merged 2 commits intomainfrom
fix/session-fiesta-orchestrator-fiesta-worke-20260315-1504

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

When the CLI server process dies, the SDK's \StartCliServerAsync\ calls \Process.HasExited\ on a stale handle, throwing:
\
System.InvalidOperationException: No process is associated with this object.
\
This error was not recognized by \RestorePreviousSessionsAsync's fallback logic, so worker sessions (e.g., \ iesta-worker-1) were silently dropped during resume — causing orchestration to fail.

Root Cause

The catch block in \RestorePreviousSessionsAsync\ only handled three error patterns: \Session not found, \corrupt, and \session file. The process-related \InvalidOperationException\ matched none of them and also wasn't classified as a connection error by \IsConnectionError().

Fix

  • **\CopilotService.Utilities.cs**: Add \IsProcessError()\ helper that detects stale process handle exceptions (\InvalidOperationException\ with \No process is associated). Include it in \IsConnectionError()\ so retry/reconnect logic works across all call sites.
  • **\CopilotService.Persistence.cs**: Add \IsProcessError(ex)\ to the fallback condition in \RestorePreviousSessionsAsync\ so affected sessions get recreated via \CreateSessionAsync\ instead of being silently dropped.

Tests

Added 9 regression tests in \ConnectionRecoveryTests.cs:

  • \IsProcessError\ detection (direct, AggregateException-wrapped, inner exception)
  • \IsProcessError\ negative cases (unrelated InvalidOperationException, non-IOE)
  • \IsConnectionError\ process handle detection (direct, wrapped, inner)
  • Structural guard verifying the restore fallback includes \IsProcessError\

All 2570 tests pass. Windows build succeeds.

When the CLI server process dies, the SDK's StartCliServerAsync calls
Process.HasExited on a stale handle, throwing InvalidOperationException:
'No process is associated with this object.' This error was not recognized
by RestorePreviousSessionsAsync's fallback logic, causing worker sessions
like fiesta-worker-1 to be silently dropped during resume.

Changes:
- Add IsProcessError() helper to detect stale process handle exceptions
- Include IsProcessError in IsConnectionError() so retry logic works
  across all call sites (SendPromptAsync, CreateSessionAsync, etc.)
- Add IsProcessError check to RestorePreviousSessionsAsync fallback
  condition so affected sessions get recreated via CreateSessionAsync
- Add 9 regression tests covering process error detection, wrapping
  in AggregateException, inner exceptions, negative cases, and a
  structural guard for the restore fallback

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR #374 Review — fix: handle stale CLI process error during worker session resume
CI: ⚠️ No checks configured
Consensus Findings (2+ of 5 models)

Sev Location Finding Consensus

1 🟡 ConnectionRecoveryTests.cs:~444 Structural test uses Substring(conditionIndex, 500) — fragile fixed window that can ArgumentOutOfRangeException or false-fail if code shifts. Tests source text, not behavior. 5/5
2 🟡 ConnectionRecoveryTests.cs No behavior-level test proving RestorePreviousSessionsAsync actually takes the CreateSessionAsync fallback when the process error fires. 2/5
3 🟡 CopilotService.Utilities.cs:338 "No process is associated" is an English BCL string — fragile on localized runtimes (low real-world risk on .NET Core, but worth noting). 2/5
4 🟢 CopilotService.Utilities.cs:323-327 Redundant double-traversal of AggregateException inner exceptions — IsProcessError walks them, then IsConnectionError walks again. Correct but wasteful. 3/5
5 🟢 ConnectionRecoveryTests.cs:~444 conditionIndex > 0 should be != -1 (IndexOf returns -1 on miss, 0 at position zero). 1/5 (noted)
Verified Clean
IsProcessError() correctly guards on type (InvalidOperationException) + message substring ✅
Fallback condition uses || correctly — process errors OR'd in, not replacing existing checks ✅
Exception chain traversal terminates correctly ✅
9 unit tests cover detection logic including negative cases ✅
Verdict: ⚠️ Request Changes
One ask: Replace or harden the source-scan structural test (RestorePreviousSessions_FallbackCoversProcessErrors):
At minimum: use Math.Min(500, source.Length - conditionIndex) and switch to IndexOf("IsProcessError", conditionIndex) instead of a fixed window
Ideally: add a behavior-level test that throws the process InvalidOperationException during restore and asserts the CreateSessionAsync fallback fires

Replace fragile Substring(conditionIndex, 500) with IndexOf("IsProcessError", conditionIndex),
eliminating the ArgumentOutOfRangeException risk and the fixed-window brittleness.
Also fix conditionIndex > 0 to conditionIndex != -1 for correct IndexOf semantics.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 5c8fc2e into main Mar 15, 2026
@PureWeen PureWeen deleted the fix/session-fiesta-orchestrator-fiesta-worke-20260315-1504 branch March 15, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant