Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 45 additions & 0 deletions PolyPilot.IntegrationTests/ShutdownPreCheckTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
using PolyPilot.IntegrationTests.Fixtures;

namespace PolyPilot.IntegrationTests;

/// <summary>
/// Integration tests for the session.shutdown pre-check (Issue #397).
/// Verifies that PolyPilot handles dead sessions gracefully when the user
/// tries to send a prompt to a server-killed session.
/// </summary>
[Collection("PolyPilot")]
[Trait("Category", "ShutdownPreCheck")]
public class ShutdownPreCheckTests : IntegrationTestBase
{
public ShutdownPreCheckTests(AppFixture app, ITestOutputHelper output)
: base(app, output) { }

[Fact]
public async Task Dashboard_SessionList_IsAccessible()
{
// Verify the app is running and the dashboard loads — baseline for shutdown pre-check.
// The actual shutdown scenario requires a live CLI server, so this test validates
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Integration tests verify generic UI presence, not the shutdown pre-check feature (1/2 reviewers — single reviewer, low confidence)

Both Dashboard_SessionList_IsAccessible and SendPrompt_ToNewSession_Succeeds check only that standard UI elements exist (#dashboard, textarea[id*='prompt']). They would pass identically on a build from before this PR. The file header claims these are integration tests for Issue #397, but the comment inside each test acknowledges they don't exercise the scenario: "The actual shutdown scenario requires a live CLI server."

This provides no regression protection against the feature being silently removed or broken.

Fix: Either remove these from this PR (they're generic smoke tests that belong in a baseline suite, not feature-specific tests) or replace with a test that seeds a session directory with a session.shutdown events.jsonl, triggers a prompt send via CDP, and asserts the reconnect UI state or the resulting session state. If live CLI infrastructure is unavailable, explicitly categorize these as [Trait("Category", "Smoke")] rather than ShutdownPreCheck.

// the UI path that would display the reconnect error or success.
await WaitForCdpReadyAsync();

// Dashboard should be the default page
var dashboardExists = await ExistsAsync("#dashboard, .sessions-list, .dashboard-container");
Assert.True(dashboardExists, "Dashboard should be accessible for session management");

await ScreenshotAsync("dashboard-baseline-for-shutdown-precheck");
}

[Fact]
public async Task SendPrompt_ToNewSession_Succeeds()
{
// Verify that the normal send path works (no shutdown event present).
// This confirms the pre-check doesn't add false positives to the happy path.
await WaitForCdpReadyAsync();

// Check that the input area exists on the dashboard
var inputExists = await ExistsAsync("#prompt-input, .prompt-input, textarea[id*='prompt']");
Assert.True(inputExists, "Prompt input should be visible on dashboard");

await ScreenshotAsync("prompt-input-available");
}
}
2 changes: 1 addition & 1 deletion PolyPilot.Tests/MultiAgentRegressionTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2513,7 +2513,7 @@ public void PrematureIdleSignal_ResetInSendPromptAsync()
var sendIdx = source.IndexOf("async Task<string> SendPromptAsync(", StringComparison.Ordinal);
Assert.True(sendIdx >= 0, "SendPromptAsync must exist in CopilotService.cs");

var sendBlock = source.Substring(sendIdx, Math.Min(8000, source.Length - sendIdx));
var sendBlock = source.Substring(sendIdx, Math.Min(10000, source.Length - sendIdx));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fragile structural test window will break on next feature addition (2/3 reviewers)

This test searches for PrematureIdleSignal.Reset() within the first N characters from the SendPromptAsync signature. Bumping from 8000→10000 to accommodate the new pre-check block means this test breaks every time code is added to the early part of SendPromptAsync.

Fix: Search from a closer anchor (e.g., IsProcessing = true) instead of the method signature, or scan the entire method body by finding the matching closing brace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Structural test window bump remains fragile (3/3 reviewers)

Math.Min(8000, ...)Math.Min(10000, ...) is forced by the new 28-line pre-check block added between the method signature and PrematureIdleSignal.Reset(). This will break again the next time substantive code is added to the early part of SendPromptAsync.

Fix (long-term): Search from a closer anchor (e.g., IsProcessing = true) instead of the method signature, or scan the entire method body by finding the matching closing brace.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/2 reviewers · Fragile structural test window

Bumping 8000→10000 chars is a band-aid that will break again the next time code is added to SendPromptAsync. Consider searching from sendIdx to the next method boundary (e.g., regex for ^\s+(public|private|internal|protected).*\( after sendIdx), or scanning the entire method body.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Magic window bump (8000→10000) remains fragile (2/2 reviewers)

This test extracts a fixed-size substring from SendPromptAsync and asserts PrematureIdleSignal.Reset() is present within it. Any future code addition between the method signature and the assertion target will silently break the invariant when the window is exceeded again. This PR bumps the window to accommodate the new 28-line pre-check block; the next addition will require another bump.

Fix: Replace the fixed-window substring with a search that is independent of method length:

var resetIdx = source.IndexOf("PrematureIdleSignal.Reset()", sendIdx, StringComparison.Ordinal);
Assert.True(resetIdx >= 0, "SendPromptAsync must call PrematureIdleSignal.Reset() (invariant: clear premature idle from prior turn)");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Structural test window will break on next feature addition (2/2 reviewers)

This has already been bumped once (8000→10000). The next PR adding code to the early part of SendPromptAsync will cause a confusing "PrematureIdleSignal.Reset() not found" failure unrelated to premature idle.

Fix: Use source.IndexOf("PrematureIdleSignal.Reset()", sendIdx) with an Assert.True(idx > sendIdx) — no fixed window needed. Or search from a closer anchor like IsProcessing = true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fragile structural test window will break on next feature addition (2/3 reviewers)

Bumping from 8000→10000 to accommodate the new pre-check is a "push the number until it fits" pattern. Adding more code before PrematureIdleSignal.Reset() in a future PR will silently push it past the scan window.

Fix: Search from a closer anchor (e.g., IsProcessing = true) or use IndexOf within the full method bounds rather than a fixed character window.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fragile structural test window (3/3 reviewers)

Bumping 8000→10000 is a band-aid — the next feature addition to SendPromptAsync will push PrematureIdleSignal.Reset() beyond this window again. SendPromptAsync is ~450 lines and growing.

Fix: Use source.IndexOf("PrematureIdleSignal.Reset()", sendIdx, StringComparison.Ordinal) to search from the method start without a fixed window, or extract the full method body and assert on that.

Assert.Contains("PrematureIdleSignal.Reset()", sendBlock);
}

Expand Down
244 changes: 244 additions & 0 deletions PolyPilot.Tests/ShutdownPreCheckTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
using Microsoft.Extensions.DependencyInjection;
using PolyPilot.Models;
using PolyPilot.Services;

namespace PolyPilot.Tests;

/// <summary>
/// Tests for the session.shutdown pre-check in SendPromptAsync (Issue #397).
/// Before sending a prompt, SendPromptAsync checks if events.jsonl ends with
/// session.shutdown and forces a reconnect instead of sending to a dead session.
/// </summary>
public class ShutdownPreCheckTests
{
private readonly StubChatDatabase _chatDb = new();
private readonly StubServerManager _serverManager = new();
private readonly StubWsBridgeClient _bridgeClient = new();
private readonly StubDemoService _demoService = new();
private readonly RepoManager _repoManager = new();
private readonly IServiceProvider _serviceProvider;

public ShutdownPreCheckTests()
{
var services = new ServiceCollection();
_serviceProvider = services.BuildServiceProvider();
}

private CopilotService CreateService() =>
new CopilotService(_chatDb, _serverManager, _bridgeClient, _repoManager, _serviceProvider, _demoService);

// --- GetLastEventType detection tests ---

[Fact]
public void GetLastEventType_DetectsSessionShutdown()
{
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
Directory.CreateDirectory(tmpDir);
var eventsFile = Path.Combine(tmpDir, "events.jsonl");

try
{
// Write events ending with session.shutdown
File.WriteAllText(eventsFile, string.Join("\n",
"""{"type":"session.start","data":{}}""",
"""{"type":"user.message","data":{"content":"hello"}}""",
"""{"type":"assistant.message","data":{"content":"hi"}}""",
"""{"type":"session.shutdown","data":{}}"""
));

var lastEvent = CopilotService.GetLastEventType(eventsFile);
Assert.Equal("session.shutdown", lastEvent);
}
finally
{
Directory.Delete(tmpDir, true);
}
}

[Fact]
public void GetLastEventType_NonShutdownEvent_DoesNotTrigger()
{
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
Directory.CreateDirectory(tmpDir);
var eventsFile = Path.Combine(tmpDir, "events.jsonl");

try
{
// Write events ending with a normal event (not shutdown)
File.WriteAllText(eventsFile, string.Join("\n",
"""{"type":"session.start","data":{}}""",
"""{"type":"user.message","data":{"content":"hello"}}""",
"""{"type":"assistant.message","data":{"content":"hi"}}"""
));

var lastEvent = CopilotService.GetLastEventType(eventsFile);
Assert.NotEqual("session.shutdown", lastEvent);
Assert.Equal("assistant.message", lastEvent);
}
finally
{
Directory.Delete(tmpDir, true);
}
}

[Fact]
public void GetLastEventType_EmptyFile_ReturnsNull()
{
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
Directory.CreateDirectory(tmpDir);
var eventsFile = Path.Combine(tmpDir, "events.jsonl");

try
{
File.WriteAllText(eventsFile, "");
var lastEvent = CopilotService.GetLastEventType(eventsFile);
Assert.Null(lastEvent);
}
finally
{
Directory.Delete(tmpDir, true);
}
}

[Fact]
public void GetLastEventType_MissingFile_ReturnsNull()
{
var lastEvent = CopilotService.GetLastEventType("/tmp/nonexistent-file-" + Guid.NewGuid().ToString("N"));
Assert.Null(lastEvent);
}

[Fact]
public void GetLastEventType_TrailingWhitespace_IgnoresBlankLines()
{
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
Directory.CreateDirectory(tmpDir);
var eventsFile = Path.Combine(tmpDir, "events.jsonl");

try
{
// session.shutdown followed by trailing whitespace/newlines
File.WriteAllText(eventsFile,
"""{"type":"session.shutdown","data":{}}""" + "\n\n \n");

var lastEvent = CopilotService.GetLastEventType(eventsFile);
Assert.Equal("session.shutdown", lastEvent);
}
finally
{
Directory.Delete(tmpDir, true);
}
}

// --- Behavioral test: SendPromptAsync on a shutdown session ---
// We can't call SendPromptAsync directly (requires SDK infrastructure), but we can
// verify the detection logic that guards it.
Comment on lines +132 to +134
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)

All 8 unit tests exercise GetLastEventType (a pre-existing static utility) or manually replicate the lastEvent == "session.shutdown" boolean check. None of them invoke SendPromptAsync or test the dispose → null → EnsureSessionConnectedAsync → send sequence that this PR introduces.

The comment here admits this limitation, but the result is that the most critical aspects of the fix are untested:

  • The DisposeAsyncSession = nullEnsureSessionConnectedAsync reconnect flow
  • The SendingFlag release in the catch block on reconnect failure
  • That a successfully reconnected session proceeds to SendAsync
  • The double-reconnect scenario (lazy-resume then pre-check)

Similarly, the integration tests in PolyPilot.IntegrationTests/ShutdownPreCheckTests.cs only check that the dashboard loads and a prompt input exists — no shutdown scenario at all.

Fix: Use the existing stub/mock infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to create a behavioral test that:

  1. Constructs a session state with a mock CopilotSession
  2. Writes session.shutdown to the test events.jsonl
  3. Verifies the old session was disposed and reconnection was attempted
  4. (Error path) Mocks EnsureSessionConnectedAsync to throw → verifies SendingFlag is released


[Fact]
public void ShutdownPreCheck_SessionWithShutdownEvent_IsDetected()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/2 reviewers · Tests cover detection only, not the dispose→reconnect flow

All "behavioral" tests (lines 136–244) reduce to: write events.jsonl, call GetLastEventType, check lastEvent == "session.shutdown". None exercise the actual DisposeAsync → state.Session = null → EnsureSessionConnectedAsync path. The SendingFlag release on failure and the OCE wrapping bug (see CopilotService.cs comment) are both untested.

This is understandable given SDK infrastructure requirements, but worth noting since the most important bugs are in the dispose/reconnect/catch logic, not in GetLastEventType.

Comment on lines +132 to +137
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Tests only cover GetLastEventType, not the actual behavior change (2/2 reviewers)

All 8 tests exercise GetLastEventType (a static utility) or manually replicate the lastEvent == "session.shutdown" boolean check. The comment at line 132-133 admits this: "We can't call SendPromptAsync directly." None test:

  • DisposeAsync called on the old session
  • EnsureSessionConnectedAsync invoked after shutdown detection
  • SendingFlag released on reconnect failure (line 3534)
  • OperationCanceledException propagation
  • The double-reconnect scenario

The codebase has established patterns for behavioral testing of these paths (e.g., ProcessingWatchdogTests, TurnEndFallbackTests, ConsecutiveStuckSessionTests) using stubs/reflection on SessionState. The integration tests are dashboard smoke tests unrelated to shutdown pre-check.

Fix: Add behavioral tests using the existing stub infrastructure that create a SessionState with a mock session, write session.shutdown to events.jsonl, and verify the reconnect path fires or the error path releases SendingFlag.

{
// Simulate the exact check from SendPromptAsync:
Comment on lines +132 to +139
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)

All 8 unit tests exercise GetLastEventType (an existing utility already tested in TurnEndFallbackTests) or manually replicate the lastEvent == "session.shutdown" boolean check. None invoke SendPromptAsync or test the dispose → reconnect → send sequence this PR introduces.

What's untested:

  • The DisposeAsyncSession = nullEnsureSessionConnectedAsync reconnect flow
  • SendingFlag release in the catch block (line 3534) — a leak here permanently deadlocks the session
  • OperationCanceledException propagation (the CRITICAL finding above)
  • The double-reconnect interaction with lazy-resume
  • That a successfully reconnected session proceeds to SendAsync

The integration tests (PolyPilot.IntegrationTests/ShutdownPreCheckTests.cs) only check that the dashboard loads and a prompt input exists — no shutdown scenario at all.

Fix: Use the existing stub/mock infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to add structural invariant tests that verify:

  1. The pre-check block exists in SendPromptAsync and contains GetLastEventType, DisposeAsync, EnsureSessionConnectedAsync
  2. The catch block contains OperationCanceledException handling (once the CRITICAL fix is applied)
  3. The SendingFlag release exists in the error path

// 1. Get session ID
Comment on lines +134 to +140
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Tests don't cover the actual behavior change (3/3 reviewers)

All 8 unit tests exercise GetLastEventType (a pre-existing static utility) or manually replicate the lastEvent == "session.shutdown" boolean check. None invoke SendPromptAsync or test the dispose → null → EnsureSessionConnectedAsync → send sequence that this PR introduces.

The most critical aspects of the fix are untested:

  • The DisposeAsyncSession = nullEnsureSessionConnectedAsync reconnect flow
  • The SendingFlag release in the catch block on reconnect failure
  • That a successfully reconnected session proceeds to SendAsync
  • The double-reconnect scenario (lazy-resume then pre-check)

Similarly, the integration tests check dashboard/prompt-input visibility — no shutdown scenario at all.

Fix: Use existing stub infrastructure (same pattern as ProcessingWatchdogTests or TurnEndFallbackTests) to add a behavioral test that:

  1. Constructs a session state with a mock CopilotSession
  2. Writes session.shutdown to the test events.jsonl
  3. Verifies the old session was disposed and reconnection was attempted
  4. (Error path) Mocks reconnect to throw → verifies SendingFlag is released

// 2. Build events path
// 3. Check GetLastEventType

var svc = CreateService();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Unit tests cover only GetLastEventType; the actual pre-check code path is untested (2/2 reviewers)

The 8 "behavioral" tests (including ShutdownPreCheck_SessionWithShutdownEvent_IsDetected, ShutdownPreCheck_ActiveSession_NoReconnectNeeded, etc.) manually replicate the detection condition:

var lastEvent = CopilotService.GetLastEventType(eventsFile);
bool shouldForceReconnect = lastEvent == "session.shutdown";
Assert.True(shouldForceReconnect, ...);

If the entire pre-check block were deleted from SendPromptAsync, all 8 tests would still pass — they test a static utility method, not the new code path.

Nothing verifies the actual behavioral contract:

  • state.Session.DisposeAsync() is called when shutdown is detected
  • EnsureSessionConnectedAsync is invoked after detection
  • SendingFlag is released when the catch fires
  • The correct exception type is thrown on reconnect failure (particularly: OperationCanceledException is not swallowed)

Fix: The core contract should be covered by a test that injects a mock CopilotSession (to observe DisposeAsync), writes a session.shutdown events.jsonl, and either exercises SendPromptAsync directly (demo-mode stub) or extracts the pre-check into a protected virtual helper that can be overridden in a test subclass.

var baseDir = TestSetup.TestBaseDir;
var sessionStatePath = Path.Combine(baseDir, "session-state");
var sessionId = Guid.NewGuid().ToString();
var sessionDir = Path.Combine(sessionStatePath, sessionId);
Comment on lines +142 to +148
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Tests cover only GetLastEventType, not the actual behavior change (3/3 reviewers)

The "behavioral" tests below this comment just call GetLastEventType and check a boolean — they don't exercise SendPromptAsync's dispose→reconnect→send flow. As the comment admits: "We can't call SendPromptAsync directly" — but CopilotService is already instantiated via CreateService() in this class, and using ConnectionMode.Demo with the existing stubs could test:

  • That SendingFlag is released on reconnect failure
  • That OperationCanceledException propagates correctly
  • That a normal send succeeds after pre-check reconnect

Without these, the critical error-path behavior (flag release, exception propagation) has zero automated coverage.

Directory.CreateDirectory(sessionDir);
var eventsFile = Path.Combine(sessionDir, "events.jsonl");

try
{
File.WriteAllText(eventsFile, string.Join("\n",
"""{"type":"session.start","data":{}}""",
"""{"type":"user.message","data":{"content":"test"}}""",
"""{"type":"session.shutdown","data":{}}"""
));

// This is the exact check added in the fix
var lastEvent = CopilotService.GetLastEventType(eventsFile);
Assert.Equal("session.shutdown", lastEvent);

// The fix would force reconnect when this condition is true
bool shouldForceReconnect = lastEvent == "session.shutdown";
Assert.True(shouldForceReconnect, "Should detect server-shutdown session and force reconnect");
}
finally
{
if (Directory.Exists(sessionDir))
Directory.Delete(sessionDir, true);
}
}

[Fact]
public void ShutdownPreCheck_ActiveSession_NoReconnectNeeded()
{
// Normal active session should NOT trigger the pre-check
var baseDir = TestSetup.TestBaseDir;
var sessionStatePath = Path.Combine(baseDir, "session-state");
var sessionId = Guid.NewGuid().ToString();
var sessionDir = Path.Combine(sessionStatePath, sessionId);
Directory.CreateDirectory(sessionDir);
var eventsFile = Path.Combine(sessionDir, "events.jsonl");

try
{
File.WriteAllText(eventsFile, string.Join("\n",
"""{"type":"session.start","data":{}}""",
"""{"type":"user.message","data":{"content":"test"}}""",
"""{"type":"assistant.message","data":{"content":"response"}}""",
"""{"type":"session.idle","data":{}}"""
));

var lastEvent = CopilotService.GetLastEventType(eventsFile);
bool shouldForceReconnect = lastEvent == "session.shutdown";
Assert.False(shouldForceReconnect, "Active session should not trigger shutdown pre-check");
}
finally
{
if (Directory.Exists(sessionDir))
Directory.Delete(sessionDir, true);
}
}

[Fact]
public void ShutdownPreCheck_ToolExecutionSession_NoReconnectNeeded()
{
// Session with tool execution in progress should NOT trigger pre-check
var baseDir = TestSetup.TestBaseDir;
var sessionStatePath = Path.Combine(baseDir, "session-state");
var sessionId = Guid.NewGuid().ToString();
var sessionDir = Path.Combine(sessionStatePath, sessionId);
Directory.CreateDirectory(sessionDir);
var eventsFile = Path.Combine(sessionDir, "events.jsonl");

try
{
File.WriteAllText(eventsFile, string.Join("\n",
"""{"type":"session.start","data":{}}""",
"""{"type":"user.message","data":{"content":"fix this"}}""",
"""{"type":"tool.execution_start","data":{"name":"edit"}}"""
));

var lastEvent = CopilotService.GetLastEventType(eventsFile);
bool shouldForceReconnect = lastEvent == "session.shutdown";
Assert.False(shouldForceReconnect, "Session with active tool execution should not trigger shutdown pre-check");
}
finally
{
if (Directory.Exists(sessionDir))
Directory.Delete(sessionDir, true);
}
}

[Fact]
public void ShutdownPreCheck_NoEventsFile_NoReconnectNeeded()
{
// New session with no events file should not trigger pre-check
var lastEvent = CopilotService.GetLastEventType("/tmp/nonexistent-" + Guid.NewGuid().ToString("N"));
bool shouldForceReconnect = lastEvent == "session.shutdown";
Assert.False(shouldForceReconnect, "Missing events file should not trigger shutdown pre-check");
}
}
28 changes: 28 additions & 0 deletions PolyPilot/Services/CopilotService.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3508,6 +3508,34 @@ public async Task<string> SendPromptAsync(string sessionName, string prompt, Lis
}
}

// Pre-check: if events.jsonl ends with session.shutdown, the server killed this
// session but our event stream was dead so we never received the notification.
// Force a reconnect NOW instead of sending to a dead session and discovering the
// failure 10+ minutes later via the watchdog. (Issue #397)
try
{
var shutdownCheckSid = state.Info.SessionId;
if (!string.IsNullOrEmpty(shutdownCheckSid))
{
var eventsPath = Path.Combine(SessionStatePath, shutdownCheckSid, "events.jsonl");
var lastEvent = GetLastEventType(eventsPath);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/2 reviewers · Potential redundant reconnect after lazy-resume

If the lazy-resume block (above) successfully resumed a session whose events.jsonl still ends with session.shutdown (because the server accepted the resume before flushing session.resume to disk), this pre-check would immediately tear down the freshly-resumed session and create another one. In practice this is a narrow race window and self-healing (the extra reconnect is harmless), but it adds ~3-5s latency on affected sends.

Consider: Skipping the pre-check if lazy-resume just succeeded (e.g., set a local bool resumedThisSend flag).

if (lastEvent == "session.shutdown")
{
Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — forcing reconnect before send");
try { await state.Session.DisposeAsync(); } catch { /* session may already be disposed */ }
state.Session = null;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — null vs null! nullable annotation violation (3/3 reviewers)

SessionState.Session is declared as public required CopilotSession Session { get; set; } (line 612) — a non-nullable reference type under #nullable enable. Every other null assignment in the codebase (12+ instances across Bridge.cs:512, Persistence.cs:847/879, Providers.cs:129/382, CopilotService.cs:2465/2547/2723/2738/2871/3169) uses null!.

This line uses bare null, producing CS8625. If TreatWarningsAsErrors is ever enabled, this becomes a build failure. It also corrupts the nullable flow analysis — the compiler may believe state.Session is non-null after this line.

Fix: state.Session = null!;

await EnsureSessionConnectedAsync(sessionName, state, cancellationToken);
Comment on lines +3515 to +3527
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (2/3 reviewers)

When a session was killed by the server (events.jsonl ends with session.shutdown) and PolyPilot restarts, the lazy-resume block at line 3498 fires first (because state.Session == null). If ResumeSessionAsync succeeds (server still has session data), state.Session is now set to a fresh session — but events.jsonl still shows session.shutdown as the last event because the server hasn't flushed a new event to disk yet.

The pre-check then reads the stale file, detects session.shutdown, disposes the freshly-resumed session, and reconnects again — wasting ~3-5 seconds and creating unnecessary resource churn.

Concrete scenario:

  1. Server idle-kills session → writes session.shutdown to events.jsonl
  2. User restarts PolyPilot, sends a prompt
  3. Lazy-resume at line 3498 → EnsureSessionConnectedAsyncstate.Session = sessionA
  4. Pre-check reads events.jsonl → still sees session.shutdown (stale)
  5. Disposes sessionA, reconnects to sessionB — unnecessary double-reconnect

Fix: Gate the pre-check on whether the session was already connected when we entered:

bool wasAlreadyConnected = state.Session != null;

// Lazy resume ...
if (state.Session == null) { ... }

// Only check when we entered with an existing (possibly stale) session.
// If we just performed a lazy-resume, the session is fresh and events.jsonl is stale.
if (wasAlreadyConnected)
{
    // ... shutdown pre-check ...
}

Comment on lines +3515 to +3527
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Spurious double-reconnect on first send after app restart (2/3 reviewers)

When state.Session == null (app restart), the lazy-resume block at line 3498 fires first and calls EnsureSessionConnectedAsync. If resume succeeds (server still has the session), state.Session is now a live connection — but state.Info.SessionId is unchanged, so events.jsonl still ends with session.shutdown (the server writes session.resume asynchronously without a synchronous flush guarantee before return).

This pre-check then reads the stale file, detects session.shutdown, disposes the freshly-resumed valid session, and reconnects a second time — wasting 3-5s and creating unnecessary resource churn.

Note: One reviewer pointed out that the fresh-create fallback path (Persistence.cs:463) updates SessionId to a new value, so the pre-check reads a different events path and avoids the double-reconnect. This is correct — but the resume-success path (the more common case for this PR's target scenario) is still affected.

Concrete scenario:

  1. Server idle-kills session → writes session.shutdown to events.jsonl
  2. User restarts PolyPilot, sends a prompt
  3. Lazy-resume succeeds → state.Session = sessionA
  4. Pre-check reads stale events.jsonl → still sees session.shutdown
  5. Disposes sessionA, reconnects to sessionB — unnecessary double-reconnect

Fix: Track whether the lazy-resume path just ran:

bool wasAlreadyConnected = state.Session != null;

// Lazy resume ...
if (state.Session == null) { ... }

// Only run pre-check when we entered with an existing (possibly stale) session.
if (wasAlreadyConnected)
{
    // ... shutdown pre-check ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/2 reviewers (follow-up confirmed) · EventsFileSizeAtSend not reset after pre-check reconnect

When the pre-check reconnects, EnsureSessionConnectedAsync may create a new session with a new ID. The EventsFileSizeAtSend snapshot (further down in SendPromptAsync) reads the new session's events.jsonl — but if it doesn't exist yet, File.Exists returns false and EventsFileSizeAtSend retains the old session's stale value. The watchdog's dead-send detection could later see currentSize <= staleBaseline and trigger a spurious abort.

Fix: Reset after reconnect:

await EnsureSessionConnectedAsync(sessionName, state, cancellationToken);
Interlocked.Exchange(ref state.EventsFileSizeAtSend, 0);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Potential double-reconnect when pre-check fires after a successful lazy resume (2/2 reviewers — severity disputed, lower bound applied)

The pre-check runs unconditionally after the lazy-resume block. If state.Session was null and the lazy resume succeeded (server still holds the session object), state.Info.SessionId is unchanged and events.jsonl still ends with session.shutdown (no new event is written just by establishing the SDK connection). The pre-check immediately disposes the just-created session and calls EnsureSessionConnectedAsync a second time — wasting one full CLI round-trip on every first send after restart.

Reviewer 1 assessed this as a narrow file-flush race (MINOR — the server might write a session.resume marker on reconnect). Reviewer 2 assessed this as the common-case path (CRITICAL — no new event is written until a prompt is sent). Both agree the code path exists. The fix is the same regardless of frequency.

Failing scenario: App restarts; user sends first prompt to a session whose events.jsonl ends with session.shutdown; lazy resume succeeds on the same SessionId; pre-check reads the unchanged events.jsonl and triggers a redundant dispose + reconnect (~3–5s extra latency). If HasInterruptedToolExecution was true for the old session, the first EnsureSessionConnectedAsync also starts a watchdog that is immediately orphaned before the generation guard in StartProcessingWatchdog cancels it.

Fix: Track whether the lazy-resume block just ran and skip the pre-check in that case, or move the shutdown check into the lazy-resume path:

// In the lazy-resume block — before EnsureSessionConnectedAsync:
if (state.Session == null)
{
    // If events.jsonl already shows shutdown, force fresh-session create path
    // by clearing SessionId so EnsureSessionConnectedAsync doesn't waste an SDK round-trip.
    // (No need for a separate post-resume pre-check when we already know the session is dead.)
    var preCheckSid = state.Info.SessionId;
    if (!string.IsNullOrEmpty(preCheckSid))
    {
        var ep = Path.Combine(SessionStatePath, preCheckSid, "events.jsonl");
        if (GetLastEventType(ep) == "session.shutdown")
        {
            Debug($"[LAZY-RESUME-PRECHECK] '{sessionName}' events.jsonl ends with session.shutdown — clearing SessionId to force fresh create");
            state.Info.SessionId = null;
        }
    }
    try { await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); }
    catch { Interlocked.Exchange(ref state.SendingFlag, 0); throw; }
}
// Remove the standalone post-resume pre-check block entirely.

Comment on lines +3511 to +3527
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (3/3 reviewers)

When state.Session == null (cold start / app restart), the lazy-resume block at line 3498 fires first → EnsureSessionConnectedAsync → session is now connected. But events.jsonl still contains session.shutdown from before the restart because the server hasn't flushed a new event to disk yet.

The pre-check then reads the stale file, sees session.shutdown, disposes the just-successfully-resumed session, nulls state.Session, and calls EnsureSessionConnectedAsync again — a full second reconnect cycle (~3-5s latency, two server RPCs).

Concrete scenario:

  1. Server idle-kills session → writes session.shutdown to events.jsonl
  2. User restarts PolyPilot, sends prompt
  3. Lazy-resume (line 3498) → EnsureSessionConnectedAsyncstate.Session = sessionA
  4. Pre-check → reads stale events.jsonl → still sees session.shutdown
  5. Disposes sessionA, reconnects to sessionB — unnecessary double-reconnect

In multi-agent contexts where several workers send concurrently after restart, this doubles CLI server traffic for the entire team.

Fix: Track whether the lazy-resume block ran and skip the pre-check:

bool justLazyResumed = false;
if (state.Session == null)
{
    justLazyResumed = true;
    try { await EnsureSessionConnectedAsync(sessionName, state, cancellationToken); }
    catch { Interlocked.Exchange(ref state.SendingFlag, 0); throw; }
}

if (!justLazyResumed)
{
    // shutdown pre-check ...
}

Comment on lines +3522 to +3527
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Spurious double-reconnect on first-send-after-restart (3/3 reviewers)

When the app restarts with a shutdown session, the lazy-resume block (above this code) fires first because state.Session == null. If ResumeSessionAsync succeeds, state.Info.SessionId stays the same (pointing at the old session directory). Then this pre-check reads the same stale events.jsonl (still ending in session.shutdown), disposes the just-resumed session, and calls EnsureSessionConnectedAsync a second time — creating an orphaned server-side session.

Fix: Skip the pre-check when the lazy-resume block just ran successfully (e.g., set a local bool justResumed = false; before the resume block, set it to true on success, and guard this check with if (!justResumed)).

}
Comment on lines +3524 to +3528
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Double-reconnect on first-send-after-restart disposes a freshly-resumed session (2/2 reviewers)

When state.Session == null (placeholder from restore), lazy-resume at 3498 fires first and connects via EnsureSessionConnectedAsync. Then this pre-check reads the old session's events.jsonl — which still has session.shutdown because the server hasn't flushed new events yet — disposes the freshly-connected session, and reconnects again.

Key nuance: The second EnsureSessionConnectedAsync call at 3527 attempts to resume the same server-killed session ID (state.Info.SessionId hasn't changed). If EnsureSessionConnectedAsync's resume fails (because the server considers that session dead), it falls through to CreateSessionAsync — which works but wastes two full round-trips (~6-10s) instead of one.

Additionally, line 3526 uses state.Session = null instead of null!. SessionState.Session is required CopilotSession Session under <Nullable>enable</Nullable> — all other 11+ null-assignment sites use null! (Persistence.cs:847/879, CopilotService.cs:2465/2547/2723/2738/2871/3169). This produces CS8625.

Fix: Gate the pre-check on whether the session was already connected when entering:

bool wasAlreadyConnected = state.Session != null;
// ... lazy resume ...
if (wasAlreadyConnected) { /* shutdown pre-check */ }

And use state.Session = null!;

}
}
catch (Exception ex)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE · 2/2 reviewers · OperationCanceledException swallowed by blanket catch(Exception)

This catch wraps all exceptions — including OperationCanceledException — in InvalidOperationException. The codebase has 15+ places that catch OperationCanceledException separately and re-throw (e.g., orchestration's SendPromptAndWaitAsync, reflection loop). Wrapping it here breaks cancellation identity:

  • Orchestration logs the worker as FAILED instead of CANCELLED
  • UI shows "Session was shut down by the server" instead of silently handling cancellation

Fix: Add a filter before this catch:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw;
}
catch (Exception ex)
{
    // existing code
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — OperationCanceledException swallowed, cancellation identity lost (2/2 reviewers)

catch (Exception ex) at this line catches OperationCanceledException thrown by EnsureSessionConnectedAsync (e.g., user navigates away, session is aborted, linked CancellationToken fires) and rewraps it as InvalidOperationException. Callers that check catch (OperationCanceledException) or ex is OperationCanceledException will never match — this silently breaks cooperative cancellation up the entire call stack. The adjacent lazy-resume catch block (lines 3503–3508) correctly uses bare throw; for the same reason.

Failing scenario: User sends a prompt on a session whose events.jsonl has session.shutdown, then immediately navigates away (e.g., task-switches) before EnsureSessionConnectedAsync completes the CLI round-trip. The OperationCanceledException becomes an InvalidOperationException("...was shut down by the server...") — task infrastructure and orchestration abort handlers receive the wrong exception type.

Fix:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw;  // preserve cancellation semantics
}
catch (Exception ex)
{
    Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect failed: {ex.Message}");
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw new InvalidOperationException(
        $"Session '{sessionName}' was shut down by the server and reconnection failed: {ex.Message}. Try creating a new session.", ex);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (3/3 reviewers)

EnsureSessionConnectedAsync can throw OperationCanceledException when the cancellation token fires (e.g., user navigates away, reconnect race). The broad catch (Exception ex) wraps it in InvalidOperationException, so callers can no longer distinguish cancellation from failure — Task.IsCanceled won't be set, retry logic treats it as a hard error, and the user sees a misleading "shut down by the server" message instead of silent cancellation.

The codebase consistently uses catch (OperationCanceledException) { throw; } before broad catches (11+ sites per prior review).

Fix: Add before this catch:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw;
}

{
Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}");
Interlocked.Exchange(ref state.SendingFlag, 0);
throw new InvalidOperationException(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/2 reviewers · Error message conflates shutdown detection with reconnect failure cause

The message always says "was shut down by the server" regardless of the actual reconnect failure (could be auth expired, network down, etc.). The inner exception carries the real cause, but user-facing messages should distinguish — especially after the OCE fix, where non-cancellation failures might be auth/network.

Suggestion: $"Reconnection failed after detecting server shutdown: {ex.Message}. Try creating a new session."

Comment on lines +3531 to +3535
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — OperationCanceledException wrapping breaks cancellation contract (2/2 reviewers)

The catch (Exception ex) catches OperationCanceledException from connectLock.WaitAsync(cancellationToken) (Persistence.cs:413) and ResumeSessionAsync(..., cancellationToken) (Persistence.cs:442/494), wrapping it in InvalidOperationException.

Impact beyond UX (not noted in prior review): The orchestrator dispatch loop at Organization.cs uses catch (OperationCanceledException) when (...) to distinguish user-abort from session-replaced. With the wrapped exception, this filter doesn't match — the orchestrator's permission-recovery retry loop misclassifies cancellation as a permanent reconnect failure, potentially abandoning recoverable multi-agent workers.

Compare to the lazy-resume block 10 lines above (3504) which uses catch { throw; } — and the 10+ other catch (OperationCanceledException) { throw; } sites in this file (lines 937, 944, 1226, 1261, 1289, 1494, 2885, etc.).

Fix:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw; // preserve cancellation semantics
}
catch (Exception ex)
{
    // ... existing wrap for real failures ...
}

$"Session '{sessionName}' was shut down by the server and reconnection failed. Try creating a new session.", ex);
Comment on lines +3531 to +3536
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (2/3 reviewers)

EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) and ResumeSessionAsync(..., cancellationToken), both of which throw OperationCanceledException when the user cancels (e.g., clicks Stop or navigates away during reconnect).

This catch wraps it in InvalidOperationException, which:

  1. Loses the OperationCanceledException identity — callers checking ex is OperationCanceledException won't match
  2. Shows the user a misleading "shut down by the server" message when they simply cancelled

Fix: Add a specific catch before the general one:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw; // preserve cancellation semantics
}
catch (Exception ex)
{
    // ... existing wrap for real failures ...
}

Comment on lines +3535 to +3536
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Error message attributes all failures to "server shutdown" (2/3 reviewers)

EnsureSessionConnectedAsync can fail for many reasons (auth failure → "Go to Settings → Save & Reconnect", network error, server not started, etc.). This message always says "shut down by the server and reconnection failed", hiding the actionable root cause.

The Dashboard.razor caller extracts t.Exception?.InnerException?.Message, so the user sees the generic message while the actual fix instruction is buried deeper.

Fix: Include the inner exception's message:

throw new InvalidOperationException(
    $"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);

Comment on lines +3531 to +3536
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — OperationCanceledException wrapping loses cancellation identity (3/3 reviewers)

EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) (Persistence.cs:413) and ResumeSessionAsync(..., cancellationToken) (Persistence.cs:442), both of which throw OperationCanceledException on user cancel. This catch (Exception ex) wraps it in InvalidOperationException, breaking cancellation semantics:

  1. Task.IsCanceled becomes Task.IsFaulted — callers checking ex is OperationCanceledException won't match
  2. Multi-agent worker dispatch at line ~3981 catches OperationCanceledException for graceful shutdown — a wrapped OCE misidentifies cancellation as permanent failure
  3. Every other catch site in this file (lines 1226, 1261, 1289, 1494) uses catch (OperationCanceledException) { throw; } before the general catch — this is the sole exception to that established pattern

Concrete scenario: User clicks Stop during pre-check reconnect → they see "shut down by the server and reconnection failed" instead of clean cancellation. In multi-agent orchestration, a cancelled worker surfaces as an error instead of a cancellation.

Fix: Add a specific catch before the general one:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw; // preserve cancellation semantics
}
catch (Exception ex)
{
    // ... existing wrap for real failures ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Error message attributes all failures to "server shutdown" (3/3 reviewers)

EnsureSessionConnectedAsync can fail for many reasons — auth failure ("Go to Settings → Save & Reconnect"), network error, server not started, cancellation. This message always says "shut down by the server and reconnection failed", hiding the actionable root cause.

Dashboard.razor extracts t.Exception?.InnerException?.Message for display, so the user sees the generic message while the specific fix instruction is buried in the inner exception.

Fix: Include the inner exception's message:

throw new InvalidOperationException(
    $"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Error message always blames "server shutdown" regardless of actual cause (2/2 reviewers)

The try block (3515-3529) runs code before confirming shutdown: state.Info.SessionId access, Path.Combine(SessionStatePath, ...), GetLastEventType. If any of these fail (e.g., SessionStatePath not yet initialized → ArgumentNullException), the user sees "shut down by the server" when the real problem is unrelated.

Even for genuine shutdown cases, EnsureSessionConnectedAsync can fail due to auth, network, or quota errors — the user needs the actionable inner message (e.g., "Go to Settings → Save & Reconnect").

Fix: Include ex.Message in the outer message:

throw new InvalidOperationException(
    $"Session '{sessionName}' needs reconnection: {ex.Message}", ex);

Comment on lines +3531 to +3536
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 CRITICAL — OperationCanceledException wrapping breaks cooperative cancellation (3/3 reviewers)

catch (Exception ex) catches all exceptions — including OperationCanceledException — and wraps them in InvalidOperationException. EnsureSessionConnectedAsync calls connectLock.WaitAsync(cancellationToken) (Persistence.cs:413) which throws OperationCanceledException when the user aborts or navigates away.

Concrete scenario: User sends prompt to a shutdown session → pre-check fires → EnsureSessionConnectedAsync begins reconnecting → user clicks Stop → OperationCanceledException is caught here and wrapped as InvalidOperationException("Session was shut down by the server..."). The 15+ catch (OperationCanceledException) { throw; } handlers up the call stack won't match. The Dashboard's ContinueWith handler sees IsFaulted=true instead of IsCanceled=true, showing a confusing "shut down by the server" error. For multi-agent workers, the catch (OperationCanceledException) when (...) filter at Organization.cs:2747 can't distinguish cancellation from genuine failure.

The lazy-resume block directly above (lines 3504–3508) correctly uses bare throw; to preserve exception identity — this new block breaks the established pattern.

Fix:

catch (OperationCanceledException)
{
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw; // preserve cancellation semantics
}
catch (Exception ex)
{
    Debug($"[SEND-SHUTDOWN-PRECHECK] '{sessionName}' reconnect after shutdown detection failed: {ex.Message}");
    Interlocked.Exchange(ref state.SendingFlag, 0);
    throw new InvalidOperationException(
        $"Session '{sessionName}' needs reconnection after detecting shutdown state: {ex.Message}", ex);
}

Comment on lines +3535 to +3536
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Misleading error message (3/3 reviewers)

This message always says "was shut down by the server" regardless of the actual reconnect failure cause (auth, network, timeout, cancellation). EnsureSessionConnectedAsync already tries to create a new session internally (the CreateSessionAsync fallback), so suggesting "Try creating a new session" is unhelpful when the underlying issue is connectivity or auth.

Fix: Use a more contextual message, e.g.: $"Session '{sessionName}' reconnection failed: {ex.Message}. Check connectivity or try again."

}

long myGeneration = 0; // will be set right after the generation increment inside try

try
Expand Down