Add MCP server error detection, auto-recovery, and `/mcp reload` command by Copilot · Pull Request #381 · PureWeen/PolyPilot

Copilot · 2026-03-15T23:44:01Z

When MCP servers go down mid-session, the model loops asking users to run /mcp commands to debug configuration — a dead end for non-technical users who have no idea what ~/.copilot/mcp-config.json is.

MCP failure detection & auto-recovery

New IsMcpError() detector in CopilotService.Events.cs catches MCP-specific tool errors (ECONNREFUSED, connection refused, server disconnected, transport error, failed to start, server process exited, spawn ENOENT). Plugged into the existing sliding-window recovery flow alongside permission denials and shell failures:

var isRecoverableError = isPermissionDenial || isShellFailure || isMcpFailure;

When 3/5 recent tool results are MCP failures, auto-triggers TryRecoverPermissionAsync (session reconnect with fresh MCP config) and shows an actionable message pointing to /mcp reload.

`/mcp reload` command

Implements the already-advertised-in-autocomplete but never-wired reload subcommand. Creates a new session with freshly loaded MCP server configs from disk, switches to it, and shows troubleshooting tips on failure.

System message guidance

New AppendMcpServerGuidance() helper injects configured MCP server names into the session system prompt with explicit instructions to suggest /mcp reload on failure — prevents the model from looping on "please verify your MCP setup" messages. Applied in both CreateSessionAsync and BuildFreshSessionConfig.

Tests

20+ new tests for IsMcpError() (positive/negative/confusion with other detectors)
Structural regression guards ensuring isRecoverableError includes isMcpFailure and recovery messages reference /mcp reload
Widened structural test search windows (2000→3000 chars) to accommodate new system message block in BuildFreshSessionConfig

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

192.0.2.1
- Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.deps.json /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/testhost.dll --port 40731 --endpoint 127.0.0.1:040731 --role client --parentprocessid 5020 --telemetryoptedin false (packet block)
- Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.deps.json /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/testhost.dll --port 42233 --endpoint 127.0.0.1:042233 --role client --parentprocessid 6432 --telemetryoptedin false (packet block)
1k9vsblobprodcus379.vsblob.vsassets.io
- Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
4myvsblobprodcus32.vsblob.vsassets.io
- Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
kijvsblobprodcus387.vsblob.vsassets.io
- Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
pc2vsblobprodcus360.vsblob.vsassets.io
- Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>[feedback] MCP server reactivation</issue_title>
<issue_description>[feedback] MCP server reactivation
when an MCP is not available, the model is stuck asking to validate with the command:
/mcp <show, add>
the list of mcps setup and active.
MCP setup is abstracted, from polypilot, and the NON technical user, or just a busy user may not know if polypilot looks at a system wide MCP setup ~/.copilot/mcp-config.json vs MCP servers enabled inside VSCode, for example, having the perspections that polypilot is not able to fetch the MCP, reactivate them or suggest to non technical users how to get unblocked.

It happened to me a couple of times, and my first reaction was:

copy the prompt into vscode and run it there.

You want to consider this scenarios, considering that this tool will be used by non tecnical people too, for "office" tasks which leverage AIs too. (daily-staup, management retro, team activity retro, active incidents, pending bugs etc...) -> not executed just by engineer, but also managers, support team.

I can clearly see -> Cannot post the pictures because some mcps are not public
that copilot cli shows that the mcps servers are active.

And polypilot, does not fetch them, does not restart,
the model just ask in loop to check if the issue is on copilot cli side -> Which is not.

</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes [feedback] MCP server reactivation #368

📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

PureWeen · 2026-03-16T15:09:41Z

PR #381 Re-Review — MCP Error Detection & `/mcp reload`

Tests: ✅ 2701 passed, 0 failed | Rebased onto origin/main ✅

Two fix commits applied:

Fixes Applied

Commit 1 (c27df976): IsMcpError — require MCP context for generic network patterns

"connection refused", "server disconnected", "transport error", "failed to start", "server process exited" matched any tool error without MCP context. An SSH failure, database connection error, or Docker tool call would trigger the 3/5 sliding window and fire session reconnect + "MCP server errors detected" message.

Fix: these patterns now require "mcp" in the error text. "MCP server" and "mcp_server" remain unconditional.

Commit 2 (b8863b2b): Two issues in ReloadMcpServersAsync + complete the ECONNREFUSED/spawn ENOENT fix

(a) Build-breaking: _activeSession doesn't exist (field is expandedSession). Would have caused CS0103 on every platform build.

(b) Incomplete session switch: Missing CopilotService.SetActiveSession() meant the service layer (watchdog, bridge sync, persistence) still tracked the old session as active after /mcp reload. Fixed to mirror the /new handler pattern exactly.

(c) ECONNREFUSED / spawn ENOENT unconditional (5/5 model consensus): ECONNREFUSED is a standard libuv errno emitted by any failed TCP connection (fetch tools, DB clients, Redis, HTTP APIs). spawn ENOENT fires for any missing binary (jq, git, docker) — not just MCP servers. Both now require "mcp" context, same as the other generic patterns.

Remaining Finding (not blocking)

🟢 Dashboard.razor — if MCP auto-recovery (TryRecoverPermissionAsync) fires for a permanently dead MCP server, the session reconnect won't help (it doesn't restart MCP processes). The existing _permissionRecoveryAttempts circuit breaker (stops after 2 attempts) prevents an infinite loop. The /mcp reload path is the correct remedy, and the recovery message already points users to it. No code change needed — the behavior is acceptable.

Verdict: ✅ Approve

All consensus findings are resolved. The feature is functionally correct: MCP detection is now precise (only fires on genuine MCP errors), /mcp reload creates a fresh session and switches to it properly, and AppendMcpServerGuidance primes the model to suggest the command proactively.

Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com

PureWeen · 2026-03-17T14:47:56Z

PR #381 Review — 5-Model Consensus

CI: No checks configured
Prior reviews: 1 existing comment (author confirming two fix commits applied)

🔴 CRITICAL — `ReloadMcpServersAsync` sets `IsProcessing = false` off the UI thread (4/5 models)

ReloadMcpServersAsync is called as _ = ReloadMcpServersAsync(session) (fire-and-forget) from a Blazor event handler. After await state.Session?.AbortAsync(), the continuation resumes on a ThreadPool thread. state.Info.IsProcessing = false (plus ProcessingStartedAt, ToolCallCount, ProcessingPhase) are then set off the UI thread.

The documented invariant is unambiguous: all mutations to state.Info.IsProcessing must be marshaled to the UI thread via InvokeOnUI(). This is the exact pattern that caused 7+ PRs of fix/regression cycles (processing-state-safety skill, INV-1). Every other path that clears IsProcessing (8 sites in Events.cs and CopilotService.cs) wraps those mutations in InvokeOnUI() or runs directly on the UI thread.

Fix: Wrap the abort cleanup block in InvokeOnUI(() => { ... }):

if (state.Info.IsProcessing)
{
    try { await state.Session?.AbortAsync()!; } catch { }
    InvokeOnUI(() =>
    {
        FlushCurrentResponse(state);
        state.Info.IsProcessing = false;
        state.Info.ProcessingStartedAt = null;
        state.Info.ToolCallCount = 0;
        state.Info.ProcessingPhase = 0;
        state.Info.ClearPermissionDenials();
    });
}

🟡 MODERATE — `ReloadMcpServersAsync` does not call `ClearPermissionDenials()` (4/5 models)

The sliding window (3-of-5 tool errors → TryRecoverPermissionAsync) is never cleared. newState.Info = state.Info shares the same AgentSessionInfo object, so all stale denial counts carry into the new session.

If the MCP server is still broken after reload, the very first new tool error combines with 2 stale entries to immediately re-hit denialCount == 3 — triggering TryRecoverPermissionAsync on top of the reload the user just triggered. Every other recovery/termination path (9 sites across Events.cs and CopilotService.cs) calls state.Info.ClearPermissionDenials().

Fix: Add state.Info.ClearPermissionDenials(); to the cleanup block (shown above in the InvokeOnUI lambda).

🟡 MODERATE — `Dashboard.razor` auto-dismiss missing `_disposed` guard (3/5 models)

The new 8-second auto-dismiss for orchestration completion fires InvokeAsync after Task.Delay(8000). If the Dashboard component is disposed within 8 seconds (user navigates away), InvokeAsync throws ObjectDisposedException. Every other delayed background task in this component guards with if (_disposed) return; before calling InvokeAsync.

Fix:

await InvokeAsync(() =>
{
    if (_disposed) return;
    _groupPhases.Remove(groupId);
    ScheduleRender();
});

🟢 MINOR — `ReloadMcpServersAsync` silent success when recovery already in-progress (2/5 models)

When _recoveryInProgress.TryAdd fails, the method returns silently with no exception. The Dashboard caller already added "🔄 Reloading MCP servers..." before the call and unconditionally adds "✅ MCP servers reloaded..." after — giving the user a false success confirmation.

Fix: Throw InvalidOperationException (or return a bool), so the Dashboard catch block can show an honest message.

✅ Confirmed NOT bugs (consensus)

IsMcpError pattern matching: The two-tier approach (unconditional for "MCP server"/"mcp_server", MCP-context-gated for ECONNREFUSED/spawn ENOENT/etc.) is correct and avoids false positives from SSH/Docker/DB errors.
_recoveryInProgress concurrency guard: Both TryRecoverPermissionAsync and ReloadMcpServersAsync use the same _recoveryInProgress: ConcurrentDictionary<string, bool> — concurrent reload + permission recovery for the same session is correctly prevented.
_sessions[sessionName] = newState: Pre-existing pattern also used in TryRecoverPermissionAsync. Not a new regression.

Test coverage gaps

ReloadMcpServersAsync_ClearsPermissionDenials — reload should reset the sliding window so one tool error after reload doesn't immediately re-trigger recovery
ReloadMcpServersAsync_ReturnsEarlyWhenInProgress — calling reload while recovery is running should surface an honest message, not silent success

Verdict: ⚠️ Request changes

Two targeted fixes needed: the IsProcessing thread safety violation (wrapping in InvokeOnUI) and the missing ClearPermissionDenials() (which can create an infinite recovery loop). The _disposed guard in the auto-dismiss Task.Run and the silent-success behavior are lower severity. Core IsMcpError detection logic and ReloadMcpServersAsync architecture are solid.

…nd system message guidance - Add IsMcpError() detection method in CopilotService.Events.cs for MCP-specific tool failures (connection refused, ECONNREFUSED, server disconnected, transport error, failed to start, server process exited, spawn ENOENT) - Integrate MCP errors into existing recovery flow (isRecoverableError now includes isMcpFailure alongside isPermissionDenial and isShellFailure) - Show MCP-specific recovery message with /mcp reload hint when threshold is reached - Implement /mcp reload command in Dashboard.razor - creates new session with fresh MCP server configuration - Add MCP server awareness to session system messages so the model knows what's configured and can suggest /mcp reload instead of looping - Update /help text and GetMcpServerInfo output to mention /mcp reload - Add 20+ tests for IsMcpError detection and structural regression guards - Update structural test search windows from 2000 to 3000 chars to accommodate new MCP guidance block in BuildFreshSessionConfig Co-authored-by: PureWeen <5375137+PureWeen@users.noreply.github.com>

…ad session message target - Extract duplicated MCP system message block into AppendMcpServerGuidance() helper - Fix ReloadMcpServersAsync to add success message to the NEW session (not old one) - Switch _activeSession after adding the message so it's visible to the user Co-authored-by: PureWeen <5375137+PureWeen@users.noreply.github.com>

Generic patterns like 'connection refused', 'server disconnected', 'transport error', 'failed to start', and 'server process exited' are extremely common in non-MCP tool errors (SSH, Docker, databases, kubectl). Without MCP context these would trigger unwanted session reconnects on any tool that hits a network or process failure. Fix: MCP-specific patterns (ECONNREFUSED, mcp_server, spawn ENOENT, 'MCP server') match unconditionally. Generic network/process patterns now require 'mcp' to appear in the error text to match. Updated tests: moved 4 standalone generic patterns from the 'returns true' suite to 'returns false', added non-MCP false-positive cases (SSH, Docker, kubectl). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix 1: ReloadMcpServersAsync build error and incomplete session switch - _activeSession does not exist; corrected to expandedSession - Add CopilotService.SetActiveSession() call to keep service-layer active session in sync with bridge, watchdog, and persistence (mirrors /new handler) - Add _explicitlyCollapsed = false to match /new session switch pattern - Add newSession.MessageCount sync after History.Add (consistent with all other History.Add sites in Dashboard.razor) Fix 2: IsMcpError — move ECONNREFUSED and spawn ENOENT behind mcp guard ECONNREFUSED is a standard Node.js/libuv errno emitted by any failed TCP connection (Docker, databases, HTTP APIs, Redis). spawn ENOENT fires for any missing binary (jq, git, docker) — not just MCP server processes. Both are now gated behind hasMcpContext, leaving only the unambiguous 'MCP server' and 'mcp_server' patterns unconditional. Updated tests: moved 3 standalone cases (ECONNREFUSED, connect ECONNREFUSED, spawn ENOENT without mcp context) to false list; added mcp-context variants to true list; added git spawn ENOENT false-positive case. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…conditional Only 'MCP server' and 'mcp_server' match unconditionally. All other patterns (ECONNREFUSED, spawn ENOENT, connection refused, etc.) require 'mcp' context to avoid false positives from unrelated errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-session UX Three bugs fixed: 1. Repeated /mcp reload would throw 'Session already exists' because session.Name + " (reloaded)" collides on second use, showing the troubleshooting error tips instead of working. 2. Reloading from an already-reloaded session would stack the suffix: 'Foo (reloaded) (reloaded)' instead of 'Foo (reloaded 2)'. 3. Error catch block showed MCP troubleshooting tips for all errors; now only shown for MCP-related errors. Fix: strip existing '(reloaded...)' suffix from baseName, use counter to find a unique name. Success message clarifies why a new session is created (SDK loads MCP configs once at session start). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tory Instead of creating a brand new named session (losing all history), ReloadMcpServersAsync now replaces the underlying SDK CopilotSession while reusing the existing AgentSessionInfo (history, messages, name). The SDK has no API to reload MCP servers mid-session — they're baked into SessionConfig at creation time. A fresh SDK session is required so the CLI re-launches MCP server processes. By reusing the same AgentSessionInfo we preserve the full conversation history. Pattern follows TryRecoverPermissionAsync (in-place replacement): - Abort any in-progress request - Cancel all timers/watchdogs - Dispose old SDK session - BuildFreshSessionConfig (reloads MCP servers from disk) - CopilotClient.CreateSessionAsync with fresh config - New SessionState { Session = newSdk, Info = existingInfo } - Replace _sessions[name] in-place, re-register event handler Dashboard.razor simplified: calls CopilotService.ReloadMcpServersAsync, adds success message to existing session (no session switch needed). Tests: 2 structural regression guards added to PermissionDenialDetectionTests verifying the service method signature and in-place history preservation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…persistence Three correctness fixes per code review: 1. Set state.IsOrphaned = true before disposal so stale event callbacks from the disposed SDK session bail out (via IsOrphaned guard in HandleSessionEvent) and don't corrupt the shared AgentSessionInfo. Follows the same CRITICAL pattern used in the reconnect path (~line 3152). 2. Add reentrance guard using _recoveryInProgress (same dict as TryRecoverPermissionAsync). Prevents double-reload if user triggers /mcp reload twice before the first completes. 3. Call FlushSaveActiveSessionsToDisk() after updating state.Info.SessionId so the new SDK session ID is persisted immediately. Without this, a crash after reload leaves active-sessions.json pointing at the old dead session ID. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-17T18:48:22Z

⚠️ Round 3 Review — PR #381 "Add MCP server error detection, auto-recovery, and `/mcp reload`"

Reviewers: 3/5 models completed (Opus, Gemini, GPT-Codex) | CI: ⚠️ No checks configured | Files: 5 changed (+476 lines)

Previous Findings Status

#	Finding	Round 2	Round 3
1	🔴 `IsProcessing = false` off UI thread	⚠️ Request changes	❌ STILL PRESENT (3/3)
2	🟡 Missing `ClearPermissionDenials()` + companion fields	⚠️ Request changes	❌ STILL PRESENT (3/3)
3	🟡 Dashboard `_disposed` guard missing	⚠️ Request changes	❌ STILL PRESENT (2/3)
4	🟢 Silent success on concurrent reload	Noted	⚠️ STILL PRESENT (2/3)

🔴 F1: `IsProcessing = false` set off UI thread (3/3 models — STILL PRESENT)

File: CopilotService.cs — ReloadMcpServersAsync

Called via fire-and-forget _ = ReloadMcpServersAsync(session) from Dashboard. After await state.Session?.AbortAsync()!, execution resumes on a ThreadPool thread. state.Info.IsProcessing = false is then set off the UI thread, violating INV-1/INV-2. This is the exact pattern that caused 7+ PRs of fix/regression cycles.

Fix: Wrap the cleanup block in InvokeOnUI():

InvokeOnUI(() =>
{
    FlushCurrentResponse(state);
    state.Info.IsProcessing = false;
    state.Info.IsResumed = false;
    state.Info.HasUsedToolsThisTurn = false;
    Interlocked.Exchange(ref state.ActiveToolCallCount, 0);
    state.Info.ProcessingStartedAt = null;
    state.Info.ToolCallCount = 0;
    state.Info.ProcessingPhase = 0;
    Interlocked.Exchange(ref state.SendingFlag, 0);
    state.Info.ClearPermissionDenials();
});

🔴 F2: Incomplete IsProcessing companion field cleanup (3/3 models — STILL PRESENT)

File: CopilotService.cs — ReloadMcpServersAsync

The processing cleanup only clears 4 fields (IsProcessing, ProcessingStartedAt, ToolCallCount, ProcessingPhase). Missing per INV-1 checklist:

IsResumed = false
HasUsedToolsThisTurn = false
ActiveToolCallCount = 0
SendingFlag = 0
ClearPermissionDenials()

The missing ClearPermissionDenials() is particularly impactful: stale denial counts carry into the new session, and the very first new tool error can immediately re-trigger TryRecoverPermissionAsync (stale 2 + new 1 = threshold 3).

🔴 F3: INV-16 violation — handler registered AFTER dictionary publish (3/3 models — NEW)

File: CopilotService.cs — ReloadMcpServersAsync

_sessions[sessionName] = newState;                              // visible to event loop
newSdkSession.On(evt => HandleSessionEvent(newState, evt));     // handler registered AFTER

Events from the new SDK session can arrive between these lines and be dropped permanently. The established pattern (per INV-16) is handler-before-publish. Swap these two lines.

🟡 F4: INV-15 violation — `_sessions[key] = newState` instead of `TryUpdate` (3/3 models — NEW)

File: CopilotService.cs — ReloadMcpServersAsync

Direct dictionary assignment can silently overwrite a concurrent swap. Use _sessions.TryUpdate(sessionName, newState, state) with expected old value. Note: the structural test ReloadMcpServersAsync_PreservesHistoryInPlace asserts _sessions[sessionName] = newState — it will need updating too.

🟡 F5: State mutations after dictionary swap race with event handlers (2/3 models — NEW)

File: CopilotService.cs — ReloadMcpServersAsync

_sessions[sessionName] = newState;                // newState is live, events can fire
state.Info.SessionId = newSdkSession.SessionId;   // mutates shared Info AFTER publish
state.Info.IsResumed = false;                     // mutates shared Info AFTER publish

Since newState.Info = state.Info (shared reference), event handlers see stale SessionId. Move these assignments BEFORE the dictionary swap.

🟡 F6: Dashboard `ReloadMcpServersAsync` missing `_disposed` guard (2/3 models — STILL PRESENT)

File: Dashboard.razor — ReloadMcpServersAsync

After await CopilotService.ReloadMcpServersAsync(...), the component may be disposed. No _disposed check before session.History.Add(...) and InvokeAsync(SafeRefreshAsync). Other async methods in Dashboard guard against this.

🟡 F7: Silent success misleads caller on concurrent reload (2/3 models — STILL PRESENT)

File: CopilotService.cs — ReloadMcpServersAsync

When _recoveryInProgress.TryAdd returns false, the method returns normally. The Dashboard caller then unconditionally adds "✅ MCP servers reloaded" — false confirmation. Either throw or return a bool.

✅ Confirmed Clean

IsMcpError detection logic — two-tier approach (unconditional for "MCP server"/"mcp_server", context-gated for generic patterns) is correct (3/3 models)
AppendMcpServerGuidance helper — cleanly extracted, applied in both session creation paths (3/3 models)
Sliding window integration — isMcpFailure correctly added to isRecoverableError with MCP-specific recovery message (3/3 models)
Test coverage for IsMcpError — 20+ tests covering positive/negative/confusion scenarios (3/3 models)

Verdict: ⚠️ Request changes

The 3 prior findings (F1, F2, F6) remain unfixed. F3 (INV-16: handler-after-publish) is a new critical finding with consensus from all 3 models.

Required fixes (blocking):

Wrap IsProcessing cleanup in InvokeOnUI() and clear ALL companion fields per INV-1
Swap handler registration and dictionary publish per INV-16
Move SessionId/IsResumed assignments before dictionary swap

Recommended fixes (non-blocking):
4. Use TryUpdate per INV-15
5. Add _disposed guard in Dashboard
6. Throw/return bool on concurrent reload instead of silent success

F1+F2: Wrap IsProcessing cleanup in InvokeOnUIAsync for thread safety (INV-1). Clear ALL 9 companion fields: IsProcessing, IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, SendingFlag, ProcessingStartedAt, ToolCallCount, ProcessingPhase, ClearPermissionDenials(). F3 (INV-16): Register event handler BEFORE publishing newState to _sessions. Events from the new SDK session could arrive between the publish and handler registration and be permanently dropped. F4 (INV-15): Replace direct _sessions[key]=newState with TryUpdate(key, new, old). Prevents silently overwriting a concurrent replacement. F5: Move state.Info.SessionId and IsResumed assignments BEFORE dictionary publish so event handlers never see a stale SessionId on the shared Info object. F7: Throw InvalidOperationException on concurrent reload (instead of silent return) so Dashboard shows an honest error instead of a false success confirmation. Dashboard: add _disposed guard after await, update pre-message text, show specific message for concurrent-reload error vs MCP failure. Tests: +2 structural guards (ThrowsOnConcurrentReload, ClearsPermissionDenials). Update PreservesHistoryInPlace assertion to match TryUpdate pattern. All 68 PermissionDenialDetection tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-17T19:58:23Z

Round 4 Review — MCP Error Detection & `/mcp reload`

Latest commit: 6be93592 — "Fix ReloadMcpServersAsync: address all Round 3 review findings"
CI: ⚠️ No checks configured
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5 completed)

Previous Findings — All Fixed ✅

#	Round 3 Finding	Status
F1 🔴	`IsProcessing = false` off UI thread	✅ FIXED — wrapped in `await InvokeOnUIAsync(() => { ... })`
F2 🔴	Incomplete companion field cleanup (5 of 9 missing)	✅ FIXED — all 9 fields cleared: `IsProcessing`, `IsResumed`, `HasUsedToolsThisTurn`, `ActiveToolCallCount`, `SendingFlag`, `ProcessingStartedAt`, `ToolCallCount`, `ProcessingPhase`, `ClearPermissionDenials()`
F3 🔴	INV-16: handler registered AFTER dict publish	✅ FIXED — `newSdkSession.On(...)` now before `TryUpdate`
F4 🟡	INV-15: direct `_sessions[key] = newState`	✅ FIXED — uses `_sessions.TryUpdate(sessionName, newState, state)`
F5 🟡	`SessionId`/`IsResumed` assigned after publish	✅ FIXED — both assigned before `TryUpdate`
F6 🟡	Dashboard `_disposed` guard missing	✅ FIXED — `if (_disposed) return;` in both success and catch paths
F7 🟡	Silent success on concurrent reload	✅ FIXED — throws `InvalidOperationException`; Dashboard surfaces `⚠️ MCP reload already in progress`

All 5 models confirmed all 7 findings fixed.

New Finding — 🟡 MODERATE (5/5 models)

CopilotService.cs:3930–3946 — ClearPermissionDenials() skipped when session is idle

if (state.Info.IsProcessing)    // ← entire cleanup block is conditional
{
    ...
    await InvokeOnUIAsync(() =>
    {
        ...
        state.Info.ClearPermissionDenials();   // ← NEVER runs when IsProcessing == false
    });
}
// newState.Info = state.Info — same AgentSessionInfo object — stale denial queue persists

The typical /mcp reload scenario is: MCP server breaks mid-session, tool errors accumulate 2 denials, the model's turn completes, and the session is idle (IsProcessing = false) when the user types the command. Since the if block is skipped, the sliding-window queue (_recentToolResults) is not cleared. newState.Info = state.Info shares the same object, so 2 stale denials carry into the fresh SDK session. The first MCP tool error in the new session pushes the count to 3 → TryRecoverPermissionAsync fires immediately on top of a reload the user just completed — exactly the cascade this feature is meant to stop.

Fix: Move state.Info.ClearPermissionDenials() unconditionally outside the if block (it is internally lock-protected; no UI-thread requirement):

if (state.Info.IsProcessing)
{
    await InvokeOnUIAsync(() =>
    {
        FlushCurrentResponse(state);
        state.Info.IsProcessing = false;
        state.Info.IsResumed = false;
        state.HasUsedToolsThisTurn = false;
        Interlocked.Exchange(ref state.ActiveToolCallCount, 0);
        Interlocked.Exchange(ref state.SendingFlag, 0);
        state.Info.ProcessingStartedAt = null;
        state.Info.ToolCallCount = 0;
        state.Info.ProcessingPhase = 0;
    });
}
state.Info.ClearPermissionDenials();  // ← unconditional: always reset before the new session goes live

New Finding — 🟢 MINOR (2/5 models)

PermissionDenialDetectionTests.cs:293 — Structural test ReloadMcpServersAsync_ClearsPermissionDenials gives false confidence

Assert.Contains("ClearPermissionDenials()", source);

This assertion passes today (the call exists inside the if (IsProcessing) block), but doesn't verify the call is unconditional. It would continue to pass even if the fix above were never applied, providing no regression protection. After fixing the MODERATE above, tighten to match a string unique to the unconditional call site.

Verdict: ⚠️ Request changes

One targeted fix needed: move ClearPermissionDenials() outside the if (state.Info.IsProcessing) block. Everything else from Round 3 is cleanly addressed. The core detection logic (IsMcpError), the thread-safety fixes (INV-1/INV-16/INV-15), and the Dashboard UX are all correct.

…Round 4) ClearPermissionDenials() was inside the if (state.Info.IsProcessing) block, so it was silently skipped in the typical /mcp reload scenario where the session is idle when the user types the command. Result: stale denial count carries into the fresh SDK session. The first MCP tool error in the new session pushes the count to the threshold and triggers an unwanted TryRecoverPermissionAsync cascade -- exactly what /mcp reload is meant to stop. Fix: move ClearPermissionDenials() unconditionally after the if block. It is lock-protected internally and has no UI-thread requirement. Also tighten the structural test: the old Assert.Contains('ClearPermissionDenials()') passed even with the bug in place. New test verifies the unique marker comment that accompanies the unconditional call, providing real regression protection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-17T21:17:29Z

Round 5 Review — MCP Error Detection & `/mcp reload`

Latest commit: c2c4dc2e — "Fix ReloadMcpServersAsync: clear permission denials unconditionally"
CI: ⚠️ No checks configured
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5 completed)

Round 4 Findings — All Fixed ✅

#	Round 4 Finding	Status
R4-1 🟡	`ClearPermissionDenials()` gated on `IsProcessing` — idle-session reload left stale denial queue	✅ FIXED — moved outside the `if` block, called unconditionally with unique marker comment
R4-2 🟢	Structural test gave false confidence (checked string exists anywhere in file)	✅ FIXED — test now uses `IndexOf` + 500-char proximity window to verify the call is near the marker comment

All Round 3 findings (F1–F7) remain fixed. Confirmed by all 5 models.

New Finding — 🟢 MINOR (2/5: claude-opus-4.6 ×2)

CopilotService.cs:3933 — Missing [MCP-RELOAD] diagnostic log tag when IsProcessing = false is cleared

The project invariant (.claude/skills/processing-state-safety/SKILL.md): "Every code path that sets IsProcessing = false MUST have a diagnostic log entry." Every other path logs before clearing ([ABORT], [ERROR], [COMPLETE], [WATCHDOG], [BRIDGE-COMPLETE], [INTERRUPTED]). The ReloadMcpServersAsync abort block logs [MCP-RELOAD] when the whole reload completes, but not specifically when it clears IsProcessing in the InvokeOnUIAsync callback.

Suggested fix: add Debug($"[MCP-RELOAD] '{sessionName}' abort cleared IsProcessing for reload") inside the InvokeOnUIAsync callback before state.Info.IsProcessing = false.

Reviewer Note (1/5 only — independently verified in source)

_permissionRecoveryAttempts not cleared in ReloadMcpServersAsync

Only sonnet-4.6 flagged this (below the 2-model consensus bar), but I verified directly against the PR source:

_permissionRecoveryAttempts.TryRemove is called in AbortSessionAsync (line 3555) and CompleteResponse (line 1060)
ReloadMcpServersAsync calls state.Session?.AbortAsync() (SDK-level) — NOT the service-level AbortSessionAsync — so the counter is NOT cleared

Failure scenario: MCP errors fire 3 auto-recovery attempts (TryRecoverPermissionAsync), exhausting the counter. User sees "Multiple recovery attempts failed", types /mcp reload. Fresh SDK session is created — but counter is still 3 (> 2). First MCP tool error in the new session immediately hits the exhaustion branch and shows "⚠️ Multiple recovery attempts failed. Use Settings → Save & Reconnect…" — the exact failure state the user was trying to escape.

Fix: Add _permissionRecoveryAttempts.TryRemove(sessionName, out _); unconditionally alongside ClearPermissionDenials(). Not blocking for merge, but worth a follow-up.

Verdict: ✅ Approve

All 3 rounds of findings (F1–F7, R4-1, R4-2) correctly addressed. The only consensus finding is 🟢 MINOR (a log tag). The _permissionRecoveryAttempts gap is real but narrow (requires prior auto-recovery exhaustion before reload), and it's a pre-existing pattern in other session-replacement paths.

Copilot AI assigned Copilot and PureWeen Mar 15, 2026

Copilot started work on behalf of PureWeen March 15, 2026 23:44 View session

Copilot AI changed the title ~~[WIP] [feedback] Address MCP server reactivation issues~~ Add MCP server error detection, auto-recovery, and /mcp reload command Mar 16, 2026

Copilot AI requested a review from PureWeen March 16, 2026 00:05

Copilot finished work on behalf of PureWeen March 16, 2026 00:05

PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from 53d8d3a to c27df97 Compare March 16, 2026 14:57

PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from b8863b2 to c610456 Compare March 17, 2026 12:37

Copilot AI and others added 9 commits March 17, 2026 13:12

Initial plan

8c72f03

PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from cbbe337 to 822c71b Compare March 17, 2026 18:13

PureWeen marked this pull request as ready for review March 17, 2026 18:51

PureWeen merged commit 0cf2cc0 into main Mar 17, 2026

PureWeen deleted the copilot/feedback-reactivate-mcp-server branch March 17, 2026 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MCP server error detection, auto-recovery, and `/mcp reload` command#381

Add MCP server error detection, auto-recovery, and `/mcp reload` command#381
PureWeen merged 11 commits intomainfrom
copilot/feedback-reactivate-mcp-server

Copilot AI commented Mar 15, 2026 •

edited

Loading

Uh oh!

PureWeen commented Mar 16, 2026

Uh oh!

PureWeen commented Mar 17, 2026

Uh oh!

PureWeen commented Mar 17, 2026

Uh oh!

PureWeen commented Mar 17, 2026

Uh oh!

PureWeen commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCP failure detection & auto-recovery

/mcp reload command

System message guidance

Tests

I tried to connect to the following addresses, but was blocked by firewall rules:

Comments on the Issue (you are @copilot in this section)

Uh oh!

PureWeen commented Mar 16, 2026

PR #381 Re-Review — MCP Error Detection & /mcp reload

Fixes Applied

Remaining Finding (not blocking)

Verdict: ✅ Approve

Uh oh!

PureWeen commented Mar 17, 2026

PR #381 Review — 5-Model Consensus

🔴 CRITICAL — ReloadMcpServersAsync sets IsProcessing = false off the UI thread (4/5 models)

🟡 MODERATE — ReloadMcpServersAsync does not call ClearPermissionDenials() (4/5 models)

🟡 MODERATE — Dashboard.razor auto-dismiss missing _disposed guard (3/5 models)

🟢 MINOR — ReloadMcpServersAsync silent success when recovery already in-progress (2/5 models)

✅ Confirmed NOT bugs (consensus)

Test coverage gaps

Verdict: ⚠️ Request changes

Uh oh!

PureWeen commented Mar 17, 2026

⚠️ Round 3 Review — PR #381 "Add MCP server error detection, auto-recovery, and /mcp reload"

Previous Findings Status

🔴 F1: IsProcessing = false set off UI thread (3/3 models — STILL PRESENT)

🔴 F2: Incomplete IsProcessing companion field cleanup (3/3 models — STILL PRESENT)

🔴 F3: INV-16 violation — handler registered AFTER dictionary publish (3/3 models — NEW)

🟡 F4: INV-15 violation — _sessions[key] = newState instead of TryUpdate (3/3 models — NEW)

🟡 F5: State mutations after dictionary swap race with event handlers (2/3 models — NEW)

🟡 F6: Dashboard ReloadMcpServersAsync missing _disposed guard (2/3 models — STILL PRESENT)

🟡 F7: Silent success misleads caller on concurrent reload (2/3 models — STILL PRESENT)

✅ Confirmed Clean

Verdict: ⚠️ Request changes

Uh oh!

PureWeen commented Mar 17, 2026

Round 4 Review — MCP Error Detection & /mcp reload

Previous Findings — All Fixed ✅

New Finding — 🟡 MODERATE (5/5 models)

New Finding — 🟢 MINOR (2/5 models)

Verdict: ⚠️ Request changes

Uh oh!

PureWeen commented Mar 17, 2026

Round 5 Review — MCP Error Detection & /mcp reload

Round 4 Findings — All Fixed ✅

New Finding — 🟢 MINOR (2/5: claude-opus-4.6 ×2)

Reviewer Note (1/5 only — independently verified in source)

Verdict: ✅ Approve

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 15, 2026 •

edited

Loading

`/mcp reload` command

PR #381 Re-Review — MCP Error Detection & `/mcp reload`

🔴 CRITICAL — `ReloadMcpServersAsync` sets `IsProcessing = false` off the UI thread (4/5 models)

🟡 MODERATE — `ReloadMcpServersAsync` does not call `ClearPermissionDenials()` (4/5 models)

🟡 MODERATE — `Dashboard.razor` auto-dismiss missing `_disposed` guard (3/5 models)

🟢 MINOR — `ReloadMcpServersAsync` silent success when recovery already in-progress (2/5 models)

⚠️ Round 3 Review — PR #381 "Add MCP server error detection, auto-recovery, and `/mcp reload`"

🔴 F1: `IsProcessing = false` set off UI thread (3/3 models — STILL PRESENT)

🟡 F4: INV-15 violation — `_sessions[key] = newState` instead of `TryUpdate` (3/3 models — NEW)

🟡 F6: Dashboard `ReloadMcpServersAsync` missing `_disposed` guard (2/3 models — STILL PRESENT)

Round 4 Review — MCP Error Detection & `/mcp reload`

Round 5 Review — MCP Error Detection & `/mcp reload`