Skip to content

Add MCP server error detection, auto-recovery, and /mcp reload command#381

Merged
PureWeen merged 11 commits intomainfrom
copilot/feedback-reactivate-mcp-server
Mar 17, 2026
Merged

Add MCP server error detection, auto-recovery, and /mcp reload command#381
PureWeen merged 11 commits intomainfrom
copilot/feedback-reactivate-mcp-server

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 15, 2026

When MCP servers go down mid-session, the model loops asking users to run /mcp commands to debug configuration — a dead end for non-technical users who have no idea what ~/.copilot/mcp-config.json is.

MCP failure detection & auto-recovery

New IsMcpError() detector in CopilotService.Events.cs catches MCP-specific tool errors (ECONNREFUSED, connection refused, server disconnected, transport error, failed to start, server process exited, spawn ENOENT). Plugged into the existing sliding-window recovery flow alongside permission denials and shell failures:

var isRecoverableError = isPermissionDenial || isShellFailure || isMcpFailure;

When 3/5 recent tool results are MCP failures, auto-triggers TryRecoverPermissionAsync (session reconnect with fresh MCP config) and shows an actionable message pointing to /mcp reload.

/mcp reload command

Implements the already-advertised-in-autocomplete but never-wired reload subcommand. Creates a new session with freshly loaded MCP server configs from disk, switches to it, and shows troubleshooting tips on failure.

System message guidance

New AppendMcpServerGuidance() helper injects configured MCP server names into the session system prompt with explicit instructions to suggest /mcp reload on failure — prevents the model from looping on "please verify your MCP setup" messages. Applied in both CreateSessionAsync and BuildFreshSessionConfig.

Tests

  • 20+ new tests for IsMcpError() (positive/negative/confusion with other detectors)
  • Structural regression guards ensuring isRecoverableError includes isMcpFailure and recovery messages reference /mcp reload
  • Widened structural test search windows (2000→3000 chars) to accommodate new system message block in BuildFreshSessionConfig

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 192.0.2.1
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.deps.json /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/testhost.dll --port 40731 --endpoint 127.0.0.1:040731 --role client --parentprocessid 5020 --telemetryoptedin false (packet block)
    • Triggering command: /usr/share/dotnet/dotnet /usr/share/dotnet/dotnet exec --runtimeconfig /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.runtimeconfig.json --depsfile /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/PolyPilot.Tests.deps.json /home/REDACTED/work/PolyPilot/PolyPilot/PolyPilot.Tests/bin/Debug/net10.0/testhost.dll --port 42233 --endpoint 127.0.0.1:042233 --role client --parentprocessid 6432 --telemetryoptedin false (packet block)
  • 1k9vsblobprodcus379.vsblob.vsassets.io
    • Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
  • 4myvsblobprodcus32.vsblob.vsassets.io
    • Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
  • kijvsblobprodcus387.vsblob.vsassets.io
    • Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)
  • pc2vsblobprodcus360.vsblob.vsassets.io
    • Triggering command: /usr/bin/dotnet dotnet restore --no-dependencies /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/55BB33C8E1046D06DB6BC8FB401555C0/missingpackages_workingdir --packages /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/missingpackages /p:DisableImplicitNuGetFallbackFolder=true --verbosity normal --configfile /tmp/codeql-scratch-abbd4936ea2bb08e/dbs/csharp/working/nugetconfig/nuget.config --force (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>[feedback] MCP server reactivation</issue_title>
<issue_description>[feedback] MCP server reactivation
when an MCP is not available, the model is stuck asking to validate with the command:

/mcp <show, add>

the list of mcps setup and active.
MCP setup is abstracted, from polypilot, and the NON technical user, or just a busy user may not know if polypilot looks at a system wide MCP setup ~/.copilot/mcp-config.json vs MCP servers enabled inside VSCode, for example, having the perspections that polypilot is not able to fetch the MCP, reactivate them or suggest to non technical users how to get unblocked.

It happened to me a couple of times, and my first reaction was:

  • copy the prompt into vscode and run it there.

You want to consider this scenarios, considering that this tool will be used by non tecnical people too, for "office" tasks which leverage AIs too. (daily-staup, management retro, team activity retro, active incidents, pending bugs etc...) -> not executed just by engineer, but also managers, support team.

I can clearly see -> Cannot post the pictures because some mcps are not public
that copilot cli shows that the mcps servers are active.

And polypilot, does not fetch them, does not restart,
the model just ask in loop to check if the issue is on copilot cli side -> Which is not.

Image</issue_description>

Comments on the Issue (you are @copilot in this section)


📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.

Copilot AI changed the title [WIP] [feedback] Address MCP server reactivation issues Add MCP server error detection, auto-recovery, and /mcp reload command Mar 16, 2026
Copilot AI requested a review from PureWeen March 16, 2026 00:05
@PureWeen PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from 53d8d3a to c27df97 Compare March 16, 2026 14:57
@PureWeen
Copy link
Copy Markdown
Owner

PR #381 Re-Review — MCP Error Detection & /mcp reload

Tests: ✅ 2701 passed, 0 failed | Rebased onto origin/main

Two fix commits applied:


Fixes Applied

Commit 1 (c27df976): IsMcpError — require MCP context for generic network patterns

"connection refused", "server disconnected", "transport error", "failed to start", "server process exited" matched any tool error without MCP context. An SSH failure, database connection error, or Docker tool call would trigger the 3/5 sliding window and fire session reconnect + "MCP server errors detected" message.

Fix: these patterns now require "mcp" in the error text. "MCP server" and "mcp_server" remain unconditional.

Commit 2 (b8863b2b): Two issues in ReloadMcpServersAsync + complete the ECONNREFUSED/spawn ENOENT fix

(a) Build-breaking: _activeSession doesn't exist (field is expandedSession). Would have caused CS0103 on every platform build.

(b) Incomplete session switch: Missing CopilotService.SetActiveSession() meant the service layer (watchdog, bridge sync, persistence) still tracked the old session as active after /mcp reload. Fixed to mirror the /new handler pattern exactly.

(c) ECONNREFUSED / spawn ENOENT unconditional (5/5 model consensus): ECONNREFUSED is a standard libuv errno emitted by any failed TCP connection (fetch tools, DB clients, Redis, HTTP APIs). spawn ENOENT fires for any missing binary (jq, git, docker) — not just MCP servers. Both now require "mcp" context, same as the other generic patterns.


Remaining Finding (not blocking)

🟢 Dashboard.razor — if MCP auto-recovery (TryRecoverPermissionAsync) fires for a permanently dead MCP server, the session reconnect won't help (it doesn't restart MCP processes). The existing _permissionRecoveryAttempts circuit breaker (stops after 2 attempts) prevents an infinite loop. The /mcp reload path is the correct remedy, and the recovery message already points users to it. No code change needed — the behavior is acceptable.


Verdict: ✅ Approve

All consensus findings are resolved. The feature is functionally correct: MCP detection is now precise (only fires on genuine MCP errors), /mcp reload creates a fresh session and switches to it properly, and AppendMcpServerGuidance primes the model to suggest the command proactively.

Co-reviewed by: Copilot 223556219+Copilot@users.noreply.github.com

@PureWeen PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from b8863b2 to c610456 Compare March 17, 2026 12:37
@PureWeen
Copy link
Copy Markdown
Owner

PR #381 Review — 5-Model Consensus

CI: No checks configured
Prior reviews: 1 existing comment (author confirming two fix commits applied)


🔴 CRITICAL — ReloadMcpServersAsync sets IsProcessing = false off the UI thread (4/5 models)

ReloadMcpServersAsync is called as _ = ReloadMcpServersAsync(session) (fire-and-forget) from a Blazor event handler. After await state.Session?.AbortAsync(), the continuation resumes on a ThreadPool thread. state.Info.IsProcessing = false (plus ProcessingStartedAt, ToolCallCount, ProcessingPhase) are then set off the UI thread.

The documented invariant is unambiguous: all mutations to state.Info.IsProcessing must be marshaled to the UI thread via InvokeOnUI(). This is the exact pattern that caused 7+ PRs of fix/regression cycles (processing-state-safety skill, INV-1). Every other path that clears IsProcessing (8 sites in Events.cs and CopilotService.cs) wraps those mutations in InvokeOnUI() or runs directly on the UI thread.

Fix: Wrap the abort cleanup block in InvokeOnUI(() => { ... }):

if (state.Info.IsProcessing)
{
    try { await state.Session?.AbortAsync()!; } catch { }
    InvokeOnUI(() =>
    {
        FlushCurrentResponse(state);
        state.Info.IsProcessing = false;
        state.Info.ProcessingStartedAt = null;
        state.Info.ToolCallCount = 0;
        state.Info.ProcessingPhase = 0;
        state.Info.ClearPermissionDenials();
    });
}

🟡 MODERATE — ReloadMcpServersAsync does not call ClearPermissionDenials() (4/5 models)

The sliding window (3-of-5 tool errors → TryRecoverPermissionAsync) is never cleared. newState.Info = state.Info shares the same AgentSessionInfo object, so all stale denial counts carry into the new session.

If the MCP server is still broken after reload, the very first new tool error combines with 2 stale entries to immediately re-hit denialCount == 3 — triggering TryRecoverPermissionAsync on top of the reload the user just triggered. Every other recovery/termination path (9 sites across Events.cs and CopilotService.cs) calls state.Info.ClearPermissionDenials().

Fix: Add state.Info.ClearPermissionDenials(); to the cleanup block (shown above in the InvokeOnUI lambda).


🟡 MODERATE — Dashboard.razor auto-dismiss missing _disposed guard (3/5 models)

The new 8-second auto-dismiss for orchestration completion fires InvokeAsync after Task.Delay(8000). If the Dashboard component is disposed within 8 seconds (user navigates away), InvokeAsync throws ObjectDisposedException. Every other delayed background task in this component guards with if (_disposed) return; before calling InvokeAsync.

Fix:

await InvokeAsync(() =>
{
    if (_disposed) return;
    _groupPhases.Remove(groupId);
    ScheduleRender();
});

🟢 MINOR — ReloadMcpServersAsync silent success when recovery already in-progress (2/5 models)

When _recoveryInProgress.TryAdd fails, the method returns silently with no exception. The Dashboard caller already added "🔄 Reloading MCP servers..." before the call and unconditionally adds "✅ MCP servers reloaded..." after — giving the user a false success confirmation.

Fix: Throw InvalidOperationException (or return a bool), so the Dashboard catch block can show an honest message.


✅ Confirmed NOT bugs (consensus)

  • IsMcpError pattern matching: The two-tier approach (unconditional for "MCP server"/"mcp_server", MCP-context-gated for ECONNREFUSED/spawn ENOENT/etc.) is correct and avoids false positives from SSH/Docker/DB errors.
  • _recoveryInProgress concurrency guard: Both TryRecoverPermissionAsync and ReloadMcpServersAsync use the same _recoveryInProgress: ConcurrentDictionary<string, bool> — concurrent reload + permission recovery for the same session is correctly prevented.
  • _sessions[sessionName] = newState: Pre-existing pattern also used in TryRecoverPermissionAsync. Not a new regression.

Test coverage gaps

  1. ReloadMcpServersAsync_ClearsPermissionDenials — reload should reset the sliding window so one tool error after reload doesn't immediately re-trigger recovery
  2. ReloadMcpServersAsync_ReturnsEarlyWhenInProgress — calling reload while recovery is running should surface an honest message, not silent success

Verdict: ⚠️ Request changes

Two targeted fixes needed: the IsProcessing thread safety violation (wrapping in InvokeOnUI) and the missing ClearPermissionDenials() (which can create an infinite recovery loop). The _disposed guard in the auto-dismiss Task.Run and the silent-success behavior are lower severity. Core IsMcpError detection logic and ReloadMcpServersAsync architecture are solid.

Copilot AI and others added 9 commits March 17, 2026 13:12
…nd system message guidance

- Add IsMcpError() detection method in CopilotService.Events.cs for MCP-specific
  tool failures (connection refused, ECONNREFUSED, server disconnected, transport error,
  failed to start, server process exited, spawn ENOENT)
- Integrate MCP errors into existing recovery flow (isRecoverableError now includes
  isMcpFailure alongside isPermissionDenial and isShellFailure)
- Show MCP-specific recovery message with /mcp reload hint when threshold is reached
- Implement /mcp reload command in Dashboard.razor - creates new session with fresh
  MCP server configuration
- Add MCP server awareness to session system messages so the model knows what's
  configured and can suggest /mcp reload instead of looping
- Update /help text and GetMcpServerInfo output to mention /mcp reload
- Add 20+ tests for IsMcpError detection and structural regression guards
- Update structural test search windows from 2000 to 3000 chars to accommodate
  new MCP guidance block in BuildFreshSessionConfig

Co-authored-by: PureWeen <5375137+PureWeen@users.noreply.github.com>
…ad session message target

- Extract duplicated MCP system message block into AppendMcpServerGuidance() helper
- Fix ReloadMcpServersAsync to add success message to the NEW session (not old one)
- Switch _activeSession after adding the message so it's visible to the user

Co-authored-by: PureWeen <5375137+PureWeen@users.noreply.github.com>
Generic patterns like 'connection refused', 'server disconnected',
'transport error', 'failed to start', and 'server process exited'
are extremely common in non-MCP tool errors (SSH, Docker, databases,
kubectl). Without MCP context these would trigger unwanted session
reconnects on any tool that hits a network or process failure.

Fix: MCP-specific patterns (ECONNREFUSED, mcp_server, spawn ENOENT,
'MCP server') match unconditionally. Generic network/process patterns
now require 'mcp' to appear in the error text to match.

Updated tests: moved 4 standalone generic patterns from the 'returns
true' suite to 'returns false', added non-MCP false-positive cases
(SSH, Docker, kubectl).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fix 1: ReloadMcpServersAsync build error and incomplete session switch
- _activeSession does not exist; corrected to expandedSession
- Add CopilotService.SetActiveSession() call to keep service-layer active
  session in sync with bridge, watchdog, and persistence (mirrors /new handler)
- Add _explicitlyCollapsed = false to match /new session switch pattern
- Add newSession.MessageCount sync after History.Add (consistent with
  all other History.Add sites in Dashboard.razor)

Fix 2: IsMcpError — move ECONNREFUSED and spawn ENOENT behind mcp guard
ECONNREFUSED is a standard Node.js/libuv errno emitted by any failed TCP
connection (Docker, databases, HTTP APIs, Redis). spawn ENOENT fires for
any missing binary (jq, git, docker) — not just MCP server processes.
Both are now gated behind hasMcpContext, leaving only the unambiguous
'MCP server' and 'mcp_server' patterns unconditional.

Updated tests: moved 3 standalone cases (ECONNREFUSED, connect ECONNREFUSED,
spawn ENOENT without mcp context) to false list; added mcp-context variants
to true list; added git spawn ENOENT false-positive case.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…conditional

Only 'MCP server' and 'mcp_server' match unconditionally. All other patterns
(ECONNREFUSED, spawn ENOENT, connection refused, etc.) require 'mcp' context
to avoid false positives from unrelated errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-session UX

Three bugs fixed:
1. Repeated /mcp reload would throw 'Session already exists' because
   session.Name + " (reloaded)" collides on second use, showing
   the troubleshooting error tips instead of working.
2. Reloading from an already-reloaded session would stack the suffix:
   'Foo (reloaded) (reloaded)' instead of 'Foo (reloaded 2)'.
3. Error catch block showed MCP troubleshooting tips for all errors;
   now only shown for MCP-related errors.

Fix: strip existing '(reloaded...)' suffix from baseName, use counter
to find a unique name. Success message clarifies why a new session is
created (SDK loads MCP configs once at session start).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tory

Instead of creating a brand new named session (losing all history),
ReloadMcpServersAsync now replaces the underlying SDK CopilotSession
while reusing the existing AgentSessionInfo (history, messages, name).

The SDK has no API to reload MCP servers mid-session — they're baked
into SessionConfig at creation time. A fresh SDK session is required
so the CLI re-launches MCP server processes. By reusing the same
AgentSessionInfo we preserve the full conversation history.

Pattern follows TryRecoverPermissionAsync (in-place replacement):
- Abort any in-progress request
- Cancel all timers/watchdogs
- Dispose old SDK session
- BuildFreshSessionConfig (reloads MCP servers from disk)
- CopilotClient.CreateSessionAsync with fresh config
- New SessionState { Session = newSdk, Info = existingInfo }
- Replace _sessions[name] in-place, re-register event handler

Dashboard.razor simplified: calls CopilotService.ReloadMcpServersAsync,
adds success message to existing session (no session switch needed).

Tests: 2 structural regression guards added to PermissionDenialDetectionTests
verifying the service method signature and in-place history preservation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…persistence

Three correctness fixes per code review:

1. Set state.IsOrphaned = true before disposal so stale event callbacks
   from the disposed SDK session bail out (via IsOrphaned guard in
   HandleSessionEvent) and don't corrupt the shared AgentSessionInfo.
   Follows the same CRITICAL pattern used in the reconnect path (~line 3152).

2. Add reentrance guard using _recoveryInProgress (same dict as
   TryRecoverPermissionAsync). Prevents double-reload if user triggers
   /mcp reload twice before the first completes.

3. Call FlushSaveActiveSessionsToDisk() after updating state.Info.SessionId
   so the new SDK session ID is persisted immediately. Without this, a crash
   after reload leaves active-sessions.json pointing at the old dead session ID.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the copilot/feedback-reactivate-mcp-server branch from cbbe337 to 822c71b Compare March 17, 2026 18:13
@PureWeen
Copy link
Copy Markdown
Owner

⚠️ Round 3 Review — PR #381 "Add MCP server error detection, auto-recovery, and /mcp reload"

Reviewers: 3/5 models completed (Opus, Gemini, GPT-Codex) | CI: ⚠️ No checks configured | Files: 5 changed (+476 lines)


Previous Findings Status

# Finding Round 2 Round 3
1 🔴 IsProcessing = false off UI thread ⚠️ Request changes STILL PRESENT (3/3)
2 🟡 Missing ClearPermissionDenials() + companion fields ⚠️ Request changes STILL PRESENT (3/3)
3 🟡 Dashboard _disposed guard missing ⚠️ Request changes STILL PRESENT (2/3)
4 🟢 Silent success on concurrent reload Noted ⚠️ STILL PRESENT (2/3)

🔴 F1: IsProcessing = false set off UI thread (3/3 models — STILL PRESENT)

File: CopilotService.csReloadMcpServersAsync

Called via fire-and-forget _ = ReloadMcpServersAsync(session) from Dashboard. After await state.Session?.AbortAsync()!, execution resumes on a ThreadPool thread. state.Info.IsProcessing = false is then set off the UI thread, violating INV-1/INV-2. This is the exact pattern that caused 7+ PRs of fix/regression cycles.

Fix: Wrap the cleanup block in InvokeOnUI():

InvokeOnUI(() =>
{
    FlushCurrentResponse(state);
    state.Info.IsProcessing = false;
    state.Info.IsResumed = false;
    state.Info.HasUsedToolsThisTurn = false;
    Interlocked.Exchange(ref state.ActiveToolCallCount, 0);
    state.Info.ProcessingStartedAt = null;
    state.Info.ToolCallCount = 0;
    state.Info.ProcessingPhase = 0;
    Interlocked.Exchange(ref state.SendingFlag, 0);
    state.Info.ClearPermissionDenials();
});

🔴 F2: Incomplete IsProcessing companion field cleanup (3/3 models — STILL PRESENT)

File: CopilotService.csReloadMcpServersAsync

The processing cleanup only clears 4 fields (IsProcessing, ProcessingStartedAt, ToolCallCount, ProcessingPhase). Missing per INV-1 checklist:

  • IsResumed = false
  • HasUsedToolsThisTurn = false
  • ActiveToolCallCount = 0
  • SendingFlag = 0
  • ClearPermissionDenials()

The missing ClearPermissionDenials() is particularly impactful: stale denial counts carry into the new session, and the very first new tool error can immediately re-trigger TryRecoverPermissionAsync (stale 2 + new 1 = threshold 3).

🔴 F3: INV-16 violation — handler registered AFTER dictionary publish (3/3 models — NEW)

File: CopilotService.csReloadMcpServersAsync

_sessions[sessionName] = newState;                              // visible to event loop
newSdkSession.On(evt => HandleSessionEvent(newState, evt));     // handler registered AFTER

Events from the new SDK session can arrive between these lines and be dropped permanently. The established pattern (per INV-16) is handler-before-publish. Swap these two lines.

🟡 F4: INV-15 violation — _sessions[key] = newState instead of TryUpdate (3/3 models — NEW)

File: CopilotService.csReloadMcpServersAsync

Direct dictionary assignment can silently overwrite a concurrent swap. Use _sessions.TryUpdate(sessionName, newState, state) with expected old value. Note: the structural test ReloadMcpServersAsync_PreservesHistoryInPlace asserts _sessions[sessionName] = newState — it will need updating too.

🟡 F5: State mutations after dictionary swap race with event handlers (2/3 models — NEW)

File: CopilotService.csReloadMcpServersAsync

_sessions[sessionName] = newState;                // newState is live, events can fire
state.Info.SessionId = newSdkSession.SessionId;   // mutates shared Info AFTER publish
state.Info.IsResumed = false;                     // mutates shared Info AFTER publish

Since newState.Info = state.Info (shared reference), event handlers see stale SessionId. Move these assignments BEFORE the dictionary swap.

🟡 F6: Dashboard ReloadMcpServersAsync missing _disposed guard (2/3 models — STILL PRESENT)

File: Dashboard.razorReloadMcpServersAsync

After await CopilotService.ReloadMcpServersAsync(...), the component may be disposed. No _disposed check before session.History.Add(...) and InvokeAsync(SafeRefreshAsync). Other async methods in Dashboard guard against this.

🟡 F7: Silent success misleads caller on concurrent reload (2/3 models — STILL PRESENT)

File: CopilotService.csReloadMcpServersAsync

When _recoveryInProgress.TryAdd returns false, the method returns normally. The Dashboard caller then unconditionally adds "✅ MCP servers reloaded" — false confirmation. Either throw or return a bool.


✅ Confirmed Clean

  • IsMcpError detection logic — two-tier approach (unconditional for "MCP server"/"mcp_server", context-gated for generic patterns) is correct (3/3 models)
  • AppendMcpServerGuidance helper — cleanly extracted, applied in both session creation paths (3/3 models)
  • Sliding window integrationisMcpFailure correctly added to isRecoverableError with MCP-specific recovery message (3/3 models)
  • Test coverage for IsMcpError — 20+ tests covering positive/negative/confusion scenarios (3/3 models)

Verdict: ⚠️ Request changes

The 3 prior findings (F1, F2, F6) remain unfixed. F3 (INV-16: handler-after-publish) is a new critical finding with consensus from all 3 models.

Required fixes (blocking):

  1. Wrap IsProcessing cleanup in InvokeOnUI() and clear ALL companion fields per INV-1
  2. Swap handler registration and dictionary publish per INV-16
  3. Move SessionId/IsResumed assignments before dictionary swap

Recommended fixes (non-blocking):
4. Use TryUpdate per INV-15
5. Add _disposed guard in Dashboard
6. Throw/return bool on concurrent reload instead of silent success

@PureWeen PureWeen marked this pull request as ready for review March 17, 2026 18:51
F1+F2: Wrap IsProcessing cleanup in InvokeOnUIAsync for thread safety (INV-1).
Clear ALL 9 companion fields: IsProcessing, IsResumed, HasUsedToolsThisTurn,
ActiveToolCallCount, SendingFlag, ProcessingStartedAt, ToolCallCount,
ProcessingPhase, ClearPermissionDenials().

F3 (INV-16): Register event handler BEFORE publishing newState to _sessions.
Events from the new SDK session could arrive between the publish and handler
registration and be permanently dropped.

F4 (INV-15): Replace direct _sessions[key]=newState with TryUpdate(key, new, old).
Prevents silently overwriting a concurrent replacement.

F5: Move state.Info.SessionId and IsResumed assignments BEFORE dictionary publish
so event handlers never see a stale SessionId on the shared Info object.

F7: Throw InvalidOperationException on concurrent reload (instead of silent return)
so Dashboard shows an honest error instead of a false success confirmation.

Dashboard: add _disposed guard after await, update pre-message text,
show specific message for concurrent-reload error vs MCP failure.

Tests: +2 structural guards (ThrowsOnConcurrentReload, ClearsPermissionDenials).
Update PreservesHistoryInPlace assertion to match TryUpdate pattern.
All 68 PermissionDenialDetection tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner

Round 4 Review — MCP Error Detection & /mcp reload

Latest commit: 6be93592 — "Fix ReloadMcpServersAsync: address all Round 3 review findings"
CI: ⚠️ No checks configured
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5 completed)


Previous Findings — All Fixed ✅

# Round 3 Finding Status
F1 🔴 IsProcessing = false off UI thread FIXED — wrapped in await InvokeOnUIAsync(() => { ... })
F2 🔴 Incomplete companion field cleanup (5 of 9 missing) FIXED — all 9 fields cleared: IsProcessing, IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, SendingFlag, ProcessingStartedAt, ToolCallCount, ProcessingPhase, ClearPermissionDenials()
F3 🔴 INV-16: handler registered AFTER dict publish FIXEDnewSdkSession.On(...) now before TryUpdate
F4 🟡 INV-15: direct _sessions[key] = newState FIXED — uses _sessions.TryUpdate(sessionName, newState, state)
F5 🟡 SessionId/IsResumed assigned after publish FIXED — both assigned before TryUpdate
F6 🟡 Dashboard _disposed guard missing FIXEDif (_disposed) return; in both success and catch paths
F7 🟡 Silent success on concurrent reload FIXED — throws InvalidOperationException; Dashboard surfaces ⚠️ MCP reload already in progress

All 5 models confirmed all 7 findings fixed.


New Finding — 🟡 MODERATE (5/5 models)

CopilotService.cs:3930–3946ClearPermissionDenials() skipped when session is idle

if (state.Info.IsProcessing)    // ← entire cleanup block is conditional
{
    ...
    await InvokeOnUIAsync(() =>
    {
        ...
        state.Info.ClearPermissionDenials();   // ← NEVER runs when IsProcessing == false
    });
}
// newState.Info = state.Info — same AgentSessionInfo object — stale denial queue persists

The typical /mcp reload scenario is: MCP server breaks mid-session, tool errors accumulate 2 denials, the model's turn completes, and the session is idle (IsProcessing = false) when the user types the command. Since the if block is skipped, the sliding-window queue (_recentToolResults) is not cleared. newState.Info = state.Info shares the same object, so 2 stale denials carry into the fresh SDK session. The first MCP tool error in the new session pushes the count to 3 → TryRecoverPermissionAsync fires immediately on top of a reload the user just completed — exactly the cascade this feature is meant to stop.

Fix: Move state.Info.ClearPermissionDenials() unconditionally outside the if block (it is internally lock-protected; no UI-thread requirement):

if (state.Info.IsProcessing)
{
    await InvokeOnUIAsync(() =>
    {
        FlushCurrentResponse(state);
        state.Info.IsProcessing = false;
        state.Info.IsResumed = false;
        state.HasUsedToolsThisTurn = false;
        Interlocked.Exchange(ref state.ActiveToolCallCount, 0);
        Interlocked.Exchange(ref state.SendingFlag, 0);
        state.Info.ProcessingStartedAt = null;
        state.Info.ToolCallCount = 0;
        state.Info.ProcessingPhase = 0;
    });
}
state.Info.ClearPermissionDenials();  // ← unconditional: always reset before the new session goes live

New Finding — 🟢 MINOR (2/5 models)

PermissionDenialDetectionTests.cs:293 — Structural test ReloadMcpServersAsync_ClearsPermissionDenials gives false confidence

Assert.Contains("ClearPermissionDenials()", source);

This assertion passes today (the call exists inside the if (IsProcessing) block), but doesn't verify the call is unconditional. It would continue to pass even if the fix above were never applied, providing no regression protection. After fixing the MODERATE above, tighten to match a string unique to the unconditional call site.


Verdict: ⚠️ Request changes

One targeted fix needed: move ClearPermissionDenials() outside the if (state.Info.IsProcessing) block. Everything else from Round 3 is cleanly addressed. The core detection logic (IsMcpError), the thread-safety fixes (INV-1/INV-16/INV-15), and the Dashboard UX are all correct.

…Round 4)

ClearPermissionDenials() was inside the if (state.Info.IsProcessing) block,
so it was silently skipped in the typical /mcp reload scenario where the
session is idle when the user types the command.

Result: stale denial count carries into the fresh SDK session. The first MCP
tool error in the new session pushes the count to the threshold and triggers
an unwanted TryRecoverPermissionAsync cascade -- exactly what /mcp reload is
meant to stop.

Fix: move ClearPermissionDenials() unconditionally after the if block. It is
lock-protected internally and has no UI-thread requirement.

Also tighten the structural test: the old Assert.Contains('ClearPermissionDenials()')
passed even with the bug in place. New test verifies the unique marker comment
that accompanies the unconditional call, providing real regression protection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner

Round 5 Review — MCP Error Detection & /mcp reload

Latest commit: c2c4dc2e — "Fix ReloadMcpServersAsync: clear permission denials unconditionally"
CI: ⚠️ No checks configured
Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex (5/5 completed)


Round 4 Findings — All Fixed ✅

# Round 4 Finding Status
R4-1 🟡 ClearPermissionDenials() gated on IsProcessing — idle-session reload left stale denial queue FIXED — moved outside the if block, called unconditionally with unique marker comment
R4-2 🟢 Structural test gave false confidence (checked string exists anywhere in file) FIXED — test now uses IndexOf + 500-char proximity window to verify the call is near the marker comment

All Round 3 findings (F1–F7) remain fixed. Confirmed by all 5 models.


New Finding — 🟢 MINOR (2/5: claude-opus-4.6 ×2)

CopilotService.cs:3933 — Missing [MCP-RELOAD] diagnostic log tag when IsProcessing = false is cleared

The project invariant (.claude/skills/processing-state-safety/SKILL.md): "Every code path that sets IsProcessing = false MUST have a diagnostic log entry." Every other path logs before clearing ([ABORT], [ERROR], [COMPLETE], [WATCHDOG], [BRIDGE-COMPLETE], [INTERRUPTED]). The ReloadMcpServersAsync abort block logs [MCP-RELOAD] when the whole reload completes, but not specifically when it clears IsProcessing in the InvokeOnUIAsync callback.

Suggested fix: add Debug($"[MCP-RELOAD] '{sessionName}' abort cleared IsProcessing for reload") inside the InvokeOnUIAsync callback before state.Info.IsProcessing = false.


Reviewer Note (1/5 only — independently verified in source)

_permissionRecoveryAttempts not cleared in ReloadMcpServersAsync

Only sonnet-4.6 flagged this (below the 2-model consensus bar), but I verified directly against the PR source:

  • _permissionRecoveryAttempts.TryRemove is called in AbortSessionAsync (line 3555) and CompleteResponse (line 1060)
  • ReloadMcpServersAsync calls state.Session?.AbortAsync() (SDK-level) — NOT the service-level AbortSessionAsync — so the counter is NOT cleared

Failure scenario: MCP errors fire 3 auto-recovery attempts (TryRecoverPermissionAsync), exhausting the counter. User sees "Multiple recovery attempts failed", types /mcp reload. Fresh SDK session is created — but counter is still 3 (> 2). First MCP tool error in the new session immediately hits the exhaustion branch and shows "⚠️ Multiple recovery attempts failed. Use Settings → Save & Reconnect…" — the exact failure state the user was trying to escape.

Fix: Add _permissionRecoveryAttempts.TryRemove(sessionName, out _); unconditionally alongside ClearPermissionDenials(). Not blocking for merge, but worth a follow-up.


Verdict: ✅ Approve

All 3 rounds of findings (F1–F7, R4-1, R4-2) correctly addressed. The only consensus finding is 🟢 MINOR (a log tag). The _permissionRecoveryAttempts gap is real but narrow (requires prior auto-recovery exhaustion before reload), and it's a pre-existing pattern in other session-replacement paths.

@PureWeen PureWeen merged commit 0cf2cc0 into main Mar 17, 2026
@PureWeen PureWeen deleted the copilot/feedback-reactivate-mcp-server branch March 17, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feedback] MCP server reactivation

2 participants