Session resilience: heartbeat timer, permission auto-recovery, activity-aware watchdog#285
Session resilience: heartbeat timer, permission auto-recovery, activity-aware watchdog#285
Conversation
PR Review Squad — PR #285 Round 1CI Status: 🔴 CRITICAL (5/5 models)
Bug A — Bug B — Old Fix required: Before resend: clear 🟡 MODERATE
Recommended Action:
|
908c6e4 to
2e461bf
Compare
PR Review Squad — PR #285 Round 3 Re-review (after commit
|
| Finding | Status | Notes |
|---|---|---|
🔴 Bug A — SendPromptAsync throws "already processing" |
STILL PRESENT (5/5) | Info = state.Info still shares object; IsProcessing never cleared; guard 1 always throws |
🔴 Bug B — Old ResponseCompletion TCS orphaned forever |
STILL PRESENT (5/5) | No state.ResponseCompletion?.TrySetCanceled() added; original awaiter hangs indefinitely |
🟡 P3 — SendPromptAsync from Task.Run background thread |
STILL PRESENT (5/5) | Still violates INV-2 (moot today due to Bug A, becomes real once A is fixed) |
🟡 P4 — _permissionRecoveryAttempts.TryRemove on any success |
STILL PRESENT (5/5) | Any clean tool call resets the attempts > 2 guard; intermittent permission errors loop forever |
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer |
STILL PRESENT (5/5) | No lock; System.Threading.Timer callbacks can overlap and corrupt on concurrent resize |
🟡 P6 — LastUpdatedAt plain DateTime from background thread |
STILL PRESENT (5/5) | Bare auto-property write; ProcessingStartedAt uses Interlocked for the same reason |
🔴 NEW CRITICAL — SendingFlag transfer worsens Bug A (4/5 models)
CopilotService.Events.cs — TryRecoverPermissionAsync
908c6e4 added the SendingFlag transfer. The intent was to prevent a concurrent SendPromptAsync from racing after session replacement — reasonable. The problem: it also blocks the intentional resend at the end of the same function.
// 908c6e4 addition:
Interlocked.Exchange(ref newState.SendingFlag,
Interlocked.Exchange(ref state.SendingFlag, 0));
// Result: state.SendingFlag = 0 (cleared)
// newState.SendingFlag = 1 (was 1 = processing)Then immediately:
await SendPromptAsync(sessionName, lastPrompt, skipHistoryMessage: true);
// Fetches _sessions[sessionName] = newState
// Guard 1: newState.Info.IsProcessing == true → throws (Bug A, unchanged)
// Guard 2: CompareExchange(newState.SendingFlag=1, 1, 0) → 1≠0 → throws (NEW)
// Both caught silently. Resend is permanently unreachable.Before 908c6e4, if Bug A were independently fixed, newState.SendingFlag would default to 0 and Guard 2 would pass. Now both guards independently block recovery.
Fix: Reset newState.SendingFlag = 0 before the resend call (or don't transfer it until after resend succeeds).
🔴 NEW CRITICAL — HandleReconnect destroys ALL sessions (5/5 models)
Dashboard.razor — HandleReconnect() (~line 2229)
The "Reconnect" button appears per-session (inside the HasPermissionIssue warning banner), but its handler calls a service-wide operation:
private async Task HandleReconnect()
{
var connSettings = ConnectionSettings.Load();
await CopilotService.ReconnectAsync(connSettings); // tears down ALL active sessions
}A user with 5 active sessions, one of which has a permission issue, clicks "Reconnect" to fix that one session. All 5 sessions are destroyed.
Fix: Use TryRecoverPermissionAsync directly (expose as a public method on CopilotService), or scope the reconnect to the affected session only.
What 908c6e4 Correctly Fixed ✅
Interlocked.Incrementfor denial counting (prevents double-recovery races)Volatile.Read/Volatile.WriteforPermissionDenialCountvia backing field- Event handler registered after
_sessions[sessionName] = newState(preventsisCurrentStatefailures) ActiveToolCallCounttransfer (prevents negative counter after in-flight completions)PermissionDenialCount = 0reset in bothCompleteResponseandSendPromptAsync- Tests for
HasPermissionIssuethreshold behavior
Test Coverage Gap
The new permission-recovery code path has no unit tests. Suggested test cases:
TryRecoverPermissionAsync— verifystate.ResponseCompletionis cancelled before new state creationSendPromptAsync— verify it succeeds afterIsProcessingis cleared andSendingFlagis 0HandleReconnect— verify it only affects the target session, not all sessions- Recovery attempt counter — verify
TryRemoveon successful tool call is not resetting it across recovery cycles
Required Fixes Before Merge
-
Bug A (blocking): In
TryRecoverPermissionAsync, before callingSendPromptAsync:- Clear
state.Info.IsProcessing = false+ all 9 companion fields viaInvokeOnUI(see processing-state-safety invariant) - Add diagnostic log
[PERMISSION-RECOVER-RESEND]
- Clear
-
Bug B (blocking): Add
state.ResponseCompletion?.TrySetCanceled()before creatingnewState -
SendingFlag regression (blocking): Add
Interlocked.Exchange(ref newState.SendingFlag, 0)before the resend call -
HandleReconnect scope (blocking): Scope the "Reconnect" button to per-session recovery only; do not call global
ReconnectAsync
Recommended Action: ⚠️ Request changes
The permission-recovery feature is the core of this PR, and it remains non-functional after 3 commits of fixes. The resend path is blocked by two independent guards (IsProcessing + SendingFlag), the original awaiter leaks forever (Bug B), and the manual reconnect button has destructive global scope.
PR #285 Round 4 Re-review (commit
|
| Finding | Status | Evidence |
|---|---|---|
🔴 Bug A — IsProcessing never cleared before resend |
✅ FIXED (5/5) | InvokeOnUI now clears IsProcessing=false + all 9 companion fields before cleanupDone.TrySetResult(). newState.Info == state.Info (shared reference), so resend guards pass. |
🔴 Bug B — Old ResponseCompletion TCS orphaned |
✅ FIXED (5/5) | state.ResponseCompletion?.TrySetCanceled() added before new state creation. Original awaiter receives TaskCanceledException. |
🔴 SendingFlag transfer regression |
✅ FIXED (5/5) | Interlocked.Exchange(ref state.SendingFlag, 0) clears old flag; newState.SendingFlag stays at default 0. Both CAS guards pass for resend. |
🔴 HandleReconnect destroys ALL sessions |
✅ FIXED (5/5) | No longer calls global ReconnectAsync(). Now calls CopilotService.RecoverSessionAsync(sessionName). See new finding below for residual targeting issue. |
🟡 P3 — SendPromptAsync from background thread (INV-2) |
❌ STILL PRESENT (5/5) | After await cleanupDone.Task.ConfigureAwait(false), execution resumes on thread pool. SendPromptAsync then sets state.Info.IsProcessing = true and invokes OnStateChanged from that background thread — direct INV-2 violation per codebase convention. |
🟡 P4 — TryRemove resets loop guard on any tool success |
✅ FIXED (5/5) | _permissionRecoveryAttempts.TryRemove moved from per-tool-success path to CompleteResponse. Intermittent successes no longer reset the 2-attempt limit. |
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer |
✅ FIXED (5/5) | Changed to ConcurrentDictionary. |
🟡 P6 — LastUpdatedAt non-atomic write from background thread |
✅ FIXED (5/5) | Now uses _lastUpdatedAtTicks with Interlocked.Read/Exchange, matching the existing ProcessingStartedAt pattern. |
New Consensus Findings
🟡 MODERATE — HandleReconnect still targets wrong session (5/5 models)
Dashboard.razor:2248 / ChatMessageList.razor:83-87
HandleReconnect reads CopilotService.ActiveSessionName instead of the session whose button was clicked. The OnReconnectRequested EventCallback carries no session identity as it flows through ChatMessageList → ExpandedSessionView → Dashboard. If the user sees a permission banner for session A but has since navigated to session B (changing ActiveSessionName), the click recovers session B instead of the broken session A.
Fix: Change EventCallback → EventCallback<string> and thread the session name through the component tree. In Dashboard: OnReconnectRequested="() => HandleReconnect(session.Name)".
🟡 MODERATE — TryRecoverPermissionAsync failure path leaves session stuck (2/5 models)
CopilotService.Events.cs — catch block at end of TryRecoverPermissionAsync
If _client.ResumeSessionAsync throws (network error, session expired), execution jumps to the outer catch block. At that point:
- The outer catch adds a warning message but does not call
state.ResponseCompletion?.TrySetCanceled() IsProcessingis never cleared- The old session was already
await state.Session.DisposeAsync()'d
Result: the session is stuck in "Thinking…" until the watchdog fires (120–600s). The watchdog is a safety net that will eventually recover it, but the user experience is a multi-minute freeze with no clear indication.
Fix: The catch block should call InvokeOnUI(() => { state.Info.IsProcessing = false; ... state.Info.PermissionDenialCount = 0; ... }) and state.ResponseCompletion?.TrySetCanceled(). Alternatively, move the TCS cancel and watchdog cancel into a finally block.
🟡 MODERATE — P3 still present: SendPromptAsync on background thread (5/5 models)
CopilotService.Events.cs:1573-1582
Per the codebase INV-2 invariant: "All mutations to state.Info.IsProcessing must be marshaled to the UI thread." SendPromptAsync sets state.Info.IsProcessing = true and fires OnStateChanged at lines 2006/2039 of CopilotService.cs. After await cleanupDone.Task.ConfigureAwait(false), the continuation runs on a thread-pool thread — these mutations then happen off the UI thread.
Fix: Wrap the resend call inside an InvokeOnUI (or its async equivalent), or change ConfigureAwait(false) to ConfigureAwait(true) so the continuation resumes on the captured synchronization context.
🟢 MINOR — TaskCanceledException from TCS cancel reaches the original caller silently (2/5 models)
CopilotService.cs:2222 / Dashboard.razor ContinueWith
When state.ResponseCompletion?.TrySetCanceled() fires, the original SendPromptAsync's await state.ResponseCompletion.Task throws TaskCanceledException. Dashboard's ContinueWith(t => { if (t.IsFaulted) ... }) checks IsFaulted (not IsCanceled), so the cancellation is silently swallowed with no [COMPLETE] or [ERROR] diagnostic log entry. Functionally harmless since recovery takes over, but creates a blind spot in event diagnostics.
Summary
4 of 8 previously critical/moderate findings are confirmed fixed. P3 (INV-2 background thread) is still present. Two new moderate issues were identified — notably the failure-path cleanup gap, which can cause a session to freeze for 120-600 seconds if the network drops during recovery.
Recommended action:
Required before merge:
- Fix
HandleReconnectto thread session name through the EventCallback chain - Fix the
catchblock inTryRecoverPermissionAsyncto clearIsProcessing, cancel the TCS, and callCancelProcessingWatchdog - Fix P3: wrap the
SendPromptAsyncresend inInvokeOnUIor restore the synchronization context before calling it
PR #285 Round 4 Re-review (commit
|
| Finding | Status |
|---|---|
🔴 Bug A — SendPromptAsync throws "already processing" |
✅ FIXED (4/4) |
🔴 Bug B — Old ResponseCompletion TCS orphaned forever |
✅ FIXED (4/4) |
🔴 SendingFlag transfer regression |
✅ FIXED (4/4) |
🔴 HandleReconnect destroys ALL sessions |
✅ FIXED (4/4) |
🟡 P3 — SendPromptAsync from Task.Run background thread |
❌ STILL PRESENT (4/4) |
🟡 P4 — TryRemove resets loop guard on clean tool call |
✅ FIXED (4/4) |
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer |
✅ FIXED (4/4) |
🟡 P6 — LastUpdatedAt plain DateTime non-atomic write |
✅ FIXED (4/4) |
🔴 NEW CRITICAL — FlushCurrentResponse missing (4/4 models)
CopilotService.Events.cs — TryRecoverPermissionAsync, InvokeOnUI cleanup block
The cleanup sets IsProcessing = false but silently discards any partial assistant response:
InvokeOnUI(() => {
state.CurrentResponse.Clear(); // ← accumulated partial response DISCARDED
state.FlushedResponse.Clear(); // ← without persisting to History or ChatDB
state.Info.IsProcessing = false;
...
});Per INV-1, every IsProcessing=false path MUST call FlushCurrentResponse(state) before clearing the buffers. Without it, any streamed assistant text that arrived before the 3rd permission denial is silently lost — not written to state.Info.History, not persisted to the chat database.
Fix: Add FlushCurrentResponse(state); inside the InvokeOnUI block, before clearing CurrentResponse.
🔴 NEW CRITICAL — History.Add from SDK background thread without Invoke() (3/4 models)
CopilotService.Events.cs ~line 372 — ToolExecutionCompleteEvent handler
if (denialCount == 3)
{
state.Info.History.Add(ChatMessage.SystemMessage( // ← no Invoke()
"⚠️ Permission errors detected. Attempting to reconnect session..."));
_ = Task.Run(async () => await TryRecoverPermissionAsync(state, sessionName));
}History is a List<ChatMessage>. This write happens on the SDK background event thread. The UI thread reads and iterates History during render. A concurrent Add can corrupt the list's internal array (InvalidOperationException: Collection was modified or silent memory corruption). Compare to lines 390–394 in the same handler which correctly use Invoke(() => { ... }).
Fix: Wrap the History.Add in Invoke(() => { ... }) like all other History mutations in the same handler.
🟡 STILL PRESENT — P3: SendPromptAsync from Task.Run background thread (4/4 models)
CopilotService.Events.cs — TryRecoverPermissionAsync
The cleanupDone TCS correctly synchronizes the cleanup, but the resend still violates INV-2:
_ = Task.Run(async () => await TryRecoverPermissionAsync(state, sessionName));
// inside TryRecoverPermissionAsync, after cleanup:
await SendPromptAsync(sessionName, lastPrompt, skipHistoryMessage: true);
// SendPromptAsync sets IsProcessing=true, ProcessingStartedAt, etc. on Task.Run threadAll IsProcessing mutations must be marshaled to the UI thread. Fix: use InvokeOnUI(() => _ = SendPromptAsync(...)) instead of direct call.
🟡 NEW MODERATE — HasUsedToolsThisTurn not cleared on IsProcessing=false path (2/4 models)
CopilotService.Events.cs — TryRecoverPermissionAsync, InvokeOnUI cleanup block
The cleanup correctly clears IsResumed, ProcessingStartedAt, ToolCallCount, ProcessingPhase, but does not clear HasUsedToolsThisTurn (it lives on SessionState, not AgentSessionInfo). The stale value is transferred to newState:
newState.HasUsedToolsThisTurn = state.HasUsedToolsThisTurn; // transfers stale trueIf the resend succeeds, SendPromptAsync resets it on entry — no harm. But if lastPrompt is null/empty (no resend), newState sits with IsProcessing=false but HasUsedToolsThisTurn=true. On the user's next manual prompt, the watchdog will select the 600s timeout instead of 120s, delaying stuck-session detection.
Fix: Add newState.HasUsedToolsThisTurn = false; after the transfer.
Test Coverage Gap
No unit tests cover the recovery flow:
- Test that
FlushCurrentResponseis called (content preserved to History) beforeIsProcessing=false - Test that
History.Addfor the permission warning reachesHistory(i.e. is not lost due to thread race) - Test that
SendPromptAsyncsucceeds after recovery (guards are all clear)
Required Fixes Before Merge
- Add
FlushCurrentResponse(state);in theInvokeOnUIcleanup block beforestate.CurrentResponse.Clear()(INV-1) - Wrap
History.Add(atdenialCount == 3) inInvoke(() => { ... })(thread safety) - Add
newState.HasUsedToolsThisTurn = false;after the transfer (INV-1 partial) - (Carry-over P3) Marshal the
SendPromptAsyncresend viaInvokeOnUI(INV-2)
Recommended Action: ⚠️ Request changes
Two new critical issues were introduced. The 4 criticals from Round 3 are all fixed — good progress. Three small additions will complete the fix.
Round 4 Action Items — 4 remaining before mergeGreat progress on 1. 🔴
|
PR #285 Round 5 Re-review (commit
|
| Finding | Status | Evidence |
|---|---|---|
| F1 — FlushCurrentResponse not called before buffer clear | ❌ STILL PRESENT (5/5) | state.CurrentResponse.Clear() in InvokeOnUI with no prior FlushCurrentResponse(state) call. Partial response text lost. |
| F2 — History.LastOrDefault on background thread | ❌ STILL PRESENT (5/5) | state.Info.History.LastOrDefault(...) runs before InvokeOnUI, on Task.Run background thread. List<ChatMessage> not thread-safe. |
| F3 — SendPromptAsync from background thread (INV-2) | ❌ STILL PRESENT (5/5) | After await cleanupDone.Task.ConfigureAwait(false), continuation runs on thread pool. SendPromptAsync sets IsProcessing=true without InvokeOnUI — INV-2 violation. |
| F4 — HasUsedToolsThisTurn not reset | ✅ FIXED (5/5) | SendPromptAsync resets state.HasUsedToolsThisTurn = false at line 2013 unconditionally, before watchdog starts. Stale copy is overwritten immediately. |
Carry-over Finding from Round 4 (previously flagged, still present)
🟡 MODERATE — Catch block missing IsProcessing cleanup on ResumeSessionAsync failure (4/5 models)
CopilotService.Events.cs — outer catch block of TryRecoverPermissionAsync
If _client.ResumeSessionAsync(...) throws, execution jumps to the outer catch. At that point:
state.ResponseCompletion?.TrySetCanceled()has not been called yet (it's placed AFTER the await), so the originalSendPromptAsyncawaiter hangs onawait state.ResponseCompletion.Taskuntil the watchdog fires (120-600s)state.Info.IsProcessingremainstrue— UI shows "Thinking…" with no exit pathstate.SendingFlagremains1— session locked for new sends
The catch block only adds a system message and calls OnStateChanged. Minimum fix:
catch (Exception ex)
{
state.ResponseCompletion?.TrySetCanceled();
InvokeOnUI(() =>
{
state.Info.IsProcessing = false;
state.Info.ProcessingStartedAt = null;
state.Info.ToolCallCount = 0;
state.Info.ProcessingPhase = 0;
state.Info.HasUsedToolsThisTurn = false;
Interlocked.Exchange(ref state.SendingFlag, 0);
state.Info.History.Add(ChatMessage.SystemMessage("⚠️ Auto-recovery failed..."));
OnStateChanged?.Invoke();
});
}Outstanding Consensus Findings (requiring fixes)
🟡 MODERATE — F1: FlushCurrentResponse (5/5)
CopilotService.Events.cs — TryRecoverPermissionAsync InvokeOnUI block
Add FlushCurrentResponse(state) call before state.CurrentResponse.Clear() to persist any partial response text. Even though the session is being reconnected, silently discarding streamed content violates the content-persistence invariant. The flush includes a dedup guard (checked via state.FlushedResponse) so it won't duplicate already-persisted content.
🟡 MODERATE — F2: History read on background thread (5/5)
CopilotService.Events.cs — before InvokeOnUI in TryRecoverPermissionAsync
Move the History read inside the InvokeOnUI lambda (before cleanupDone.TrySetResult()), then pass the extracted lastPrompt to the resend via a captured variable:
string? lastPrompt = null;
InvokeOnUI(() =>
{
var lastUserMsg = state.Info.History.LastOrDefault(m => m.Role == "user");
lastPrompt = lastUserMsg?.OriginalContent ?? lastUserMsg?.Content;
// ... existing cleanup ...
cleanupDone.TrySetResult();
});🟡 MODERATE — F3: SendPromptAsync INV-2 (5/5)
CopilotService.Events.cs — resend at end of TryRecoverPermissionAsync
Change ConfigureAwait(false) to ConfigureAwait(true) so the resend continuation captures the synchronization context, OR wrap the resend in InvokeOnUI/InvokeAsync:
await cleanupDone.Task; // remove ConfigureAwait(false)This ensures SendPromptAsync runs on the UI thread, consistent with all other call sites.
Summary
1 of the 4 Round 4 findings is confirmed fixed (F4). F1, F2, F3 remain. The catch-block cleanup gap is a separate carry-over moderate issue.
Recommended action:
Required before merge:
- Add
FlushCurrentResponse(state)before clearing buffers (F1) - Move History read into
InvokeOnUIlambda (F2) - Remove
ConfigureAwait(false)from cleanupDone await so resend runs on UI thread (F3) - Add
IsProcessingcleanup + TCS cancel to catch block
Round 6 Re-Review:
|
| Finding | Status |
|---|---|
| F1: FlushCurrentResponse before Clear in success path | ✅ FIXED (5/5) |
| F2: History.LastOrDefault on background thread | ✅ FIXED (5/5) |
| F3: SendPromptAsync called from threadpool | ✅ FIXED (5/5) |
| F4: HasUsedToolsThisTurn not reset in SendPromptAsync | ✅ FIXED (5/5) |
| NEW/R5: Catch block missing IsProcessing cleanup | ✅ FIXED (5/5) |
All 5 prior round findings are confirmed fixed. However 3 new consensus issues were found.
New Consensus Findings
🔴 CRITICAL — History.Add on SDK background thread (5/5 models)
CopilotService.Events.cs, ToolExecutionCompleteEvent handler, denialCount == 3 branch
state.Info.History.Add(ChatMessage.SystemMessage("⚠️ Permission errors detected...")); // background thread!
InvokeOnUI(() => _ = TryRecoverPermissionAsync(state, sessionName));This bare History.Add runs on the SDK event thread outside any InvokeOnUI. History is List<ChatMessage> — not thread-safe. The Blazor render loop and FlushCurrentResponse enumerate this list on the UI thread simultaneously. This is the same class of violation as F2, just at a new call site.
Fix: Move the History.Add inside the existing InvokeOnUI callback (can be prepended to the TryRecoverPermissionAsync launch lambda), or into its own InvokeOnUI(() => state.Info.History.Add(...)).
🟡 MODERATE — HasUsedToolsThisTurn not cleared in success-path cleanup (4/5 models)
CopilotService.Events.cs, TryRecoverPermissionAsync success path InvokeOnUI block
The cleanup block sets IsProcessing = false but does not clear HasUsedToolsThisTurn (the catch block does clear it). This violates INV-1 checklist item 2. newState.HasUsedToolsThisTurn = state.HasUsedToolsThisTurn copies the stale true value to the live session. If lastPrompt is null/empty and the resend is skipped, IsProcessing=false with HasUsedToolsThisTurn=true persists — any subsequent watchdog will incorrectly select the 600s tool timeout instead of 120s.
Fix: Add newState.HasUsedToolsThisTurn = false; in the InvokeOnUI cleanup block alongside the other field clears.
🟡 MODERATE — FlushCurrentResponse called after IsProcessing=false in catch block (2/5 models)
CopilotService.Events.cs, TryRecoverPermissionAsync catch block InvokeOnUI
state.Info.IsProcessing = false; // ← cleared first
...
FlushCurrentResponse(state); // ← flush happens after (INV-1 item 7 violated)The success path has the correct order (FlushCurrentResponse before IsProcessing=false). The catch block has it inverted. INV-1 item 7: "Call FlushCurrentResponse(state) BEFORE clearing IsProcessing". If FlushCurrentResponse ever adds an IsProcessing guard (consistent with the direction of the codebase), this will silently drop response content on recovery failures.
Fix: Move FlushCurrentResponse(state) to before state.Info.IsProcessing = false in the catch block's InvokeOnUI, matching the success path ordering.
Summary
| # | Severity | Finding | Models |
|---|---|---|---|
| N1 | 🔴 CRITICAL | History.Add on background thread (ToolExecutionCompleteEvent) | 5/5 |
| N2 | 🟡 MODERATE | HasUsedToolsThisTurn not cleared in success path InvokeOnUI | 4/5 |
| N3 | 🟡 MODERATE | FlushCurrentResponse after IsProcessing=false in catch block | 2/5 |
3 targeted fixes needed, all in TryRecoverPermissionAsync and its callsite.
PR #285 Round 6 Re-review (commit
|
| Finding | Status | Verification |
|---|---|---|
| F1: FlushCurrentResponse not called before Clear() | ✅ FIXED | FlushCurrentResponse(state) now called before CurrentResponse.Clear() in InvokeOnUI block |
| F2: History.LastOrDefault on background thread | ✅ FIXED | Moved inside InvokeOnUI lambda via cleanupDone TCS pattern |
| F3: SendPromptAsync after ConfigureAwait(false) | ✅ FIXED | ConfigureAwait(false) removed; continuation stays on UI thread |
| F4: HasUsedToolsThisTurn not reset | ✅ FIXED | HasUsedToolsThisTurn = false present in SendPromptAsync |
| NEW: Catch block missing cleanup | ✅ FIXED | Full 9-field cleanup + FlushCurrentResponse + TCS.TrySetException in catch |
All 5 models unanimously agree all previous findings are fixed.
New Consensus Findings
🔴 CRITICAL — History.Add() on background thread (4/5 models)
CopilotService.Events.cs ~line 368 — In the ToolExecutionCompleteEvent handler, when denialCount == 3:
if (denialCount == 3)
{
state.Info.History.Add(ChatMessage.SystemMessage( // ← SDK background thread!
"⚠️ Permission errors detected..."));
InvokeOnUI(() => _ = TryRecoverPermissionAsync(state, sessionName));
}List<ChatMessage> is not thread-safe. This Add() executes on the SDK event callback thread while the UI thread may be enumerating History for rendering — risking InvalidOperationException or internal array corruption. Every other History.Add in this handler correctly goes through Invoke().
Fix: Move the History.Add into the InvokeOnUI lambda or a separate Invoke() call.
🟡 MODERATE — Watchdog exceededMaxTime change is logically redundant (3/5 models)
CopilotService.Events.cs ~line 1399:
-var exceededMaxTime = totalProcessingSeconds >= WatchdogMaxProcessingTimeSeconds;
+var exceededMaxTime = totalProcessingSeconds >= WatchdogMaxProcessingTimeSeconds
+ && elapsed >= effectiveTimeout;Since exceededMaxTime now requires elapsed >= effectiveTimeout, and that is already the first branch of if (elapsed >= effectiveTimeout || exceededMaxTime), the entire exceededMaxTime term is logically redundant: A || (B && A) ≡ A. This silently disables the absolute time safety net — a session receiving periodic events (keeping elapsed below effectiveTimeout) can now run forever past WatchdogMaxProcessingTimeSeconds.
Fix: If intent is "only fire max-time when events are somewhat stale," use a lower staleness threshold (e.g., elapsed >= 30) rather than effectiveTimeout.
🟡 MODERATE — CheckForMissedUpdates iterates sessions from timer thread (2/5 models)
Dashboard.razor heartbeat timer:
_heartbeatTimer = new Timer(_ => CheckForMissedUpdates(), null, 5000, 5000);
// ...
var snapshot = sessions; // List<AgentSessionInfo>, not thread-safe
foreach (var session in snapshot) { ... }sessions is a List<> mutated on the UI thread. The Timer fires on ThreadPool. Concurrent enumeration + mutation throws InvalidOperationException. The bare catch {} swallows it but silently breaks the heartbeat for that tick.
Fix: Use InvokeAsync to marshal to Blazor sync context, or take a .ToArray() snapshot.
🟡 MODERATE — Race between zombie SDK events and TryRecoverPermissionAsync (2/5 models)
CopilotService.Events.cs TryRecoverPermissionAsync — newState shares the Info object reference with old state. Between DisposeAsync and _sessions[sessionName] = newState, in-flight SDK events on the old session can invoke HandleSessionEvent and mutate the shared Info (e.g., calling CompleteResponse, setting IsProcessing=false), potentially corrupting the newly resumed session state.
Fix: Set a guard flag (e.g., state.IsRecovering) checked in CompleteResponse, or ensure HandleSessionEvent's isCurrentState check catches this.
🟢 MINOR — _permissionDenialCount backing field is public (3/5 models)
AgentSessionInfo.cs ~line 50:
public int _permissionDenialCount;Leaks the backing field, allowing external code to bypass Volatile.Read/Write semantics. Should be internal with an IncrementPermissionDenialCount() method.
🟢 MINOR — _lastRenderedSnapshot grows unbounded (2/5 models)
Dashboard.razor — Entries added to snapshot dictionary but never pruned when sessions are closed. Minor memory leak over long app lifetime.
Recommended Action: ⚠️ Request Changes
The 🔴 CRITICAL History.Add on background thread is a genuine thread-safety bug that can cause InvalidOperationException crashes during rendering. This must be fixed before merge.
The watchdog redundancy (🟡) silently removes the absolute time safety net and should be clarified/fixed.
The remaining moderate/minor issues are lower priority but worth addressing.
All 5 previous round findings are confirmed fixed — excellent progress! One more small fix and this is ready.
FlushCurrentResponse persists messages to History on each TurnEnd, but in zero-idle sessions (where SessionIdleEvent never arrives), the event- driven render pipeline can miss updates because: - RenderThrottle may suppress the OnStateChanged callback - CompleteResponse/HandleComplete never fires to bypass the throttle - The session appears stuck showing stale content Add a 5-second periodic heartbeat timer that compares each session's current state (MessageCount, IsProcessing, ToolCallCount, ProcessingPhase) against the last-rendered snapshot. If anything changed, schedule a render. This acts as a safety net alongside the existing event-driven mechanism. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a session is processing, the status indicator now shows how long since the last SDK event (e.g., 'Working · 2m · 8 tools… (last activity 45s ago)'). This helps distinguish actively-working sessions from stuck ones, especially in zero-idle scenarios. Changes: - Update LastUpdatedAt on every meaningful SDK event in HandleSessionEvent - Add LastActivityAt parameter to ChatMessageList with AppendLastActivity - Add AppendShortLastActivity to SessionListItem sidebar status - Wire LastActivityAt through SessionCard and ExpandedSessionView - Heartbeat timer now also re-renders processing sessions so elapsed time and last-activity text stay fresh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Revert unintended changes to CopilotService.cs, Organization.cs, and skill documentation files from a previous session - Fix thread safety in heartbeat timer: snapshot sessions list before iterating to avoid racing with UI-thread mutations - Add null/empty guard for snapshot Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tton Detect when the copilot CLI's permission system returns 'Permission denied' on tool calls (caused by lost callback binding in headless mode). After 3 consecutive denials: - Show amber warning banner in chat with Reconnect button - Show⚠️ indicator in sidebar session list - Add system message to chat history explaining the issue Reconnect button calls ReconnectAsync to re-establish the permission callback. Count resets on successful tool completion or new prompt. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When 3 consecutive 'Permission denied' tool failures are detected, automatically attempt to recover by resuming just the affected session with a fresh AutoApprovePermissions callback. Other sessions are untouched. On success: resets denial count, adds '✅ Session reconnected' system message. On failure: falls back to UI warning banner with manual Reconnect button. The per-session reconnect disposes only the broken SDK session, creates a new one via ResumeSessionAsync (preserving chat history, watchdog state, and processing generation), and re-registers the event handler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…work After a successful per-session reconnect, finds the last user message and resends it with skipHistoryMessage:true so the agent picks up where it left off. Adds a recovery attempt counter (max 2) to prevent infinite loops if the permission issue persists. Counter resets when a tool succeeds. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n recovery - Use Interlocked.Increment for PermissionDenialCount to prevent duplicate recovery triggers from concurrent tool completion events - Register event handler AFTER dictionary replacement so HandleSessionEvent's isCurrentState check passes for the new state (prevents dropped events) - Transfer ActiveToolCallCount to prevent negative values from in-flight tool completions decrementing the new state's default-zero counter - Transfer SendingFlag to prevent concurrent SendPromptAsync races Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The watchdog previously killed sessions after 3600s total processing time regardless of event activity. This caused premature kills on long-running CI/agent sessions that were actively receiving events. Now the max-time cap only fires when BOTH conditions are true: - Total processing time exceeds 3600s, AND - No events received within the effectiveTimeout window If events are still flowing, the session is alive — let it work. The cap still catches zombie sessions where non-progress events keep the inactivity timer happy without any real work happening. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hread safety Bug A (CRITICAL): Clear IsProcessing + all 9 companion fields via InvokeOnUI before calling SendPromptAsync in recovery path. Use TaskCompletionSource to await UI-thread cleanup completion before resending. Bug B (CRITICAL): Call state.ResponseCompletion?.TrySetCanceled() before creating new state, so the original SendPromptAsync awaiter doesn't hang. SendingFlag regression (CRITICAL): Don't transfer SendingFlag=1 to newState since we need it at 0 for the resend. Clear old state's flag instead. HandleReconnect scope (CRITICAL): Changed from global ReconnectAsync (destroys all sessions) to per-session RecoverSessionAsync via new public method. P4 (recovery attempt reset): Only reset _permissionRecoveryAttempts on successful turn completion (CompleteResponse), not on every successful tool. P5 (Dictionary thread safety): Use ConcurrentDictionary for heartbeat _lastRenderedSnapshot to prevent corruption from overlapping Timer callbacks. P6 (LastUpdatedAt thread safety): Use Interlocked ticks pattern matching ProcessingStartedAt's thread-safe implementation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The strict consecutive counter was defeated by occasional OK tool results interspersed with denials. Worker-1 hit 2 denials, then 1 OK (resetting the counter), then 2 more denials — never reaching the threshold of 3. Replace with a sliding window of the last 5 tool results. Recovery triggers when 3+ of those 5 are denials, tolerating intermittent successes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ead resend, clear HasUsedToolsThisTurn 1. FlushCurrentResponse before clearing buffers in recovery (INV-1) 2. Wrap History.Add in Invoke() in ToolExecutionComplete handler 3. Dispatch resend via InvokeOnUI to satisfy INV-2 4. Clear HasUsedToolsThisTurn after transfer to prevent stale 600s timeout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…afety Fixes from 4-model code review (Opus, Sonnet, GPT-5.1, Gemini): 1. HandleReconnect targeted ActiveSessionName instead of clicked session — replaced with lambda capturing session.Name (Sonnet) 2. _permissionRecoveryAttempts not cleared on abort — added TryRemove in AbortSessionAsync (Sonnet) 3. Concurrent recovery race — added _recoveryInProgress ConcurrentDictionary guard with TryAdd/TryRemove in finally (Opus + GPT) 4. Thread-unsafe History.LastOrDefault on pool thread — moved read into InvokeOnUI cleanup block (Opus) 5. exceededMaxTime was dead code (subsumed by elapsed check) — simplified to use exceededMaxTime only for logging (Opus + Gemini) 6. PermissionDenialCount public setter could desync from sliding window — changed to read-only property (Gemini) 7. Bumped fragile RightClickContextMenu test threshold (our new markup pushed distance from 161 to 284 chars) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… Invoke calls - N1: Merge two Invoke() calls into one atomic UI dispatch for History.Add + TryRecoverPermissionAsync (was two separate posts) - N2: Add HasUsedToolsThisTurn + ActiveToolCallCount reset to success-path cleanup (INV-1 completeness) - F3: Replace Task.Run with Invoke() so TryRecoverPermissionAsync runs on UI thread; remove ConfigureAwait(false); simplify resend to direct await (no InvokeOnUI+TCS dance needed) - Catch block: Full IsProcessing cleanup with TCS cancel, SendingFlag reset, FlushCurrentResponse, all 9 companion fields cleared Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5235bfc to
56b4b3a
Compare
Round 8 Re-Review:
|
| Finding | Status |
|---|---|
| F1: FlushCurrentResponse before Clear (success path) | ✅ FIXED (5/5) |
| F2: History.LastOrDefault on background thread | ✅ FIXED (5/5) |
| F3: SendPromptAsync on threadpool | ✅ FIXED (5/5) |
| F4: HasUsedToolsThisTurn not reset | ✅ FIXED (5/5) |
| NEW-R5: Catch block missing IsProcessing cleanup | ✅ FIXED (5/5) |
| N1: History.Add at denialCount==3 on background thread | ✅ FIXED (5/5) |
| N2: HasUsedToolsThisTurn not cleared in success path | ✅ FIXED (5/5) |
| N3: FlushCurrentResponse after IsProcessing=false in catch | ✅ FIXED (5/5) |
| N4: Watchdog exceededMaxTime AND condition | ✅ FIXED (5/5) |
The sliding window redesign (5-result window, 3-denial threshold) and _recoveryInProgress concurrency guard are solid additions.
New Consensus Finding (4/5 models)
🟡 MODERATE — attempts > 2 early return leaves session frozen until watchdog
CopilotService.Events.cs, TryRecoverPermissionAsync — the attempts > 2 branch (and the null-client/null-sessionId guard)
When recovery has been attempted 3 times, the method posts a warning message and returns without clearing processing state. state.ResponseCompletion.TCS is never resolved — the original SendPromptAsync caller hangs until watchdog (up to 600s). SendingFlag remains set, so SendPromptAsync will silently reject any new prompt. The user sees the warning AND "Thinking..." simultaneously with no way to send a new message (only Stop works). The null-guard early return has the same issue.
Fix: Apply the same cleanup as the catch block before each early return:
state.ResponseCompletion?.TrySetCanceled()Interlocked.Exchange(ref state.SendingFlag, 0)- Full 9-field IsProcessing cleanup (FlushCurrentResponse first, then clear the fields)
Test Coverage
✅ 5 new unit tests for sliding window permission detection
✅ Heartbeat timer, permission UI, LastUpdatedAt thread-safety all covered
🔲 No test for the attempts > 2 early return path (should verify session is releasable after max attempts)
8 rounds of iteration. Every prior finding is fixed. One targeted fix needed on the early returns, then this PR is ready to merge.
…very Extract ClearProcessingStateForRecoveryFailure helper to avoid tripling cleanup code. Called from: null-guard return, attempts>2 return, and catch block. Prevents session stuck in 'Thinking...' until watchdog when recovery is skipped or fails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR #285 — Round 9 Review (5-model consensus)Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex Previous Findings Status — ALL 10 FIXED ✅
The New Consensus Findings (2+ model agreement)🟡 R9-A (5/5 models) — Null-client guard missing 🟡 R9-C (5/5 models) — Catch block should use explicit 🟢 R9-B (5/5 models) — Missing diagnostic log in helper (INV-1 #8) 🟢 R9-D (5/5 models) — Verdict: ✅ ApproveAll 10 previous findings are confirmed fixed. The 2 moderate findings (R9-A and R9-C) are non-blocking — R9-A is mitigated by the heartbeat timer (5s), and R9-C is mitigated by MAUI's SynchronizationContext. The Review by PR Review Squad worker-2 (Round 9, 5-model consensus) |
…ty-aware watchdog (#285) Add heartbeat timer and liveness signal for zero-idle sessions 9 rounds of multi-model review, all 10 findings fixed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onAsync cancellation, restored comment - Changed TurnEndIdleCts from property to field to fix 7 CS0206 compile errors with Interlocked.Exchange(ref ...) - Added TurnEndIdleCts cancellation in AbortSessionAsync to prevent delayed CompleteResponse on aborted sessions - Restored deleted // Release send lock comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…onAsync cancellation, restored comment - Changed TurnEndIdleCts from property to field to fix 7 CS0206 compile errors with Interlocked.Exchange(ref ...) - Added TurnEndIdleCts cancellation in AbortSessionAsync to prevent delayed CompleteResponse on aborted sessions - Restored deleted // Release send lock comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…roved permission detection (#305) ## Fixes issues from PR #285 ### Changes - **Gate auto-resend on successful tools** (fixes #298): Added `SuccessfulToolCountThisTurn` to SessionState. Permission recovery skips resend when tools already completed this turn, preventing re-execution of side-effectful work. Shows `⟳ Resending: "..."` when resending, or `Auto-resend skipped` when skipped. - **TurnEnd→Idle fallback** (addresses #299): When AssistantTurnEndEvent fires, schedule a 4s delayed CompleteResponse. Cancelled when SessionIdleEvent arrives (normal path) or a new turn starts. - **Improved permission denial detection** (improves #300): Extracted IsPermissionDenialText() helper matching "Permission denied", "denied-no-approval-rule", and "could not request permission". ### Issues Addressed - Fixes #298 - Addresses #299 (workaround for SDK bug) - Improves #300 (more robust detection) ### Verification - Build: 0 errors - 20/20 tests pass --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds defense-in-depth resilience for two classes of session failures observed in production:
session.idle, causing the UI to miss final renders (see Zero-idle sessions: SDK never emits session.idle event #299)denied-no-approval-rule-and-could-not-request-from-user, silently breaking all tool execution (see SDK permission callback binding lost in headless mode after reconnections #300)These workarounds are safe no-ops when the SDK works correctly. They activate only when failures are detected and deactivate automatically when conditions resolve.
Changes
Heartbeat Timer (Dashboard.razor)
System.Threading.Timercompares session state snapshots (_lastRenderedSnapshotviaConcurrentDictionary)LastUpdatedAtticks and schedules a re-renderOnStateChangedfires but the Blazor render cycle doesn't pick it upLiveness Signal (ChatMessageList.razor, SessionCard.razor)
Permission Denial Detection (AgentSessionInfo.cs)
Queue<bool>of last 5 tool results, triggers at ≥3 denialslock+Volatile.Writefor cross-thread count readsAuto-Recovery (CopilotService.Events.cs)
TryRecoverPermissionAsync: disposes broken session →ResumeSessionAsyncwith freshAutoApprovePermissionscallback → resends last prompt_recoveryInProgressConcurrentDictionaryprevents auto + manual recovery racingCompleteResponseandAbortSessionAsync)ClearProcessingStateForRecoveryFailurehelper — called from null-guard return, attempts>2 return, and catch block. Clears TCS, SendingFlag, FlushCurrentResponse, and all 9 companion fieldsInvoke()(notTask.Run) so allawaitcontinuations stay on UI SynchronizationContext.SendPromptAsyncresend is a directawait— noConfigureAwait(false), noInvokeOnUI+TCS dance neededPermission Warning UI (ChatMessageList.razor, SessionListItem.razor)
HasPermissionIssueis trueActivity-Aware Watchdog (CopilotService.Events.cs)
exceededMaxTimeused only for logging when total > 3600s; actual timeout uses tiered approach (30s/120s/600s) based on event stalenessFollow-up Issues
session.idleevent (root cause investigation)Testing
AgentSessionInfoTests.cs)Files Changed (12 files, +493 / -16)
AgentSessionInfo.csLastUpdatedAtInterlocked patternCopilotService.Events.csTryRecoverPermissionAsync,ClearProcessingStateForRecoveryFailure, activity-aware watchdogCopilotService.csClearPermissionDenialsin SendPromptAsync,TryRemovein AbortSessionAsyncDashboard.razorChatMessageList.razorChatMessageList.razor.cssExpandedSessionView.razorHasPermissionIssue+OnReconnectRequestedSessionListItem.razorSessionListItem.razor.cssSessionCard.razorLastActivityAtparameterAgentSessionInfoTests.csRightClickContextMenuTests.cs