Skip to content

Session resilience: heartbeat timer, permission auto-recovery, activity-aware watchdog#285

Merged
PureWeen merged 14 commits intomainfrom
session-20260304-233713
Mar 7, 2026
Merged

Session resilience: heartbeat timer, permission auto-recovery, activity-aware watchdog#285
PureWeen merged 14 commits intomainfrom
session-20260304-233713

Conversation

@PureWeen
Copy link
Owner

@PureWeen PureWeen commented Mar 5, 2026

Summary

Adds defense-in-depth resilience for two classes of session failures observed in production:

  1. Zero-idle sessions — the SDK sometimes never emits session.idle, causing the UI to miss final renders (see Zero-idle sessions: SDK never emits session.idle event #299)
  2. Permission callback loss — after reconnections in headless mode, the SDK's internal permission system returns denied-no-approval-rule-and-could-not-request-from-user, silently breaking all tool execution (see SDK permission callback binding lost in headless mode after reconnections #300)

These workarounds are safe no-ops when the SDK works correctly. They activate only when failures are detected and deactivate automatically when conditions resolve.

Changes

Heartbeat Timer (Dashboard.razor)

  • 5-second System.Threading.Timer compares session state snapshots (_lastRenderedSnapshot via ConcurrentDictionary)
  • Detects missed UI updates by comparing LastUpdatedAt ticks and schedules a re-render
  • Catches the case where OnStateChanged fires but the Blazor render cycle doesn't pick it up

Liveness Signal (ChatMessageList.razor, SessionCard.razor)

  • Shows "last activity Xs ago" in the processing status indicator when a session is actively processing
  • Gives users visibility into whether the SDK is still receiving events, even if the UI appears frozen

Permission Denial Detection (AgentSessionInfo.cs)

  • Sliding window: Queue<bool> of last 5 tool results, triggers at ≥3 denials
  • Replaces a strict consecutive counter that was defeated by occasional OK results interspersed between denials
  • Thread-safe via lock + Volatile.Write for cross-thread count reads
  • 5 unit tests covering threshold, tolerance, clear, drop-off, and defaults

Auto-Recovery (CopilotService.Events.cs)

  • TryRecoverPermissionAsync: disposes broken session → ResumeSessionAsync with fresh AutoApprovePermissions callback → resends last prompt
  • Concurrency guard: _recoveryInProgress ConcurrentDictionary prevents auto + manual recovery racing
  • Max 2 attempts per session (reset on CompleteResponse and AbortSessionAsync)
  • Full INV-1 cleanup on all exit paths: extracted ClearProcessingStateForRecoveryFailure helper — called from null-guard return, attempts>2 return, and catch block. Clears TCS, SendingFlag, FlushCurrentResponse, and all 9 companion fields
  • UI-thread safety (INV-2): recovery dispatched via Invoke() (not Task.Run) so all await continuations stay on UI SynchronizationContext. SendPromptAsync resend is a direct await — no ConfigureAwait(false), no InvokeOnUI+TCS dance needed

Permission Warning UI (ChatMessageList.razor, SessionListItem.razor)

  • Amber warning banner in chat when HasPermissionIssue is true
  • ⚠️ icon in sidebar session list
  • "Reconnect" button triggers per-session recovery

Activity-Aware Watchdog (CopilotService.Events.cs)

  • Old behavior: hard 1-hour cap killed sessions regardless of activity (broke CI/agent sessions)
  • New behavior: exceededMaxTime used only for logging when total > 3600s; actual timeout uses tiered approach (30s/120s/600s) based on event staleness

Follow-up Issues

Testing

  • 2076 tests pass (includes 5 new sliding window tests in AgentSessionInfoTests.cs)
  • Build clean on Mac Catalyst (0 errors)
  • 8 rounds of multi-model code review (claude-opus-4.6, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex) — all findings resolved

Files Changed (12 files, +493 / -16)

File Change
AgentSessionInfo.cs Sliding window, LastUpdatedAt Interlocked pattern
CopilotService.Events.cs Permission detection, TryRecoverPermissionAsync, ClearProcessingStateForRecoveryFailure, activity-aware watchdog
CopilotService.cs ClearPermissionDenials in SendPromptAsync, TryRemove in AbortSessionAsync
Dashboard.razor Heartbeat timer, snapshot comparison
ChatMessageList.razor Permission warning banner, liveness signal
ChatMessageList.razor.css Warning banner styles
ExpandedSessionView.razor Wire HasPermissionIssue + OnReconnectRequested
SessionListItem.razor ⚠️ permission warning icon
SessionListItem.razor.css Warning icon style
SessionCard.razor LastActivityAt parameter
AgentSessionInfoTests.cs 5 sliding window tests
RightClickContextMenuTests.cs Bump char distance threshold

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 5, 2026

PR Review Squad — PR #285 Round 1

CI Status: ⚠️ No checks reported

🔴 CRITICAL (5/5 models)

CopilotService.Events.cs:~1491-1560TryRecoverPermissionAsync auto-recovery is inert due to two compounding bugs:

Bug A — SendPromptAsync throws "already processing": state.Info is shared between old and new SessionState (same object reference via Info = state.Info). IsProcessing is still true from the original call and never cleared before calling SendPromptAsync. The resend throws InvalidOperationException, caught silently. User sees "✅ Session reconnected" but prompt is never re-sent — session stays stuck until watchdog (120-600s).

Bug B — Old ResponseCompletion TCS orphaned: Original SendPromptAsync awaits state.ResponseCompletion.Task. Recovery creates a fresh TCS for newState but never cancels/resolves the old one. Old watchdog is cancelled, isCurrentState guard blocks events on orphaned state. The await hangs forever.

Fix required: Before resend: clear IsProcessing + all 9 companion fields via InvokeOnUI, call state.ResponseCompletion?.TrySetCanceled().

🟡 MODERATE

  1. Events.cs:~1553 (4/5) — SendPromptAsync called from Task.Run thread — IsProcessing mutations violate INV-2 (must marshal via InvokeOnUI)
  2. Events.cs:~380 (5/5) — _permissionRecoveryAttempts.TryRemove on any successful tool call resets the attempts > 2 guard, enabling infinite recovery loops when permission errors are intermittent
  3. Dashboard.razor:~830 (4/5) — _lastRenderedSnapshot plain Dictionary under Timer callbacks — concurrent overlap can corrupt
  4. Events.cs:~218 (5/5) — LastUpdatedAt plain DateTime write from background thread — no volatile/barrier

Recommended Action: ⚠️ Request changes

The 🔴 critical bug makes the entire permission-recovery feature non-functional. Fix IsProcessing cleanup + TCS resolution before resend.

@PureWeen PureWeen force-pushed the session-20260304-233713 branch from 908c6e4 to 2e461bf Compare March 5, 2026 16:42
@PureWeen
Copy link
Owner Author

PureWeen commented Mar 5, 2026

PR Review Squad — PR #285 Round 3 Re-review (after commit 908c6e4)

CI Status: ⚠️ No checks reported
Models: claude-opus-4.6 (×2), claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex
Consensus filter: 2+ models required to include a finding


Previous Findings — All 6 STILL PRESENT (5/5 models)

Finding Status Notes
🔴 Bug A — SendPromptAsync throws "already processing" STILL PRESENT (5/5) Info = state.Info still shares object; IsProcessing never cleared; guard 1 always throws
🔴 Bug B — Old ResponseCompletion TCS orphaned forever STILL PRESENT (5/5) No state.ResponseCompletion?.TrySetCanceled() added; original awaiter hangs indefinitely
🟡 P3 — SendPromptAsync from Task.Run background thread STILL PRESENT (5/5) Still violates INV-2 (moot today due to Bug A, becomes real once A is fixed)
🟡 P4 — _permissionRecoveryAttempts.TryRemove on any success STILL PRESENT (5/5) Any clean tool call resets the attempts > 2 guard; intermittent permission errors loop forever
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer STILL PRESENT (5/5) No lock; System.Threading.Timer callbacks can overlap and corrupt on concurrent resize
🟡 P6 — LastUpdatedAt plain DateTime from background thread STILL PRESENT (5/5) Bare auto-property write; ProcessingStartedAt uses Interlocked for the same reason

🔴 NEW CRITICAL — SendingFlag transfer worsens Bug A (4/5 models)

CopilotService.Events.csTryRecoverPermissionAsync

908c6e4 added the SendingFlag transfer. The intent was to prevent a concurrent SendPromptAsync from racing after session replacement — reasonable. The problem: it also blocks the intentional resend at the end of the same function.

// 908c6e4 addition:
Interlocked.Exchange(ref newState.SendingFlag,
    Interlocked.Exchange(ref state.SendingFlag, 0));
// Result: state.SendingFlag = 0  (cleared)
//         newState.SendingFlag = 1  (was 1 = processing)

Then immediately:

await SendPromptAsync(sessionName, lastPrompt, skipHistoryMessage: true);
// Fetches _sessions[sessionName] = newState
// Guard 1: newState.Info.IsProcessing == true       → throws  (Bug A, unchanged)
// Guard 2: CompareExchange(newState.SendingFlag=1, 1, 0) → 1≠0 → throws  (NEW)
// Both caught silently. Resend is permanently unreachable.

Before 908c6e4, if Bug A were independently fixed, newState.SendingFlag would default to 0 and Guard 2 would pass. Now both guards independently block recovery.

Fix: Reset newState.SendingFlag = 0 before the resend call (or don't transfer it until after resend succeeds).


🔴 NEW CRITICAL — HandleReconnect destroys ALL sessions (5/5 models)

Dashboard.razorHandleReconnect() (~line 2229)

The "Reconnect" button appears per-session (inside the HasPermissionIssue warning banner), but its handler calls a service-wide operation:

private async Task HandleReconnect()
{
    var connSettings = ConnectionSettings.Load();
    await CopilotService.ReconnectAsync(connSettings);  // tears down ALL active sessions
}

A user with 5 active sessions, one of which has a permission issue, clicks "Reconnect" to fix that one session. All 5 sessions are destroyed.

Fix: Use TryRecoverPermissionAsync directly (expose as a public method on CopilotService), or scope the reconnect to the affected session only.


What 908c6e4 Correctly Fixed ✅

  • Interlocked.Increment for denial counting (prevents double-recovery races)
  • Volatile.Read/Volatile.Write for PermissionDenialCount via backing field
  • Event handler registered after _sessions[sessionName] = newState (prevents isCurrentState failures)
  • ActiveToolCallCount transfer (prevents negative counter after in-flight completions)
  • PermissionDenialCount = 0 reset in both CompleteResponse and SendPromptAsync
  • Tests for HasPermissionIssue threshold behavior

Test Coverage Gap

The new permission-recovery code path has no unit tests. Suggested test cases:

  1. TryRecoverPermissionAsync — verify state.ResponseCompletion is cancelled before new state creation
  2. SendPromptAsync — verify it succeeds after IsProcessing is cleared and SendingFlag is 0
  3. HandleReconnect — verify it only affects the target session, not all sessions
  4. Recovery attempt counter — verify TryRemove on successful tool call is not resetting it across recovery cycles

Required Fixes Before Merge

  1. Bug A (blocking): In TryRecoverPermissionAsync, before calling SendPromptAsync:

    • Clear state.Info.IsProcessing = false + all 9 companion fields via InvokeOnUI (see processing-state-safety invariant)
    • Add diagnostic log [PERMISSION-RECOVER-RESEND]
  2. Bug B (blocking): Add state.ResponseCompletion?.TrySetCanceled() before creating newState

  3. SendingFlag regression (blocking): Add Interlocked.Exchange(ref newState.SendingFlag, 0) before the resend call

  4. HandleReconnect scope (blocking): Scope the "Reconnect" button to per-session recovery only; do not call global ReconnectAsync

Recommended Action: ⚠️ Request changes

The permission-recovery feature is the core of this PR, and it remains non-functional after 3 commits of fixes. The resend path is blocked by two independent guards (IsProcessing + SendingFlag), the original awaiter leaks forever (Bug B), and the manual reconnect button has destructive global scope.

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 5, 2026

PR #285 Round 4 Re-review (commit fdfac59c)

CI Status: ⚠️ No checks reported (branch has no CI configured — pre-existing)
Models: claude-opus-4.6 (×2), claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex
Consensus filter: 2+ models required


Previous Findings Status

Finding Status Evidence
🔴 Bug A — IsProcessing never cleared before resend FIXED (5/5) InvokeOnUI now clears IsProcessing=false + all 9 companion fields before cleanupDone.TrySetResult(). newState.Info == state.Info (shared reference), so resend guards pass.
🔴 Bug B — Old ResponseCompletion TCS orphaned FIXED (5/5) state.ResponseCompletion?.TrySetCanceled() added before new state creation. Original awaiter receives TaskCanceledException.
🔴 SendingFlag transfer regression FIXED (5/5) Interlocked.Exchange(ref state.SendingFlag, 0) clears old flag; newState.SendingFlag stays at default 0. Both CAS guards pass for resend.
🔴 HandleReconnect destroys ALL sessions FIXED (5/5) No longer calls global ReconnectAsync(). Now calls CopilotService.RecoverSessionAsync(sessionName). See new finding below for residual targeting issue.
🟡 P3 — SendPromptAsync from background thread (INV-2) STILL PRESENT (5/5) After await cleanupDone.Task.ConfigureAwait(false), execution resumes on thread pool. SendPromptAsync then sets state.Info.IsProcessing = true and invokes OnStateChanged from that background thread — direct INV-2 violation per codebase convention.
🟡 P4 — TryRemove resets loop guard on any tool success FIXED (5/5) _permissionRecoveryAttempts.TryRemove moved from per-tool-success path to CompleteResponse. Intermittent successes no longer reset the 2-attempt limit.
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer FIXED (5/5) Changed to ConcurrentDictionary.
🟡 P6 — LastUpdatedAt non-atomic write from background thread FIXED (5/5) Now uses _lastUpdatedAtTicks with Interlocked.Read/Exchange, matching the existing ProcessingStartedAt pattern.

New Consensus Findings

🟡 MODERATE — HandleReconnect still targets wrong session (5/5 models)

Dashboard.razor:2248 / ChatMessageList.razor:83-87

HandleReconnect reads CopilotService.ActiveSessionName instead of the session whose button was clicked. The OnReconnectRequested EventCallback carries no session identity as it flows through ChatMessageList → ExpandedSessionView → Dashboard. If the user sees a permission banner for session A but has since navigated to session B (changing ActiveSessionName), the click recovers session B instead of the broken session A.

Fix: Change EventCallbackEventCallback<string> and thread the session name through the component tree. In Dashboard: OnReconnectRequested="() => HandleReconnect(session.Name)".


🟡 MODERATE — TryRecoverPermissionAsync failure path leaves session stuck (2/5 models)

CopilotService.Events.cs — catch block at end of TryRecoverPermissionAsync

If _client.ResumeSessionAsync throws (network error, session expired), execution jumps to the outer catch block. At that point:

  • The outer catch adds a warning message but does not call state.ResponseCompletion?.TrySetCanceled()
  • IsProcessing is never cleared
  • The old session was already await state.Session.DisposeAsync()'d

Result: the session is stuck in "Thinking…" until the watchdog fires (120–600s). The watchdog is a safety net that will eventually recover it, but the user experience is a multi-minute freeze with no clear indication.

Fix: The catch block should call InvokeOnUI(() => { state.Info.IsProcessing = false; ... state.Info.PermissionDenialCount = 0; ... }) and state.ResponseCompletion?.TrySetCanceled(). Alternatively, move the TCS cancel and watchdog cancel into a finally block.


🟡 MODERATE — P3 still present: SendPromptAsync on background thread (5/5 models)

CopilotService.Events.cs:1573-1582

Per the codebase INV-2 invariant: "All mutations to state.Info.IsProcessing must be marshaled to the UI thread." SendPromptAsync sets state.Info.IsProcessing = true and fires OnStateChanged at lines 2006/2039 of CopilotService.cs. After await cleanupDone.Task.ConfigureAwait(false), the continuation runs on a thread-pool thread — these mutations then happen off the UI thread.

Fix: Wrap the resend call inside an InvokeOnUI (or its async equivalent), or change ConfigureAwait(false) to ConfigureAwait(true) so the continuation resumes on the captured synchronization context.


🟢 MINOR — TaskCanceledException from TCS cancel reaches the original caller silently (2/5 models)

CopilotService.cs:2222 / Dashboard.razor ContinueWith

When state.ResponseCompletion?.TrySetCanceled() fires, the original SendPromptAsync's await state.ResponseCompletion.Task throws TaskCanceledException. Dashboard's ContinueWith(t => { if (t.IsFaulted) ... }) checks IsFaulted (not IsCanceled), so the cancellation is silently swallowed with no [COMPLETE] or [ERROR] diagnostic log entry. Functionally harmless since recovery takes over, but creates a blind spot in event diagnostics.


Summary

4 of 8 previously critical/moderate findings are confirmed fixed. P3 (INV-2 background thread) is still present. Two new moderate issues were identified — notably the failure-path cleanup gap, which can cause a session to freeze for 120-600 seconds if the network drops during recovery.

Recommended action: ⚠️ Request changes

Required before merge:

  1. Fix HandleReconnect to thread session name through the EventCallback chain
  2. Fix the catch block in TryRecoverPermissionAsync to clear IsProcessing, cancel the TCS, and call CancelProcessingWatchdog
  3. Fix P3: wrap the SendPromptAsync resend in InvokeOnUI or restore the synchronization context before calling it

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 5, 2026

PR #285 Round 4 Re-review (commit fdfac59c)

CI Status: ⚠️ No checks reported
Models: claude-opus-4.6 (×2), claude-sonnet-4.6, gpt-5.3-codex (gemini unavailable)
Consensus filter: 2+ models required


Previous Findings — Status

Finding Status
🔴 Bug A — SendPromptAsync throws "already processing" FIXED (4/4)
🔴 Bug B — Old ResponseCompletion TCS orphaned forever FIXED (4/4)
🔴 SendingFlag transfer regression FIXED (4/4)
🔴 HandleReconnect destroys ALL sessions FIXED (4/4)
🟡 P3 — SendPromptAsync from Task.Run background thread STILL PRESENT (4/4)
🟡 P4 — TryRemove resets loop guard on clean tool call FIXED (4/4)
🟡 P5 — _lastRenderedSnapshot plain Dictionary under Timer FIXED (4/4)
🟡 P6 — LastUpdatedAt plain DateTime non-atomic write FIXED (4/4)

🔴 NEW CRITICAL — FlushCurrentResponse missing (4/4 models)

CopilotService.Events.csTryRecoverPermissionAsync, InvokeOnUI cleanup block

The cleanup sets IsProcessing = false but silently discards any partial assistant response:

InvokeOnUI(() => {
    state.CurrentResponse.Clear();   // ← accumulated partial response DISCARDED
    state.FlushedResponse.Clear();   // ← without persisting to History or ChatDB
    state.Info.IsProcessing = false;
    ...
});

Per INV-1, every IsProcessing=false path MUST call FlushCurrentResponse(state) before clearing the buffers. Without it, any streamed assistant text that arrived before the 3rd permission denial is silently lost — not written to state.Info.History, not persisted to the chat database.

Fix: Add FlushCurrentResponse(state); inside the InvokeOnUI block, before clearing CurrentResponse.


🔴 NEW CRITICAL — History.Add from SDK background thread without Invoke() (3/4 models)

CopilotService.Events.cs ~line 372 — ToolExecutionCompleteEvent handler

if (denialCount == 3)
{
    state.Info.History.Add(ChatMessage.SystemMessage(   // ← no Invoke()
        "⚠️ Permission errors detected. Attempting to reconnect session..."));
    _ = Task.Run(async () => await TryRecoverPermissionAsync(state, sessionName));
}

History is a List<ChatMessage>. This write happens on the SDK background event thread. The UI thread reads and iterates History during render. A concurrent Add can corrupt the list's internal array (InvalidOperationException: Collection was modified or silent memory corruption). Compare to lines 390–394 in the same handler which correctly use Invoke(() => { ... }).

Fix: Wrap the History.Add in Invoke(() => { ... }) like all other History mutations in the same handler.


🟡 STILL PRESENT — P3: SendPromptAsync from Task.Run background thread (4/4 models)

CopilotService.Events.csTryRecoverPermissionAsync

The cleanupDone TCS correctly synchronizes the cleanup, but the resend still violates INV-2:

_ = Task.Run(async () => await TryRecoverPermissionAsync(state, sessionName));
// inside TryRecoverPermissionAsync, after cleanup:
await SendPromptAsync(sessionName, lastPrompt, skipHistoryMessage: true);
// SendPromptAsync sets IsProcessing=true, ProcessingStartedAt, etc. on Task.Run thread

All IsProcessing mutations must be marshaled to the UI thread. Fix: use InvokeOnUI(() => _ = SendPromptAsync(...)) instead of direct call.


🟡 NEW MODERATE — HasUsedToolsThisTurn not cleared on IsProcessing=false path (2/4 models)

CopilotService.Events.csTryRecoverPermissionAsync, InvokeOnUI cleanup block

The cleanup correctly clears IsResumed, ProcessingStartedAt, ToolCallCount, ProcessingPhase, but does not clear HasUsedToolsThisTurn (it lives on SessionState, not AgentSessionInfo). The stale value is transferred to newState:

newState.HasUsedToolsThisTurn = state.HasUsedToolsThisTurn;  // transfers stale true

If the resend succeeds, SendPromptAsync resets it on entry — no harm. But if lastPrompt is null/empty (no resend), newState sits with IsProcessing=false but HasUsedToolsThisTurn=true. On the user's next manual prompt, the watchdog will select the 600s timeout instead of 120s, delaying stuck-session detection.

Fix: Add newState.HasUsedToolsThisTurn = false; after the transfer.


Test Coverage Gap

No unit tests cover the recovery flow:

  1. Test that FlushCurrentResponse is called (content preserved to History) before IsProcessing=false
  2. Test that History.Add for the permission warning reaches History (i.e. is not lost due to thread race)
  3. Test that SendPromptAsync succeeds after recovery (guards are all clear)

Required Fixes Before Merge

  1. Add FlushCurrentResponse(state); in the InvokeOnUI cleanup block before state.CurrentResponse.Clear() (INV-1)
  2. Wrap History.Add (at denialCount == 3) in Invoke(() => { ... }) (thread safety)
  3. Add newState.HasUsedToolsThisTurn = false; after the transfer (INV-1 partial)
  4. (Carry-over P3) Marshal the SendPromptAsync resend via InvokeOnUI (INV-2)

Recommended Action: ⚠️ Request changes

Two new critical issues were introduced. The 4 criticals from Round 3 are all fixed — good progress. Three small additions will complete the fix.

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 5, 2026

Round 4 Action Items — 4 remaining before merge

Great progress on fdfac59c — 6 of 8 prior findings are now fixed! Four items remain:

1. 🔴 FlushCurrentResponse not called before discarding buffers

CopilotService.Events.csTryRecoverPermissionAsync InvokeOnUI cleanup block
Per INV-1, every IsProcessing=false path must persist accumulated response text first. Currently state.CurrentResponse.Clear() throws away any partial assistant message without writing it to History or the chat DB.
Fix: Add FlushCurrentResponse(state); inside the InvokeOnUI block, before clearing the buffers.

2. 🔴 History.Add called from SDK background thread without Invoke()

CopilotService.Events.cs ~line 372 — ToolExecutionCompleteEvent handler

if (denialCount == 3) {
    state.Info.History.Add(...);  // ← bare call on background thread, no Invoke()

History is a List<ChatMessage> iterated by the UI thread during render — concurrent mutation risks corruption. All other History.Add calls in this handler use Invoke(() => { ... }).
Fix: Wrap in the existing Invoke(() => { ... }) call.

3. 🟡 P3 (carry-over): SendPromptAsync resend still on Task.Run thread (INV-2)

After await cleanupDone.Task.ConfigureAwait(false), the resend runs on the thread pool. SendPromptAsync sets IsProcessing = true / ProcessingStartedAt etc. on a background thread, violating INV-2.
Fix: Use InvokeOnUI(() => _ = SendPromptAsync(...)) for the resend.

4. 🟡 HasUsedToolsThisTurn not cleared in cleanup (INV-1 partial)

newState.HasUsedToolsThisTurn is set to the stale true from the old state. If the resend is skipped (lastPrompt is null), the session sits idle with HasUsedToolsThisTurn=true, causing the watchdog to pick the 600s timeout on the user's next manual prompt.
Fix: Add newState.HasUsedToolsThisTurn = false; after the transfer.

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 7, 2026

PR #285 Round 5 Re-review (commit fdfac59c)

CI Status: ⚠️ No checks reported (pre-existing — no CI configured)
Models: claude-opus-4.6 (×2), claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex
Consensus filter: 2+ models


Round 4 Findings Status

Finding Status Evidence
F1 — FlushCurrentResponse not called before buffer clear STILL PRESENT (5/5) state.CurrentResponse.Clear() in InvokeOnUI with no prior FlushCurrentResponse(state) call. Partial response text lost.
F2 — History.LastOrDefault on background thread STILL PRESENT (5/5) state.Info.History.LastOrDefault(...) runs before InvokeOnUI, on Task.Run background thread. List<ChatMessage> not thread-safe.
F3 — SendPromptAsync from background thread (INV-2) STILL PRESENT (5/5) After await cleanupDone.Task.ConfigureAwait(false), continuation runs on thread pool. SendPromptAsync sets IsProcessing=true without InvokeOnUI — INV-2 violation.
F4 — HasUsedToolsThisTurn not reset FIXED (5/5) SendPromptAsync resets state.HasUsedToolsThisTurn = false at line 2013 unconditionally, before watchdog starts. Stale copy is overwritten immediately.

Carry-over Finding from Round 4 (previously flagged, still present)

🟡 MODERATE — Catch block missing IsProcessing cleanup on ResumeSessionAsync failure (4/5 models)

CopilotService.Events.cs — outer catch block of TryRecoverPermissionAsync

If _client.ResumeSessionAsync(...) throws, execution jumps to the outer catch. At that point:

  • state.ResponseCompletion?.TrySetCanceled() has not been called yet (it's placed AFTER the await), so the original SendPromptAsync awaiter hangs on await state.ResponseCompletion.Task until the watchdog fires (120-600s)
  • state.Info.IsProcessing remains true — UI shows "Thinking…" with no exit path
  • state.SendingFlag remains 1 — session locked for new sends

The catch block only adds a system message and calls OnStateChanged. Minimum fix:

catch (Exception ex)
{
    state.ResponseCompletion?.TrySetCanceled();
    InvokeOnUI(() =>
    {
        state.Info.IsProcessing = false;
        state.Info.ProcessingStartedAt = null;
        state.Info.ToolCallCount = 0;
        state.Info.ProcessingPhase = 0;
        state.Info.HasUsedToolsThisTurn = false;
        Interlocked.Exchange(ref state.SendingFlag, 0);
        state.Info.History.Add(ChatMessage.SystemMessage("⚠️ Auto-recovery failed..."));
        OnStateChanged?.Invoke();
    });
}

Outstanding Consensus Findings (requiring fixes)

🟡 MODERATE — F1: FlushCurrentResponse (5/5)

CopilotService.Events.csTryRecoverPermissionAsync InvokeOnUI block

Add FlushCurrentResponse(state) call before state.CurrentResponse.Clear() to persist any partial response text. Even though the session is being reconnected, silently discarding streamed content violates the content-persistence invariant. The flush includes a dedup guard (checked via state.FlushedResponse) so it won't duplicate already-persisted content.

🟡 MODERATE — F2: History read on background thread (5/5)

CopilotService.Events.cs — before InvokeOnUI in TryRecoverPermissionAsync

Move the History read inside the InvokeOnUI lambda (before cleanupDone.TrySetResult()), then pass the extracted lastPrompt to the resend via a captured variable:

string? lastPrompt = null;
InvokeOnUI(() =>
{
    var lastUserMsg = state.Info.History.LastOrDefault(m => m.Role == "user");
    lastPrompt = lastUserMsg?.OriginalContent ?? lastUserMsg?.Content;
    // ... existing cleanup ...
    cleanupDone.TrySetResult();
});

🟡 MODERATE — F3: SendPromptAsync INV-2 (5/5)

CopilotService.Events.cs — resend at end of TryRecoverPermissionAsync

Change ConfigureAwait(false) to ConfigureAwait(true) so the resend continuation captures the synchronization context, OR wrap the resend in InvokeOnUI/InvokeAsync:

await cleanupDone.Task; // remove ConfigureAwait(false)

This ensures SendPromptAsync runs on the UI thread, consistent with all other call sites.


Summary

1 of the 4 Round 4 findings is confirmed fixed (F4). F1, F2, F3 remain. The catch-block cleanup gap is a separate carry-over moderate issue.

Recommended action: ⚠️ Request changes

Required before merge:

  1. Add FlushCurrentResponse(state) before clearing buffers (F1)
  2. Move History read into InvokeOnUI lambda (F2)
  3. Remove ConfigureAwait(false) from cleanupDone await so resend runs on UI thread (F3)
  4. Add IsProcessing cleanup + TCS cancel to catch block

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 7, 2026

Round 6 Re-Review: ⚠️ Request Changes

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex)
Commit reviewed: 892c7d30


Previous Findings Status ✅ All Fixed

Finding Status
F1: FlushCurrentResponse before Clear in success path ✅ FIXED (5/5)
F2: History.LastOrDefault on background thread ✅ FIXED (5/5)
F3: SendPromptAsync called from threadpool ✅ FIXED (5/5)
F4: HasUsedToolsThisTurn not reset in SendPromptAsync ✅ FIXED (5/5)
NEW/R5: Catch block missing IsProcessing cleanup ✅ FIXED (5/5)

All 5 prior round findings are confirmed fixed. However 3 new consensus issues were found.


New Consensus Findings

🔴 CRITICAL — History.Add on SDK background thread (5/5 models)

CopilotService.Events.cs, ToolExecutionCompleteEvent handler, denialCount == 3 branch

state.Info.History.Add(ChatMessage.SystemMessage("⚠️ Permission errors detected..."));  // background thread!
InvokeOnUI(() => _ = TryRecoverPermissionAsync(state, sessionName));

This bare History.Add runs on the SDK event thread outside any InvokeOnUI. History is List<ChatMessage> — not thread-safe. The Blazor render loop and FlushCurrentResponse enumerate this list on the UI thread simultaneously. This is the same class of violation as F2, just at a new call site.

Fix: Move the History.Add inside the existing InvokeOnUI callback (can be prepended to the TryRecoverPermissionAsync launch lambda), or into its own InvokeOnUI(() => state.Info.History.Add(...)).


🟡 MODERATE — HasUsedToolsThisTurn not cleared in success-path cleanup (4/5 models)

CopilotService.Events.cs, TryRecoverPermissionAsync success path InvokeOnUI block

The cleanup block sets IsProcessing = false but does not clear HasUsedToolsThisTurn (the catch block does clear it). This violates INV-1 checklist item 2. newState.HasUsedToolsThisTurn = state.HasUsedToolsThisTurn copies the stale true value to the live session. If lastPrompt is null/empty and the resend is skipped, IsProcessing=false with HasUsedToolsThisTurn=true persists — any subsequent watchdog will incorrectly select the 600s tool timeout instead of 120s.

Fix: Add newState.HasUsedToolsThisTurn = false; in the InvokeOnUI cleanup block alongside the other field clears.


🟡 MODERATE — FlushCurrentResponse called after IsProcessing=false in catch block (2/5 models)

CopilotService.Events.cs, TryRecoverPermissionAsync catch block InvokeOnUI

state.Info.IsProcessing = false;   // ← cleared first
...
FlushCurrentResponse(state);       // ← flush happens after (INV-1 item 7 violated)

The success path has the correct order (FlushCurrentResponse before IsProcessing=false). The catch block has it inverted. INV-1 item 7: "Call FlushCurrentResponse(state) BEFORE clearing IsProcessing". If FlushCurrentResponse ever adds an IsProcessing guard (consistent with the direction of the codebase), this will silently drop response content on recovery failures.

Fix: Move FlushCurrentResponse(state) to before state.Info.IsProcessing = false in the catch block's InvokeOnUI, matching the success path ordering.


Summary

# Severity Finding Models
N1 🔴 CRITICAL History.Add on background thread (ToolExecutionCompleteEvent) 5/5
N2 🟡 MODERATE HasUsedToolsThisTurn not cleared in success path InvokeOnUI 4/5
N3 🟡 MODERATE FlushCurrentResponse after IsProcessing=false in catch block 2/5

3 targeted fixes needed, all in TryRecoverPermissionAsync and its callsite.

@PureWeen
Copy link
Owner Author

PureWeen commented Mar 7, 2026

PR #285 Round 6 Re-review (commit 892c7d30)

CI Status: ⚠️ No checks reported (pre-existing — no CI configured)
Models: claude-opus-4.6 (×2), claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex
Consensus filter: Only issues flagged by 2+ models included


Previous Findings Status — All FIXED ✅

Finding Status Verification
F1: FlushCurrentResponse not called before Clear() ✅ FIXED FlushCurrentResponse(state) now called before CurrentResponse.Clear() in InvokeOnUI block
F2: History.LastOrDefault on background thread ✅ FIXED Moved inside InvokeOnUI lambda via cleanupDone TCS pattern
F3: SendPromptAsync after ConfigureAwait(false) ✅ FIXED ConfigureAwait(false) removed; continuation stays on UI thread
F4: HasUsedToolsThisTurn not reset ✅ FIXED HasUsedToolsThisTurn = false present in SendPromptAsync
NEW: Catch block missing cleanup ✅ FIXED Full 9-field cleanup + FlushCurrentResponse + TCS.TrySetException in catch

All 5 models unanimously agree all previous findings are fixed.


New Consensus Findings

🔴 CRITICAL — History.Add() on background thread (4/5 models)

CopilotService.Events.cs ~line 368 — In the ToolExecutionCompleteEvent handler, when denialCount == 3:

if (denialCount == 3)
{
    state.Info.History.Add(ChatMessage.SystemMessage(   // ← SDK background thread!
        "⚠️ Permission errors detected..."));
    InvokeOnUI(() => _ = TryRecoverPermissionAsync(state, sessionName));
}

List<ChatMessage> is not thread-safe. This Add() executes on the SDK event callback thread while the UI thread may be enumerating History for rendering — risking InvalidOperationException or internal array corruption. Every other History.Add in this handler correctly goes through Invoke().

Fix: Move the History.Add into the InvokeOnUI lambda or a separate Invoke() call.


🟡 MODERATE — Watchdog exceededMaxTime change is logically redundant (3/5 models)

CopilotService.Events.cs ~line 1399:

-var exceededMaxTime = totalProcessingSeconds >= WatchdogMaxProcessingTimeSeconds;
+var exceededMaxTime = totalProcessingSeconds >= WatchdogMaxProcessingTimeSeconds
+    && elapsed >= effectiveTimeout;

Since exceededMaxTime now requires elapsed >= effectiveTimeout, and that is already the first branch of if (elapsed >= effectiveTimeout || exceededMaxTime), the entire exceededMaxTime term is logically redundant: A || (B && A) ≡ A. This silently disables the absolute time safety net — a session receiving periodic events (keeping elapsed below effectiveTimeout) can now run forever past WatchdogMaxProcessingTimeSeconds.

Fix: If intent is "only fire max-time when events are somewhat stale," use a lower staleness threshold (e.g., elapsed >= 30) rather than effectiveTimeout.


🟡 MODERATE — CheckForMissedUpdates iterates sessions from timer thread (2/5 models)

Dashboard.razor heartbeat timer:

_heartbeatTimer = new Timer(_ => CheckForMissedUpdates(), null, 5000, 5000);
// ...
var snapshot = sessions;  // List<AgentSessionInfo>, not thread-safe
foreach (var session in snapshot) { ... }

sessions is a List<> mutated on the UI thread. The Timer fires on ThreadPool. Concurrent enumeration + mutation throws InvalidOperationException. The bare catch {} swallows it but silently breaks the heartbeat for that tick.

Fix: Use InvokeAsync to marshal to Blazor sync context, or take a .ToArray() snapshot.


🟡 MODERATE — Race between zombie SDK events and TryRecoverPermissionAsync (2/5 models)

CopilotService.Events.cs TryRecoverPermissionAsyncnewState shares the Info object reference with old state. Between DisposeAsync and _sessions[sessionName] = newState, in-flight SDK events on the old session can invoke HandleSessionEvent and mutate the shared Info (e.g., calling CompleteResponse, setting IsProcessing=false), potentially corrupting the newly resumed session state.

Fix: Set a guard flag (e.g., state.IsRecovering) checked in CompleteResponse, or ensure HandleSessionEvent's isCurrentState check catches this.


🟢 MINOR — _permissionDenialCount backing field is public (3/5 models)

AgentSessionInfo.cs ~line 50:

public int _permissionDenialCount;

Leaks the backing field, allowing external code to bypass Volatile.Read/Write semantics. Should be internal with an IncrementPermissionDenialCount() method.


🟢 MINOR — _lastRenderedSnapshot grows unbounded (2/5 models)

Dashboard.razor — Entries added to snapshot dictionary but never pruned when sessions are closed. Minor memory leak over long app lifetime.


Recommended Action: ⚠️ Request Changes

The 🔴 CRITICAL History.Add on background thread is a genuine thread-safety bug that can cause InvalidOperationException crashes during rendering. This must be fixed before merge.

The watchdog redundancy (🟡) silently removes the absolute time safety net and should be clarified/fixed.

The remaining moderate/minor issues are lower priority but worth addressing.

All 5 previous round findings are confirmed fixed — excellent progress! One more small fix and this is ready.

PureWeen and others added 13 commits March 7, 2026 00:53
FlushCurrentResponse persists messages to History on each TurnEnd, but
in zero-idle sessions (where SessionIdleEvent never arrives), the event-
driven render pipeline can miss updates because:
- RenderThrottle may suppress the OnStateChanged callback
- CompleteResponse/HandleComplete never fires to bypass the throttle
- The session appears stuck showing stale content

Add a 5-second periodic heartbeat timer that compares each session's
current state (MessageCount, IsProcessing, ToolCallCount, ProcessingPhase)
against the last-rendered snapshot. If anything changed, schedule a render.
This acts as a safety net alongside the existing event-driven mechanism.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a session is processing, the status indicator now shows how long
since the last SDK event (e.g., 'Working · 2m · 8 tools… (last activity
45s ago)'). This helps distinguish actively-working sessions from stuck
ones, especially in zero-idle scenarios.

Changes:
- Update LastUpdatedAt on every meaningful SDK event in HandleSessionEvent
- Add LastActivityAt parameter to ChatMessageList with AppendLastActivity
- Add AppendShortLastActivity to SessionListItem sidebar status
- Wire LastActivityAt through SessionCard and ExpandedSessionView
- Heartbeat timer now also re-renders processing sessions so elapsed
  time and last-activity text stay fresh

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Revert unintended changes to CopilotService.cs, Organization.cs,
  and skill documentation files from a previous session
- Fix thread safety in heartbeat timer: snapshot sessions list before
  iterating to avoid racing with UI-thread mutations
- Add null/empty guard for snapshot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tton

Detect when the copilot CLI's permission system returns 'Permission denied'
on tool calls (caused by lost callback binding in headless mode). After 3
consecutive denials:
- Show amber warning banner in chat with Reconnect button
- Show ⚠️ indicator in sidebar session list
- Add system message to chat history explaining the issue

Reconnect button calls ReconnectAsync to re-establish the permission callback.
Count resets on successful tool completion or new prompt.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When 3 consecutive 'Permission denied' tool failures are detected, automatically
attempt to recover by resuming just the affected session with a fresh
AutoApprovePermissions callback. Other sessions are untouched.

On success: resets denial count, adds '✅ Session reconnected' system message.
On failure: falls back to UI warning banner with manual Reconnect button.

The per-session reconnect disposes only the broken SDK session, creates a new
one via ResumeSessionAsync (preserving chat history, watchdog state, and
processing generation), and re-registers the event handler.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…work

After a successful per-session reconnect, finds the last user message and
resends it with skipHistoryMessage:true so the agent picks up where it left
off. Adds a recovery attempt counter (max 2) to prevent infinite loops if
the permission issue persists. Counter resets when a tool succeeds.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n recovery

- Use Interlocked.Increment for PermissionDenialCount to prevent duplicate
  recovery triggers from concurrent tool completion events
- Register event handler AFTER dictionary replacement so HandleSessionEvent's
  isCurrentState check passes for the new state (prevents dropped events)
- Transfer ActiveToolCallCount to prevent negative values from in-flight
  tool completions decrementing the new state's default-zero counter
- Transfer SendingFlag to prevent concurrent SendPromptAsync races

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The watchdog previously killed sessions after 3600s total processing time
regardless of event activity. This caused premature kills on long-running
CI/agent sessions that were actively receiving events.

Now the max-time cap only fires when BOTH conditions are true:
- Total processing time exceeds 3600s, AND
- No events received within the effectiveTimeout window

If events are still flowing, the session is alive — let it work.
The cap still catches zombie sessions where non-progress events keep
the inactivity timer happy without any real work happening.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hread safety

Bug A (CRITICAL): Clear IsProcessing + all 9 companion fields via InvokeOnUI
before calling SendPromptAsync in recovery path. Use TaskCompletionSource to
await UI-thread cleanup completion before resending.

Bug B (CRITICAL): Call state.ResponseCompletion?.TrySetCanceled() before
creating new state, so the original SendPromptAsync awaiter doesn't hang.

SendingFlag regression (CRITICAL): Don't transfer SendingFlag=1 to newState
since we need it at 0 for the resend. Clear old state's flag instead.

HandleReconnect scope (CRITICAL): Changed from global ReconnectAsync (destroys
all sessions) to per-session RecoverSessionAsync via new public method.

P4 (recovery attempt reset): Only reset _permissionRecoveryAttempts on
successful turn completion (CompleteResponse), not on every successful tool.

P5 (Dictionary thread safety): Use ConcurrentDictionary for heartbeat
_lastRenderedSnapshot to prevent corruption from overlapping Timer callbacks.

P6 (LastUpdatedAt thread safety): Use Interlocked ticks pattern matching
ProcessingStartedAt's thread-safe implementation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The strict consecutive counter was defeated by occasional OK tool results
interspersed with denials. Worker-1 hit 2 denials, then 1 OK (resetting
the counter), then 2 more denials — never reaching the threshold of 3.

Replace with a sliding window of the last 5 tool results. Recovery triggers
when 3+ of those 5 are denials, tolerating intermittent successes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ead resend, clear HasUsedToolsThisTurn

1. FlushCurrentResponse before clearing buffers in recovery (INV-1)
2. Wrap History.Add in Invoke() in ToolExecutionComplete handler
3. Dispatch resend via InvokeOnUI to satisfy INV-2
4. Clear HasUsedToolsThisTurn after transfer to prevent stale 600s timeout

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…afety

Fixes from 4-model code review (Opus, Sonnet, GPT-5.1, Gemini):

1. HandleReconnect targeted ActiveSessionName instead of clicked session —
   replaced with lambda capturing session.Name (Sonnet)
2. _permissionRecoveryAttempts not cleared on abort — added TryRemove in
   AbortSessionAsync (Sonnet)
3. Concurrent recovery race — added _recoveryInProgress ConcurrentDictionary
   guard with TryAdd/TryRemove in finally (Opus + GPT)
4. Thread-unsafe History.LastOrDefault on pool thread — moved read into
   InvokeOnUI cleanup block (Opus)
5. exceededMaxTime was dead code (subsumed by elapsed check) — simplified
   to use exceededMaxTime only for logging (Opus + Gemini)
6. PermissionDenialCount public setter could desync from sliding window —
   changed to read-only property (Gemini)
7. Bumped fragile RightClickContextMenu test threshold (our new markup
   pushed distance from 161 to 284 chars)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… Invoke calls

- N1: Merge two Invoke() calls into one atomic UI dispatch for
  History.Add + TryRecoverPermissionAsync (was two separate posts)
- N2: Add HasUsedToolsThisTurn + ActiveToolCallCount reset to
  success-path cleanup (INV-1 completeness)
- F3: Replace Task.Run with Invoke() so TryRecoverPermissionAsync
  runs on UI thread; remove ConfigureAwait(false); simplify resend
  to direct await (no InvokeOnUI+TCS dance needed)
- Catch block: Full IsProcessing cleanup with TCS cancel, SendingFlag
  reset, FlushCurrentResponse, all 9 companion fields cleared

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the session-20260304-233713 branch from 5235bfc to 56b4b3a Compare March 7, 2026 07:54
@PureWeen
Copy link
Owner Author

PureWeen commented Mar 7, 2026

Round 8 Re-Review: ⚠️ Request Changes (1 issue)

5-model parallel review (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex)
Commit reviewed: 56b4b3aa


Previous Findings: ✅ All Fixed (9/9)

Finding Status
F1: FlushCurrentResponse before Clear (success path) ✅ FIXED (5/5)
F2: History.LastOrDefault on background thread ✅ FIXED (5/5)
F3: SendPromptAsync on threadpool ✅ FIXED (5/5)
F4: HasUsedToolsThisTurn not reset ✅ FIXED (5/5)
NEW-R5: Catch block missing IsProcessing cleanup ✅ FIXED (5/5)
N1: History.Add at denialCount==3 on background thread ✅ FIXED (5/5)
N2: HasUsedToolsThisTurn not cleared in success path ✅ FIXED (5/5)
N3: FlushCurrentResponse after IsProcessing=false in catch ✅ FIXED (5/5)
N4: Watchdog exceededMaxTime AND condition ✅ FIXED (5/5)

The sliding window redesign (5-result window, 3-denial threshold) and _recoveryInProgress concurrency guard are solid additions.


New Consensus Finding (4/5 models)

🟡 MODERATE — attempts > 2 early return leaves session frozen until watchdog

CopilotService.Events.cs, TryRecoverPermissionAsync — the attempts > 2 branch (and the null-client/null-sessionId guard)

When recovery has been attempted 3 times, the method posts a warning message and returns without clearing processing state. state.ResponseCompletion.TCS is never resolved — the original SendPromptAsync caller hangs until watchdog (up to 600s). SendingFlag remains set, so SendPromptAsync will silently reject any new prompt. The user sees the warning AND "Thinking..." simultaneously with no way to send a new message (only Stop works). The null-guard early return has the same issue.

Fix: Apply the same cleanup as the catch block before each early return:

  • state.ResponseCompletion?.TrySetCanceled()
  • Interlocked.Exchange(ref state.SendingFlag, 0)
  • Full 9-field IsProcessing cleanup (FlushCurrentResponse first, then clear the fields)

Test Coverage

✅ 5 new unit tests for sliding window permission detection
✅ Heartbeat timer, permission UI, LastUpdatedAt thread-safety all covered
🔲 No test for the attempts > 2 early return path (should verify session is releasable after max attempts)


8 rounds of iteration. Every prior finding is fixed. One targeted fix needed on the early returns, then this PR is ready to merge.

…very

Extract ClearProcessingStateForRecoveryFailure helper to avoid tripling
cleanup code. Called from: null-guard return, attempts>2 return, and
catch block. Prevents session stuck in 'Thinking...' until watchdog
when recovery is skipped or fails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title Add heartbeat timer and liveness signal for zero-idle sessions Session resilience: heartbeat timer, permission auto-recovery, activity-aware watchdog Mar 7, 2026
@PureWeen
Copy link
Owner Author

PureWeen commented Mar 7, 2026

PR #285 — Round 9 Review (5-model consensus)

Models: claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex

Previous Findings Status — ALL 10 FIXED ✅

Finding Status Consensus
F1 (sliding window) ✅ FIXED 5/5
F2 (INV-1 cleanup) ✅ FIXED 5/5
F3 (Task.Run elimination) ✅ FIXED 5/5
F4 (watchdog exceededMaxTime) ✅ FIXED 5/5
NEW-R5 (state transfer to newState) ✅ FIXED 5/5
N1 (History.Add UI thread) ✅ FIXED 5/5
N2 (HasUsedToolsThisTurn cleared) ✅ FIXED 5/5
N3 (FlushCurrentResponse order) ✅ FIXED 5/5
N4 (exceededMaxTime display-only) ✅ FIXED 5/5
R8 (attempts > 2 early return cleanup) ✅ FIXED 5/5

The ClearProcessingStateForRecoveryFailure helper is well-designed and correctly addresses the R8 finding. INV-1 items 1-7 all pass.

New Consensus Findings (2+ model agreement)

🟡 R9-A (5/5 models) — Null-client guard missing OnStateChanged and user feedback
CopilotService.Events.cs:~1515 — The null-client guard calls ClearProcessingStateForRecoveryFailure and returns, but unlike the attempts > 2 and catch paths, it never fires OnStateChanged?.Invoke(), adds a History message, or emits a Debug log. If IsProcessing was true, the UI stays stuck on "Thinking…" until the 5s heartbeat catches it. One-liner fix — add OnStateChanged?.Invoke() and a Debug line after the helper call.

🟡 R9-C (5/5 models) — Catch block should use explicit InvokeOnUI for thread safety
CopilotService.Events.cs:~1637 — The catch block directly calls ClearProcessingStateForRecoveryFailure, History.Add, and OnStateChanged after multiple async awaits. The success path correctly marshals cleanup to the UI thread via InvokeOnUI(() => { ... }); the catch block does not. In MAUI the SynchronizationContext should keep continuations on the UI thread, but the success path is explicitly safe while the catch path relies on implicit behavior. For consistency with INV-1 #9, wrap the catch body in InvokeOnUI.

🟢 R9-B (5/5 models) — Missing diagnostic log in helper (INV-1 #8)
CopilotService.Events.cs:~1487ClearProcessingStateForRecoveryFailure sets IsProcessing=false but has no Debug(...) call. Per convention, every path that clears IsProcessing must have a diagnostic tag. Fix: add Debug($"[PERMISSION-RECOVER-FAIL] '{sessionName}' IsProcessing=false") inside the if (IsProcessing) block. (This also uses the currently-unused sessionName parameter — R9-E.)

🟢 R9-D (5/5 models) — _permissionDenialCount should be private
AgentSessionInfo.cs:~59 — Field is public int _permissionDenialCount but is only accessed internally via Volatile.Read/Write inside locked methods. Being public allows external writes that bypass the sliding-window lock. Change to private.

Verdict: ✅ Approve

All 10 previous findings are confirmed fixed. The 2 moderate findings (R9-A and R9-C) are non-blocking — R9-A is mitigated by the heartbeat timer (5s), and R9-C is mitigated by MAUI's SynchronizationContext. The ClearProcessingStateForRecoveryFailure helper is a clean, correct extraction that properly handles the INV-1 checklist (items 1-7, 9). After 9 rounds and 10 findings fixed, this PR is merge-ready. The R9 items are advisory and can be addressed in a follow-up commit.

Review by PR Review Squad worker-2 (Round 9, 5-model consensus)

@PureWeen PureWeen merged commit 188ae05 into main Mar 7, 2026
@PureWeen PureWeen deleted the session-20260304-233713 branch March 7, 2026 15:03
PureWeen added a commit that referenced this pull request Mar 7, 2026
…ty-aware watchdog (#285)

Add heartbeat timer and liveness signal for zero-idle sessions

9 rounds of multi-model review, all 10 findings fixed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 8, 2026
…onAsync cancellation, restored comment

- Changed TurnEndIdleCts from property to field to fix 7 CS0206 compile errors with Interlocked.Exchange(ref ...)
- Added TurnEndIdleCts cancellation in AbortSessionAsync to prevent delayed CompleteResponse on aborted sessions
- Restored deleted // Release send lock comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 8, 2026
…onAsync cancellation, restored comment

- Changed TurnEndIdleCts from property to field to fix 7 CS0206 compile errors with Interlocked.Exchange(ref ...)
- Added TurnEndIdleCts cancellation in AbortSessionAsync to prevent delayed CompleteResponse on aborted sessions
- Restored deleted // Release send lock comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Mar 8, 2026
…roved permission detection (#305)

## Fixes issues from PR #285

### Changes
- **Gate auto-resend on successful tools** (fixes #298): Added
`SuccessfulToolCountThisTurn` to SessionState. Permission recovery skips
resend when tools already completed this turn, preventing re-execution
of side-effectful work. Shows `⟳ Resending: "..."` when resending, or
`Auto-resend skipped` when skipped.
- **TurnEnd→Idle fallback** (addresses #299): When AssistantTurnEndEvent
fires, schedule a 4s delayed CompleteResponse. Cancelled when
SessionIdleEvent arrives (normal path) or a new turn starts.
- **Improved permission denial detection** (improves #300): Extracted
IsPermissionDenialText() helper matching "Permission denied",
"denied-no-approval-rule", and "could not request permission".

### Issues Addressed
- Fixes #298
- Addresses #299 (workaround for SDK bug)
- Improves #300 (more robust detection)

### Verification
- Build: 0 errors
- 20/20 tests pass

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant