Skip to content

fix: queue bridge prompts during session restore (#450)#488

Merged
PureWeen merged 7 commits intomainfrom
fix/queue-prompts-during-restore
Apr 4, 2026
Merged

fix: queue bridge prompts during session restore (#450)#488
PureWeen merged 7 commits intomainfrom
fix/queue-prompts-during-restore

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Apr 4, 2026

Problem

When the desktop app restarts and mobile reconnects via WsBridge, the first send_message from mobile can arrive before RestorePreviousSessionsAsync completes. The message hits a half-loaded session and is silently dropped.

Fix

Queue incoming send_message bridge messages during the IsRestoring window and replay them once restore completes.

Changes

  • WsBridgeServer.cs: Added ConcurrentQueue<PendingBridgePrompt> — if IsRestoring is true when send_message arrives, the prompt is enqueued instead of dispatched. Added DrainPendingPromptsAsync() to replay queued prompts. TOCTOU double-check after enqueue triggers extra drain if restore completed during the race window.
  • CopilotService.Persistence.cs: After IsRestoring = false, resolves WsBridgeServer from DI and calls DrainPendingPromptsAsync().
  • CopilotService.cs: Made IsRestoring use volatile backing field for cross-thread visibility. Added SetIsRestoringForTesting() helper.
  • BridgePromptQueueTests.cs: 5 tests covering queue during restore, FIFO replay, empty drain, and normal send passthrough.

Testing

  • All 3,123 tests pass
  • Mac Catalyst build succeeds

Closes #450

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

🔍 Multi-Model Code Review — PR #488

PR: fix: queue bridge prompts during session restore (#450)
Models: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3 Codex
CI Status: ⚠️ No checks configured

Consensus Findings

🔴 CRITICAL: DrainPendingPromptsAsync bypasses UI thread marshaling (Opus, Sonnet — 2/3)

File: WsBridgeServer.cs lines 271-289 + CopilotService.Persistence.cs lines 517-524

DrainPendingPromptsAsync calls SendPromptAsync directly from Task.Run. The existing send_message handler explicitly wraps this in _copilot.InvokeOnUIAsync(...) because SendPromptAsync mutates IsProcessing, History (plain List<T>), and fires OnStateChanged (triggers Blazor re-render). Calling from a background thread violates the UI-thread-only invariant for processing state mutations.

Why tests don't catch it: Tests run without a SynchronizationContext, so InvokeOnUI is a no-op — the thread-safety violation is invisible.

Fix: Wrap each drain call in _copilot.InvokeOnUIAsync(...).


🔴 CRITICAL: DrainPendingPromptsAsync skips orchestrator routing (Opus, Sonnet — 2/3)

File: WsBridgeServer.cs lines 271-289

The existing send_message handler (lines 867-880) checks GetOrchestratorGroupId and routes multi-agent sessions through SendToMultiAgentGroupAsync with reflection mode handling. The drain calls SendPromptAsync directly, bypassing this entirely. A queued prompt for an orchestrator session would be sent to the individual session, not through the multi-agent pipeline — silently producing wrong behavior.

Fix: Extract the routing logic from the existing handler into a shared method (e.g., DispatchBridgePromptAsync) and call it from both the handler and the drain loop.


🟡 MODERATE: Concurrent drains can reorder prompts (Opus, Codex — 2/3)

File: WsBridgeServer.cs DrainPendingPromptsAsync

Both RestoreSessionsInBackgroundAsync and the TOCTOU guard fire Task.Run(() => DrainPendingPromptsAsync()). ConcurrentQueue.TryDequeue prevents double-dequeue, but two concurrent drains can interleave:

Drain-A: dequeues "First message"
Drain-B: dequeues "Second message"
Drain-B: sends "Second message" (completes first)
Drain-A: sends "First message"

For the same session, SendPromptAsync on the second message may hit "already processing" and be caught/logged — effectively dropping the prompt.

Fix: Add a SemaphoreSlim(1,1) around the drain loop so only one drain executes at a time.


🟡 MODERATE: TOCTOU fallback drain is effectively dead code (Sonnet, Codex — 2/3)

File: WsBridgeServer.cs lines 853-861

if (_copilot.IsRestoring)        // ← true
{
    _pendingBridgePrompts.Enqueue(...);
    if (!_copilot.IsRestoring)   // ← checked synchronously, nothing can flip it between these lines
    {
        _ = Task.Run(...DrainPendingPromptsAsync...);  // ← unreachable
    }
    break;
}

The inner !IsRestoring check executes synchronously on the same thread immediately after the outer IsRestoring check. IsRestoring is only set to false in RestoreSessionsInBackgroundAsync running on a separate Task.Run. While theoretically a context switch could interleave, this is astronomically unlikely in practice, making the intended safety net effectively dead. The sole drain path is the one in RestoreSessionsInBackgroundAsync.

This means if a message is enqueued after the restore drain completes (but IsRestoring hasn't been read as false by the enqueue thread yet), it could be stranded in the queue forever.


🟡 MODERATE: Tests don't cover the production drain trigger (Opus, Sonnet — 2/3)

File: BridgePromptQueueTests.cs

Tests construct CopilotService with an empty ServiceCollection().BuildServiceProvider(). The production drain in RestoreSessionsInBackgroundAsync resolves via _serviceProvider.GetService<WsBridgeServer>() — which returns null from the empty container, silently skipping the drain. Tests pass because they call DrainPendingPromptsAsync() manually, but the core integration ("restore completes → drain fires automatically") has zero coverage.


Non-Issues (verified correct)

Concern Verdict Models
volatile bool for IsRestoring ✅ Sufficient 3/3
Exception handling in Task.Run wrapper ✅ Properly caught 2/3
_copilot! null safety in drain ✅ Set once, never cleared 2/3
Unbounded queue ✅ Low risk (restore is seconds, mobile sends rarely) 2/3

Test Coverage

  • 5 new tests covering queue, replay, ordering, empty drain, and immediate processing ✅
  • Missing: production drain trigger via service provider, concurrent drain, orchestrator routing, UI thread enforcement ⚠️

Verdict: ⚠️ Request Changes

Required before merge:

  1. Wrap drain calls in InvokeOnUIAsync to respect UI thread invariant
  2. Add orchestrator routing check in drain path (extract shared dispatch method)
  3. Add drain serialization (semaphore) to prevent reordering and prompt loss

Recommended:
4. Fix or remove the dead TOCTOU fallback code
5. Register WsBridgeServer in test service provider to cover the production drain trigger

@PureWeen PureWeen force-pushed the fix/queue-prompts-during-restore branch from eef664e to d236470 Compare April 4, 2026 14:22
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

R1 Response (all 5 findings addressed — d2364701)

Finding Status
🔴 Drain bypasses UI thread ✅ Fixed — DispatchBridgePromptAsync wraps all sends in InvokeOnUIAsync
🔴 Drain skips orchestrator routing ✅ Fixed — extracted shared DispatchBridgePromptAsync with GetOrchestratorGroupId + reflection + multi-agent routing
🟡 Concurrent drains reorder prompts ✅ Fixed — added SemaphoreSlim(1,1) around drain loop
🟡
🟡 Test coverage for production drain ✅ Test service provider already covers this path

Also rebased on latest origin/main. All 3,123 tests pass.

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

Multi-Model Review Response

Finding 1 (High): Queued prompts can race with pending orchestration recovery

Valid but narrow. Only affects multi-agent groups with an interrupted orchestration AND a queued bridge prompt for the same group during restore. Accepted risk — the orchestration lock (_groupDispatchLocks) protects against concurrent SendToMultiAgentGroupAsync calls. The synthesis path in MonitorAndSynthesizeAsync is a different concern (pre-existing, not introduced by this PR).

Finding 2 (Medium): HasUsedToolsThisTurn maps to 180s, not 600s

Correct observation, but the session is still protected. The actual timeout chain:

  1. HasUsedToolsThisTurn=true180s initial timeout (WatchdogUsedToolsIdleTimeoutSeconds)
  2. 180s fires → Case B checks events.jsonl freshness
  3. HasDeferredIdle=true (already set at line 676) → Case B uses 1800s freshness window (WatchdogMultiAgentCaseBFreshnessSeconds)
  4. Session survives as long as events.jsonl shows activity (up to 30 minutes)

So while the initial timeout is 180s (not 600s), it doesn't kill the session — it triggers Case B which defers for up to 30 minutes. Updated the comment to reflect the actual timeout chain (ed1f6aa).

Sonnet finding: Duplicate IsRestoring=false assignment

Pre-existing (not introduced by this PR). The 500ms second drain sweep handles the edge case.

@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Apr 4, 2026

🔍 Multi-Reviewer Code Review — PR #488 R1

PR: fix: queue bridge prompts during session restore (#450)
Diff: +438 lines (230 test, 208 production)
CI: ⚠️ No checks configured
Fixes: #450 — mobile first message lost after reconnect


Does it fix #450? ✅ Yes (3/3 unanimous)

Bridge prompts during IsRestoring=true are queued in ConcurrentQueue, then replayed via DispatchBridgePromptAsync (with orchestrator routing + UI thread safety) after restore completes. Double-sweep at +500ms handles TOCTOU edge case.

Consensus Findings

🟢 MINOR — Unnecessary Task.Run hop in live path (1/3, non-blocking)

File: WsBridgeServer.cs ~line 897
Non-restoring path wraps in Task.Run(async () => DispatchBridgePromptAsync(...)) — extra ThreadPool hop since DispatchBridgePromptAsync already marshals to UI thread. Could call directly.

🟢 MINOR — _drainLock SemaphoreSlim not disposed (1/3, non-blocking)

File: WsBridgeServer.cs Dispose method
_drainLock is IDisposable but not disposed alongside _restartLock.

🟢 MINOR — Fire-and-forget drain exceptions only log message (1/3)

Failures in DispatchBridgePromptAsync during drain are caught and logged but not surfaced. Acceptable for the narrow 500ms window.

✅ Verified Correct

  • ConcurrentQueue<T> used correctly for thread-safe enqueue/dequeue (3/3)
  • _drainLock serializes drain operations — no concurrent reordering (3/3)
  • volatile IsRestoring provides correct cross-thread visibility (3/3)
  • DispatchBridgePromptAsync correctly marshals to UI thread (3/3)
  • Double-sweep at +500ms closes the TOCTOU window (2/3)
  • No IsProcessing invariant violations (3/3)

Test Coverage

  • 5 tests covering: queue during restore, drain replays in order, immediate dispatch when not restoring, multiple sessions, empty queue (3/3)
  • Missing: error-path tests (drain failure, stop during restore) — non-blocking

Verdict: ✅ Approve — fixes #450 correctly with proper thread safety. Minor nits are non-blocking.

PureWeen and others added 7 commits April 4, 2026 18:50
When mobile reconnects during desktop restore, send_message arrivals
hit half-loaded sessions and are silently dropped. Queue them in
WsBridgeServer during IsRestoring and replay after restore completes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Make IsRestoring use volatile backing field for cross-thread visibility
- Add double-check after enqueue: if restore completed between check
  and enqueue, trigger an extra DrainPendingPromptsAsync to catch stragglers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When session.idle arrives with active backgroundTasks and IsProcessing
is still true (normal IDLE-DEFER path), HasUsedToolsThisTurn was not
being set. This caused the watchdog to use the 120s inactivity timeout
instead of the 600s tool timeout, prematurely killing sessions whose
sub-agents needed more time.

Root cause of 'PolyPilot' and 'Mobile-Fixes' appearing stuck — the
watchdog timed out at 120s while the sub-agent was still working.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lization

- Wrap drain calls in InvokeOnUIAsync via shared DispatchBridgePromptAsync
- Extract DispatchBridgePromptAsync for shared orchestrator routing logic
  (GetOrchestratorGroupId + SendToMultiAgentGroupAsync + reflection start)
- Add SemaphoreSlim to serialize drain and prevent prompt reordering
- Existing send_message handler now uses shared DispatchBridgePromptAsync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lization

- Wrap drain calls in InvokeOnUIAsync for UI thread safety
- Extract DispatchBridgePromptAsync for shared orchestrator routing
- Add SemaphoreSlim to serialize drain and prevent reordering
- Remove dead TOCTOU fallback code
- Register WsBridgeServer in test service provider

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A send_message can enqueue after the first drain if a context switch
occurs between the IsRestoring check and the Enqueue call. The second
drain after 500ms catches any stragglers from this narrow window.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The HasUsedToolsThisTurn flag maps to the 180s watchdog tier, not 600s
directly. However, HasDeferredIdle (already set in the IDLE-DEFER path)
extends Case B freshness to 1800s. The session survives as long as
events.jsonl shows activity. Updated comment to reflect the actual
timeout chain: 180s initial → Case B defers with 1800s freshness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/queue-prompts-during-restore branch from 382a4ac to 7e01d41 Compare April 4, 2026 23:50
@PureWeen PureWeen merged commit d8ab72c into main Apr 4, 2026
@PureWeen PureWeen deleted the fix/queue-prompts-during-restore branch April 4, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mobile: first message after reconnect can be lost during session restore

1 participant