Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog by PureWeen · Pull Request #207 · PureWeen/PolyPilot

PureWeen · 2026-02-25T00:37:06Z

Summary

Makes multi-agent orchestration dispatch resilient to app relaunches. When the app is relaunched while workers are processing, the orchestrator can now automatically resume and collect results.

Root Cause

When the app relaunches mid-dispatch, the in-memory Task.WhenAll awaiting worker TCS completions dies with the old process. Workers continue on the backend but their results are never collected for synthesis.

Changes

Relaunch Resilience (CopilotService.Organization.cs)

PendingOrchestration model persisted to ~/.polypilot/pending-orchestration.json before dispatching workers
ResumeOrchestrationIfPendingAsync detects pending orchestrations on restart and monitors workers
MonitorAndSynthesizeAsync polls every 5s (15min timeout) until all workers idle, then auto-synthesizes
Dispatch state cleared in finally blocks to prevent leaked files on cancellation/error
Worker results filtered by dispatch timestamp to avoid picking up stale pre-dispatch messages
UTC→local time conversion for reliable timestamp comparison

Watchdog Fix (CopilotService.cs)

IsMultiAgentSession carried forward on session reconnect — without this, the watchdog used the 120s inactivity timeout instead of 600s, killing long-running workers prematurely
[DISPATCH] tag added to diagnostic log file filter for post-mortem analysis

Tests (MultiAgentRegressionTests.cs)

7 new tests: save/load/clear lifecycle, no-file resume, missing-group cleanup, dispatch tag guard, reconnect state guard, timestamp filter guard, finally block guard

Multi-Model Review

Reviewed by 3 models (Opus 4.6, Sonnet 4.6, GPT-5.3-Codex). All 3 consensus findings fixed:

✅ Worker result collection picks up pre-dispatch history → timestamp filter added
✅ UTC/local time mismatch → converted on comparison
✅ Pending file leaked on cancellation → try/finally added

Testing

1268 tests total, 1267 pass (1 pre-existing flaky TestIsolationGuard)
Manually tested orchestration resume after app relaunch

When the app is relaunched while an orchestrator dispatch is in progress, workers continue processing on the backend but the in-memory orchestration task dies. Previously, worker responses were silently lost. Now the dispatch state is persisted to pending-orchestration.json before workers are dispatched. On app restart, the pending state is detected and a background monitor waits for all workers to complete, then automatically collects their responses and sends the synthesis prompt to the orchestrator. - PendingOrchestration model with save/load/clear persistence - SavePendingOrchestration called in both Orchestrator and OrchestratorReflect - ClearPendingOrchestration called after successful synthesis - ResumeOrchestrationIfPendingAsync monitors workers post-restart - System messages show resume status in orchestrator chat - 15-minute timeout on resume monitoring - 4 new tests covering persistence round-trip and resume edge cases - SetBaseDirForTesting clears _pendingOrchestrationFile for test isolation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Root cause: When SendPromptAsync reconnects a session (disconnect during dispatch), the new SessionState didn't carry forward IsMultiAgentSession. The watchdog then used the 120s inactivity timeout instead of 600s, killing long-running PR review workers mid-task. Fixes: - Carry IsMultiAgentSession to new state on reconnect (like HasUsedToolsThisTurn) - Add [DISPATCH] to diagnostic log file filter so orchestration events are persisted for post-mortem analysis - Fix test race condition in PendingOrchestration tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- PendingOrchestration save/load/clear lifecycle test (isolated dir) - Resume with no pending file test - Resume with missing group clears state test - DiagnosticLogFilter includes [DISPATCH] tag guard test - ReconnectState carries IsMultiAgentSession guard test - Tests use per-test temp subdirs to avoid parallel races Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Filter worker results by dispatch timestamp to avoid stale pre-dispatch messages (3/3 reviewer consensus) - Convert UTC StartedAt to local time for ChatMessage.Timestamp comparison - Wrap dispatch→clear in try/finally for both non-reflect and reflect paths to prevent leaked pending-orchestration.json on cancellation/error - Add 2 source-code guard tests for timestamp filter and finally block Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Root cause: UsageStatsTests used reflection to set _polyPilotBaseDir and nulled it in Dispose, racing with TestIsolationGuardTests reading BaseDir. Fixes: - UsageStatsTests: use SetBaseDirForTesting API, restore to TestBaseDir in Dispose instead of nulling (never null _polyPilotBaseDir) - CopilotService: replace ??= with LazyInitializer.EnsureInitialized (uses Interlocked.CompareExchange, prevents concurrent init overwrite) - CopilotService: use Volatile.Write in SetBaseDirForTesting for ARM memory ordering visibility - Add [Collection("BaseDir")] to all test classes that mutate BaseDir to prevent parallel races - New BaseDirCollection.cs defines the xUnit collection Verified: 20/20 consecutive test runs pass (0 flaky failures). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add [DISPATCH] tag logging to SendToMultiAgentGroupAsync for: - Early-return paths (group not found, no members) - Entry logging (group name, mode, member count) - Exception logging with try/catch around dispatch switch Improve Dashboard error handler to capture inner exception message before InvokeAsync boundary (avoids potential null reference). Verified end-to-end dispatch flow via MauiDevFlow: orchestrator parses plan → dispatches to 5 workers → workers complete → synthesis sent back to orchestrator successfully. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@onkeydown

…n cleanup 1. DiffView: Remove HtmlEncode() that caused double-encoding with Blazor's @() auto-escaping. Angle brackets in diffs (HTML, generics, JSX) were rendering as < instead of <. Restore DiffParser guard tests. 2. Input selection: Revert @onkeydown back to @onkeyup on value-bound inputs. @onkeydown causes Blazor to re-render before the browser processes the keystroke, clearing text selection and breaking multi-character delete. Restore InputSelectionTests regression guard. 3. Session close: Restore cleanup block that clears expandedSession, _lastActiveSession, and _focusedInputId when active session becomes null. Without this, single-session auto-expand doesn't fire after closing a session. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@worker

When the orchestrator session was processing and a user sent a message, it was queued (Dashboard.razor line 1210). On dequeue in Events.cs, the message went directly to SendPromptAsync, bypassing the multi-agent dispatch pipeline entirely. The orchestrator would respond with @worker assignments but nothing parsed or dispatched them. Fix: The queue drain in CompleteResponse now checks GetOrchestratorGroupId() and routes dequeued messages through SendToMultiAgentGroupAsync instead of SendPromptAsync. Also adds DISPATCH-ROUTE diagnostic logging and QUEUED tracking to the event-diagnostics.log for future debugging. Adds 4 regression tests for GetOrchestratorGroupId. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move Complete event into finally blocks in SendViaOrchestratorAsync and SendViaOrchestratorReflectAsync so it fires even when synthesis or worker dispatch throws. Without this, the UI permanently shows 'Waiting for workers...' with no way to clear it. Also add Complete event to MonitorAndSynthesizeAsync cancellation exit path which previously only cleared the file without notifying the UI. Found by multi-model code review (Opus 4.6 + GPT-5.1-Codex consensus). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

# Conflicts: # PolyPilot/Components/DiffView.razor

PureWeen and others added 10 commits February 24, 2026 10:36

Merge remote-tracking branch 'origin/main' into orchestration-messages

65e40fa

# Conflicts: # PolyPilot/Components/DiffView.razor

PureWeen merged commit 7088a9b into main Feb 25, 2026

PureWeen deleted the orchestration-messages branch February 25, 2026 14:54

PureWeen mentioned this pull request Feb 26, 2026

Fix orchestrator delegation returning empty response #228

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207

Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207
PureWeen merged 10 commits intomainfrom
orchestration-messages

PureWeen commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Feb 25, 2026

Summary

Root Cause

Changes

Multi-Model Review

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant