Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207
Merged
Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207
Conversation
When the app is relaunched while an orchestrator dispatch is in progress, workers continue processing on the backend but the in-memory orchestration task dies. Previously, worker responses were silently lost. Now the dispatch state is persisted to pending-orchestration.json before workers are dispatched. On app restart, the pending state is detected and a background monitor waits for all workers to complete, then automatically collects their responses and sends the synthesis prompt to the orchestrator. - PendingOrchestration model with save/load/clear persistence - SavePendingOrchestration called in both Orchestrator and OrchestratorReflect - ClearPendingOrchestration called after successful synthesis - ResumeOrchestrationIfPendingAsync monitors workers post-restart - System messages show resume status in orchestrator chat - 15-minute timeout on resume monitoring - 4 new tests covering persistence round-trip and resume edge cases - SetBaseDirForTesting clears _pendingOrchestrationFile for test isolation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: When SendPromptAsync reconnects a session (disconnect during dispatch), the new SessionState didn't carry forward IsMultiAgentSession. The watchdog then used the 120s inactivity timeout instead of 600s, killing long-running PR review workers mid-task. Fixes: - Carry IsMultiAgentSession to new state on reconnect (like HasUsedToolsThisTurn) - Add [DISPATCH] to diagnostic log file filter so orchestration events are persisted for post-mortem analysis - Fix test race condition in PendingOrchestration tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- PendingOrchestration save/load/clear lifecycle test (isolated dir) - Resume with no pending file test - Resume with missing group clears state test - DiagnosticLogFilter includes [DISPATCH] tag guard test - ReconnectState carries IsMultiAgentSession guard test - Tests use per-test temp subdirs to avoid parallel races Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Filter worker results by dispatch timestamp to avoid stale pre-dispatch messages (3/3 reviewer consensus) - Convert UTC StartedAt to local time for ChatMessage.Timestamp comparison - Wrap dispatch→clear in try/finally for both non-reflect and reflect paths to prevent leaked pending-orchestration.json on cancellation/error - Add 2 source-code guard tests for timestamp filter and finally block Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: UsageStatsTests used reflection to set _polyPilotBaseDir and
nulled it in Dispose, racing with TestIsolationGuardTests reading BaseDir.
Fixes:
- UsageStatsTests: use SetBaseDirForTesting API, restore to TestBaseDir
in Dispose instead of nulling (never null _polyPilotBaseDir)
- CopilotService: replace ??= with LazyInitializer.EnsureInitialized
(uses Interlocked.CompareExchange, prevents concurrent init overwrite)
- CopilotService: use Volatile.Write in SetBaseDirForTesting for ARM
memory ordering visibility
- Add [Collection("BaseDir")] to all test classes that mutate BaseDir
to prevent parallel races
- New BaseDirCollection.cs defines the xUnit collection
Verified: 20/20 consecutive test runs pass (0 flaky failures).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add [DISPATCH] tag logging to SendToMultiAgentGroupAsync for: - Early-return paths (group not found, no members) - Entry logging (group name, mode, member count) - Exception logging with try/catch around dispatch switch Improve Dashboard error handler to capture inner exception message before InvokeAsync boundary (avoids potential null reference). Verified end-to-end dispatch flow via MauiDevFlow: orchestrator parses plan → dispatches to 5 workers → workers complete → synthesis sent back to orchestrator successfully. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n cleanup 1. DiffView: Remove HtmlEncode() that caused double-encoding with Blazor's @() auto-escaping. Angle brackets in diffs (HTML, generics, JSX) were rendering as < instead of <. Restore DiffParser guard tests. 2. Input selection: Revert @onkeydown back to @onkeyup on value-bound inputs. @onkeydown causes Blazor to re-render before the browser processes the keystroke, clearing text selection and breaking multi-character delete. Restore InputSelectionTests regression guard. 3. Session close: Restore cleanup block that clears expandedSession, _lastActiveSession, and _focusedInputId when active session becomes null. Without this, single-session auto-expand doesn't fire after closing a session. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the orchestrator session was processing and a user sent a message, it was queued (Dashboard.razor line 1210). On dequeue in Events.cs, the message went directly to SendPromptAsync, bypassing the multi-agent dispatch pipeline entirely. The orchestrator would respond with @worker assignments but nothing parsed or dispatched them. Fix: The queue drain in CompleteResponse now checks GetOrchestratorGroupId() and routes dequeued messages through SendToMultiAgentGroupAsync instead of SendPromptAsync. Also adds DISPATCH-ROUTE diagnostic logging and QUEUED tracking to the event-diagnostics.log for future debugging. Adds 4 regression tests for GetOrchestratorGroupId. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Complete event into finally blocks in SendViaOrchestratorAsync and SendViaOrchestratorReflectAsync so it fires even when synthesis or worker dispatch throws. Without this, the UI permanently shows 'Waiting for workers...' with no way to clear it. Also add Complete event to MonitorAndSynthesizeAsync cancellation exit path which previously only cleared the file without notifying the UI. Found by multi-model code review (Opus 4.6 + GPT-5.1-Codex consensus). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # PolyPilot/Components/DiffView.razor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes multi-agent orchestration dispatch resilient to app relaunches. When the app is relaunched while workers are processing, the orchestrator can now automatically resume and collect results.
Root Cause
When the app relaunches mid-dispatch, the in-memory
Task.WhenAllawaiting worker TCS completions dies with the old process. Workers continue on the backend but their results are never collected for synthesis.Changes
Relaunch Resilience (
CopilotService.Organization.cs)PendingOrchestrationmodel persisted to~/.polypilot/pending-orchestration.jsonbefore dispatching workersResumeOrchestrationIfPendingAsyncdetects pending orchestrations on restart and monitors workersMonitorAndSynthesizeAsyncpolls every 5s (15min timeout) until all workers idle, then auto-synthesizesfinallyblocks to prevent leaked files on cancellation/errorWatchdog Fix (
CopilotService.cs)IsMultiAgentSessioncarried forward on session reconnect — without this, the watchdog used the 120s inactivity timeout instead of 600s, killing long-running workers prematurely[DISPATCH]tag added to diagnostic log file filter for post-mortem analysisTests (
MultiAgentRegressionTests.cs)Multi-Model Review
Reviewed by 3 models (Opus 4.6, Sonnet 4.6, GPT-5.3-Codex). All 3 consensus findings fixed:
Testing
TestIsolationGuard)