Skip to content

Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207

Merged
PureWeen merged 10 commits intomainfrom
orchestration-messages
Feb 25, 2026
Merged

Orchestration relaunch resilience: persist dispatch state, resume workers, fix watchdog#207
PureWeen merged 10 commits intomainfrom
orchestration-messages

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Summary

Makes multi-agent orchestration dispatch resilient to app relaunches. When the app is relaunched while workers are processing, the orchestrator can now automatically resume and collect results.

Root Cause

When the app relaunches mid-dispatch, the in-memory Task.WhenAll awaiting worker TCS completions dies with the old process. Workers continue on the backend but their results are never collected for synthesis.

Changes

Relaunch Resilience (CopilotService.Organization.cs)

  • PendingOrchestration model persisted to ~/.polypilot/pending-orchestration.json before dispatching workers
  • ResumeOrchestrationIfPendingAsync detects pending orchestrations on restart and monitors workers
  • MonitorAndSynthesizeAsync polls every 5s (15min timeout) until all workers idle, then auto-synthesizes
  • Dispatch state cleared in finally blocks to prevent leaked files on cancellation/error
  • Worker results filtered by dispatch timestamp to avoid picking up stale pre-dispatch messages
  • UTC→local time conversion for reliable timestamp comparison

Watchdog Fix (CopilotService.cs)

  • IsMultiAgentSession carried forward on session reconnect — without this, the watchdog used the 120s inactivity timeout instead of 600s, killing long-running workers prematurely
  • [DISPATCH] tag added to diagnostic log file filter for post-mortem analysis

Tests (MultiAgentRegressionTests.cs)

  • 7 new tests: save/load/clear lifecycle, no-file resume, missing-group cleanup, dispatch tag guard, reconnect state guard, timestamp filter guard, finally block guard

Multi-Model Review

Reviewed by 3 models (Opus 4.6, Sonnet 4.6, GPT-5.3-Codex). All 3 consensus findings fixed:

  1. ✅ Worker result collection picks up pre-dispatch history → timestamp filter added
  2. ✅ UTC/local time mismatch → converted on comparison
  3. ✅ Pending file leaked on cancellation → try/finally added

Testing

  • 1268 tests total, 1267 pass (1 pre-existing flaky TestIsolationGuard)
  • Manually tested orchestration resume after app relaunch

PureWeen and others added 10 commits February 24, 2026 10:36
When the app is relaunched while an orchestrator dispatch is in progress,
workers continue processing on the backend but the in-memory orchestration
task dies. Previously, worker responses were silently lost.

Now the dispatch state is persisted to pending-orchestration.json before
workers are dispatched. On app restart, the pending state is detected and
a background monitor waits for all workers to complete, then automatically
collects their responses and sends the synthesis prompt to the orchestrator.

- PendingOrchestration model with save/load/clear persistence
- SavePendingOrchestration called in both Orchestrator and OrchestratorReflect
- ClearPendingOrchestration called after successful synthesis
- ResumeOrchestrationIfPendingAsync monitors workers post-restart
- System messages show resume status in orchestrator chat
- 15-minute timeout on resume monitoring
- 4 new tests covering persistence round-trip and resume edge cases
- SetBaseDirForTesting clears _pendingOrchestrationFile for test isolation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: When SendPromptAsync reconnects a session (disconnect during
dispatch), the new SessionState didn't carry forward IsMultiAgentSession.
The watchdog then used the 120s inactivity timeout instead of 600s,
killing long-running PR review workers mid-task.

Fixes:
- Carry IsMultiAgentSession to new state on reconnect (like HasUsedToolsThisTurn)
- Add [DISPATCH] to diagnostic log file filter so orchestration events
  are persisted for post-mortem analysis
- Fix test race condition in PendingOrchestration tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- PendingOrchestration save/load/clear lifecycle test (isolated dir)
- Resume with no pending file test
- Resume with missing group clears state test
- DiagnosticLogFilter includes [DISPATCH] tag guard test
- ReconnectState carries IsMultiAgentSession guard test
- Tests use per-test temp subdirs to avoid parallel races

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Filter worker results by dispatch timestamp to avoid stale pre-dispatch
  messages (3/3 reviewer consensus)
- Convert UTC StartedAt to local time for ChatMessage.Timestamp comparison
- Wrap dispatch→clear in try/finally for both non-reflect and reflect paths
  to prevent leaked pending-orchestration.json on cancellation/error
- Add 2 source-code guard tests for timestamp filter and finally block

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause: UsageStatsTests used reflection to set _polyPilotBaseDir and
nulled it in Dispose, racing with TestIsolationGuardTests reading BaseDir.

Fixes:
- UsageStatsTests: use SetBaseDirForTesting API, restore to TestBaseDir
  in Dispose instead of nulling (never null _polyPilotBaseDir)
- CopilotService: replace ??= with LazyInitializer.EnsureInitialized
  (uses Interlocked.CompareExchange, prevents concurrent init overwrite)
- CopilotService: use Volatile.Write in SetBaseDirForTesting for ARM
  memory ordering visibility
- Add [Collection("BaseDir")] to all test classes that mutate BaseDir
  to prevent parallel races
- New BaseDirCollection.cs defines the xUnit collection

Verified: 20/20 consecutive test runs pass (0 flaky failures).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add [DISPATCH] tag logging to SendToMultiAgentGroupAsync for:
- Early-return paths (group not found, no members)
- Entry logging (group name, mode, member count)
- Exception logging with try/catch around dispatch switch

Improve Dashboard error handler to capture inner exception message
before InvokeAsync boundary (avoids potential null reference).

Verified end-to-end dispatch flow via MauiDevFlow: orchestrator
parses plan → dispatches to 5 workers → workers complete →
synthesis sent back to orchestrator successfully.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n cleanup

1. DiffView: Remove HtmlEncode() that caused double-encoding with Blazor's
   @() auto-escaping. Angle brackets in diffs (HTML, generics, JSX) were
   rendering as &lt; instead of <. Restore DiffParser guard tests.

2. Input selection: Revert @onkeydown back to @onkeyup on value-bound
   inputs. @onkeydown causes Blazor to re-render before the browser
   processes the keystroke, clearing text selection and breaking
   multi-character delete. Restore InputSelectionTests regression guard.

3. Session close: Restore cleanup block that clears expandedSession,
   _lastActiveSession, and _focusedInputId when active session becomes
   null. Without this, single-session auto-expand doesn't fire after
   closing a session.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the orchestrator session was processing and a user sent a message,
it was queued (Dashboard.razor line 1210). On dequeue in Events.cs, the
message went directly to SendPromptAsync, bypassing the multi-agent
dispatch pipeline entirely. The orchestrator would respond with @worker
assignments but nothing parsed or dispatched them.

Fix: The queue drain in CompleteResponse now checks GetOrchestratorGroupId()
and routes dequeued messages through SendToMultiAgentGroupAsync instead of
SendPromptAsync. Also adds DISPATCH-ROUTE diagnostic logging and QUEUED
tracking to the event-diagnostics.log for future debugging.

Adds 4 regression tests for GetOrchestratorGroupId.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move Complete event into finally blocks in SendViaOrchestratorAsync and
SendViaOrchestratorReflectAsync so it fires even when synthesis or worker
dispatch throws. Without this, the UI permanently shows 'Waiting for
workers...' with no way to clear it.

Also add Complete event to MonitorAndSynthesizeAsync cancellation exit
path which previously only cleared the file without notifying the UI.

Found by multi-model code review (Opus 4.6 + GPT-5.1-Codex consensus).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	PolyPilot/Components/DiffView.razor
@PureWeen PureWeen merged commit 7088a9b into main Feb 25, 2026
@PureWeen PureWeen deleted the orchestration-messages branch February 25, 2026 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant