feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) by PureWeen · Pull Request #451 · PureWeen/PolyPilot

PureWeen · 2026-03-29T07:04:05Z

Summary

Redesigns the built-in PR Review Squad multi-agent preset and adds fleet command diagnostics.

Architecture: 5x Opus Workers with Internal Multi-Model Dispatch

Each of the 5 workers runs on Opus and internally dispatches 3 parallel sub-agent reviews via the task tool:

claude-opus-4.6 -- deep reasoning, architecture, subtle logic bugs
claude-sonnet-4.6 -- fast pattern matching, common bug classes, security
gpt-5.3-codex -- alternative perspective, edge cases

The worker synthesizes a 2-of-3 consensus report. The orchestrator assigns one worker per PR (round-robin for multiple PRs) -- no fan-out.

Changes

WorkerReviewPrompt -- instructs workers to dispatch 3 sub-agents with different models
RoutingContext -- 1 worker per PR, no fan-out, round-robin distribution
SharedContext -- consensus filter (2+ models agree), fix standards use git merge (not rebase)
Fix process -- git merge instead of git rebase + force-with-lease, includes push verification
Re-review tracking -- structured FIXED / STILL PRESENT / N/A tracking restored
.squad/ deleted -- replaced by built-in preset; Squad integration tracked in Deep Squad Integration: SquadSessionProvider via ISessionProvider Plugin System #436
/fleet diagnostics -- StartFleetAsync returns specific error reason
Tests -- verify all 5 workers are Opus, prompts contain 3 sub-agent models, merge not rebase

Why 5x Opus?

Workers orchestrate sub-agent dispatch -- Opus excels at this. Model diversity happens at the sub-agent level inside each worker.

….squad/ Replace 5 identical Sonnet workers with 3 diverse-model workers: - Worker 1: Claude Opus (deep reasoning, subtle bugs) - Worker 2: Claude Sonnet (fast pattern matching, common bugs) - Worker 3: GPT Codex (alternative perspective, edge cases) Model diversity is now achieved at the worker level instead of relying on workers to internally dispatch sub-agents. The orchestrator synthesizes consensus (2-of-3 filter) from the different model perspectives. Also removes .squad/ directory — the useful content (routing rules, review standards) is now baked into the built-in preset. Real Squad integration will come via SquadSessionProvider (issue #436). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

User feedback: 3 workers was too few. Restored to 5 workers with mixed models for better consensus coverage. Updated consensus filter references to 2-of-5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Each worker dispatches 3 sub-agents (Opus, Sonnet, Codex) via the task tool for consensus. The orchestrator assigns 1 worker per PR and distributes round-robin — no fan-out of multiple workers to the same PR. All workers run on Opus (best at orchestrating sub-agent dispatch). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Return specific error (session not found, not connected, processing, RPC error) instead of generic failure message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-29T14:45:46Z

🤖 Multi-Model Code Review — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied

🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse models

File: PolyPilot/Models/ModelCapabilities.cs:327
Flagged by: Opus, Sonnet, Codex (unanimous)

The diff changes workers from 5×claude-sonnet-4.6 to 5×claude-opus-4.6. The PR description claims "Replace 5 identical Sonnet workers with 3 diverse-model workers (Opus + Sonnet + Codex)." The code keeps 5 workers, all Opus. The description string in the code even says "5 reviewers — each does multi-model consensus (Opus + Sonnet + Codex)", confirming the code and PR description describe two different architectures.

This is also a cost regression — 5 Opus workers each dispatching 3 sub-agents = 20 model calls per review, with 10 at premium Opus tier.

🟡 MODERATE — `git rebase` + `--force-with-lease` in worker prompt contradicts deleted safety rules

File: PolyPilot/Models/ModelCapabilities.cs:303-304
Flagged by: Opus, Sonnet

The fix instructions tell workers: git rebase origin/main then push with --force-with-lease. The deleted .squad/decisions.md explicitly banned both: "NEVER force push — no --force, --force-with-lease, or any force variant" and "ALWAYS integrate with git merge — never git rebase." These rules existed because rebase+force-push caused commits to land on the wrong remote. The PR deletes the safety documents while restoring the dangerous pattern.

🟡 MODERATE — `.squad/` deletion removes `push-to-pr.sh` safety script with no replacement

File: .squad/push-to-pr.sh (deleted)
Flagged by: Opus, Sonnet, Codex (unanimous)

The 58-line script verified push targets, confirmed remote/branch matching, and validated pushes landed correctly. The embedded prompt's fix instructions are a brief 2-line "checkout, rebase, push" with no verification step. Workers pushing to contributor forks without verification is the exact failure mode decisions.md documented.

🟡 MODERATE — Test coverage insufficient for the change

File: PolyPilot.Tests/SessionOrganizationTests.cs:2531
Flagged by: Opus, Sonnet, Codex (unanimous)

The test only changes WorkerModels[0] from sonnet to opus. It still asserts WorkerModels.Length == 5 (unchanged). It doesn't verify models at indices 1-4, doesn't test model diversity, and doesn't validate the description matches reality. Missing tests:

Worker count matches intent (3 vs 5)
Model diversity (Opus + Sonnet + Codex all present)
Worker prompt content (rebase vs merge)
StartFleetAsync new tuple return type

🟡 MODERATE — Cross-worker consensus filter weakened/undefined

File: PolyPilot/Models/ModelCapabilities.cs:280-292
Flagged by: Opus, Sonnet

Old: 5 workers, 2+ agree = 40% threshold. New: 3 sub-agent models, 2+ agree = 67% threshold. The higher threshold means fewer findings survive consensus. Additionally, the orchestrator's routing context says "Assign ONE worker per PR" but has no guidance for merging findings if multiple workers review the same PR.

🟢 MINOR — Raw exception messages surfaced in UI

File: PolyPilot/Services/CopilotService.cs:~1989-2008, Dashboard.razor:~2222-2226
Flagged by: Sonnet, Codex

ex.Message from RPC exceptions is written directly into chat history via ChatMessage.ErrorMessage($"Failed to start fleet mode: {error}"). This can expose internal endpoints, paths, or credential fragments. The message is persisted to events.jsonl. Fix: log full exception but return a generic "RPC error. Check logs." to the caller.

🟢 MINOR — Re-review tracking removed

File: PolyPilot/Models/ModelCapabilities.cs:307-315 (deleted §6)
Flagged by: Opus, Sonnet

The old prompt had a "Re-Review Process" section with structured FIXED / STILL PRESENT / N/A tracking. The replacement is a one-liner: "re-run the 3-model review on the updated diff." Without explicit finding-status tracking, fixed issues may be re-reported and still-present issues may be missed.

📋 Additional Notes

CI: ⚠️ No checks configured for this branch
Prior review comments: None
Positive: The StartFleetAsync refactor from Task<bool> to Task<(bool, string?)> is clean — specific error messages with proper Dashboard integration.

Recommended action: 🔴 Do not merge

Two blockers:

The feature wasn't implemented. Code delivers 5×Opus, not 3×mixed. Description, code string, and implementation are three different architectures. Decide on one and align all three.
Safety regression. The deleted .squad/decisions.md and push-to-pr.sh documented hard-won lessons from wrong-remote pushes. Either keep the safety script or embed equivalent git merge + verification in the prompt. Do not restore rebase + --force-with-lease.

The fleet review incorrectly flagged these as non-existent because they aren't in PolyPilot's internal registry, but these are CLI model IDs passed to the task tool — the CLI supports them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Replace git rebase + force-with-lease with git merge in worker fix instructions and SharedContext (safety regression) - Add push verification step (equivalent to deleted push-to-pr.sh) - Restore structured re-review tracking (FIXED/STILL PRESENT/N/A) - Sanitize RPC exception in fleet error (don't leak internal paths) - Improve test coverage: verify all 5 models are Opus, verify worker prompts contain all 3 sub-agent model names, verify merge not rebase in SharedContext, verify 1-worker-per-PR routing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-29T15:23:22Z

🤖 Multi-Model Code Review (Re-Review) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied

Previous Findings Status

#	Previous Finding	Status	Models
1	🔴 Worker array 5×Opus vs described 3×mixed	✅ FIXED	Opus ✅, Sonnet ✅, Codex ✅
2	🟡 git rebase + --force-with-lease contradicting deleted safety rules	✅ FIXED	Opus ✅, Sonnet ✅, Codex ✅
3	🟡 push-to-pr.sh deleted with no replacement	⚠️ PARTIALLY FIXED	Opus ⚠️, Sonnet ⚠️, Codex ❌
4	🟡 Insufficient test coverage	⚠️ PARTIALLY FIXED	Opus ⚠️, Sonnet ✅, Codex ⚠️
5	🟡 Consensus filter weakened	✅ FIXED	Opus ✅, Sonnet ✅, Codex ❌
6	🟢 Raw exception messages in UI	✅ FIXED	Opus ✅, Sonnet ✅, Codex ✅
7	🟢 Re-review tracking removed	✅ FIXED	Opus ✅, Sonnet ✅, Codex ✅

Summary: 5 of 7 previous findings are fully resolved. 2 are partially addressed (see details below).

Remaining Issues

⚠️ Finding #3: push-to-pr.sh — Inline verification weaker than script

Flagged by: Opus, Sonnet, Codex (unanimous)

Push verification is now inline in the worker prompt (steps 5–6: git fetch origin <branch> && git log). This replaces the deleted 58-line bash script that did SHA comparison and fork-remote detection. Text instructions are less deterministic than set -e scripted checks, but gh pr checkout handles remote tracking, so the gap is narrower than before. Acceptable for merge — the core safety (merge-not-rebase, gh pr checkout) is preserved.

⚠️ Finding #4: Test coverage — Preset tests good, `StartFleetAsync` untested

Flagged by: Opus, Codex

Good preset test additions: 5×Opus verification, 3 sub-agent model strings, merge-not-rebase assertions, 1-worker-per-PR routing. However, the StartFleetAsync signature change from Task<bool> to Task<(bool, string?)> has no dedicated tests for the new error tuple propagation or Dashboard integration.

New Issues (2+ model consensus)

🟢 MINOR — `Console.WriteLine` with full stack trace in production

File: PolyPilot/Services/CopilotService.cs:~2007
Flagged by: Opus, Sonnet, Codex (unanimous)

Changed from Debug.WriteLine (stripped in Release) to Console.WriteLine (always present). Logs {ex} (full stack trace) instead of {ex.Message}. On iOS/Android, Console.WriteLine goes nowhere useful. Should use ILogger or keep Debug.WriteLine for the stack trace.

📋 Additional Notes

CI: ⚠️ No checks configured for this branch
Prior review comment addressed: Yes — the original review had 2 blockers (architecture mismatch + safety regression). Both are resolved: PR description now explicitly documents 5×Opus architecture with "Why 5x Opus?" section, and all fix instructions use git merge with push verification.
Notable single-model finding (Sonnet only, did not meet consensus): StartFleetAsync lacks a IsRemoteMode guard — in remote mode state.Session is always null, so the new error message "Session is not connected" is misleading. Every other CopilotService method that touches state.Session checks IsRemoteMode first. Worth addressing but not a consensus finding.

Recommended action: ✅ Approve

Both previous blockers are resolved. The remaining items are minor:

Push verification is weaker but adequate (⚠️ Persistent sessions, streaming fixes, stop button #3)
StartFleetAsync tests would be nice but not blocking (⚠️ Tab navigation + shared ChatMessageList component #4)
Console.WriteLine → ILogger is a cleanup (🟢 new)

The PR delivers what it promises: 5 Opus workers with internal 3-model consensus, merge-based fix process, and structured re-review tracking.

PureWeen · 2026-03-29T15:31:06Z

🤖 Multi-Model Code Review (Fresh Review #3) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied

Consensus Findings (flagged by 2+ models)

🟡 MODERATE — `Console.WriteLine` replaces `Debug.WriteLine` in production path

File: PolyPilot/Services/CopilotService.cs:~2008
Flagged by: Opus 🟢, Sonnet 🟢, Codex 🟡 (unanimous)

Changed from System.Diagnostics.Debug.WriteLine (stripped in Release builds) to Console.WriteLine (always present). Logs full exception via {ex} (message + stack trace). On iOS/Android, Console.WriteLine goes nowhere useful and bypasses the app's ILogger pipeline. Should use ILogger or at minimum keep Debug.WriteLine for the stack trace while returning the sanitized error string to the caller.

🟡 MODERATE — Degraded consensus fallback undefined (1-model scenario)

File: PolyPilot/Models/ModelCapabilities.cs (WorkerReviewPrompt §2-3)
Flagged by: Opus 🟡, Sonnet 🟢

The prompt says "If a model is unavailable, proceed with the remaining models" and "require 2+ models (if only 2 ran, require both)." If 2 of 3 models fail, only 1 remains — the 2+ consensus threshold can never be met, silently filtering every finding and producing an empty report with no warning. Add explicit guidance: "If only 1 model ran, include all its findings with a low-confidence disclaimer."

🟢 MINOR — `.squad/` deletion removes deterministic push safety script

File: .squad/push-to-pr.sh (deleted), .squad/decisions.md (deleted)
Flagged by: Sonnet 🟡, Codex 🟢

The 58-line bash script enforced push safety deterministically (branch verification, correct remote detection, SHA comparison). Replaced by prompt text in the worker instructions (steps 5-6: git fetch && git log). Prompt-based instructions are less reliable than set -e scripted checks. However, gh pr checkout handles remote tracking and the new prompt includes merge-not-rebase + push verification steps, so the gap is narrowed. Acceptable for merge but worth noting the regression in enforcement rigor.

Notable Single-Model Findings (did not meet consensus — informational only)

Finding	Model	Severity
`StartFleetAsync` has no `IsRemoteMode` guard — in remote mode `state.Session` is always null, so `/fleet` gives misleading "Session not connected" error instead of "not supported in remote mode"	Codex	🟡
Force push prohibition (`--force`) not explicitly banned — old SharedContext had it, new one only removes recommendation	Sonnet	🟡
"Verify Claims Against Code" section removed from worker prompt — weakens cross-checking of PR descriptions	Codex	🟡

📋 Summary

CI: ⚠️ No checks configured for this branch
Test coverage: Good preset assertions added (5×Opus, 3 sub-agent models, merge-not-rebase, 1-worker-per-PR). StartFleetAsync tuple return type has no dedicated tests.
Prior reviews (2 comments): Review 1 found 7 issues (2 blockers: architecture mismatch + safety regression). Review 2 confirmed both blockers resolved, 5/7 fixed, 2/7 partially fixed. This fresh review confirms the PR is in good shape.
Architecture: Code, description, and tests are aligned — 5 Opus workers each dispatching 3 sub-agents (Opus/Sonnet/Codex) with 2-of-3 consensus.
Safety: git merge replaces git rebase everywhere. --force-with-lease removed. Push verification added.

Recommended action: ✅ Approve

The PR delivers what it promises. Two moderate findings remain (Console.WriteLine logging, degraded consensus fallback) but neither blocks merge. The Console.WriteLine is a minor production hygiene issue, and the 1-model consensus gap is an edge case (2 of 3 model APIs failing simultaneously) that can be addressed in a follow-up.

Workers now: - Post exactly ONE comment per PR (edit existing, never add new) - Use adversarial consensus: when only 1 model flags an issue, the other models get a follow-up round to agree/disagree - Handle degraded mode: if only 1 model ran, include findings with low-confidence disclaimer - Structured re-review updates the existing comment in-place Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When a user sends a message to an orchestrator that's already dispatching workers, the message is now queued with a '📨 New task queued' system message visible in the orchestrator's chat. After the current dispatch completes (workers finish + synthesis), all queued messages are drained and dispatched sequentially. Previously, messages blocked silently on a semaphore — the user got no feedback and the message appeared to vanish. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two fixes for stuck orchestrator when a worker goes silent: 1. Add 15-minute OrchestratorCollectionTimeout on Task.WhenAll. If any worker is stuck, force-complete it and proceed to synthesis with partial results. Previously the orchestrator would block for up to 60 minutes (WorkerExecutionTimeout). 2. Don't reset WatchdogCaseBLastFileSize to 0 on each SDK event. The stale-file-size detection needs prevSize > 0 to work on its first iteration. Resetting to 0 wasted one full 180s timeout cycle, compounding to ~540s (9 min) total recovery instead of ~360s (6 min). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-29T18:44:06Z

🤖 Multi-Model Code Review (Re-Review #6 — Post-Fix) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 adversarial consensus applied
Diff: +230/-254 across 11 files · Latest commit: fix: address consensus findings — queue cleanup, timeout bound, cancel path

Previous Findings Status (from Review #5)

#	Finding	Status	Details
1	🟡 Missing `_orchestratorQueuedPrompts` cleanup in `DeleteGroup`	✅ FIXED	`TryRemove` added at line 1043 (Opus ✅, Sonnet ✅, Codex ✅)
2	🟡 Collection timeout not a true bound for pre-processing workers	✅ FIXED	New `else if` branch at lines 1955-1960 resolves TCS for non-processing workers (Opus ✅, Sonnet ⚠️, Codex ⚠️ — see below)
3	🟢 Cancellation triggers force-complete path	✅ FIXED	`CancellationToken.None` at line 1939 with explanatory comment (Opus ✅, Sonnet ⚠️, Codex ✅ — see below)

All 3 targeted fixes are functionally correct. Two models raised secondary concerns about the fix implementations (detailed below), but the original bugs are resolved.

New Consensus Findings (2+ models agree after adversarial challenge)

🟢 MINOR — `_groupDispatchLocks` not cleaned up in `DeleteGroup`

File: PolyPilot/Services/CopilotService.Organization.cs:~1043
Flagged by: Opus, Sonnet (Codex not asked)

The fix commit added _orchestratorQueuedPrompts.TryRemove but not _groupDispatchLocks.TryRemove. One SemaphoreSlim (~100 bytes, no OS handles) leaks per deleted group. Not a functional bug — SendToMultiAgentGroupAsync guards on group existence before touching the lock. Do NOT call Dispose() on the removed semaphore (race with concurrent GetOrAdd callers).

Fix: _groupDispatchLocks.TryRemove(groupId, out _);

🟢 MINOR — Stranded queued prompts when depth exceeds drain cap

File: PolyPilot/Services/CopilotService.Organization.cs:1724-1758
Flagged by: Opus, Sonnet

DrainOrchestratorQueueAsync caps at maxDrainPerCycle=3. If >3 prompts queue while the orchestrator is busy, the remainder sit until the next user message triggers another dispatch+drain. The "📨 will process after this cycle completes" message is misleading — no automatic trigger exists. User is notified, so not silent data loss. Practical risk is low (queueing 4+ prompts during one dispatch is unusual), and the cap correctly prevents unbounded lock holding.

🟢 MINOR — Uncancelled 15-min `Task.Delay` timer

File: PolyPilot/Services/CopilotService.Organization.cs:1939
Flagged by: Opus, Sonnet (Codex disagrees — says GC handles it)

Task.Delay(OrchestratorCollectionTimeout, CancellationToken.None) holds a System.Threading.Timer for 15 minutes even when orchestration finishes early. Bounded in practice (1-5 concurrent timers in normal use), but a trivial hygiene fix:

using var timeoutCts = new CancellationTokenSource();
var timeout = Task.Delay(OrchestratorCollectionTimeout, timeoutCts.Token);
if (await Task.WhenAny(allDone, timeout) == allDone) timeoutCts.Cancel();

Contested Finding (did not reach consensus — informational)

ℹ️ `TrySetResult` on pre-processing workers skips companion state cleanup

File: PolyPilot/Services/CopilotService.Organization.cs:1955-1960
Sonnet: 🟡 MODERATE — SendingFlag not cleared, OnSessionComplete not fired
Opus: 🟢 LOW — gap exists but triggering conditions are extremely narrow (microsecond-scale cross-thread race between two UI-thread mutations)
Codex: DISAGREE — transitions to non-processing already clear SendingFlag on normal/abort/watchdog paths

Analysis: The else if branch fires when IsProcessing=false AND ResponseCompletion.Task.IsCompleted==false. Every code path that clears IsProcessing also resolves the TCS and clears SendingFlag in the same UI-thread dispatch. The window for this state combination is vanishingly small. Adding Interlocked.Exchange(ref ws.SendingFlag, 0) is essentially free defensive cleanup but not a blocker.

📋 Summary

CI: ⚠️ No checks configured for this branch
Tests: BuiltInPresets_IncludePRReviewSquad and Scenario_CreateGroupFromPreset both ✅ pass
Fix commit quality: All 3 targeted fixes are correct, minimal, and well-commented
Remaining items: Three 🟢 MINOR findings (dispatch lock cleanup, stranded prompts, timer hygiene) — none are blockers

Recommended action: ✅ Approve

All previous findings are resolved. The three new minor findings are good hygiene improvements but don't block merge. The fix commit is clean — each fix is minimal, correctly targeted, and includes explanatory comments. The PR delivers what it promises.

1. 🔴 Force-completion now uses ForceCompleteProcessingAsync instead of bare TrySetResult — properly clears IsProcessing + 9 companion fields (INV-1 compliant). Timed-out workers no longer show stuck spinner. 2. 🟡 Fix double-await of allDone that re-threw caught exceptions. Timeout path now collects results individually with per-task try/catch, then passes partial results array to synthesis. 3. 🟡 Cap queue drain to 3 per cycle to prevent unbounded lock holding. Remaining prompts get user-visible feedback and are processed on the next cycle. 4. 🟡 Console.WriteLine → Debug() in fleet error path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…llation 1. 🟡 Clean up _orchestratorQueuedPrompts in DeleteGroup to prevent memory/stale-state leak when orchestrator groups are removed. 2. 🟡 Collection timeout now also resolves TCS for workers not yet in IsProcessing state (stuck in SendAsync). Previously these workers were skipped by ForceCompleteProcessingAsync, causing await to block beyond the 15min timeout. 3. 🟢 Use CancellationToken.None for the timeout Task.Delay so caller cancellation propagates cleanly via the worker tasks instead of incorrectly entering the force-complete branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-03-30T00:27:02Z

🤖 Multi-Model Code Review (Review #6) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex

Previous Review #4 Findings — Status

All 4 findings from Review #4 are unanimously confirmed FIXED by all 3 models:

#	Previous Finding	Status	Consensus
1	🔴 Force-completion via bare `TrySetResult` left `IsProcessing` stuck	✅ FIXED — now uses `ForceCompleteProcessingAsync` (INV-1 compliant). Bare `TrySetResult` only fires when `IsProcessing==false` (correct).	3/3
2	🟡 Double-await of `allDone` re-throws caught OCE	✅ FIXED — `Task.Delay` uses `CancellationToken.None`; per-task try/catch in timeout branch; caller cancellation routes through happy path cleanly.	3/3
3	🟡 Unbounded queue drain holds dispatch lock indefinitely	✅ FIXED — `maxDrainPerCycle = 3` cap added.	3/3
4	🟡 `Console.WriteLine` replaces `Debug.WriteLine` in production	✅ FIXED — `StartFleetAsync` now uses `Debug()` helper.	3/3

New Findings (2/3 model consensus required)

🟡 MODERATE — `GetOrchestratorSession` reads `Organization.Sessions` in async continuation paths

File: CopilotService.Organization.cs — lines 1738, 1752 (DrainOrchestratorQueueAsync)
Models: Sonnet + Codex (2/3)

GetOrchestratorSession reads Organization.Sessions (a plain List<SessionMeta>, documented as UI-thread-only). It is called inside DrainOrchestratorQueueAsync after await SendViaOrchestratorAsync(...). In MAUI Blazor Hybrid, the UIKitSynchronizationContext marshals async continuations back to the UI thread, so this is safe in the current call chain (always initiated from Blazor event handlers). However, the method has no thread-affinity guard — if a future caller invokes SendToMultiAgentGroupAsync from a background thread, or if ConfigureAwait(false) is added anywhere in the chain, this becomes a data race on the List.

Impact: Defensive concern, not a current runtime bug. Low risk.

🟡 MODERATE — `StartFleetAsync` missing `IsRemoteMode` guard

File: CopilotService.cs — lines ~1991-1996
Models: Opus + Sonnet + Codex (3/3)

StartFleetAsync checks state.Session == null and returns "Session is not connected". In remote mode, state.Session is always null by design — the error message is misleading ("not connected" when the session IS connected via bridge). The established convention is to check IsRemoteMode first and delegate or return an appropriate message. Low runtime risk since fleet startup is desktop-only, but violates the project convention and would be a trap if fleet support is ever bridged to mobile.

Single-Model Findings (excluded by consensus rule, noted for completeness)

Drain cap (3) still allows ~45 min worst-case lock hold (Opus only)
DeleteGroup queue cleanup races with in-flight dispatch — orphaned queue possible (Sonnet only)
Orphaned XML doc comment from field insertion (Opus only)

CI Status

⚠️ No CI checks configured for fix/mixed-model-review-squad branch.

Test Coverage

PR adds 24 new test assertions in SessionOrganizationTests.cs validating:

5×Opus worker configuration
3 sub-agent model diversity (Opus, Sonnet, Codex)
Adversarial consensus language
Merge-not-rebase instruction

New code paths in DrainOrchestratorQueueAsync, force-completion, and queue cleanup lack direct unit tests. These are hard to unit-test (deep integration with SendViaOrchestratorAsync) but could benefit from focused testing of the drain cap and error-handling paths.

Existing Review Comments

4 prior review comments (Reviews #1-5). All previously-flagged blockers are resolved.

Recommended Action

✅ Approve — Both critical blockers from Review #4 are resolved. The two remaining moderate findings are defensive concerns (not active bugs). The force-completion path now correctly uses ForceCompleteProcessingAsync, the drain is bounded, cancellation is handled cleanly, and logging uses the project helper. Good to merge.

PureWeen force-pushed the fix/mixed-model-review-squad branch from 32b4d6b to 91041af Compare March 29, 2026 07:08

PureWeen and others added 2 commits March 29, 2026 09:25

fix: use 5 mixed-model workers (2×Opus, 2×Sonnet, 1×Codex)

32c15f8

User feedback: 3 workers was too few. Restored to 5 workers with mixed models for better consensus coverage. Updated consensus filter references to 2-of-5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen force-pushed the fix/mixed-model-review-squad branch from baea1ae to ae642e5 Compare March 29, 2026 14:25

PureWeen and others added 2 commits March 29, 2026 09:31

fix: show diagnostic reason when /fleet command fails

b3fe6fe

Return specific error (session not found, not connected, processing, RPC error) instead of generic failure message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen force-pushed the fix/mixed-model-review-squad branch from ae642e5 to b3fe6fe Compare March 29, 2026 14:31

fix: correct model names and consensus wording

3d5beff

PureWeen and others added 3 commits March 29, 2026 09:49

chore: remove app.db and add to .gitignore

f5f4584

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 3 commits March 29, 2026 10:41

PureWeen mentioned this pull request Mar 29, 2026

feat: multi-model consensus review for PR Review Squad #443

Closed

PureWeen merged commit bfadbd2 into main Mar 30, 2026

PureWeen deleted the fix/mixed-model-review-squad branch March 30, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
PureWeen merged 13 commits intomainfrom
fix/mixed-model-review-squad

PureWeen commented Mar 29, 2026 •

edited

Loading

Uh oh!

PureWeen commented Mar 29, 2026

Uh oh!

PureWeen commented Mar 29, 2026

Uh oh!

PureWeen commented Mar 29, 2026

Uh oh!

PureWeen commented Mar 29, 2026 •

edited

Loading

Uh oh!

PureWeen commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture: 5x Opus Workers with Internal Multi-Model Dispatch

Changes

Why 5x Opus?

Uh oh!

PureWeen commented Mar 29, 2026

🤖 Multi-Model Code Review — PR #451

🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse models

🟡 MODERATE — git rebase + --force-with-lease in worker prompt contradicts deleted safety rules

🟡 MODERATE — .squad/ deletion removes push-to-pr.sh safety script with no replacement

🟡 MODERATE — Test coverage insufficient for the change

🟡 MODERATE — Cross-worker consensus filter weakened/undefined

🟢 MINOR — Raw exception messages surfaced in UI

🟢 MINOR — Re-review tracking removed

📋 Additional Notes

Uh oh!

PureWeen commented Mar 29, 2026

🤖 Multi-Model Code Review (Re-Review) — PR #451

Previous Findings Status

Remaining Issues

⚠️ Finding #3: push-to-pr.sh — Inline verification weaker than script

⚠️ Finding #4: Test coverage — Preset tests good, StartFleetAsync untested

New Issues (2+ model consensus)

🟢 MINOR — Console.WriteLine with full stack trace in production

📋 Additional Notes

Uh oh!

PureWeen commented Mar 29, 2026

🤖 Multi-Model Code Review (Fresh Review #3) — PR #451

Consensus Findings (flagged by 2+ models)

🟡 MODERATE — Console.WriteLine replaces Debug.WriteLine in production path

🟡 MODERATE — Degraded consensus fallback undefined (1-model scenario)

🟢 MINOR — .squad/ deletion removes deterministic push safety script

Notable Single-Model Findings (did not meet consensus — informational only)

📋 Summary

Uh oh!

PureWeen commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Multi-Model Code Review (Re-Review #6 — Post-Fix) — PR #451

Previous Findings Status (from Review #5)

New Consensus Findings (2+ models agree after adversarial challenge)

🟢 MINOR — _groupDispatchLocks not cleaned up in DeleteGroup

🟢 MINOR — Stranded queued prompts when depth exceeds drain cap

🟢 MINOR — Uncancelled 15-min Task.Delay timer

Contested Finding (did not reach consensus — informational)

ℹ️ TrySetResult on pre-processing workers skips companion state cleanup

📋 Summary

Recommended action: ✅ Approve

Uh oh!

PureWeen commented Mar 30, 2026

🤖 Multi-Model Code Review (Review #6) — PR #451

Previous Review #4 Findings — Status

New Findings (2/3 model consensus required)

🟡 MODERATE — GetOrchestratorSession reads Organization.Sessions in async continuation paths

🟡 MODERATE — StartFleetAsync missing IsRemoteMode guard

Single-Model Findings (excluded by consensus rule, noted for completeness)

CI Status

Test Coverage

Existing Review Comments

Recommended Action

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PureWeen commented Mar 29, 2026 •

edited

Loading

🟡 MODERATE — `git rebase` + `--force-with-lease` in worker prompt contradicts deleted safety rules

🟡 MODERATE — `.squad/` deletion removes `push-to-pr.sh` safety script with no replacement

⚠️ Finding #4: Test coverage — Preset tests good, `StartFleetAsync` untested

🟢 MINOR — `Console.WriteLine` with full stack trace in production

🟡 MODERATE — `Console.WriteLine` replaces `Debug.WriteLine` in production path

🟢 MINOR — `.squad/` deletion removes deterministic push safety script

PureWeen commented Mar 29, 2026 •

edited

Loading

🟢 MINOR — `_groupDispatchLocks` not cleaned up in `DeleteGroup`

🟢 MINOR — Uncancelled 15-min `Task.Delay` timer

ℹ️ `TrySetResult` on pre-processing workers skips companion state cleanup

🟡 MODERATE — `GetOrchestratorSession` reads `Organization.Sessions` in async continuation paths

🟡 MODERATE — `StartFleetAsync` missing `IsRemoteMode` guard