feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
Conversation
32b4d6b to
91041af
Compare
….squad/ Replace 5 identical Sonnet workers with 3 diverse-model workers: - Worker 1: Claude Opus (deep reasoning, subtle bugs) - Worker 2: Claude Sonnet (fast pattern matching, common bugs) - Worker 3: GPT Codex (alternative perspective, edge cases) Model diversity is now achieved at the worker level instead of relying on workers to internally dispatch sub-agents. The orchestrator synthesizes consensus (2-of-3 filter) from the different model perspectives. Also removes .squad/ directory — the useful content (routing rules, review standards) is now baked into the built-in preset. Real Squad integration will come via SquadSessionProvider (issue #436). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: 3 workers was too few. Restored to 5 workers with mixed models for better consensus coverage. Updated consensus filter references to 2-of-5. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
baea1ae to
ae642e5
Compare
Each worker dispatches 3 sub-agents (Opus, Sonnet, Codex) via the task tool for consensus. The orchestrator assigns 1 worker per PR and distributes round-robin — no fan-out of multiple workers to the same PR. All workers run on Opus (best at orchestrating sub-agent dispatch). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return specific error (session not found, not connected, processing, RPC error) instead of generic failure message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ae642e5 to
b3fe6fe
Compare
🤖 Multi-Model Code Review — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) 🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse modelsFile: The diff changes workers from This is also a cost regression — 5 Opus workers each dispatching 3 sub-agents = 20 model calls per review, with 10 at premium Opus tier. 🟡 MODERATE —
|
The fleet review incorrectly flagged these as non-existent because they aren't in PolyPilot's internal registry, but these are CLI model IDs passed to the task tool — the CLI supports them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace git rebase + force-with-lease with git merge in worker fix instructions and SharedContext (safety regression) - Add push verification step (equivalent to deleted push-to-pr.sh) - Restore structured re-review tracking (FIXED/STILL PRESENT/N/A) - Sanitize RPC exception in fleet error (don't leak internal paths) - Improve test coverage: verify all 5 models are Opus, verify worker prompts contain all 3 sub-agent model names, verify merge not rebase in SharedContext, verify 1-worker-per-PR routing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review (Re-Review) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Previous Findings Status
Summary: 5 of 7 previous findings are fully resolved. 2 are partially addressed (see details below). Remaining Issues
|
🤖 Multi-Model Code Review (Fresh Review #3) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Consensus Findings (flagged by 2+ models)🟡 MODERATE —
|
| Finding | Model | Severity |
|---|---|---|
StartFleetAsync has no IsRemoteMode guard — in remote mode state.Session is always null, so /fleet gives misleading "Session not connected" error instead of "not supported in remote mode" |
Codex | 🟡 |
Force push prohibition (--force) not explicitly banned — old SharedContext had it, new one only removes recommendation |
Sonnet | 🟡 |
| "Verify Claims Against Code" section removed from worker prompt — weakens cross-checking of PR descriptions | Codex | 🟡 |
📋 Summary
- CI:
⚠️ No checks configured for this branch - Test coverage: Good preset assertions added (5×Opus, 3 sub-agent models, merge-not-rebase, 1-worker-per-PR).
StartFleetAsynctuple return type has no dedicated tests. - Prior reviews (2 comments): Review 1 found 7 issues (2 blockers: architecture mismatch + safety regression). Review 2 confirmed both blockers resolved, 5/7 fixed, 2/7 partially fixed. This fresh review confirms the PR is in good shape.
- Architecture: Code, description, and tests are aligned — 5 Opus workers each dispatching 3 sub-agents (Opus/Sonnet/Codex) with 2-of-3 consensus.
- Safety:
git mergereplacesgit rebaseeverywhere.--force-with-leaseremoved. Push verification added.
Recommended action: ✅ Approve
The PR delivers what it promises. Two moderate findings remain (Console.WriteLine logging, degraded consensus fallback) but neither blocks merge. The Console.WriteLine is a minor production hygiene issue, and the 1-model consensus gap is an edge case (2 of 3 model APIs failing simultaneously) that can be addressed in a follow-up.
Workers now: - Post exactly ONE comment per PR (edit existing, never add new) - Use adversarial consensus: when only 1 model flags an issue, the other models get a follow-up round to agree/disagree - Handle degraded mode: if only 1 model ran, include findings with low-confidence disclaimer - Structured re-review updates the existing comment in-place Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a user sends a message to an orchestrator that's already dispatching workers, the message is now queued with a '📨 New task queued' system message visible in the orchestrator's chat. After the current dispatch completes (workers finish + synthesis), all queued messages are drained and dispatched sequentially. Previously, messages blocked silently on a semaphore — the user got no feedback and the message appeared to vanish. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes for stuck orchestrator when a worker goes silent: 1. Add 15-minute OrchestratorCollectionTimeout on Task.WhenAll. If any worker is stuck, force-complete it and proceed to synthesis with partial results. Previously the orchestrator would block for up to 60 minutes (WorkerExecutionTimeout). 2. Don't reset WatchdogCaseBLastFileSize to 0 on each SDK event. The stale-file-size detection needs prevSize > 0 to work on its first iteration. Resetting to 0 wasted one full 180s timeout cycle, compounding to ~540s (9 min) total recovery instead of ~360s (6 min). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review (Re-Review #6 — Post-Fix) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Previous Findings Status (from Review #5)
All 3 targeted fixes are functionally correct. Two models raised secondary concerns about the fix implementations (detailed below), but the original bugs are resolved. New Consensus Findings (2+ models agree after adversarial challenge)🟢 MINOR —
|
1. 🔴 Force-completion now uses ForceCompleteProcessingAsync instead of bare TrySetResult — properly clears IsProcessing + 9 companion fields (INV-1 compliant). Timed-out workers no longer show stuck spinner. 2. 🟡 Fix double-await of allDone that re-threw caught exceptions. Timeout path now collects results individually with per-task try/catch, then passes partial results array to synthesis. 3. 🟡 Cap queue drain to 3 per cycle to prevent unbounded lock holding. Remaining prompts get user-visible feedback and are processed on the next cycle. 4. 🟡 Console.WriteLine → Debug() in fleet error path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llation 1. 🟡 Clean up _orchestratorQueuedPrompts in DeleteGroup to prevent memory/stale-state leak when orchestrator groups are removed. 2. 🟡 Collection timeout now also resolves TCS for workers not yet in IsProcessing state (stuck in SendAsync). Previously these workers were skipped by ForceCompleteProcessingAsync, causing await to block beyond the 15min timeout. 3. 🟢 Use CancellationToken.None for the timeout Task.Delay so caller cancellation propagates cleanly via the worker tasks instead of incorrectly entering the force-complete branch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🤖 Multi-Model Code Review (Review #6) — PR #451feat: mixed-model PR Review Squad (Opus + Sonnet + Codex) Previous Review #4 Findings — StatusAll 4 findings from Review #4 are unanimously confirmed FIXED by all 3 models:
New Findings (2/3 model consensus required)🟡 MODERATE —
|
Summary
Redesigns the built-in PR Review Squad multi-agent preset and adds fleet command diagnostics.
Architecture: 5x Opus Workers with Internal Multi-Model Dispatch
Each of the 5 workers runs on Opus and internally dispatches 3 parallel sub-agent reviews via the task tool:
The worker synthesizes a 2-of-3 consensus report. The orchestrator assigns one worker per PR (round-robin for multiple PRs) -- no fan-out.
Changes
Why 5x Opus?
Workers orchestrate sub-agent dispatch -- Opus excels at this. Model diversity happens at the sub-agent level inside each worker.