Skip to content

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451

Merged
PureWeen merged 13 commits intomainfrom
fix/mixed-model-review-squad
Mar 30, 2026
Merged

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)#451
PureWeen merged 13 commits intomainfrom
fix/mixed-model-review-squad

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 29, 2026

Summary

Redesigns the built-in PR Review Squad multi-agent preset and adds fleet command diagnostics.

Architecture: 5x Opus Workers with Internal Multi-Model Dispatch

Each of the 5 workers runs on Opus and internally dispatches 3 parallel sub-agent reviews via the task tool:

  • claude-opus-4.6 -- deep reasoning, architecture, subtle logic bugs
  • claude-sonnet-4.6 -- fast pattern matching, common bug classes, security
  • gpt-5.3-codex -- alternative perspective, edge cases

The worker synthesizes a 2-of-3 consensus report. The orchestrator assigns one worker per PR (round-robin for multiple PRs) -- no fan-out.

Changes

  1. WorkerReviewPrompt -- instructs workers to dispatch 3 sub-agents with different models
  2. RoutingContext -- 1 worker per PR, no fan-out, round-robin distribution
  3. SharedContext -- consensus filter (2+ models agree), fix standards use git merge (not rebase)
  4. Fix process -- git merge instead of git rebase + force-with-lease, includes push verification
  5. Re-review tracking -- structured FIXED / STILL PRESENT / N/A tracking restored
  6. .squad/ deleted -- replaced by built-in preset; Squad integration tracked in Deep Squad Integration: SquadSessionProvider via ISessionProvider Plugin System #436
  7. /fleet diagnostics -- StartFleetAsync returns specific error reason
  8. Tests -- verify all 5 workers are Opus, prompts contain 3 sub-agent models, merge not rebase

Why 5x Opus?

Workers orchestrate sub-agent dispatch -- Opus excels at this. Model diversity happens at the sub-agent level inside each worker.

@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from 32b4d6b to 91041af Compare March 29, 2026 07:08
PureWeen and others added 2 commits March 29, 2026 09:25
….squad/

Replace 5 identical Sonnet workers with 3 diverse-model workers:
- Worker 1: Claude Opus (deep reasoning, subtle bugs)
- Worker 2: Claude Sonnet (fast pattern matching, common bugs)
- Worker 3: GPT Codex (alternative perspective, edge cases)

Model diversity is now achieved at the worker level instead of
relying on workers to internally dispatch sub-agents. The
orchestrator synthesizes consensus (2-of-3 filter) from the
different model perspectives.

Also removes .squad/ directory — the useful content (routing rules,
review standards) is now baked into the built-in preset. Real Squad
integration will come via SquadSessionProvider (issue #436).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
User feedback: 3 workers was too few. Restored to 5 workers with
mixed models for better consensus coverage. Updated consensus
filter references to 2-of-5.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from baea1ae to ae642e5 Compare March 29, 2026 14:25
PureWeen and others added 2 commits March 29, 2026 09:31
Each worker dispatches 3 sub-agents (Opus, Sonnet, Codex) via the
task tool for consensus. The orchestrator assigns 1 worker per PR
and distributes round-robin — no fan-out of multiple workers to
the same PR. All workers run on Opus (best at orchestrating
sub-agent dispatch).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return specific error (session not found, not connected, processing,
RPC error) instead of generic failure message.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/mixed-model-review-squad branch from ae642e5 to b3fe6fe Compare March 29, 2026 14:31
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


🔴 CRITICAL — Worker array contradicts PR description: 5×Opus, not 3 diverse models

File: PolyPilot/Models/ModelCapabilities.cs:327
Flagged by: Opus, Sonnet, Codex (unanimous)

The diff changes workers from 5×claude-sonnet-4.6 to 5×claude-opus-4.6. The PR description claims "Replace 5 identical Sonnet workers with 3 diverse-model workers (Opus + Sonnet + Codex)." The code keeps 5 workers, all Opus. The description string in the code even says "5 reviewers — each does multi-model consensus (Opus + Sonnet + Codex)", confirming the code and PR description describe two different architectures.

This is also a cost regression — 5 Opus workers each dispatching 3 sub-agents = 20 model calls per review, with 10 at premium Opus tier.


🟡 MODERATE — git rebase + --force-with-lease in worker prompt contradicts deleted safety rules

File: PolyPilot/Models/ModelCapabilities.cs:303-304
Flagged by: Opus, Sonnet

The fix instructions tell workers: git rebase origin/main then push with --force-with-lease. The deleted .squad/decisions.md explicitly banned both: "NEVER force push — no --force, --force-with-lease, or any force variant" and "ALWAYS integrate with git merge — never git rebase." These rules existed because rebase+force-push caused commits to land on the wrong remote. The PR deletes the safety documents while restoring the dangerous pattern.


🟡 MODERATE — .squad/ deletion removes push-to-pr.sh safety script with no replacement

File: .squad/push-to-pr.sh (deleted)
Flagged by: Opus, Sonnet, Codex (unanimous)

The 58-line script verified push targets, confirmed remote/branch matching, and validated pushes landed correctly. The embedded prompt's fix instructions are a brief 2-line "checkout, rebase, push" with no verification step. Workers pushing to contributor forks without verification is the exact failure mode decisions.md documented.


🟡 MODERATE — Test coverage insufficient for the change

File: PolyPilot.Tests/SessionOrganizationTests.cs:2531
Flagged by: Opus, Sonnet, Codex (unanimous)

The test only changes WorkerModels[0] from sonnet to opus. It still asserts WorkerModels.Length == 5 (unchanged). It doesn't verify models at indices 1-4, doesn't test model diversity, and doesn't validate the description matches reality. Missing tests:

  • Worker count matches intent (3 vs 5)
  • Model diversity (Opus + Sonnet + Codex all present)
  • Worker prompt content (rebase vs merge)
  • StartFleetAsync new tuple return type

🟡 MODERATE — Cross-worker consensus filter weakened/undefined

File: PolyPilot/Models/ModelCapabilities.cs:280-292
Flagged by: Opus, Sonnet

Old: 5 workers, 2+ agree = 40% threshold. New: 3 sub-agent models, 2+ agree = 67% threshold. The higher threshold means fewer findings survive consensus. Additionally, the orchestrator's routing context says "Assign ONE worker per PR" but has no guidance for merging findings if multiple workers review the same PR.


🟢 MINOR — Raw exception messages surfaced in UI

File: PolyPilot/Services/CopilotService.cs:~1989-2008, Dashboard.razor:~2222-2226
Flagged by: Sonnet, Codex

ex.Message from RPC exceptions is written directly into chat history via ChatMessage.ErrorMessage($"Failed to start fleet mode: {error}"). This can expose internal endpoints, paths, or credential fragments. The message is persisted to events.jsonl. Fix: log full exception but return a generic "RPC error. Check logs." to the caller.


🟢 MINOR — Re-review tracking removed

File: PolyPilot/Models/ModelCapabilities.cs:307-315 (deleted §6)
Flagged by: Opus, Sonnet

The old prompt had a "Re-Review Process" section with structured FIXED / STILL PRESENT / N/A tracking. The replacement is a one-liner: "re-run the 3-model review on the updated diff." Without explicit finding-status tracking, fixed issues may be re-reported and still-present issues may be missed.


📋 Additional Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comments: None
  • Positive: The StartFleetAsync refactor from Task<bool> to Task<(bool, string?)> is clean — specific error messages with proper Dashboard integration.

Recommended action: 🔴 Do not merge

Two blockers:

  1. The feature wasn't implemented. Code delivers 5×Opus, not 3×mixed. Description, code string, and implementation are three different architectures. Decide on one and align all three.
  2. Safety regression. The deleted .squad/decisions.md and push-to-pr.sh documented hard-won lessons from wrong-remote pushes. Either keep the safety script or embed equivalent git merge + verification in the prompt. Do not restore rebase + --force-with-lease.

PureWeen and others added 3 commits March 29, 2026 09:49
The fleet review incorrectly flagged these as non-existent because
they aren't in PolyPilot's internal registry, but these are CLI
model IDs passed to the task tool — the CLI supports them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace git rebase + force-with-lease with git merge in worker
  fix instructions and SharedContext (safety regression)
- Add push verification step (equivalent to deleted push-to-pr.sh)
- Restore structured re-review tracking (FIXED/STILL PRESENT/N/A)
- Sanitize RPC exception in fleet error (don't leak internal paths)
- Improve test coverage: verify all 5 models are Opus, verify worker
  prompts contain all 3 sub-agent model names, verify merge not rebase
  in SharedContext, verify 1-worker-per-PR routing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Re-Review) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


Previous Findings Status

# Previous Finding Status Models
1 🔴 Worker array 5×Opus vs described 3×mixed FIXED Opus ✅, Sonnet ✅, Codex ✅
2 🟡 git rebase + --force-with-lease contradicting deleted safety rules FIXED Opus ✅, Sonnet ✅, Codex ✅
3 🟡 push-to-pr.sh deleted with no replacement ⚠️ PARTIALLY FIXED Opus ⚠️, Sonnet ⚠️, Codex ❌
4 🟡 Insufficient test coverage ⚠️ PARTIALLY FIXED Opus ⚠️, Sonnet ✅, Codex ⚠️
5 🟡 Consensus filter weakened FIXED Opus ✅, Sonnet ✅, Codex ❌
6 🟢 Raw exception messages in UI FIXED Opus ✅, Sonnet ✅, Codex ✅
7 🟢 Re-review tracking removed FIXED Opus ✅, Sonnet ✅, Codex ✅

Summary: 5 of 7 previous findings are fully resolved. 2 are partially addressed (see details below).


Remaining Issues

⚠️ Finding #3: push-to-pr.sh — Inline verification weaker than script

Flagged by: Opus, Sonnet, Codex (unanimous)

Push verification is now inline in the worker prompt (steps 5–6: git fetch origin <branch> && git log). This replaces the deleted 58-line bash script that did SHA comparison and fork-remote detection. Text instructions are less deterministic than set -e scripted checks, but gh pr checkout handles remote tracking, so the gap is narrower than before. Acceptable for merge — the core safety (merge-not-rebase, gh pr checkout) is preserved.

⚠️ Finding #4: Test coverage — Preset tests good, StartFleetAsync untested

Flagged by: Opus, Codex

Good preset test additions: 5×Opus verification, 3 sub-agent model strings, merge-not-rebase assertions, 1-worker-per-PR routing. However, the StartFleetAsync signature change from Task<bool> to Task<(bool, string?)> has no dedicated tests for the new error tuple propagation or Dashboard integration.


New Issues (2+ model consensus)

🟢 MINOR — Console.WriteLine with full stack trace in production

File: PolyPilot/Services/CopilotService.cs:~2007
Flagged by: Opus, Sonnet, Codex (unanimous)

Changed from Debug.WriteLine (stripped in Release) to Console.WriteLine (always present). Logs {ex} (full stack trace) instead of {ex.Message}. On iOS/Android, Console.WriteLine goes nowhere useful. Should use ILogger or keep Debug.WriteLine for the stack trace.


📋 Additional Notes

  • CI: ⚠️ No checks configured for this branch
  • Prior review comment addressed: Yes — the original review had 2 blockers (architecture mismatch + safety regression). Both are resolved: PR description now explicitly documents 5×Opus architecture with "Why 5x Opus?" section, and all fix instructions use git merge with push verification.
  • Notable single-model finding (Sonnet only, did not meet consensus): StartFleetAsync lacks a IsRemoteMode guard — in remote mode state.Session is always null, so the new error message "Session is not connected" is misleading. Every other CopilotService method that touches state.Session checks IsRemoteMode first. Worth addressing but not a consensus finding.

Recommended action:Approve

Both previous blockers are resolved. The remaining items are minor:

The PR delivers what it promises: 5 Opus workers with internal 3-model consensus, merge-based fix process, and structured re-review tracking.

@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Fresh Review #3) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 consensus applied


Consensus Findings (flagged by 2+ models)

🟡 MODERATE — Console.WriteLine replaces Debug.WriteLine in production path

File: PolyPilot/Services/CopilotService.cs:~2008
Flagged by: Opus 🟢, Sonnet 🟢, Codex 🟡 (unanimous)

Changed from System.Diagnostics.Debug.WriteLine (stripped in Release builds) to Console.WriteLine (always present). Logs full exception via {ex} (message + stack trace). On iOS/Android, Console.WriteLine goes nowhere useful and bypasses the app's ILogger pipeline. Should use ILogger or at minimum keep Debug.WriteLine for the stack trace while returning the sanitized error string to the caller.


🟡 MODERATE — Degraded consensus fallback undefined (1-model scenario)

File: PolyPilot/Models/ModelCapabilities.cs (WorkerReviewPrompt §2-3)
Flagged by: Opus 🟡, Sonnet 🟢

The prompt says "If a model is unavailable, proceed with the remaining models" and "require 2+ models (if only 2 ran, require both)." If 2 of 3 models fail, only 1 remains — the 2+ consensus threshold can never be met, silently filtering every finding and producing an empty report with no warning. Add explicit guidance: "If only 1 model ran, include all its findings with a low-confidence disclaimer."


🟢 MINOR — .squad/ deletion removes deterministic push safety script

File: .squad/push-to-pr.sh (deleted), .squad/decisions.md (deleted)
Flagged by: Sonnet 🟡, Codex 🟢

The 58-line bash script enforced push safety deterministically (branch verification, correct remote detection, SHA comparison). Replaced by prompt text in the worker instructions (steps 5-6: git fetch && git log). Prompt-based instructions are less reliable than set -e scripted checks. However, gh pr checkout handles remote tracking and the new prompt includes merge-not-rebase + push verification steps, so the gap is narrowed. Acceptable for merge but worth noting the regression in enforcement rigor.


Notable Single-Model Findings (did not meet consensus — informational only)

Finding Model Severity
StartFleetAsync has no IsRemoteMode guard — in remote mode state.Session is always null, so /fleet gives misleading "Session not connected" error instead of "not supported in remote mode" Codex 🟡
Force push prohibition (--force) not explicitly banned — old SharedContext had it, new one only removes recommendation Sonnet 🟡
"Verify Claims Against Code" section removed from worker prompt — weakens cross-checking of PR descriptions Codex 🟡

📋 Summary

  • CI: ⚠️ No checks configured for this branch
  • Test coverage: Good preset assertions added (5×Opus, 3 sub-agent models, merge-not-rebase, 1-worker-per-PR). StartFleetAsync tuple return type has no dedicated tests.
  • Prior reviews (2 comments): Review 1 found 7 issues (2 blockers: architecture mismatch + safety regression). Review 2 confirmed both blockers resolved, 5/7 fixed, 2/7 partially fixed. This fresh review confirms the PR is in good shape.
  • Architecture: Code, description, and tests are aligned — 5 Opus workers each dispatching 3 sub-agents (Opus/Sonnet/Codex) with 2-of-3 consensus.
  • Safety: git merge replaces git rebase everywhere. --force-with-lease removed. Push verification added.

Recommended action:Approve

The PR delivers what it promises. Two moderate findings remain (Console.WriteLine logging, degraded consensus fallback) but neither blocks merge. The Console.WriteLine is a minor production hygiene issue, and the 1-model consensus gap is an edge case (2 of 3 model APIs failing simultaneously) that can be addressed in a follow-up.

PureWeen and others added 3 commits March 29, 2026 10:41
Workers now:
- Post exactly ONE comment per PR (edit existing, never add new)
- Use adversarial consensus: when only 1 model flags an issue,
  the other models get a follow-up round to agree/disagree
- Handle degraded mode: if only 1 model ran, include findings
  with low-confidence disclaimer
- Structured re-review updates the existing comment in-place

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a user sends a message to an orchestrator that's already
dispatching workers, the message is now queued with a '📨 New task
queued' system message visible in the orchestrator's chat. After
the current dispatch completes (workers finish + synthesis), all
queued messages are drained and dispatched sequentially.

Previously, messages blocked silently on a semaphore — the user
got no feedback and the message appeared to vanish.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two fixes for stuck orchestrator when a worker goes silent:

1. Add 15-minute OrchestratorCollectionTimeout on Task.WhenAll.
   If any worker is stuck, force-complete it and proceed to
   synthesis with partial results. Previously the orchestrator
   would block for up to 60 minutes (WorkerExecutionTimeout).

2. Don't reset WatchdogCaseBLastFileSize to 0 on each SDK event.
   The stale-file-size detection needs prevSize > 0 to work on
   its first iteration. Resetting to 0 wasted one full 180s
   timeout cycle, compounding to ~540s (9 min) total recovery
   instead of ~360s (6 min).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Mar 29, 2026

🤖 Multi-Model Code Review (Re-Review #6 — Post-Fix) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex · 2-of-3 adversarial consensus applied
Diff: +230/-254 across 11 files · Latest commit: fix: address consensus findings — queue cleanup, timeout bound, cancel path


Previous Findings Status (from Review #5)

# Finding Status Details
1 🟡 Missing _orchestratorQueuedPrompts cleanup in DeleteGroup FIXED TryRemove added at line 1043 (Opus ✅, Sonnet ✅, Codex ✅)
2 🟡 Collection timeout not a true bound for pre-processing workers FIXED New else if branch at lines 1955-1960 resolves TCS for non-processing workers (Opus ✅, Sonnet ⚠️, Codex ⚠️ — see below)
3 🟢 Cancellation triggers force-complete path FIXED CancellationToken.None at line 1939 with explanatory comment (Opus ✅, Sonnet ⚠️, Codex ✅ — see below)

All 3 targeted fixes are functionally correct. Two models raised secondary concerns about the fix implementations (detailed below), but the original bugs are resolved.


New Consensus Findings (2+ models agree after adversarial challenge)

🟢 MINOR — _groupDispatchLocks not cleaned up in DeleteGroup

File: PolyPilot/Services/CopilotService.Organization.cs:~1043
Flagged by: Opus, Sonnet (Codex not asked)

The fix commit added _orchestratorQueuedPrompts.TryRemove but not _groupDispatchLocks.TryRemove. One SemaphoreSlim (~100 bytes, no OS handles) leaks per deleted group. Not a functional bug — SendToMultiAgentGroupAsync guards on group existence before touching the lock. Do NOT call Dispose() on the removed semaphore (race with concurrent GetOrAdd callers).

Fix: _groupDispatchLocks.TryRemove(groupId, out _);


🟢 MINOR — Stranded queued prompts when depth exceeds drain cap

File: PolyPilot/Services/CopilotService.Organization.cs:1724-1758
Flagged by: Opus, Sonnet

DrainOrchestratorQueueAsync caps at maxDrainPerCycle=3. If >3 prompts queue while the orchestrator is busy, the remainder sit until the next user message triggers another dispatch+drain. The "📨 will process after this cycle completes" message is misleading — no automatic trigger exists. User is notified, so not silent data loss. Practical risk is low (queueing 4+ prompts during one dispatch is unusual), and the cap correctly prevents unbounded lock holding.


🟢 MINOR — Uncancelled 15-min Task.Delay timer

File: PolyPilot/Services/CopilotService.Organization.cs:1939
Flagged by: Opus, Sonnet (Codex disagrees — says GC handles it)

Task.Delay(OrchestratorCollectionTimeout, CancellationToken.None) holds a System.Threading.Timer for 15 minutes even when orchestration finishes early. Bounded in practice (1-5 concurrent timers in normal use), but a trivial hygiene fix:

using var timeoutCts = new CancellationTokenSource();
var timeout = Task.Delay(OrchestratorCollectionTimeout, timeoutCts.Token);
if (await Task.WhenAny(allDone, timeout) == allDone) timeoutCts.Cancel();

Contested Finding (did not reach consensus — informational)

ℹ️ TrySetResult on pre-processing workers skips companion state cleanup

File: PolyPilot/Services/CopilotService.Organization.cs:1955-1960
Sonnet: 🟡 MODERATE — SendingFlag not cleared, OnSessionComplete not fired
Opus: 🟢 LOW — gap exists but triggering conditions are extremely narrow (microsecond-scale cross-thread race between two UI-thread mutations)
Codex: DISAGREE — transitions to non-processing already clear SendingFlag on normal/abort/watchdog paths

Analysis: The else if branch fires when IsProcessing=false AND ResponseCompletion.Task.IsCompleted==false. Every code path that clears IsProcessing also resolves the TCS and clears SendingFlag in the same UI-thread dispatch. The window for this state combination is vanishingly small. Adding Interlocked.Exchange(ref ws.SendingFlag, 0) is essentially free defensive cleanup but not a blocker.


📋 Summary

  • CI: ⚠️ No checks configured for this branch
  • Tests: BuiltInPresets_IncludePRReviewSquad and Scenario_CreateGroupFromPreset both ✅ pass
  • Fix commit quality: All 3 targeted fixes are correct, minimal, and well-commented
  • Remaining items: Three 🟢 MINOR findings (dispatch lock cleanup, stranded prompts, timer hygiene) — none are blockers

Recommended action: ✅ Approve

All previous findings are resolved. The three new minor findings are good hygiene improvements but don't block merge. The fix commit is clean — each fix is minimal, correctly targeted, and includes explanatory comments. The PR delivers what it promises.

1. 🔴 Force-completion now uses ForceCompleteProcessingAsync instead
   of bare TrySetResult — properly clears IsProcessing + 9 companion
   fields (INV-1 compliant). Timed-out workers no longer show stuck
   spinner.

2. 🟡 Fix double-await of allDone that re-threw caught exceptions.
   Timeout path now collects results individually with per-task
   try/catch, then passes partial results array to synthesis.

3. 🟡 Cap queue drain to 3 per cycle to prevent unbounded lock
   holding. Remaining prompts get user-visible feedback and are
   processed on the next cycle.

4. 🟡 Console.WriteLine → Debug() in fleet error path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…llation

1. 🟡 Clean up _orchestratorQueuedPrompts in DeleteGroup to prevent
   memory/stale-state leak when orchestrator groups are removed.

2. 🟡 Collection timeout now also resolves TCS for workers not yet in
   IsProcessing state (stuck in SendAsync). Previously these workers
   were skipped by ForceCompleteProcessingAsync, causing await to
   block beyond the 15min timeout.

3. 🟢 Use CancellationToken.None for the timeout Task.Delay so caller
   cancellation propagates cleanly via the worker tasks instead of
   incorrectly entering the force-complete branch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🤖 Multi-Model Code Review (Review #6) — PR #451

feat: mixed-model PR Review Squad (Opus + Sonnet + Codex)
Reviewed by: Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.3-Codex


Previous Review #4 Findings — Status

All 4 findings from Review #4 are unanimously confirmed FIXED by all 3 models:

# Previous Finding Status Consensus
1 🔴 Force-completion via bare TrySetResult left IsProcessing stuck FIXED — now uses ForceCompleteProcessingAsync (INV-1 compliant). Bare TrySetResult only fires when IsProcessing==false (correct). 3/3
2 🟡 Double-await of allDone re-throws caught OCE FIXEDTask.Delay uses CancellationToken.None; per-task try/catch in timeout branch; caller cancellation routes through happy path cleanly. 3/3
3 🟡 Unbounded queue drain holds dispatch lock indefinitely FIXEDmaxDrainPerCycle = 3 cap added. 3/3
4 🟡 Console.WriteLine replaces Debug.WriteLine in production FIXEDStartFleetAsync now uses Debug() helper. 3/3

New Findings (2/3 model consensus required)

🟡 MODERATE — GetOrchestratorSession reads Organization.Sessions in async continuation paths

File: CopilotService.Organization.cs — lines 1738, 1752 (DrainOrchestratorQueueAsync)
Models: Sonnet + Codex (2/3)

GetOrchestratorSession reads Organization.Sessions (a plain List<SessionMeta>, documented as UI-thread-only). It is called inside DrainOrchestratorQueueAsync after await SendViaOrchestratorAsync(...). In MAUI Blazor Hybrid, the UIKitSynchronizationContext marshals async continuations back to the UI thread, so this is safe in the current call chain (always initiated from Blazor event handlers). However, the method has no thread-affinity guard — if a future caller invokes SendToMultiAgentGroupAsync from a background thread, or if ConfigureAwait(false) is added anywhere in the chain, this becomes a data race on the List.

Impact: Defensive concern, not a current runtime bug. Low risk.

🟡 MODERATE — StartFleetAsync missing IsRemoteMode guard

File: CopilotService.cs — lines ~1991-1996
Models: Opus + Sonnet + Codex (3/3)

StartFleetAsync checks state.Session == null and returns "Session is not connected". In remote mode, state.Session is always null by design — the error message is misleading ("not connected" when the session IS connected via bridge). The established convention is to check IsRemoteMode first and delegate or return an appropriate message. Low runtime risk since fleet startup is desktop-only, but violates the project convention and would be a trap if fleet support is ever bridged to mobile.


Single-Model Findings (excluded by consensus rule, noted for completeness)

  • Drain cap (3) still allows ~45 min worst-case lock hold (Opus only)
  • DeleteGroup queue cleanup races with in-flight dispatch — orphaned queue possible (Sonnet only)
  • Orphaned XML doc comment from field insertion (Opus only)

CI Status

⚠️ No CI checks configured for fix/mixed-model-review-squad branch.

Test Coverage

PR adds 24 new test assertions in SessionOrganizationTests.cs validating:

  • 5×Opus worker configuration
  • 3 sub-agent model diversity (Opus, Sonnet, Codex)
  • Adversarial consensus language
  • Merge-not-rebase instruction

New code paths in DrainOrchestratorQueueAsync, force-completion, and queue cleanup lack direct unit tests. These are hard to unit-test (deep integration with SendViaOrchestratorAsync) but could benefit from focused testing of the drain cap and error-handling paths.

Existing Review Comments

4 prior review comments (Reviews #1-5). All previously-flagged blockers are resolved.


Recommended Action

Approve — Both critical blockers from Review #4 are resolved. The two remaining moderate findings are defensive concerns (not active bugs). The force-completion path now correctly uses ForceCompleteProcessingAsync, the drain is bounded, cancellation is handled cleanly, and logging uses the project helper. Good to merge.

@PureWeen PureWeen merged commit bfadbd2 into main Mar 30, 2026
@PureWeen PureWeen deleted the fix/mixed-model-review-squad branch March 30, 2026 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant