fix: comprehensive self-healing for multi-agent groups lost on restart#412
fix: comprehensive self-healing for multi-agent groups lost on restart#412
Conversation
Multi-agent groups were silently lost across app restarts because the self-healing in LoadOrganization only checked Role == Orchestrator, but all session roles defaulted to Worker when the org file was stale. Root cause: The organization.json on disk had IsMultiAgent=false on all groups and Role=Worker on all sessions (including orchestrator sessions). The existing self-healing could only restore IsMultiAgent when a session explicitly had Role=Orchestrator, creating a chicken-and-egg problem. Fix: Three-phase self-healing in HealMultiAgentGroups(): 1. Detect orchestrator sessions by name pattern (*-orchestrator*) and restore Role=Orchestrator 2. Restore IsMultiAgent on groups that have orchestrator sessions, using name matching to avoid incorrectly marking repo groups 3. Reconstruct missing multi-agent groups from scattered sessions by detecting team members via name prefix patterns Also adds: - Save verification in WriteOrgFile (checks file exists after write) - Error logging to event-diagnostics.log for save failures - 6 new tests covering all healing scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR Review: fix: comprehensive self-healing for multi-agent groups lost on restart (#412)CI Status: 🔴 CRITICAL — Phase 3 creates duplicate groups when team has multiple orchestratorsFile: Scenario: A team has both
Result: Two duplicate groups named Fix: Track processed team prefixes to skip duplicates: var processedPrefixes = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var orchMeta in orchSessions)
{
// ...
continue; // Already reconstructed a group for this team
// ... rest of reconstruction
}🟡 MODERATE — Phase 2
|
Addendum (Round 1 Review — 5th model results)After posting the review, the 5th model (claude-sonnet-4.6) completed. It confirmed all 4 findings above and added one additional consensus issue (2/5 models): 🟡 MODERATE — Phase 3 worker query has no GroupId filter (cross-group session theft)File: The worker filter matches by session name prefix across all groups, not just the orchestrator's group. Any session in any group whose name happens to match Fix: add Full consensus summary (2+ models):
|
…ross-group theft Fixes from PR #412 review: CRITICAL: Phase 3 duplicate groups with multiple orchestrators - Add processedPrefixes guard so TeamA-orchestrator + TeamA-orchestrator-1 creates exactly ONE group, with both orchestrators moved into it. MODERATE: Phase 2 orchInGroup[0] arbitrary selection - Replace [0] with Any() check so all orchestrators in a group are considered for team prefix matching. MODERATE: Phase 3 cross-group worker theft - Filter worker query to only non-multi-agent groups, preventing workers from being stolen from already-correct teams. MINOR: Phase 1 false positives on coincidental names - Only promote Role to Orchestrator if matching worker sessions exist. A session named 'deploy-orchestrator' with no workers stays Worker. MINOR: Remove dead _lastOrgSaveTime field. New test: MultipleOrchestrators_NoDuplicateGroups Updated test: false-positive coverage for names ending in -orchestrator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code review caught that the healing code reads/writes Organization.Sessions and Organization.Groups without holding _organizationLock. While this runs during LoadOrganization (before background timers), the FlushSaveOrganization call at the end could race with SaveOrganizationCore's lock-guarded snapshot. C# Monitor (lock) is reentrant, so the nested AddGroup() call is safe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔄 Re-Review Round 2 — PR #412Previous Findings Status
All 5 original findings confirmed fixed. 7/7 tests pass (6 original + 1 new New Issue (4/5 models consensus)🟡 MODERATE — File:
When Minimal fix: Move the // Before the existing-group lookup (replace current guard):
if (processedPrefixes.Contains(teamPrefix))
{
var existingGroup = Organization.Groups.FirstOrDefault(g =>
string.Equals(g.Name, teamPrefix, StringComparison.OrdinalIgnoreCase) && g.IsMultiAgent);
if (existingGroup != null)
orchMeta.GroupId = existingGroup.Id;
continue;
}
// ... teamWorkers check ...
if (teamWorkers.Count == 0)
continue;
// ... AddGroup, move sessions ...
processedPrefixes.Add(teamPrefix); // only claim after group is actually createdThis also fixes a secondary case-sensitivity issue (use CI StatusVerdict
|
…x poisoning If the first orchestrator for a team has no eligible workers (e.g., all in already-healed multi-agent groups), the prefix was claimed but no group was created. A second orchestrator for the same team would then find no existing group to join and be stranded. Fix: claim prefix only AFTER verifying workers exist and creating the group. Also fixes case-sensitivity in the existing-group lookup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Filter otherOrchs to exclude sessions already in multi-agent groups - Reuse existing multi-agent group if available instead of creating duplicate - Add Co-authored-by trailer as per guidelines Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
🔄 Re-Review Round 3 — PR #412Previous Findings Status
New Findings (Round 3) — Consensus 2+/5N2 🟡 MODERATE — Phase 3 creates duplicate MA group when Phase 2 already healed one If Phase 2 healed a group TeamA to Consensus: Sonnet + Gemini (2/5 explicit; Opus-1 flagged the related N3 🟡 MODERATE —
Consensus: Opus-1 + Gemini (2/5) Fix Applied (commit
|
Problem
Multi-agent groups were silently lost across app restarts. The
organization.jsonon disk had:IsMultiAgent=falseRole=Worker(even orchestrator sessions)Affected teams: PR Review Squad, MAUI - PR Reviewer, FixWeirdInfoPList, PP- Organize, Evaluate Ortinau Skills, Implement & Challenge (6 teams total, ~25 sessions).
Root Cause
The self-healing in
LoadOrganizationonly checkedRole == Orchestrator, but all sessions defaulted toWorkerwhen the org file was stale. Chicken-and-egg: can't detect orchestrator → can't heal group → can't protect sessions → sessions scatter.Fix
Three-phase self-healing in
HealMultiAgentGroups():*-orchestrator*) and restoreRole=OrchestratorIsMultiAgenton groups that have orchestrator sessions (with name matching to avoid incorrectly marking repo groups)Also adds save verification and error logging to
event-diagnostics.log.Tests
6 new tests covering:
All 55 organization tests pass.