Skip to content

fix: comprehensive self-healing for multi-agent groups lost on restart#412

Merged
PureWeen merged 5 commits intomainfrom
fix/heal-multiagent-groups
Mar 20, 2026
Merged

fix: comprehensive self-healing for multi-agent groups lost on restart#412
PureWeen merged 5 commits intomainfrom
fix/heal-multiagent-groups

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

Problem

Multi-agent groups were silently lost across app restarts. The organization.json on disk had:

  • All groups with IsMultiAgent=false
  • All sessions with Role=Worker (even orchestrator sessions)
  • Multi-agent group IDs from event-diagnostics didn't exist in the file

Affected teams: PR Review Squad, MAUI - PR Reviewer, FixWeirdInfoPList, PP- Organize, Evaluate Ortinau Skills, Implement & Challenge (6 teams total, ~25 sessions).

Root Cause

The self-healing in LoadOrganization only checked Role == Orchestrator, but all sessions defaulted to Worker when the org file was stale. Chicken-and-egg: can't detect orchestrator → can't heal group → can't protect sessions → sessions scatter.

Fix

Three-phase self-healing in HealMultiAgentGroups():

  1. Phase 1: Detect orchestrator sessions by name pattern (*-orchestrator*) and restore Role=Orchestrator
  2. Phase 2: Restore IsMultiAgent on groups that have orchestrator sessions (with name matching to avoid incorrectly marking repo groups)
  3. Phase 3: Reconstruct missing multi-agent groups from scattered sessions by detecting team members via name prefix patterns

Also adds save verification and error logging to event-diagnostics.log.

Tests

6 new tests covering:

  • Orchestrator role detection by name pattern (with suffix variants)
  • Group reconstruction from scattered sessions
  • Multiple scattered teams handling
  • No false positives on regular session names
  • No-op when groups are already correct

All 55 organization tests pass.

Multi-agent groups were silently lost across app restarts because the
self-healing in LoadOrganization only checked Role == Orchestrator, but
all session roles defaulted to Worker when the org file was stale.

Root cause: The organization.json on disk had IsMultiAgent=false on all
groups and Role=Worker on all sessions (including orchestrator sessions).
The existing self-healing could only restore IsMultiAgent when a session
explicitly had Role=Orchestrator, creating a chicken-and-egg problem.

Fix: Three-phase self-healing in HealMultiAgentGroups():
1. Detect orchestrator sessions by name pattern (*-orchestrator*)
   and restore Role=Orchestrator
2. Restore IsMultiAgent on groups that have orchestrator sessions,
   using name matching to avoid incorrectly marking repo groups
3. Reconstruct missing multi-agent groups from scattered sessions
   by detecting team members via name prefix patterns

Also adds:
- Save verification in WriteOrgFile (checks file exists after write)
- Error logging to event-diagnostics.log for save failures
- 6 new tests covering all healing scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PR Review: fix: comprehensive self-healing for multi-agent groups lost on restart (#412)

CI Status: ⚠️ No checks reported on this branch.
Tests: 6 new HealMultiAgentGroups tests — all pass ✅ (verified locally). 55 total organization tests pass.


🔴 CRITICAL — Phase 3 creates duplicate groups when team has multiple orchestrators

File: PolyPilot/Services/CopilotService.Organization.cs ~line 248–300
Flagged by: 4/5 models

Scenario: A team has both TeamA-orchestrator and TeamA-orchestrator-1 as separate sessions (both promoted to Role=Orchestrator in Phase 1). The Phase 3 loop iterates over ALL orchestrator sessions:

  • Iteration 1 (TeamA-orchestrator): creates new group G1, moves workers into it
  • Iteration 2 (TeamA-orchestrator-1): orchMeta.GroupId still points to the OLD repo group (the snapshot orchSessions captured the old GroupId). The guard currentGroup?.IsMultiAgent == true passes against the old repo group (still not multi-agent). The teamWorkers query searches by name prefix globally — finds the same workers regardless of their new GroupId. Creates G2, re-assigns workers from G1 to G2.

Result: Two duplicate groups named TeamA, the first with only the orchestrator (stranded), the second with all workers and the second orchestrator. The team is split/corrupted on every startup.

Fix: Track processed team prefixes to skip duplicates:

var processedPrefixes = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
foreach (var orchMeta in orchSessions)
{
    // ...
        continue; // Already reconstructed a group for this team
    // ... rest of reconstruction
}

🟡 MODERATE — Phase 2 orchInGroup[0] picks arbitrarily when multiple teams share a repo group

File: PolyPilot/Services/CopilotService.Organization.cs ~line 234
Flagged by: 2/5 models

If two different teams' orchestrators land in the same repo group (both name-detected in Phase 1), only orchInGroup[0].SessionName is used for the team prefix comparison. If the list order puts TeamB first but the repo group is named TeamA, Phase 2 fails to heal the group even though TeamA's orchestrator is present. Phase 3 compensates, but the behavior is non-deterministic. Fix: check orchInGroup.Any(m => string.Equals(orchestratorPattern.Replace(m.SessionName, ""), group.Name, StringComparison.OrdinalIgnoreCase)) instead of [0].


🟢 MINOR — Phase 1 false-positives on coincidentally named sessions

File: PolyPilot/Services/CopilotService.Organization.cs ~line 197–203
Flagged by: 2/5 models

A normal user session named "notes-orchestrator" or "deploy-orchestrator" matches -[Oo]rchestrator(-\d+)?$ and gets permanently promoted to Role=Orchestrator and saved to disk. Phase 3 skips it (no matching workers), but the incorrect role persists and could affect UI rendering. The false-positive test only checks "orchestrator-tips" (starts with orchestrator) and "my-orchestrator-notes" (has it in the middle) — not "notes-orchestrator" (ends with it), which does false-positive.


🟢 MINOR — _lastOrgSaveTime dead field

File: PolyPilot/Services/CopilotService.Organization.cs ~line 327
Flagged by: 2/5 models

private DateTime _lastOrgSaveTime; is written in SaveOrganizationCore() but never read anywhere. Should be removed or used.


Test Coverage Assessment

The 6 new tests cover the primary healing scenarios well. However, there is a gap:

  • Not tested: Two orchestrators with the same team prefix (e.g., TeamA-orchestrator + TeamA-orchestrator-1) — the scenario that triggers the CRITICAL duplicate-group bug
  • Not tested: "notes-orchestrator" false positive (user session ending in -orchestrator)

Recommended Action: ⚠️ Request Changes

Fix required before merge:

  1. CRITICAL: Add a processedPrefixes guard in Phase 3 to prevent duplicate group creation for teams with multiple orchestrators
  2. MODERATE: Replace orchInGroup[0] with Any() check in Phase 2
  3. Add a test for the multiple-orchestrator scenario

@PureWeen
Copy link
Copy Markdown
Owner Author

Addendum (Round 1 Review — 5th model results)

After posting the review, the 5th model (claude-sonnet-4.6) completed. It confirmed all 4 findings above and added one additional consensus issue (2/5 models):

🟡 MODERATE — Phase 3 worker query has no GroupId filter (cross-group session theft)

File: CopilotService.Organization.cs Phase 3 teamWorkers LINQ
Flagged by: sonnet + codex (2/5 models)

The worker filter matches by session name prefix across all groups, not just the orchestrator's group. Any session in any group whose name happens to match teamPrefix + "-worker-N" gets moved into the reconstructed group. This means workers scattered across different repo groups (not the orchestrator's group) could be incorrectly stolen.

Fix: add && m.GroupId == orchMeta.GroupId to the worker filter, or at minimum gather workers from a broader set of non-multi-agent groups rather than all groups indiscriminately.


Full consensus summary (2+ models):

Severity Finding
🔴 CRITICAL Phase 3 duplicate groups — multiple orchestrators with same prefix
🟡 MODERATE Phase 3 cross-group worker theft — no GroupId filter
🟡 MODERATE Phase 2 orchInGroup[0] arbitrary selection
🟢 MINOR _lastOrgSaveTime dead field
🟢 MINOR Phase 1 false-positive role promotion

PureWeen and others added 2 commits March 20, 2026 09:44
…ross-group theft

Fixes from PR #412 review:

CRITICAL: Phase 3 duplicate groups with multiple orchestrators
- Add processedPrefixes guard so TeamA-orchestrator + TeamA-orchestrator-1
  creates exactly ONE group, with both orchestrators moved into it.

MODERATE: Phase 2 orchInGroup[0] arbitrary selection
- Replace [0] with Any() check so all orchestrators in a group are
  considered for team prefix matching.

MODERATE: Phase 3 cross-group worker theft
- Filter worker query to only non-multi-agent groups, preventing
  workers from being stolen from already-correct teams.

MINOR: Phase 1 false positives on coincidental names
- Only promote Role to Orchestrator if matching worker sessions exist.
  A session named 'deploy-orchestrator' with no workers stays Worker.

MINOR: Remove dead _lastOrgSaveTime field.

New test: MultipleOrchestrators_NoDuplicateGroups
Updated test: false-positive coverage for names ending in -orchestrator

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Code review caught that the healing code reads/writes Organization.Sessions
and Organization.Groups without holding _organizationLock. While this runs
during LoadOrganization (before background timers), the FlushSaveOrganization
call at the end could race with SaveOrganizationCore's lock-guarded snapshot.

C# Monitor (lock) is reentrant, so the nested AddGroup() call is safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔄 Re-Review Round 2 — PR #412

Previous Findings Status

Finding Status
F1 🔴 Phase 3 duplicate groups (processedPrefixes guard) ✅ FIXED
F2 🟡 Cross-group worker theft (no GroupId filter) ✅ FIXED
F3 🟡 Phase 2 orchInGroup[0] arbitrary selection ✅ FIXED
F4 🟢 _lastOrgSaveTime dead field ✅ FIXED
F5 🟢 Phase 1 false-positive role promotion ✅ FIXED

All 5 original findings confirmed fixed. 7/7 tests pass (6 original + 1 new MultipleOrchestrators_NoDuplicateGroups).


New Issue (4/5 models consensus)

🟡 MODERATE — processedPrefixes claims prefix before verifying group can be created

File: PolyPilot/Services/CopilotService.Organization.cs ~line 273

processedPrefixes.Add(teamPrefix) succeeds and the prefix is recorded before the teamWorkers.Count == 0 check. If the first orchestrator for TeamA has no eligible workers (e.g., they're all in a group already marked IsMultiAgent=true by Phase 2, which the nonMultiAgentGroupIds filter correctly excludes), it skips group creation with continue — but the prefix is now permanently claimed.

When TeamA-orchestrator-1 is iterated next, processedPrefixes.Add("TeamA") returns false → enters the redirect branch → Organization.Groups.FirstOrDefault(g => g.Name == "TeamA" && g.IsMultiAgent) finds nothing (the group was never created) → existingGroup is null → orchestrator is left unhealed in its old repo group.

Minimal fix: Move the processedPrefixes.Add() call to AFTER group creation succeeds:

// Before the existing-group lookup (replace current guard):
if (processedPrefixes.Contains(teamPrefix))
{
    var existingGroup = Organization.Groups.FirstOrDefault(g =>
        string.Equals(g.Name, teamPrefix, StringComparison.OrdinalIgnoreCase) && g.IsMultiAgent);
    if (existingGroup != null)
        orchMeta.GroupId = existingGroup.Id;
    continue;
}
// ... teamWorkers check ...
if (teamWorkers.Count == 0)
    continue;
// ... AddGroup, move sessions ...
processedPrefixes.Add(teamPrefix); // only claim after group is actually created

This also fixes a secondary case-sensitivity issue (use StringComparison.OrdinalIgnoreCase in the lookup — current code uses g.Name == teamPrefix which is ordinal).


CI Status

⚠️ No checks reported on this branch.


Verdict

⚠️ Request changes — one new moderate issue introduced by the F1 fix. Small targeted fix needed: move processedPrefixes.Add() to after group creation, not before the workers check.

PureWeen and others added 2 commits March 20, 2026 10:17
…x poisoning

If the first orchestrator for a team has no eligible workers (e.g., all
in already-healed multi-agent groups), the prefix was claimed but no
group was created. A second orchestrator for the same team would then
find no existing group to join and be stranded.

Fix: claim prefix only AFTER verifying workers exist and creating the
group. Also fixes case-sensitivity in the existing-group lookup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Filter otherOrchs to exclude sessions already in multi-agent groups
- Reuse existing multi-agent group if available instead of creating duplicate
- Add Co-authored-by trailer as per guidelines

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

🔄 Re-Review Round 3 — PR #412

Previous Findings Status

  • F1 (🔴 Duplicate groups): FIXED ✅ (Round 2)
  • F2 (🟡 Cross-group worker theft): FIXED ✅ (Round 2)
  • F3 (🟡 Arbitrary orchInGroup[0]): FIXED ✅ (Round 2)
  • F4 (🟢 Dead _lastOrgSaveTime): FIXED ✅ (Round 2)
  • F5 (🟢 False-positive role promotion): FIXED ✅ (Round 2)
  • N1 (🟡 processedPrefixes premature claim): FIXED ✅processedPrefixes.Add(teamPrefix) now at line 314, after the teamWorkers.Count == 0 guard at line 309. Two-orch scenario (first has no eligible workers, second does) is now handled correctly: the first orchestrator skips without claiming the prefix, leaving it available for the second to create the group.

New Findings (Round 3) — Consensus 2+/5

N2 🟡 MODERATE — Phase 3 creates duplicate MA group when Phase 2 already healed one

If Phase 2 healed a group TeamA to IsMultiAgent=true, and Phase 3 encounters a scattered TeamA-orchestrator, it would call AddGroup(new SessionGroup { Name = TeamA, IsMultiAgent = true }) without checking whether TeamA already exists. Result: two MA groups named TeamA in the UI, one populated and one empty.

Consensus: Sonnet + Gemini (2/5 explicit; Opus-1 flagged the related otherOrchs filter gap)

N3 🟡 MODERATE — otherOrchs missing nonMultiAgentGroupIds filter

otherOrchs had no nonMultiAgentGroupIds.Contains(m.GroupId) guard (unlike teamWorkers). A Phase-2-correctly-placed orchestrator in TeamA could be grabbed by otherOrchs and moved into a newly-reconstructed group, dismantling the valid group.

Consensus: Opus-1 + Gemini (2/5)


Fix Applied (commit 3b95f524)

Both N2 and N3 fixed in a single commit already pushed to this branch:

  1. otherOrchs filter — added && nonMultiAgentGroupIds.Contains(m.GroupId) to exclude orchestrators already in valid MA groups.

  2. Reuse existing MA group — added targetGroup lookup before AddGroup. If Phase 2 already healed a group with the matching name, scattered sessions merge into it instead of creating a duplicate.

  • Assertion at class-init.c:4675, condition `parent' not met

=================================================================
Native Crash Reporting

Got a abrt while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.

=================================================================
Native stacktrace:

0x1008e5679 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_dump_native_crash_info
0x10087d40e - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_handle_native_crash
0x1008e4c3f - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : sigabrt_signal_handler
0x7ff8191d037d - /usr/lib/system/libsystem_platform.dylib : _sigtramp
0x3057d05d0 - Unknown
0x7ff8190d63a6 - /usr/lib/system/libsystem_c.dylib : abort
0x100ad1ec7 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : monoeg_assert_abort
0x100ab2d2f - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_log_write_logfile
0x100ad235e - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : monoeg_g_logv_nofree
0x100ad24df - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : monoeg_assertion_message
0x100ad251a - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_assertion_message
0x10096bf3d - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_setup_parent
0x10096b497 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_create_from_typedef
0x100963aac - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_get_checked
0x10096474b - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_from_name_checked_aux
0x1009647cc - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_from_name_checked_aux
0x10095f440 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_class_load_from_name
0x100953f17 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_init_internal
0x1007d7bc3 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mini_init
0x10083d072 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : mono_main
0x1007c9b78 - /Library/Frameworks/Mono.framework/Versions/6.12.0/bin/mono-sgen64 : main
0x200ef5781 - Unknown

=================================================================
Telemetry Dumper:

=================================================================
Basic Fault Address Reporting

Memory around native instruction pointer (0x7ff81918982e):0x7ff81918981e ff ff c3 00 00 00 b8 48 01 00 02 49 89 ca 0f 05 .......H...I....
0x7ff81918982e 73 08 48 89 c7 e9 70 9a ff ff c3 00 00 00 b8 53 s.H...p........S
0x7ff81918983e 00 00 02 49 89 ca 0f 05 73 08 48 89 c7 e9 58 9a ...I....s.H...X.
0x7ff81918984e ff ff c3 00 00 00 b8 83 01 00 02 49 89 ca 0f 05 ...........I....


Tests

dotnet test --filter SessionOrganizationTests19/19 pass ✅ (includes MultipleOrchestrators_NoDuplicateGroups)

CI Status

⚠️ No CI checks configured on this branch.

Verdict

Approve — N2 and N3 have been identified and fixed (commit 3b95f524). All Round 1–3 findings resolved. Tests pass. Please merge when ready.

@PureWeen PureWeen merged commit d515696 into main Mar 20, 2026
@PureWeen PureWeen deleted the fix/heal-multiagent-groups branch March 20, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant