Skip to content

fix: eliminate null transient state in DeleteGroup concurrent reads#290

Merged
PureWeen merged 3 commits intomainfrom
fix/delete-group-test-null-ref
Mar 6, 2026
Merged

fix: eliminate null transient state in DeleteGroup concurrent reads#290
PureWeen merged 3 commits intomainfrom
fix/delete-group-test-null-ref

Conversation

@PureWeen
Copy link
Copy Markdown
Owner

@PureWeen PureWeen commented Mar 5, 2026

Problem

CI failure: WsBridgeIntegrationTests.Organization_DeleteGroup_RemovesFromServer

System.NullReferenceException : Object reference not set to an instance of an object.
at WsBridgeIntegrationTests.<Organization_DeleteGroup_RemovesFromServer>b__2(SessionGroup g)
   in WsBridgeIntegrationTests.cs:line 520

Root cause: List<T>.RemoveAll() in .NET nulls out trailing array elements via Array.Clear before decrementing _size. This creates a transient window where:

  • A background thread (WebSocket receive) is running DeleteGroup → RemoveAll
  • A concurrent reader (test polling thread, or the Blazor render thread) iterates Organization.Groups
  • The reader sees a null SessionGroup at an index that's still within the old _size
  • g.Id throws NullReferenceException

This only manifests on CI because thread scheduling variance is higher in the GitHub Actions runner environment. It passes locally where the polling loop rarely lands in the narrow null-transient window.

Fix

Replace in-place RemoveAll with an atomic list-reference replacement:

// Before (produces transient null state):
Organization.Groups.RemoveAll(g => g.Id == groupId);

// After (atomic: readers see old complete list OR new list, never null elements):
Organization.Groups = Organization.Groups.Where(g => g.Id != groupId).ToList();

Assigning a new List<T> reference is atomic on 64-bit .NET (object reference assignment). Readers that already hold a reference to the old list continue to see all elements. Readers that pick up the new reference see the post-delete state. No reader ever sees a null element.

Testing

  • Organization_DeleteGroup_RemovesFromServer passes
  • ✅ All 50 WsBridgeIntegrationTests pass
  • ✅ 2026/2039 tests pass (13 pre-existing PopupThemeTests failures unrelated to this change)

Fixes: https://github.com/PureWeen/PolyPilot/actions/runs/22735831218/job/65936842712

Organization.Groups.RemoveAll() nulls out trailing array elements before
decrementing _size (via Array.Clear), creating a window where concurrent
readers (test polling thread, render thread) observe a null SessionGroup
and throw NullReferenceException.

Replace with an atomic list-reference swap using Where().ToList(). Readers
see either the old complete list or the new list without the deleted group,
but never a null-containing partial state.

Fixes CI failure: WsBridgeIntegrationTests.Organization_DeleteGroup_RemovesFromServer

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Mar 5, 2026

PR Review — 5-Model Consensus Analysis

CI Status: ✅ Passing (2026/2039, 13 pre-existing failures)
Existing reviews: None


The Fix (What's Correct)

The root-cause diagnosis is accurate. List<T>.RemoveAll in .NET shifts matching elements out and then calls Array.Clear on the tail — before decrementing _size. This creates a transient window where a concurrent reader can dereference a null entry. Replacing with Where().ToList() + reference assignment does eliminate that null-window for Organization.Groups.


🔴 CRITICAL — Lost-Update Race Introduced by This Fix (3/5 models)

CopilotService.Organization.cs:530

The Where().ToList() + field assignment is an unsynchronized read-modify-write. The race scenario:

  1. Thread A (WebSocket receive): HandleOrganizationCommandDeleteGroup → starts Where(g => g.Id != groupId) snapshot
  2. Thread B (Blazor UI): CreateGroup / GetOrCreateRepoGroupOrganization.Groups.Add(newGroup) into the old list
  3. Thread A assigns the new filtered list — which doesn't contain newGroupsilent group loss

HandleOrganizationCommand (line 540 in WsBridgeServer.cs) is called directly from the WebSocket receive handler without InvokeOnUI. Same for the BridgeMessageTypes.RemoveRepo path at line 727. So the concurrent-thread condition is real and present in production today.

The original in-place RemoveAll was immune to this specific lost-update because it mutated the live list that other writers also hold a reference to. The new approach trades a visible crash for a silent data loss regression.

Request: Marshal DeleteGroup calls from WsBridgeServer to the UI thread with InvokeOnUI (lines 540 and 727), matching the pattern already used elsewhere (lines 612, 697, 807). That would also fix the original race correctly without the lost-update problem.


🟡 MODERATE — Incomplete Fix: Two Other RemoveAll Sites Unaddressed (4/5 models)

CopilotService.Organization.cs:379 and CopilotService.Organization.cs:467

The same List<T>.RemoveAll null-transient pattern on Organization.Sessions is unfixed:

  • Line 467: Organization.Sessions.RemoveAll(m => sessionNames.Contains(m.SessionName)) in DeleteGroup itself

If the concurrent-read premise is correct (and it is — BroadcastOrganizationState at WsBridgeServer line 997 serializes _copilot.Organization including Sessions from the WebSocket thread), these are also vulnerable. The fix is incomplete on its own stated terms.


What Doesn't Need Addressing

  • The List.Add / Insert calls (lines 62, 135, 425, 705, 731) don't null-out trailing elements on resize — they extend or overwrite valid slots. A concurrent reader would see the list without the new element (safe) or with it (also safe). The null-transient issue is specific to RemoveAll.
  • Reference assignment atomicity: The ECMA-335 spec guarantees atomic reads/writes for all reference types on all CLI platforms (not just 64-bit), so the approach is sound in principle.

Recommended Action: ⚠️ Request Changes

The fix correctly eliminates the NRE crash on the specific line. But it introduces a lost-update regression and leaves two sibling RemoveAll sites on Organization.Sessions vulnerable.

Specific ask:

  1. In WsBridgeServer.HandleClientMessage (line 540): wrap HandleOrganizationCommand(orgCmd) in InvokeOnUI to serialize it onto the UI thread — this is the proper fix for the race root cause.
  2. In the BridgeMessageTypes.RemoveRepo case (line 727): wrap _copilot?.DeleteGroup(removeReq.GroupId) in InvokeOnUI as well.
  3. If WsBridgeServer marshaling is the right fix, revert the Organization.Groups change back to RemoveAll (or leave it — both are fine once writes are serialized).
  4. Fix lines 379 and 467: apply the same Where().ToList() swap (or rely on UI-thread serialization if that's the approach).

Reviewed by independent 5-model consensus (claude-opus-4.6 ×2, claude-sonnet-4.6, gemini-3-pro-preview, gpt-5.3-codex)

PureWeen and others added 2 commits March 5, 2026 18:29
Root cause: WsBridgeServer called DeleteGroup and other Organization-
mutating methods directly from the WebSocket receive thread (ThreadPool),
violating the UI-thread-only invariant for Organization.Groups mutations.
List<T>.RemoveAll() nulls trailing elements before decrementing _size,
so concurrent UI-thread readers saw null SessionGroup entries.

Fix: convert HandleOrganizationCommand to HandleOrganizationCommandAsync
and wrap every _copilot.* organization mutation in InvokeOnUIAsync() so
they execute on the UI thread. Same fix applied to the RemoveRepo handler.

Also revert the previous Where().ToList() workaround in DeleteGroup --
that approach introduced a lost-update race where a concurrent CreateGroup
on the UI thread could be silently overwritten by the assignment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor ShowSkillsPopup, ShowAgentsPopup, ShowPromptsPopup, and
ShowLogPopup in ExpandedSessionView.razor to use the existing CSS
class system (skills-popup-*, skills-popup-log-*) instead of inline
styles with hardcoded Catppuccin dark-theme hex colors.

Changes per method:
- Overlay: className='skills-popup-overlay' replaces inline style.cssText
- Popup: className='skills-popup' (or with --wide modifier), only
  bottom/left set inline for computed positioning
- Rows: CSS classes for layout (skills-popup-row, -title, -name,
  -source, -desc, -log-row, -log-ts, -log-detail, -empty)
- Log event type colors: CSS variables (--accent-primary/success/
  warning/error) instead of hardcoded hex
- Form inputs in ShowPromptsPopup: CSS variables (--bg-tertiary,
  --control-border, --text-bright, --text-muted, --accent-primary)

All 2039 tests pass (including all 23 PopupThemeTests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Owner Author

PureWeen commented Mar 6, 2026

Round 2 Update — All review findings addressed ✅

Previous review checklist status

  1. WsBridgeServer.cs line ~540: HandleOrganizationCommand → converted to HandleOrganizationCommandAsync, all mutations wrapped in await _copilot.InvokeOnUIAsync()
  2. WsBridgeServer.cs line ~727: _copilot?.DeleteGroup in RemoveRepo handler → await _copilot.InvokeOnUIAsync(() => _copilot.DeleteGroup(...))
  3. CopilotService.Organization.cs line ~379: ReconcileOrganization RemoveAll — safe; ReconcileOrganization runs on UI thread only (all callers marshal via InvokeOnUI)
  4. CopilotService.Organization.cs line ~467: DeleteGroup Sessions.RemoveAll — safe; DeleteGroup is now only called via InvokeOnUIAsync from WebSocket handlers
  5. CopilotService.Organization.cs line ~530: Reverted to Organization.Groups.RemoveAll() — the root fix is at the caller level (UI-thread marshaling), not here

Additional: All 13 pre-existing PopupThemeTests fixed

Refactored ShowSkillsPopup, ShowAgentsPopup, ShowPromptsPopup, and ShowLogPopup in ExpandedSessionView.razor to use the existing CSS class system (skills-popup-*) instead of inline styles with hardcoded Catppuccin hex colors.

Test results: ✅ 2039/2039 passing (0 failures)

@PureWeen PureWeen merged commit c789d33 into main Mar 6, 2026
@PureWeen PureWeen deleted the fix/delete-group-test-null-ref branch March 6, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant