fix(graph): advance the release watcher's dispatch high-water only on the Update COMMIT path#194
Conversation
… the Update COMMIT path (#185) InstallReleaseRequestWatcher advanced its process-local dispatchHighWater EAGERLY in the Subscribe callback — before the stream Update that stamps LastReleaseRequestHandledAt committed. The Update's bail paths (trigger already handled; status already Pending/Compiling) return `curr` WITHOUT stamping, so a trigger whose Update bailed was left with high-water >= trigger while the on-node stamp stayed below it: the post-settle re-emission then failed the `req > dispatchHighWater` gate and the trigger was LOST for the life of the process. The in-code claim that "an in-flight compile carries this request to a terminal status" did not hold — the settled re-emission was the intended recovery, and the eager advance gated it off. Deterministic repro (ReleaseRequestWatcherHighWaterTest): park the owner's serialized MeshNode write queue, enqueue two release triggers T1 < T2 while parked (so both settled-status emissions pass the Where before the watcher's first Update applies), release. U1 commits Pending + stamps T1; U2 bails on the status guard; after the compile settles, the node sits at req=T2 > handled=T1 forever — red on main, green with this fix. Fix (re-deriving abandoned #124's refinement against post-#173 semantics): advance the high-water ONLY in the Update lambda's commit branch — the one path that stamps LastReleaseRequestHandledAt — so the in-memory mark never runs ahead of the on-node stamp and a bailed trigger re-fires on the next settled emission. The mark is now Interlocked over UTC ticks (MonotonicHighWaterMark) because the commit-time advance runs on the owner's serialized write path while the Where reads it on the reduced-stream emission path. Flap-back correctness is unchanged: the carried-trigger + monotonic on-node stamp (and the data layer's DropStaleMonotonicTriggers) handle it; the eager advance only bought suppression of redundant no-op Updates in the dispatch-to-commit window, which now simply bail on the handled/status guards. Regression-green: CodeEditRecompileTest 5/5, NodeTypeReleaseGateTest 4/4, NodeTypeReleaseTest (Monolith 1/1, Graph 31/31 incl. helpers), repro 1/1. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a correctness bug in the Graph release-request watcher where the process-local “dispatch high-water” could advance ahead of the node’s committed LastReleaseRequestHandledAt stamp, allowing a newer trigger to be permanently gated off after an Update bail. Adds a deterministic integration test that reproduces the lost-trigger interleaving and verifies the fix.
Changes:
- Advance the watcher’s high-water mark only on the Update lambda’s commit branch (in lockstep with stamping
LastReleaseRequestHandledAt). - Introduce a small
MonotonicHighWaterMarkhelper (Interlocked-backed) to keep cross-context reads/writes tear-free. - Add
ReleaseRequestWatcherHighWaterTestas a deterministic repro and regression test for issue #185 residual 1.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/MeshWeaver.Hosting.Monolith.Test/ReleaseRequestWatcherHighWaterTest.cs | New deterministic repro/regression test for the watcher high-water lost-trigger interleaving. |
| src/MeshWeaver.Graph/Configuration/NodeTypeCompilationHelpers.cs | Fix watcher gating by moving high-water advancement to the Update commit path + add Interlocked high-water helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Test Results (shard 2) 15 files ±0 15 suites ±0 7m 10s ⏱️ +4s Results for commit 3e1786f. ± Comparison against base commit 9f7b0ea. This pull request removes 39 and adds 32 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Test Results 54 files ±0 54 suites ±0 20m 19s ⏱️ -6s Results for commit 3e1786f. ± Comparison against base commit 9f7b0ea. This pull request removes 39 and adds 33 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
…andshake in the high-water repro 1. Replace the IDisposable sync-over-async disposal (base.DisposeAsync() .GetAwaiter().GetResult()) with an override of the base's async ValueTask DisposeAsync() — xUnit v3 awaits it naturally, the mesh tears down exactly once, and the per-test compile cache dir is cleaned after the mesh is gone. 2. Close the false-PASS window in the forced interleaving: the gate write is meant to park the owner's serialized MeshNode write queue before the two release triggers are enqueued, but nothing confirmed the gate lambda had actually STARTED executing. On a busy runner the gate could run only after gate.Set() — nothing parked, the interleaving degrades to ordinary timing, and the test could silently pass over a regressed watcher. Add a `parked` ManualResetEventSlim handshake (the project's established synchronous Subscribe-side signal — see DeleteNodeBehaviorTest): set at the top of the gate lambda, asserted on the test thread (bounded wait + Should().BeTrue) before P1/P2 are posted. Re-verified end to end after hardening: the repro still FAILS against the pre-fix watcher (46s, node frozen at req=T2 handled=T1 — same trace as the original pin) and passes with the fix (3s); CodeEditRecompileTest 5/5; test project builds Release -warnaserror clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Addresses residual 1 of #185 (the watcher eager high-water advance) — residuals 2 (flake-repro workflow) and 3 remain open, so this does not close the issue.
Verdict: the lost trigger was REAL
Pinned first with a deterministic repro per repo rules —
ReleaseRequestWatcherHighWaterTestparks the owner's serialized write queue with a gate write and enqueues two triggers while parked, forcing the exact interleaving with no timing races. Red on unmodified main (frozen atstatus=Ok, req=T2, handled=T1— the trigger lost for the life of the process), green with the fix.The interleaving (bail path C)
Two triggers T1 < T2 enqueue back-to-back → emission E1 advances the high-water eagerly to T1 and posts U1 → E2 (still settled, U1 not yet applied) advances eagerly to T2 and posts U2 → U1 commits (Pending,
handled=T1) → U2 bails on the status guard without stamping → compile settles → the recovery re-emissionreq=T2 > handled=T1failsreq > dispatchHighWater(T2 > T2 is false) → T2 never dispatches. The in-code claim that "the in-flight compile carries this request to a terminal status" was false — the settled re-emission was the intended recovery, and the eager advance gated it off. Bail A (untyped-degrade) had the same lossy shape; bail B (triggerAt <= handled) is genuinely safe.Fix
Re-derives #124's refinement against post-#173 code: the high-water advances only in the Update lambda's commit branch, in the same breath as the
LastReleaseRequestHandledAtstamp — the in-memory mark can no longer run ahead of the on-node stamp, so a bailed trigger stays live and re-fires on the next settled emission. Since commit runs on the owner's write path while theWherereads on the emission path, the mark is a per-installMonotonicHighWaterMark(Interlocked over UTC ticks — no torn reads, no static state). Flap-back correctness unchanged (carried trigger + monotonic stamp +DropStaleMonotonicTriggers); the only cost is an occasional redundant Update in the dispatch-to-commit window, which bails on the existing guards.Tests
ReleaseRequestWatcherHighWaterTest(red on main → green): documents the guarantee.CodeEditRecompileTest5/5,NodeTypeReleaseGateTest4/4,NodeTypeReleaseTest(Monolith + Graph) andNodeTypeCompilationHelpersTest33/33.dotnet build -c Release -warnaserror: clean, including after merging current main.MeshNodeCompilationService.EmitToDiskWithRetry).🤖 Generated with Claude Code