refactor: serialise FinalizeSync via captured per-round pivot by asdacap · Pull Request #11477 · NethermindEth/nethermind

asdacap · 2026-05-04T03:52:11Z

Summary

Stacks on top of #11102. Addresses the GetPivotHeader() race flagged by svlachakis in #11102 (root cause discussed in #11457 / #11458): TreeSync.SaveNode previously called _store.FinalizeSync(GetPivotHeader()) inline on the IsRoot branch, while parallel HandleResponse workers were still writing nodes — and GetPivotHeader() is mutating, so the pivot rotation can race with the root commit and corrupt FlatDB CurrentState.

This PR pins the pivot per round in the runner and runs all post-sync work at the quiescent point after the dispatcher has fully drained.

What changed

Pivot pinned per round. TreeSync.ResetStateRootToBestSuggested now returns the BlockHeader it used. StateSyncRunner.RunStateSyncRounds captures that header at the start of each round and, when the round commits the root, uses that captured value for FinalizeSync — no second GetPivotHeader() call.
FinalizeSync out of SaveNode. The _store.FinalizeSync(pivotHeader) call is removed from TreeSync.SaveNode's IsRoot branch. The runner invokes treeSync.FinalizeSync(finalPivot) once after the outer loop exits via IsRootComplete.
VerifyPostSyncCleanUp out of IsSyncRoundFinished. The two cleanup-firing branches inside the predicate (_rootSaved == 1 and NodeExists) no longer call cleanup as a side effect. The runner calls treeSync.VerifyPostSyncCleanUp() once after the loop.
STATE SYNC FINISHED:… log moved to the runner so all post-sync reporting is in one place.
_syncStateLock removed. It existed to serialise HandleResponse (read) against ResetStateRoot / CleanupMemory (write). With the cleanup moved out and the writers only running from the runner outside dispatcher.Run, SimpleDispatcher.Run's drain (semaphore.WaitAsync(CancellationToken.None) × maxThreads) is the sole synchronisation barrier — strict happens-before between the last HandleResponse and the next reset/cleanup.

What didn't change

Round-termination is still signalled by StateSyncFeed.PrepareRequest returning null when treeSync.IsSyncRoundFinished() is true — natural dispatcher exit, no watcher / per-round CTS.
Outer break stays on treeSync.IsRootComplete. Tightening to the strict IsRootSaved (_rootSaved == 1) breaks legitimate completion paths (empty-state syncs, snap-prepopulated root) where the IsRoot branch in SaveNode never fires. The NodeExists over-reporting concern stays as FlatDb 1.37.1: restart mid-snap promotes incomplete state, deletes canonical chain, sync permanently stuck #11457 / fix(snap-sync): defer FinalizeSync to end of state sync to prevent canonical chain wipe #11458 territory.
Other concerns from the race-analysis comment (GetPivotHeader() null-deref at TreeSync.cs:355, SyncCompleted event timing) are pre-existing and untouched here.

Test plan

Nethermind.Synchronization.Test filtered to StateSync + SnapSync: 932/940 pass, 8 skips (hash-db).
E2E sync tests (CI / PR owner).

🤖 Generated with Claude Code

claude · 2026-05-04T03:52:28Z

Claude finished @asdacap's task in 7m 58s —— View job

PR Review

Load task context and understand the PR
Load relevant rule files (coding-style, robustness)
Read changed files and git diff
Analyze correctness, edge cases, concurrency
Check performance implications
Post inline comments for specific findings
Post final review summary

Summary

The architectural move is sound. StateSyncFeed.PrepareRequest is now a pure batch producer, and round-termination authority is cleanly centralised in RunStateSyncRounds. The linked-CTS + drain pattern is the right structure for cooperative shutdown, and the IsRootComplete (vs strict IsRootSaved) break condition correctly handles the empty-state and snap-pre-populated paths.

Critical: 0 | High: 0 | Medium: 0 | Low: 4

Low findings

#	Location	Finding
1	`StateSyncRunner.cs:111`	`when (roundCts.IsCancellationRequested)` is logically always true when reached — it covers both normal round-end (path A, we cancel it) and outer-token shutdown (path B, linked CTS). Effectively a bare `catch (OperationCanceledException) { }`. Behaviour is correct but the filter and comment are misleading. Inline →
2	`StateSyncRunner.cs:89`	If `dispatch` is already faulted before the `while (!dispatch.IsCompleted)` loop is entered, `IsSyncRoundFinished()` / `VerifyPostSyncCleanUp()` are never called and verify-trie is bypassed. Requires a non-OCE fault from `SimpleDispatcher.Run()`, which never happens in practice (all `DoDispatch` exceptions are caught), but the assumption isn't documented. Inline →
3	`StateSyncRunner.cs:91`	`VerifyPostSyncCleanUp()` is called from inside `IsSyncRoundFinished()` before `roundCts.Cancel()`, so in-flight `HandleResponse` calls can race with `CleanupMemory()`'s write lock. The write lock serialises them correctly and `_ongoingRequests.TryRemove` guards any late arrivals, but the `_pendingItems.Count != 0` / `_dependencies.Count != 0` corruption warnings can still fire as false positives. Pre-existing behaviour. Inline →
4	`StateSyncRunner.cs:84`	If `roundCts.Cancel()` fires while the dispatcher is blocked in `peerPool.Allocate` or `semaphore.WaitAsync`, the OCE exits `SimpleDispatcher.Run()` before its drain loop, so in-flight workers become fire-and-forget. They still free their peer allocations (via `DoDispatch`'s try/finally), just asynchronously. Comment "peer allocations are always freed" is true but the timing guarantee is weaker than implied. Pre-existing. Inline →

Correctness notes

Empty-state / snap-pre-populated paths: IsSyncRoundFinished() returns true (empty-tree-hash branch) without calling VerifyPostSyncCleanUp(), and IsRootComplete also short-circuits on EmptyTreeHash. Both paths exit cleanly. ✅
Pivot-change mid-round: IsSyncRoundFinished() returns true (pivot-mismatch branch, no cleanup), IsRootComplete returns false, outer loop continues into a new round. Correct. ✅
Re-entrant rounds: SimpleDispatcher creates a fresh SemaphoreSlim per Run() call, so multiple rounds are safe. roundCts is using-scoped, so it can't leak across rounds. ✅
Drain guarantee (happy path): On normal round completion IsSyncRoundFinished() → roundCts.Cancel() → PrepareRequest's Task.Delay throws OCE → returns null → dispatcher exits via request is null → drain loop runs. Peer allocations freed synchronously. ✅

Mergeable. The low findings are either pre-existing or cosmetic (comment clarity). No blocking issues.

claude · 2026-05-04T03:59:42Z

+            // CancellationToken.None internally, so peer allocations are always freed.
+            // Swallow OCE from roundCts cancellation: that's the normal way to end a
+            // round here and must not prevent the verify-trie hand-off below.
+            try { await dispatch; }


The catch filter when (roundCts.IsCancellationRequested) is effectively always true when an OCE is raised here, because roundCts is cancelled in both exit paths before reaching this line:

Path A (round finished): roundCts.Cancel() called explicitly at line 93.

Path B (outer token cancelled): roundCts is linked to token, so IsCancellationRequested is already true.

In practice this means it's equivalent to catch (OperationCanceledException) { } — any OCE from dispatch, including one raised because token was cancelled during the drain, is silently swallowed. The intent (letting the verify-trie handoff proceed) is correct, but the comment could be clearer that this also covers the shutdown path, not just the normal-round-end path. Consider:

// Swallow OCE regardless of which token fired: both normal round termination (roundCts // cancelled by us) and graceful shutdown (token cancelled) must not skip verify-trie. try { await dispatch; } catch (OperationCanceledException) { }

Low severity — behaviour is correct, just the comment/filter is misleading.

claude · 2026-05-04T03:59:46Z

+            // Round-termination authority lives here, not in the feed: poll IsSyncRoundFinished
+            // (which also fires VerifyPostSyncCleanUp / SyncCompleted) and cancel the dispatcher
+            // when it returns true. The feed is a pure batch producer.
+            while (!dispatch.IsCompleted)


Edge case: if dispatch is already completed (e.g. the dispatcher faulted with a non-OCE exception) before the polling loop is entered, while (!dispatch.IsCompleted) evaluates to false immediately. The loop body never executes, so IsSyncRoundFinished() and VerifyPostSyncCleanUp() are never called. Control then falls through to await dispatch, which re-throws the fault exception — that propagates all the way out of Run(), bypassing verify-trie.

In practice SimpleDispatcher.Run() doesn't fault (all exceptions in DoDispatch are caught), so this is unlikely, but worth a comment to document the assumption.

claude · 2026-05-04T03:59:56Z

+            // when it returns true. The feed is a pure batch producer.
+            while (!dispatch.IsCompleted)
+            {
+                if (treeSync.IsSyncRoundFinished())


IsSyncRoundFinished() has a DB read side path: the rootNodeKeyExists check acquires _stateDbLock and calls _store.NodeExists(...) (→ RocksDB). This runs every 50 ms for the lifetime of a sync round. That's fine for state sync, but worth keeping in mind if the interval is ever tightened.

More importantly, IsSyncRoundFinished() calls VerifyPostSyncCleanUp() (via the _rootSaved == 1 and rootNodeKeyExists branches) before roundCts.Cancel() is called (line 93). In-flight DoDispatch tasks can still be calling HandleResponse → treeSync.HandleResponse concurrently when CleanupMemory() acquires its write lock. The write lock serialises correctly, but it means VerifyPostSyncCleanUp's _pendingItems / _dependencies non-zero checks fire before all in-flight responses are drained, which can produce false-positive "POSSIBLE FAST SYNC CORRUPTION" log errors.

This is pre-existing behaviour (same race existed when IsSyncRoundFinished was called from inside PrepareRequest), but noting it because the new drain (await dispatch) does not change the window — cleanup is still called before the drain completes.

claude · 2026-05-04T04:00:06Z

-            await stateSyncDispatcher.Run(token);
+
+            using CancellationTokenSource roundCts = CancellationTokenSource.CreateLinkedTokenSource(token);
+            Task dispatch = stateSyncDispatcher.Run(roundCts.Token);


There is a narrow path where the drain is bypassed. If roundCts.Cancel() fires while SimpleDispatcher.Run() is blocked inside peerPool.Allocate(..., roundCts.Token) or semaphore.WaitAsync(roundCts.Token), an OperationCanceledException propagates out of Run() directly, skipping the drain loop (for (int i = 0; i < maxThreads; ...) in the dispatcher). The runner catches this OCE at line 111–112 and continues, but the orphaned Task.Run workers from that round are still executing in the background.

In practice the consequences are benign — the workers hold closure references to semaphore (keeping it alive), eventually call peerPool.Free and semaphore.Release(), and any HandleResponse they attempt will fail the _ongoingRequests.TryRemove guard and return early after CleanupMemory() has cleared the dictionary. No data corruption or resource leak. But the comment "peer allocations are always freed" is slightly optimistic — they are freed, just asynchronously by the still-running tasks, not synchronously by the drain. This same race exists in the pre-PR code.

asdacap · 2026-05-04T06:29:24Z

@claude reevaluate

claude · 2026-05-04T06:29:42Z

Claude finished @asdacap's task in 5m 52s —— View job

Re-evaluation

Start re-evaluation
Read changed files and diff
Analyze concurrency correctness
Post findings

What changed since the first review

The previous review described a roundCts-based cooperative-shutdown mechanism — a linked CancellationTokenSource that RunStateSyncRounds would cancel to signal round end, followed by a catch (OperationCanceledException) when (roundCts.IsCancellationRequested) filter. That mechanism does not exist in the current code. The actual round-termination flow is simpler:

IsSyncRoundFinished() returns true → StateSyncFeed.PrepareRequest returns null → SimpleDispatcher.Run exits via if (request is null) break → drain loop runs → dispatcher.Run() returns.
Runner checks treeSync.IsRootComplete, captures roundPivot if true, then after the outer loop calls VerifyPostSyncCleanUp() and FinalizeSync(finalPivot).

Findings 1 and 3 from the previous review were based on that roundCts model. Finding 1 (misleading catch filter) does not apply — there is no such catch in RunStateSyncRounds. Finding 3 (VerifyPostSyncCleanUp called before drain) is resolved: cleanup now happens after dispatcher.Run() returns, which means after all HandleResponse workers complete via the drain. The false-positive "POSSIBLE FAST SYNC CORRUPTION" warnings are no longer possible here.

Concurrency model — verified correct

The lock-removal invariant holds: HandleResponse only executes inside dispatcher.Run(), while ResetStateRoot and CleanupMemory only execute outside it. SimpleDispatcher.Run's drain (maxThreads × semaphore.WaitAsync(CancellationToken.None)) is a strict happens-before barrier — semaphore.Release() is in the finally of the worker lambda, which runs after DoDispatch returns (which includes HandleResponse). The five previously-locked fields are safe without the ReaderWriterLockSlim.

Drain bypass on cancellation (pre-existing, benign)

If token is cancelled while SimpleDispatcher.Run is blocked in semaphore.WaitAsync(token) or peerPool.Allocate(…, token), the OCE propagates out of the while loop — the drain does not run, and in-flight Task.Run workers become fire-and-forget. However, DoDispatch guards with if (token.IsCancellationRequested) return before calling HandleResponse, so those workers skip the call entirely. The runner catches the OCE in Run(), finalPivot is null, and the sync session terminates cleanly. No new round starts after cancellation, so there is no concurrent ResetStateRoot to race with.

Updated findings

Critical: 0 | High: 0 | Medium: 0 | Low: 2

#	Location	Finding
1	`StateSyncRunner.cs:95–98` + `TreeSync.cs:60`	`IsRootComplete` is `true` for `_rootNode == Keccak.EmptyTreeHash`, so `FinalizeSync(roundPivot)` and `VerifyPostSyncCleanUp()` are now called on the empty-state path. Previously `FinalizeSync` was only invoked from `SaveNode`'s `IsRoot` branch, which never fires for empty-state sync. `VerifyPostSyncCleanUp` was only called from the `_rootSaved == 1` and `rootNodeKeyExists` branches of `IsSyncRoundFinished`, not from the `EmptyTreeHash` branch. These are new behaviours. They should be harmless in practice (empty state has no nodes to corrupt, and `_store.FinalizeSync` on an empty pivot should be a no-op), but it would be worth a brief test or an explicit comment confirming the store tolerates this.
2	`TreeSync.cs:351`	`_stateSyncPivot.GetPivotHeader()` is called without a null guard — `GetPivotHeader()` can return null, producing a NRE on `.StateRoot`. Pre-existing; not introduced here.

Previous findings — status

Old #	Status	Note
1 (roundCts filter)	❌ Not applicable	No `roundCts` in current code
2 (dispatcher fault before loop)	✅ Still valid (low)	`SimpleDispatcher.Run` never faults in practice; assumption undocumented
3 (VerifyPostSyncCleanUp before drain)	✅ Resolved	Cleanup now correctly runs after drain
4 (drain bypass on cancellation)	✅ Still valid (low)	Benign; `HandleResponse` skipped when cancelled

Overall

The architectural core is sound and cleaner than the previous review characterised. The pivot-per-round capture eliminates the GetPivotHeader() mutation race, the drain is the correct synchronisation barrier, and moving cleanup after the drain resolves the pre-existing false-positive corruption warnings. No blocking issues.

LukaszRozmej · 2026-05-04T10:20:17Z

+        catch (ObjectDisposedException)
+        {
+            // DBs / stores can be torn down while sync is in progress on shutdown — same
+            // semantics as cancellation: swallow rather than logging "State sync failed".
+        }


Can this be prevented and order be preserved?

LukaszRozmej · 2026-05-04T10:22:12Z

+        // The five fields below were previously guarded by an RWLock so HandleResponse (read)
+        // could run concurrently with ResetStateRoot / CleanupMemory (write). The lock is no
+        // longer needed: SimpleDispatcher.Run drains all in-flight workers before returning,
+        // and StateSyncRunner only calls reset/cleanup outside dispatcher.Run, so the drain is
+        // the sole synchronisation barrier.


Can be more concise, don't need history.

The base branch was changed.

Round-termination signalling stays in StateSyncFeed.PrepareRequest (it returns null when treeSync.IsSyncRoundFinished() is true), so the dispatcher exits naturally per round and the runner just `await`s it. Post-sync work (VerifyPostSyncCleanUp, FinalizeSync, verify-trie) moves out of TreeSync's response-handling path (was inside SaveNode's IsRoot branch and IsSyncRoundFinished's success branches) into the runner, which calls them once after the loop with the captured per-round pivot. The captured pivot — returned by ResetStateRootToBestSuggested — is the key fix for svlachakis #11457/#11458: FinalizeSync now uses exactly the pivot the round committed against, instead of re-reading the mutating GetPivotHeader() and racing concurrent SaveNodes. Pending_items_cache_mechanism_works_across_root_changes drives TreeSync directly since the feed only returns null on round-finish/cancellation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- StateSyncRunner: drop the try/catch around FinalizeSync (original call site in TreeSync.SaveNode had no such guard). - TreeSync: move the "STATE SYNC FINISHED:..." log into StateSyncRunner so all post-sync reporting lives in one place. - TreeSync: keep VerifyPostSyncCleanUp at its original location next to CleanupMemory. - TreeSync.SaveNode: drop the explanatory comment about FinalizeSync's prior call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The RWLock previously serialised HandleResponse (read) against ResetStateRoot / CleanupMemory (write). After the round-termination move, those writers only run from StateSyncRunner outside dispatcher.Run, and SimpleDispatcher.Run drains all in-flight workers before returning, so the drain alone is the synchronisation barrier. The lock is dead weight on the response hot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The feed didn't actually need to change — the OCE catches I added were redundant (the dispatcher / runner already swallow OCE on cancellation), and the formatting tweaks were noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every caller of ResetStateRoot / ResetStateRootToBestSuggested passes SyncFeedState.Dormant; the only thing the parameter did was guard a branch that's always taken (UpdateHeaderForcefully) and a throw that's unreachable. Holdover from when StateSyncFeed was an ISyncFeed with a state machine — gone after the ISimpleSyncFeed migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VerifyPostSyncCleanUp's local try/catch around the dependency-clear and CleanupMemory was a one-off swallow at a low level. Drop it and add a single ObjectDisposedException catch alongside the existing OperationCanceledException catch in StateSyncRunner.Run, so any shutdown-time DB teardown is handled in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

GetPivotHeader() can return null transiently (pivot not known yet, beacon control not ready). Previously TreeSync.IsSyncRoundFinished would NRE on .StateRoot. Now it treats null pivot as "round done so the runner can park", and the runner sleeps 1s before retrying when ResetStateRootToBestSuggested also returns null — preventing a tight loop on transient pivot unavailability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- StateSyncRunner.Run: drop the catch (ObjectDisposedException) and instead skip the DB-tune-default in the finally when token is cancelled, since the OD on shutdown came from touching already-disposed DBs in that path. - TreeSync: trim the lock-removal comment — keep the invariant, drop the history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

asdacap requested review from LukaszRozmej, benaadams, flcl42 and marcindsobczak as code owners May 4, 2026 03:52

asdacap mentioned this pull request May 4, 2026

refactor: simplify snap+state sync dispatcher #11102

Merged

8 tasks

github-actions Bot added the refactoring label May 4, 2026

github-actions Bot added the sync label May 4, 2026

claude Bot reviewed May 4, 2026

View reviewed changes

asdacap force-pushed the move-logic-to-state-sync branch 3 times, most recently from 20cf987 to a35807d Compare May 4, 2026 05:04

asdacap changed the title ~~refactor: move state-sync round termination from feed to runner~~ refactor: serialise FinalizeSync via captured per-round pivot May 4, 2026

LukaszRozmej previously approved these changes May 4, 2026

View reviewed changes

Base automatically changed from amirul/simple-sync-dispatcher to master May 4, 2026 11:06

github-actions Bot added network snap sync labels May 4, 2026

LukaszRozmej approved these changes May 4, 2026

View reviewed changes

asdacap and others added 6 commits May 4, 2026 20:00

asdacap and others added 2 commits May 4, 2026 20:00

asdacap force-pushed the move-logic-to-state-sync branch from fa60df4 to ef57ead Compare May 4, 2026 12:03

github-actions Bot removed network snap sync labels May 4, 2026

asdacap mentioned this pull request May 4, 2026

fix(snap-sync): defer FinalizeSync to end of state sync to prevent canonical chain wipe #11480

Closed

12 tasks

svlachakis approved these changes May 4, 2026

View reviewed changes

asdacap merged commit 12fabc9 into master May 4, 2026
469 of 894 checks passed

asdacap deleted the move-logic-to-state-sync branch May 4, 2026 12:51

ak88 mentioned this pull request May 12, 2026

XDC Fix sync #11562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: serialise FinalizeSync via captured per-round pivot#11477

refactor: serialise FinalizeSync via captured per-round pivot#11477
asdacap merged 8 commits into
masterfrom
move-logic-to-state-sync

asdacap commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot May 4, 2026

Uh oh!

claude Bot May 4, 2026

Uh oh!

claude Bot May 4, 2026

Uh oh!

claude Bot May 4, 2026

Uh oh!

asdacap commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

LukaszRozmej May 4, 2026

Uh oh!

LukaszRozmej May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

asdacap commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

What didn't change

Test plan

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Summary

Low findings

Correctness notes

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

asdacap commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-evaluation

What changed since the first review

Concurrency model — verified correct

Drain bypass on cancellation (pre-existing, benign)

Updated findings

Previous findings — status

Overall

Uh oh!

LukaszRozmej May 4, 2026

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asdacap commented May 4, 2026 •

edited

Loading

claude Bot commented May 4, 2026 •

edited

Loading

claude Bot commented May 4, 2026 •

edited

Loading