Skip to content

refactor: serialise FinalizeSync via captured per-round pivot#11477

Merged
asdacap merged 8 commits into
masterfrom
move-logic-to-state-sync
May 4, 2026
Merged

refactor: serialise FinalizeSync via captured per-round pivot#11477
asdacap merged 8 commits into
masterfrom
move-logic-to-state-sync

Conversation

@asdacap
Copy link
Copy Markdown
Contributor

@asdacap asdacap commented May 4, 2026

Summary

Stacks on top of #11102. Addresses the GetPivotHeader() race flagged by svlachakis in #11102 (root cause discussed in #11457 / #11458): TreeSync.SaveNode previously called _store.FinalizeSync(GetPivotHeader()) inline on the IsRoot branch, while parallel HandleResponse workers were still writing nodes — and GetPivotHeader() is mutating, so the pivot rotation can race with the root commit and corrupt FlatDB CurrentState.

This PR pins the pivot per round in the runner and runs all post-sync work at the quiescent point after the dispatcher has fully drained.

What changed

  • Pivot pinned per round. TreeSync.ResetStateRootToBestSuggested now returns the BlockHeader it used. StateSyncRunner.RunStateSyncRounds captures that header at the start of each round and, when the round commits the root, uses that captured value for FinalizeSync — no second GetPivotHeader() call.
  • FinalizeSync out of SaveNode. The _store.FinalizeSync(pivotHeader) call is removed from TreeSync.SaveNode's IsRoot branch. The runner invokes treeSync.FinalizeSync(finalPivot) once after the outer loop exits via IsRootComplete.
  • VerifyPostSyncCleanUp out of IsSyncRoundFinished. The two cleanup-firing branches inside the predicate (_rootSaved == 1 and NodeExists) no longer call cleanup as a side effect. The runner calls treeSync.VerifyPostSyncCleanUp() once after the loop.
  • STATE SYNC FINISHED:… log moved to the runner so all post-sync reporting is in one place.
  • _syncStateLock removed. It existed to serialise HandleResponse (read) against ResetStateRoot / CleanupMemory (write). With the cleanup moved out and the writers only running from the runner outside dispatcher.Run, SimpleDispatcher.Run's drain (semaphore.WaitAsync(CancellationToken.None) × maxThreads) is the sole synchronisation barrier — strict happens-before between the last HandleResponse and the next reset/cleanup.

What didn't change

Test plan

  • Nethermind.Synchronization.Test filtered to StateSync + SnapSync: 932/940 pass, 8 skips (hash-db).
  • E2E sync tests (CI / PR owner).

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Claude finished @asdacap's task in 7m 58s —— View job


PR Review

  • Load task context and understand the PR
  • Load relevant rule files (coding-style, robustness)
  • Read changed files and git diff
  • Analyze correctness, edge cases, concurrency
  • Check performance implications
  • Post inline comments for specific findings
  • Post final review summary

Summary

The architectural move is sound. StateSyncFeed.PrepareRequest is now a pure batch producer, and round-termination authority is cleanly centralised in RunStateSyncRounds. The linked-CTS + drain pattern is the right structure for cooperative shutdown, and the IsRootComplete (vs strict IsRootSaved) break condition correctly handles the empty-state and snap-pre-populated paths.

Critical: 0 | High: 0 | Medium: 0 | Low: 4


Low findings

# Location Finding
1 StateSyncRunner.cs:111 when (roundCts.IsCancellationRequested) is logically always true when reached — it covers both normal round-end (path A, we cancel it) and outer-token shutdown (path B, linked CTS). Effectively a bare catch (OperationCanceledException) { }. Behaviour is correct but the filter and comment are misleading. Inline →
2 StateSyncRunner.cs:89 If dispatch is already faulted before the while (!dispatch.IsCompleted) loop is entered, IsSyncRoundFinished() / VerifyPostSyncCleanUp() are never called and verify-trie is bypassed. Requires a non-OCE fault from SimpleDispatcher.Run(), which never happens in practice (all DoDispatch exceptions are caught), but the assumption isn't documented. Inline →
3 StateSyncRunner.cs:91 VerifyPostSyncCleanUp() is called from inside IsSyncRoundFinished() before roundCts.Cancel(), so in-flight HandleResponse calls can race with CleanupMemory()'s write lock. The write lock serialises them correctly and _ongoingRequests.TryRemove guards any late arrivals, but the _pendingItems.Count != 0 / _dependencies.Count != 0 corruption warnings can still fire as false positives. Pre-existing behaviour. Inline →
4 StateSyncRunner.cs:84 If roundCts.Cancel() fires while the dispatcher is blocked in peerPool.Allocate or semaphore.WaitAsync, the OCE exits SimpleDispatcher.Run() before its drain loop, so in-flight workers become fire-and-forget. They still free their peer allocations (via DoDispatch's try/finally), just asynchronously. Comment "peer allocations are always freed" is true but the timing guarantee is weaker than implied. Pre-existing. Inline →

Correctness notes

  • Empty-state / snap-pre-populated paths: IsSyncRoundFinished() returns true (empty-tree-hash branch) without calling VerifyPostSyncCleanUp(), and IsRootComplete also short-circuits on EmptyTreeHash. Both paths exit cleanly. ✅
  • Pivot-change mid-round: IsSyncRoundFinished() returns true (pivot-mismatch branch, no cleanup), IsRootComplete returns false, outer loop continues into a new round. Correct. ✅
  • Re-entrant rounds: SimpleDispatcher creates a fresh SemaphoreSlim per Run() call, so multiple rounds are safe. roundCts is using-scoped, so it can't leak across rounds. ✅
  • Drain guarantee (happy path): On normal round completion IsSyncRoundFinished()roundCts.Cancel()PrepareRequest's Task.Delay throws OCE → returns null → dispatcher exits via request is null → drain loop runs. Peer allocations freed synchronously. ✅

Mergeable. The low findings are either pre-existing or cosmetic (comment clarity). No blocking issues.

@github-actions github-actions Bot added the sync label May 4, 2026
// CancellationToken.None internally, so peer allocations are always freed.
// Swallow OCE from roundCts cancellation: that's the normal way to end a
// round here and must not prevent the verify-trie hand-off below.
try { await dispatch; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch filter when (roundCts.IsCancellationRequested) is effectively always true when an OCE is raised here, because roundCts is cancelled in both exit paths before reaching this line:

  • Path A (round finished): roundCts.Cancel() called explicitly at line 93.
  • Path B (outer token cancelled): roundCts is linked to token, so IsCancellationRequested is already true.

In practice this means it's equivalent to catch (OperationCanceledException) { } — any OCE from dispatch, including one raised because token was cancelled during the drain, is silently swallowed. The intent (letting the verify-trie handoff proceed) is correct, but the comment could be clearer that this also covers the shutdown path, not just the normal-round-end path. Consider:

// Swallow OCE regardless of which token fired: both normal round termination (roundCts
// cancelled by us) and graceful shutdown (token cancelled) must not skip verify-trie.
try { await dispatch; }
catch (OperationCanceledException) { }

Low severity — behaviour is correct, just the comment/filter is misleading.

// Round-termination authority lives here, not in the feed: poll IsSyncRoundFinished
// (which also fires VerifyPostSyncCleanUp / SyncCompleted) and cancel the dispatcher
// when it returns true. The feed is a pure batch producer.
while (!dispatch.IsCompleted)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge case: if dispatch is already completed (e.g. the dispatcher faulted with a non-OCE exception) before the polling loop is entered, while (!dispatch.IsCompleted) evaluates to false immediately. The loop body never executes, so IsSyncRoundFinished() and VerifyPostSyncCleanUp() are never called. Control then falls through to await dispatch, which re-throws the fault exception — that propagates all the way out of Run(), bypassing verify-trie.

In practice SimpleDispatcher.Run() doesn't fault (all exceptions in DoDispatch are caught), so this is unlikely, but worth a comment to document the assumption.

// when it returns true. The feed is a pure batch producer.
while (!dispatch.IsCompleted)
{
if (treeSync.IsSyncRoundFinished())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsSyncRoundFinished() has a DB read side path: the rootNodeKeyExists check acquires _stateDbLock and calls _store.NodeExists(...) (→ RocksDB). This runs every 50 ms for the lifetime of a sync round. That's fine for state sync, but worth keeping in mind if the interval is ever tightened.

More importantly, IsSyncRoundFinished() calls VerifyPostSyncCleanUp() (via the _rootSaved == 1 and rootNodeKeyExists branches) before roundCts.Cancel() is called (line 93). In-flight DoDispatch tasks can still be calling HandleResponsetreeSync.HandleResponse concurrently when CleanupMemory() acquires its write lock. The write lock serialises correctly, but it means VerifyPostSyncCleanUp's _pendingItems / _dependencies non-zero checks fire before all in-flight responses are drained, which can produce false-positive "POSSIBLE FAST SYNC CORRUPTION" log errors.

This is pre-existing behaviour (same race existed when IsSyncRoundFinished was called from inside PrepareRequest), but noting it because the new drain (await dispatch) does not change the window — cleanup is still called before the drain completes.

await stateSyncDispatcher.Run(token);

using CancellationTokenSource roundCts = CancellationTokenSource.CreateLinkedTokenSource(token);
Task dispatch = stateSyncDispatcher.Run(roundCts.Token);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a narrow path where the drain is bypassed. If roundCts.Cancel() fires while SimpleDispatcher.Run() is blocked inside peerPool.Allocate(..., roundCts.Token) or semaphore.WaitAsync(roundCts.Token), an OperationCanceledException propagates out of Run() directly, skipping the drain loop (for (int i = 0; i < maxThreads; ...) in the dispatcher). The runner catches this OCE at line 111–112 and continues, but the orphaned Task.Run workers from that round are still executing in the background.

In practice the consequences are benign — the workers hold closure references to semaphore (keeping it alive), eventually call peerPool.Free and semaphore.Release(), and any HandleResponse they attempt will fail the _ongoingRequests.TryRemove guard and return early after CleanupMemory() has cleared the dictionary. No data corruption or resource leak. But the comment "peer allocations are always freed" is slightly optimistic — they are freed, just asynchronously by the still-running tasks, not synchronously by the drain. This same race exists in the pre-PR code.

@asdacap asdacap force-pushed the move-logic-to-state-sync branch 3 times, most recently from 20cf987 to a35807d Compare May 4, 2026 05:04
@asdacap asdacap changed the title refactor: move state-sync round termination from feed to runner refactor: serialise FinalizeSync via captured per-round pivot May 4, 2026
@asdacap
Copy link
Copy Markdown
Contributor Author

asdacap commented May 4, 2026

@claude reevaluate

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Claude finished @asdacap's task in 5m 52s —— View job


Re-evaluation

  • Start re-evaluation
  • Read changed files and diff
  • Analyze concurrency correctness
  • Post findings

What changed since the first review

The previous review described a roundCts-based cooperative-shutdown mechanism — a linked CancellationTokenSource that RunStateSyncRounds would cancel to signal round end, followed by a catch (OperationCanceledException) when (roundCts.IsCancellationRequested) filter. That mechanism does not exist in the current code. The actual round-termination flow is simpler:

  1. IsSyncRoundFinished() returns true → StateSyncFeed.PrepareRequest returns nullSimpleDispatcher.Run exits via if (request is null) breakdrain loop runsdispatcher.Run() returns.
  2. Runner checks treeSync.IsRootComplete, captures roundPivot if true, then after the outer loop calls VerifyPostSyncCleanUp() and FinalizeSync(finalPivot).

Findings 1 and 3 from the previous review were based on that roundCts model. Finding 1 (misleading catch filter) does not apply — there is no such catch in RunStateSyncRounds. Finding 3 (VerifyPostSyncCleanUp called before drain) is resolved: cleanup now happens after dispatcher.Run() returns, which means after all HandleResponse workers complete via the drain. The false-positive "POSSIBLE FAST SYNC CORRUPTION" warnings are no longer possible here.


Concurrency model — verified correct

The lock-removal invariant holds: HandleResponse only executes inside dispatcher.Run(), while ResetStateRoot and CleanupMemory only execute outside it. SimpleDispatcher.Run's drain (maxThreads × semaphore.WaitAsync(CancellationToken.None)) is a strict happens-before barrier — semaphore.Release() is in the finally of the worker lambda, which runs after DoDispatch returns (which includes HandleResponse). The five previously-locked fields are safe without the ReaderWriterLockSlim.


Drain bypass on cancellation (pre-existing, benign)

If token is cancelled while SimpleDispatcher.Run is blocked in semaphore.WaitAsync(token) or peerPool.Allocate(…, token), the OCE propagates out of the while loop — the drain does not run, and in-flight Task.Run workers become fire-and-forget. However, DoDispatch guards with if (token.IsCancellationRequested) return before calling HandleResponse, so those workers skip the call entirely. The runner catches the OCE in Run(), finalPivot is null, and the sync session terminates cleanly. No new round starts after cancellation, so there is no concurrent ResetStateRoot to race with.


Updated findings

Critical: 0 | High: 0 | Medium: 0 | Low: 2

# Location Finding
1 StateSyncRunner.cs:95–98 + TreeSync.cs:60 IsRootComplete is true for _rootNode == Keccak.EmptyTreeHash, so FinalizeSync(roundPivot) and VerifyPostSyncCleanUp() are now called on the empty-state path. Previously FinalizeSync was only invoked from SaveNode's IsRoot branch, which never fires for empty-state sync. VerifyPostSyncCleanUp was only called from the _rootSaved == 1 and rootNodeKeyExists branches of IsSyncRoundFinished, not from the EmptyTreeHash branch. These are new behaviours. They should be harmless in practice (empty state has no nodes to corrupt, and _store.FinalizeSync on an empty pivot should be a no-op), but it would be worth a brief test or an explicit comment confirming the store tolerates this.
2 TreeSync.cs:351 _stateSyncPivot.GetPivotHeader() is called without a null guard — GetPivotHeader() can return null, producing a NRE on .StateRoot. Pre-existing; not introduced here.

Previous findings — status

Old # Status Note
1 (roundCts filter) ❌ Not applicable No roundCts in current code
2 (dispatcher fault before loop) ✅ Still valid (low) SimpleDispatcher.Run never faults in practice; assumption undocumented
3 (VerifyPostSyncCleanUp before drain) Resolved Cleanup now correctly runs after drain
4 (drain bypass on cancellation) ✅ Still valid (low) Benign; HandleResponse skipped when cancelled

Overall

The architectural core is sound and cleaner than the previous review characterised. The pivot-per-round capture eliminates the GetPivotHeader() mutation race, the drain is the correct synchronisation barrier, and moving cleanup after the drain resolves the pre-existing false-positive corruption warnings. No blocking issues.

LukaszRozmej
LukaszRozmej previously approved these changes May 4, 2026
Comment on lines +67 to +71
catch (ObjectDisposedException)
{
// DBs / stores can be torn down while sync is in progress on shutdown — same
// semantics as cancellation: swallow rather than logging "State sync failed".
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be prevented and order be preserved?

Comment on lines +69 to +73
// The five fields below were previously guarded by an RWLock so HandleResponse (read)
// could run concurrently with ResetStateRoot / CleanupMemory (write). The lock is no
// longer needed: SimpleDispatcher.Run drains all in-flight workers before returning,
// and StateSyncRunner only calls reset/cleanup outside dispatcher.Run, so the drain is
// the sole synchronisation barrier.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be more concise, don't need history.

Base automatically changed from amirul/simple-sync-dispatcher to master May 4, 2026 11:06
@asdacap asdacap dismissed LukaszRozmej’s stale review May 4, 2026 11:06

The base branch was changed.

asdacap and others added 6 commits May 4, 2026 20:00
Round-termination signalling stays in StateSyncFeed.PrepareRequest (it
returns null when treeSync.IsSyncRoundFinished() is true), so the
dispatcher exits naturally per round and the runner just `await`s it.

Post-sync work (VerifyPostSyncCleanUp, FinalizeSync, verify-trie) moves
out of TreeSync's response-handling path (was inside SaveNode's IsRoot
branch and IsSyncRoundFinished's success branches) into the runner,
which calls them once after the loop with the captured per-round pivot.

The captured pivot — returned by ResetStateRootToBestSuggested — is the
key fix for svlachakis #11457/#11458: FinalizeSync now uses exactly the
pivot the round committed against, instead of re-reading the mutating
GetPivotHeader() and racing concurrent SaveNodes.

Pending_items_cache_mechanism_works_across_root_changes drives TreeSync
directly since the feed only returns null on round-finish/cancellation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- StateSyncRunner: drop the try/catch around FinalizeSync (original
  call site in TreeSync.SaveNode had no such guard).
- TreeSync: move the "STATE SYNC FINISHED:..." log into StateSyncRunner
  so all post-sync reporting lives in one place.
- TreeSync: keep VerifyPostSyncCleanUp at its original location next to
  CleanupMemory.
- TreeSync.SaveNode: drop the explanatory comment about FinalizeSync's
  prior call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RWLock previously serialised HandleResponse (read) against
ResetStateRoot / CleanupMemory (write). After the round-termination
move, those writers only run from StateSyncRunner outside dispatcher.Run,
and SimpleDispatcher.Run drains all in-flight workers before returning,
so the drain alone is the synchronisation barrier. The lock is dead
weight on the response hot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The feed didn't actually need to change — the OCE catches I added were
redundant (the dispatcher / runner already swallow OCE on cancellation),
and the formatting tweaks were noise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every caller of ResetStateRoot / ResetStateRootToBestSuggested passes
SyncFeedState.Dormant; the only thing the parameter did was guard a
branch that's always taken (UpdateHeaderForcefully) and a throw that's
unreachable. Holdover from when StateSyncFeed was an ISyncFeed with a
state machine — gone after the ISimpleSyncFeed migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VerifyPostSyncCleanUp's local try/catch around the dependency-clear and
CleanupMemory was a one-off swallow at a low level. Drop it and add a
single ObjectDisposedException catch alongside the existing
OperationCanceledException catch in StateSyncRunner.Run, so any
shutdown-time DB teardown is handled in one place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
asdacap and others added 2 commits May 4, 2026 20:00
GetPivotHeader() can return null transiently (pivot not known yet,
beacon control not ready). Previously TreeSync.IsSyncRoundFinished
would NRE on .StateRoot. Now it treats null pivot as "round done so
the runner can park", and the runner sleeps 1s before retrying when
ResetStateRootToBestSuggested also returns null — preventing a tight
loop on transient pivot unavailability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- StateSyncRunner.Run: drop the catch (ObjectDisposedException) and instead skip the
  DB-tune-default in the finally when token is cancelled, since the OD on shutdown
  came from touching already-disposed DBs in that path.
- TreeSync: trim the lock-removal comment — keep the invariant, drop the history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@asdacap asdacap merged commit 12fabc9 into master May 4, 2026
469 of 894 checks passed
@asdacap asdacap deleted the move-logic-to-state-sync branch May 4, 2026 12:51
@ak88 ak88 mentioned this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants