refactor: serialise FinalizeSync via captured per-round pivot#11477
Conversation
|
Claude finished @asdacap's task in 7m 58s —— View job PR Review
SummaryThe architectural move is sound. Critical: 0 | High: 0 | Medium: 0 | Low: 4 Low findings
Correctness notes
Mergeable. The low findings are either pre-existing or cosmetic (comment clarity). No blocking issues. |
| // CancellationToken.None internally, so peer allocations are always freed. | ||
| // Swallow OCE from roundCts cancellation: that's the normal way to end a | ||
| // round here and must not prevent the verify-trie hand-off below. | ||
| try { await dispatch; } |
There was a problem hiding this comment.
The catch filter when (roundCts.IsCancellationRequested) is effectively always true when an OCE is raised here, because roundCts is cancelled in both exit paths before reaching this line:
- Path A (round finished):
roundCts.Cancel()called explicitly at line 93. - Path B (outer
tokencancelled):roundCtsis linked totoken, soIsCancellationRequestedis already true.
In practice this means it's equivalent to catch (OperationCanceledException) { } — any OCE from dispatch, including one raised because token was cancelled during the drain, is silently swallowed. The intent (letting the verify-trie handoff proceed) is correct, but the comment could be clearer that this also covers the shutdown path, not just the normal-round-end path. Consider:
// Swallow OCE regardless of which token fired: both normal round termination (roundCts
// cancelled by us) and graceful shutdown (token cancelled) must not skip verify-trie.
try { await dispatch; }
catch (OperationCanceledException) { }Low severity — behaviour is correct, just the comment/filter is misleading.
| // Round-termination authority lives here, not in the feed: poll IsSyncRoundFinished | ||
| // (which also fires VerifyPostSyncCleanUp / SyncCompleted) and cancel the dispatcher | ||
| // when it returns true. The feed is a pure batch producer. | ||
| while (!dispatch.IsCompleted) |
There was a problem hiding this comment.
Edge case: if dispatch is already completed (e.g. the dispatcher faulted with a non-OCE exception) before the polling loop is entered, while (!dispatch.IsCompleted) evaluates to false immediately. The loop body never executes, so IsSyncRoundFinished() and VerifyPostSyncCleanUp() are never called. Control then falls through to await dispatch, which re-throws the fault exception — that propagates all the way out of Run(), bypassing verify-trie.
In practice SimpleDispatcher.Run() doesn't fault (all exceptions in DoDispatch are caught), so this is unlikely, but worth a comment to document the assumption.
| // when it returns true. The feed is a pure batch producer. | ||
| while (!dispatch.IsCompleted) | ||
| { | ||
| if (treeSync.IsSyncRoundFinished()) |
There was a problem hiding this comment.
IsSyncRoundFinished() has a DB read side path: the rootNodeKeyExists check acquires _stateDbLock and calls _store.NodeExists(...) (→ RocksDB). This runs every 50 ms for the lifetime of a sync round. That's fine for state sync, but worth keeping in mind if the interval is ever tightened.
More importantly, IsSyncRoundFinished() calls VerifyPostSyncCleanUp() (via the _rootSaved == 1 and rootNodeKeyExists branches) before roundCts.Cancel() is called (line 93). In-flight DoDispatch tasks can still be calling HandleResponse → treeSync.HandleResponse concurrently when CleanupMemory() acquires its write lock. The write lock serialises correctly, but it means VerifyPostSyncCleanUp's _pendingItems / _dependencies non-zero checks fire before all in-flight responses are drained, which can produce false-positive "POSSIBLE FAST SYNC CORRUPTION" log errors.
This is pre-existing behaviour (same race existed when IsSyncRoundFinished was called from inside PrepareRequest), but noting it because the new drain (await dispatch) does not change the window — cleanup is still called before the drain completes.
| await stateSyncDispatcher.Run(token); | ||
|
|
||
| using CancellationTokenSource roundCts = CancellationTokenSource.CreateLinkedTokenSource(token); | ||
| Task dispatch = stateSyncDispatcher.Run(roundCts.Token); |
There was a problem hiding this comment.
There is a narrow path where the drain is bypassed. If roundCts.Cancel() fires while SimpleDispatcher.Run() is blocked inside peerPool.Allocate(..., roundCts.Token) or semaphore.WaitAsync(roundCts.Token), an OperationCanceledException propagates out of Run() directly, skipping the drain loop (for (int i = 0; i < maxThreads; ...) in the dispatcher). The runner catches this OCE at line 111–112 and continues, but the orphaned Task.Run workers from that round are still executing in the background.
In practice the consequences are benign — the workers hold closure references to semaphore (keeping it alive), eventually call peerPool.Free and semaphore.Release(), and any HandleResponse they attempt will fail the _ongoingRequests.TryRemove guard and return early after CleanupMemory() has cleared the dictionary. No data corruption or resource leak. But the comment "peer allocations are always freed" is slightly optimistic — they are freed, just asynchronously by the still-running tasks, not synchronously by the drain. This same race exists in the pre-PR code.
20cf987 to
a35807d
Compare
|
@claude reevaluate |
|
Claude finished @asdacap's task in 5m 52s —— View job Re-evaluation
What changed since the first reviewThe previous review described a
Findings 1 and 3 from the previous review were based on that Concurrency model — verified correctThe lock-removal invariant holds: Drain bypass on cancellation (pre-existing, benign)If Updated findingsCritical: 0 | High: 0 | Medium: 0 | Low: 2
Previous findings — status
OverallThe architectural core is sound and cleaner than the previous review characterised. The pivot-per-round capture eliminates the |
| catch (ObjectDisposedException) | ||
| { | ||
| // DBs / stores can be torn down while sync is in progress on shutdown — same | ||
| // semantics as cancellation: swallow rather than logging "State sync failed". | ||
| } |
There was a problem hiding this comment.
Can this be prevented and order be preserved?
| // The five fields below were previously guarded by an RWLock so HandleResponse (read) | ||
| // could run concurrently with ResetStateRoot / CleanupMemory (write). The lock is no | ||
| // longer needed: SimpleDispatcher.Run drains all in-flight workers before returning, | ||
| // and StateSyncRunner only calls reset/cleanup outside dispatcher.Run, so the drain is | ||
| // the sole synchronisation barrier. |
There was a problem hiding this comment.
Can be more concise, don't need history.
Round-termination signalling stays in StateSyncFeed.PrepareRequest (it returns null when treeSync.IsSyncRoundFinished() is true), so the dispatcher exits naturally per round and the runner just `await`s it. Post-sync work (VerifyPostSyncCleanUp, FinalizeSync, verify-trie) moves out of TreeSync's response-handling path (was inside SaveNode's IsRoot branch and IsSyncRoundFinished's success branches) into the runner, which calls them once after the loop with the captured per-round pivot. The captured pivot — returned by ResetStateRootToBestSuggested — is the key fix for svlachakis #11457/#11458: FinalizeSync now uses exactly the pivot the round committed against, instead of re-reading the mutating GetPivotHeader() and racing concurrent SaveNodes. Pending_items_cache_mechanism_works_across_root_changes drives TreeSync directly since the feed only returns null on round-finish/cancellation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- StateSyncRunner: drop the try/catch around FinalizeSync (original call site in TreeSync.SaveNode had no such guard). - TreeSync: move the "STATE SYNC FINISHED:..." log into StateSyncRunner so all post-sync reporting lives in one place. - TreeSync: keep VerifyPostSyncCleanUp at its original location next to CleanupMemory. - TreeSync.SaveNode: drop the explanatory comment about FinalizeSync's prior call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RWLock previously serialised HandleResponse (read) against ResetStateRoot / CleanupMemory (write). After the round-termination move, those writers only run from StateSyncRunner outside dispatcher.Run, and SimpleDispatcher.Run drains all in-flight workers before returning, so the drain alone is the synchronisation barrier. The lock is dead weight on the response hot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The feed didn't actually need to change — the OCE catches I added were redundant (the dispatcher / runner already swallow OCE on cancellation), and the formatting tweaks were noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every caller of ResetStateRoot / ResetStateRootToBestSuggested passes SyncFeedState.Dormant; the only thing the parameter did was guard a branch that's always taken (UpdateHeaderForcefully) and a throw that's unreachable. Holdover from when StateSyncFeed was an ISyncFeed with a state machine — gone after the ISimpleSyncFeed migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VerifyPostSyncCleanUp's local try/catch around the dependency-clear and CleanupMemory was a one-off swallow at a low level. Drop it and add a single ObjectDisposedException catch alongside the existing OperationCanceledException catch in StateSyncRunner.Run, so any shutdown-time DB teardown is handled in one place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GetPivotHeader() can return null transiently (pivot not known yet, beacon control not ready). Previously TreeSync.IsSyncRoundFinished would NRE on .StateRoot. Now it treats null pivot as "round done so the runner can park", and the runner sleeps 1s before retrying when ResetStateRootToBestSuggested also returns null — preventing a tight loop on transient pivot unavailability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- StateSyncRunner.Run: drop the catch (ObjectDisposedException) and instead skip the DB-tune-default in the finally when token is cancelled, since the OD on shutdown came from touching already-disposed DBs in that path. - TreeSync: trim the lock-removal comment — keep the invariant, drop the history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fa60df4 to
ef57ead
Compare
Summary
Stacks on top of #11102. Addresses the
GetPivotHeader()race flagged bysvlachakisin #11102 (root cause discussed in #11457 / #11458):TreeSync.SaveNodepreviously called_store.FinalizeSync(GetPivotHeader())inline on theIsRootbranch, while parallelHandleResponseworkers were still writing nodes — andGetPivotHeader()is mutating, so the pivot rotation can race with the root commit and corrupt FlatDBCurrentState.This PR pins the pivot per round in the runner and runs all post-sync work at the quiescent point after the dispatcher has fully drained.
What changed
TreeSync.ResetStateRootToBestSuggestednow returns theBlockHeaderit used.StateSyncRunner.RunStateSyncRoundscaptures that header at the start of each round and, when the round commits the root, uses that captured value forFinalizeSync— no secondGetPivotHeader()call.FinalizeSyncout ofSaveNode. The_store.FinalizeSync(pivotHeader)call is removed fromTreeSync.SaveNode'sIsRootbranch. The runner invokestreeSync.FinalizeSync(finalPivot)once after the outer loop exits viaIsRootComplete.VerifyPostSyncCleanUpout ofIsSyncRoundFinished. The two cleanup-firing branches inside the predicate (_rootSaved == 1andNodeExists) no longer call cleanup as a side effect. The runner callstreeSync.VerifyPostSyncCleanUp()once after the loop.STATE SYNC FINISHED:…log moved to the runner so all post-sync reporting is in one place._syncStateLockremoved. It existed to serialiseHandleResponse(read) againstResetStateRoot/CleanupMemory(write). With the cleanup moved out and the writers only running from the runner outsidedispatcher.Run,SimpleDispatcher.Run's drain (semaphore.WaitAsync(CancellationToken.None)× maxThreads) is the sole synchronisation barrier — strict happens-before between the lastHandleResponseand the next reset/cleanup.What didn't change
StateSyncFeed.PrepareRequestreturning null whentreeSync.IsSyncRoundFinished()is true — natural dispatcher exit, no watcher / per-round CTS.treeSync.IsRootComplete. Tightening to the strictIsRootSaved(_rootSaved == 1) breaks legitimate completion paths (empty-state syncs, snap-prepopulated root) where theIsRootbranch inSaveNodenever fires. TheNodeExistsover-reporting concern stays as FlatDb 1.37.1: restart mid-snap promotes incomplete state, deletes canonical chain, sync permanently stuck #11457 / fix(snap-sync): deferFinalizeSyncto end of state sync to prevent canonical chain wipe #11458 territory.GetPivotHeader()null-deref atTreeSync.cs:355,SyncCompletedevent timing) are pre-existing and untouched here.Test plan
Nethermind.Synchronization.Testfiltered to StateSync + SnapSync: 932/940 pass, 8 skips (hash-db).🤖 Generated with Claude Code