consensus: fix milestone-mismatch rewind deadlock#2246
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review once |
There was a problem hiding this comment.
Pull request overview
This PR fixes a recovery deadlock where Bor nodes on a bad fork could not automatically rewind on a milestone mismatch because the canonical sidechain segment was missing locally and the downloader would refuse to fetch it while the bad-fork tip was still present. The updated rewind logic allows rewinding to an attested canonical ancestor (when the canonical segment isn’t locally available), and purges stale milestone/fork-validation state so canonical resync from peers can proceed.
Changes:
- Update milestone-mismatch handling to allow rewinding to an attested canonical ancestor even when the canonical chain segment isn’t present locally, while preserving the “don’t rewind blindly” safety checks.
- Add milestone-state purge support (
PurgeMilestonesAfter/PurgeAfter) to clear stale whitelisted/locked/future milestone entries and fork-validation cache after a rewind. - Add rawdb helper to delete persisted finality entries and introduce new tests covering purge and rewind behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| interfaces.go | Extends ChainValidator with PurgeMilestonesAfter to support post-rewind cleanup. |
| eth/downloader/whitelist/service.go | Implements PurgeMilestonesAfter to purge milestone state and reset fork-validation cache/bounds. |
| eth/downloader/whitelist/milestone.go | Adds PurgeAfter to drop stale milestone whitelist/lock/future entries in-memory and on-disk. |
| eth/downloader/whitelist/milestone_test.go | Adds unit tests validating PurgeAfter behavior for memory+disk combinations. |
| eth/downloader/bor_downloader_test.go | Updates whitelist fake to satisfy the extended ChainValidator interface. |
| eth/bor_checkpoint_verifier.go | Refactors mismatch rewind logic (attested blind rewind fallback), adds miner pause helper, and invokes milestone purge after successful rewinds. |
| eth/bor_checkpoint_verifier_test.go | Adds integration tests for the new rewind behavior and updates helper logic. |
| core/rawdb/milestone.go | Adds DeleteLastFinality helper used by milestone purge. |
| core/forkchoice_test.go | Updates chain validator fake to satisfy the extended ChainValidator interface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (62.41%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2246 +/- ##
===========================================
+ Coverage 52.84% 52.91% +0.07%
===========================================
Files 886 886
Lines 156786 156863 +77
===========================================
+ Hits 82848 83008 +160
+ Misses 68688 68605 -83
Partials 5250 5250
... and 21 files with indirect coverage changes
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
I didn't surface any bugs, but this PR rewrites the milestone-mismatch rewind state machine and adds new chain-reorg behavior (blind rewind on attested ancestor + cross-component whitelist purge), so it warrants a careful human review of the consensus path.
Extended reasoning...
Overview
The PR fixes a real deadlock where a forked Bor node couldn't auto-recover: GetBlocksFromHash returned nothing because the canonical sidechain only existed on peers, so the rewind path bailed; meanwhile the downloader rejected those peers as 'sidechain ghost-state attack' because the bad-fork tip was still local. Recovery previously required wiping chain data. The fix introduces an attested-anchor concept (whitelist hash matches local hash at the anchor block and num < end) and, only for milestones, performs SetHead to that anchor without inserting a canonical chain, letting the downloader resync from peers. It also adds PurgeMilestonesAfter to drop stale whitelist/lock/future-milestone state above the new head so the downloader's validators don't reject canonical peers post-rewind.
Security risks
The new blind-rewind path is gated by three conjoined conditions (not a checkpoint, rewindAttested is true, and head-rewindTo <= maxRewindLen) and the attestation explicitly requires num < end && GetCanonicalHash(num) == expectedHash, so an attacker can't induce a rewind on a hash they control unless that hash already matches the local canonical chain at that height — i.e. the local node would have to already agree. The checkpoint path is intentionally excluded because the checkpoint hash is a merkle root, not a block hash, and cannot attest. The maxRewindLen clamp prevents unbounded rewinds and explicitly clears rewindAttested when the target is clamped. PurgeMilestonesAfter also caps lastValidForkBlock so checkForkCorrectness's max(milestoneNumber, lastValidForkBlock) bound doesn't blind-accept peer chains past the rewind anchor — a subtle but important detail. The risks I'd want a human to weigh are: (1) whether the attested predicate is sufficient under adversarial peer scenarios where a milestone might be processed before reorg detection completes, (2) the genesis-anchor edge case (rewindTo=0 with an attested genesis hash deliberately distinguished from 'unset'), and (3) whether dropping the future-milestone list could in some scenario weaken protection against a fork the node was correctly rejecting.
Level of scrutiny
High. This touches the consensus-validation rewind path, modifies chain reorg behavior, and adds a new interface method (PurgeMilestonesAfter) that ripples across ChainValidator consumers. Bugs here can desync nodes from the canonical chain or — in the worst case — make a node accept a non-canonical fork. The author has added thoughtful test coverage including negative cases (whitelist hash differs, rewindTo==end no-op, unattested fallback, genesis anchor), and the change was validated on a live devnet, but the state-machine reasoning is dense enough that a maintainer familiar with this code path should sign off rather than an automated reviewer.
Other factors
Code quality is high: the new helpers (pauseMiner, rewind, reorgToFinalized now returning bool) are sensible refactors that fix a previously ignored SetHead error; comments explain the why (e.g. race between findCommonAncestor's re-read and the outer mismatch check). The PR also adds a metric (PurgeAfterDBErrorMeter) for observability on DB failures during purge. SonarQube quality gate passed. The only nit I noticed is that this is a substantial behavioral change with no CODEOWNERS sign-off visible yet — that alone is reason to defer.
|



Summary
Bor nodes that forked off the canonical chain due to a Heimdall startup hiccup, span propagation gap, or similar issue could not recover automatically. The mismatch-rewind path refused to act without the canonical sidechain in the local DB, while the downloader refused to deliver that sidechain because the bad-fork tip was already in place. This created a chicken-and-egg deadlock where a manual chain data wipe was the only recovery path.
Root cause
In eth/bor_checkpoint_verifier.go before the fix:
GetBlocksFromHashonly returns blocks that already exist in the local DB. After a fork, those canonical blocks live on peers, but the downloader rejects them as sidechain data. The refusal logic predates the fix rewritten in this PR and was originally added in PR #23 to prevent a separate “merkle root reorg on nil chain” failure mode. This fix preserves that protection while allowing recovery.Validation on a live devnet
Two RPC nodes hit the bug naturally and recovered: