consensus: fix milestone-mismatch rewind deadlock by kamuikatsurgi · Pull Request #2246 · 0xPolygon/bor

kamuikatsurgi · 2026-05-27T12:27:00Z

Summary

Bor nodes that forked off the canonical chain due to a Heimdall startup hiccup, span propagation gap, or similar issue could not recover automatically. The mismatch-rewind path refused to act without the canonical sidechain in the local DB, while the downloader refused to deliver that sidechain because the bad-fork tip was already in place. This created a chicken-and-egg deadlock where a manual chain data wipe was the only recovery path.

INFO [06-24-32.167] Sealing successful                       number=378
INFO [06-24-52.504] Whitelisting new milestone from heimdall block=397 hash=aafd6f..e695f7
INFO [06-25-01.113] Sealing successful                       number=407  ← peak
WARN [06-25-03.949] End block hash mismatch while whitelisting milestone expected=919b75.. got=91b2b4..
INFO [06-25-03.949] Rewinding chain due to milestone endblock hash mismatch number=397
WARN [06-25-03.949] Milestone mismatch: refusing to rewind without a canonical chain segment head=407 rewindTo=397 start=398 end=398
WARN [06-25-08.310] Sidechain ghost-state attack detected    number=398 sideroot=b4d9e9.. canonroot=b4d9e9..
WARN [06-26-20.967] Synchronisation failed, dropping peer    err="retrieved hash chain is invalid: sidechain ghost-state attack"

Root cause

In eth/bor_checkpoint_verifier.go before the fix:

canonicalChain = eth.BlockChain().GetBlocksFromHash(canonicalHash, length)
if len(canonicalChain) == 0 {
    log.Warn("Milestone mismatch: refusing to rewind without a canonical chain segment")
    return errHashMismatch
}

GetBlocksFromHash only returns blocks that already exist in the local DB. After a fork, those canonical blocks live on peers, but the downloader rejects them as sidechain data. The refusal logic predates the fix rewritten in this PR and was originally added in PR #23 to prevent a separate “merkle root reorg on nil chain” failure mode. This fix preserves that protection while allowing recovery.

Validation on a live devnet

Two RPC nodes hit the bug naturally and recovered:

WARN [11:31:47.845] End block hash mismatch while whitelisting milestone expected=ebe1a9bd.. got=fabc9940..
WARN [11:31:47.845] Milestone mismatch: rewinding to attested canonical ancestor without local sidechain; canonical chain will resync from peers head=639 rewindTo=589 start=590 end=591
WARN [11:31:47.845] Rewinding blockchain to block            target=589
INFO [11:31:47.845] Rewound to block with state              number=638..589

claude

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

_{Tip: disable this comment in your organization's Code Review settings.}

kamuikatsurgi · 2026-05-27T12:32:30Z

@claude review once

Copilot

Pull request overview

This PR fixes a recovery deadlock where Bor nodes on a bad fork could not automatically rewind on a milestone mismatch because the canonical sidechain segment was missing locally and the downloader would refuse to fetch it while the bad-fork tip was still present. The updated rewind logic allows rewinding to an attested canonical ancestor (when the canonical segment isn’t locally available), and purges stale milestone/fork-validation state so canonical resync from peers can proceed.

Changes:

Update milestone-mismatch handling to allow rewinding to an attested canonical ancestor even when the canonical chain segment isn’t present locally, while preserving the “don’t rewind blindly” safety checks.
Add milestone-state purge support (PurgeMilestonesAfter / PurgeAfter) to clear stale whitelisted/locked/future milestone entries and fork-validation cache after a rewind.
Add rawdb helper to delete persisted finality entries and introduce new tests covering purge and rewind behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
interfaces.go	Extends `ChainValidator` with `PurgeMilestonesAfter` to support post-rewind cleanup.
eth/downloader/whitelist/service.go	Implements `PurgeMilestonesAfter` to purge milestone state and reset fork-validation cache/bounds.
eth/downloader/whitelist/milestone.go	Adds `PurgeAfter` to drop stale milestone whitelist/lock/future entries in-memory and on-disk.
eth/downloader/whitelist/milestone_test.go	Adds unit tests validating `PurgeAfter` behavior for memory+disk combinations.
eth/downloader/bor_downloader_test.go	Updates whitelist fake to satisfy the extended `ChainValidator` interface.
eth/bor_checkpoint_verifier.go	Refactors mismatch rewind logic (attested blind rewind fallback), adds miner pause helper, and invokes milestone purge after successful rewinds.
eth/bor_checkpoint_verifier_test.go	Adds integration tests for the new rewind behavior and updates helper logic.
core/rawdb/milestone.go	Adds `DeleteLastFinality` helper used by milestone purge.
core/forkchoice_test.go	Updates chain validator fake to satisfy the extended `ChainValidator` interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-05-27T12:57:37Z

Codecov Report

❌ Patch coverage is 62.41135% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.91%. Comparing base (7fdfdb4) to head (09fafb5).

Files with missing lines	Patch %	Lines
eth/bor_checkpoint_verifier.go	63.63%	23 Missing and 5 partials ⚠️
eth/downloader/whitelist/milestone.go	77.77%	7 Missing and 3 partials ⚠️
eth/downloader/whitelist/service.go	30.76%	9 Missing ⚠️
core/rawdb/milestone.go	0.00%	6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (62.41%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2246      +/-   ##
===========================================
+ Coverage    52.84%   52.91%   +0.07%     
===========================================
  Files          886      886              
  Lines       156786   156863      +77     
===========================================
+ Hits         82848    83008     +160     
+ Misses       68688    68605      -83     
  Partials      5250     5250

Files with missing lines	Coverage Δ
interfaces.go	`0.00% <ø> (ø)`
core/rawdb/milestone.go	`0.00% <0.00%> (ø)`
eth/downloader/whitelist/service.go	`89.15% <30.76%> (-2.84%)`	⬇️
eth/downloader/whitelist/milestone.go	`81.71% <77.77%> (-2.26%)`	⬇️
eth/bor_checkpoint_verifier.go	`58.48% <63.63%> (+12.66%)`	⬆️

... and 21 files with indirect coverage changes

Files with missing lines	Coverage Δ
interfaces.go	`0.00% <ø> (ø)`
core/rawdb/milestone.go	`0.00% <0.00%> (ø)`
eth/downloader/whitelist/service.go	`89.15% <30.76%> (-2.84%)`	⬇️
eth/downloader/whitelist/milestone.go	`81.71% <77.77%> (-2.26%)`	⬇️
eth/bor_checkpoint_verifier.go	`58.48% <63.63%> (+12.66%)`	⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

claude

I didn't surface any bugs, but this PR rewrites the milestone-mismatch rewind state machine and adds new chain-reorg behavior (blind rewind on attested ancestor + cross-component whitelist purge), so it warrants a careful human review of the consensus path.

Extended reasoning...

Overview

The PR fixes a real deadlock where a forked Bor node couldn't auto-recover: GetBlocksFromHash returned nothing because the canonical sidechain only existed on peers, so the rewind path bailed; meanwhile the downloader rejected those peers as 'sidechain ghost-state attack' because the bad-fork tip was still local. Recovery previously required wiping chain data. The fix introduces an attested-anchor concept (whitelist hash matches local hash at the anchor block and num < end) and, only for milestones, performs SetHead to that anchor without inserting a canonical chain, letting the downloader resync from peers. It also adds PurgeMilestonesAfter to drop stale whitelist/lock/future-milestone state above the new head so the downloader's validators don't reject canonical peers post-rewind.

Security risks

The new blind-rewind path is gated by three conjoined conditions (not a checkpoint, rewindAttested is true, and head-rewindTo <= maxRewindLen) and the attestation explicitly requires num < end && GetCanonicalHash(num) == expectedHash, so an attacker can't induce a rewind on a hash they control unless that hash already matches the local canonical chain at that height — i.e. the local node would have to already agree. The checkpoint path is intentionally excluded because the checkpoint hash is a merkle root, not a block hash, and cannot attest. The maxRewindLen clamp prevents unbounded rewinds and explicitly clears rewindAttested when the target is clamped. PurgeMilestonesAfter also caps lastValidForkBlock so checkForkCorrectness's max(milestoneNumber, lastValidForkBlock) bound doesn't blind-accept peer chains past the rewind anchor — a subtle but important detail. The risks I'd want a human to weigh are: (1) whether the attested predicate is sufficient under adversarial peer scenarios where a milestone might be processed before reorg detection completes, (2) the genesis-anchor edge case (rewindTo=0 with an attested genesis hash deliberately distinguished from 'unset'), and (3) whether dropping the future-milestone list could in some scenario weaken protection against a fork the node was correctly rejecting.

Level of scrutiny

High. This touches the consensus-validation rewind path, modifies chain reorg behavior, and adds a new interface method (PurgeMilestonesAfter) that ripples across ChainValidator consumers. Bugs here can desync nodes from the canonical chain or — in the worst case — make a node accept a non-canonical fork. The author has added thoughtful test coverage including negative cases (whitelist hash differs, rewindTo==end no-op, unattested fallback, genesis anchor), and the change was validated on a live devnet, but the state-machine reasoning is dense enough that a maintainer familiar with this code path should sign off rather than an automated reviewer.

Other factors

Code quality is high: the new helpers (pauseMiner, rewind, reorgToFinalized now returning bool) are sensible refactors that fix a previously ignored SetHead error; comments explain the why (e.g. race between findCommonAncestor's re-read and the outer mismatch check). The PR also adds a metric (PurgeAfterDBErrorMeter) for observability on DB failures during purge. SonarQube quality gate passed. The only nit I noticed is that this is a substantial behavioral change with no CODEOWNERS sign-off visible yet — that alone is reason to defer.

sonarqubecloud · 2026-05-27T15:10:54Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

lucca30

lgtm

kamuikatsurgi added 4 commits May 27, 2026 16:09

eth: fix milestone-mismatch rewind deadlock

b80185c

chore: nits

3e60cc6

chore: nits

9e3cce1

chore: nit

4004edb

Copilot AI review requested due to automatic review settings May 27, 2026 12:27

claude Bot reviewed May 27, 2026

View reviewed changes

Copilot started reviewing on behalf of kamuikatsurgi May 27, 2026 12:27 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread eth/bor_checkpoint_verifier.go

Comment thread eth/downloader/whitelist/service.go

Comment thread eth/bor_checkpoint_verifier_test.go

claude Bot reviewed May 27, 2026

View reviewed changes

chore: address copilot comments

09fafb5

adamdossa requested review from lucca30 and pratikspatil024 May 28, 2026 10:11

lucca30 approved these changes May 28, 2026

View reviewed changes

pratikspatil024 approved these changes May 29, 2026

View reviewed changes

kamuikatsurgi merged commit b267c75 into develop May 29, 2026
20 of 21 checks passed

kamuikatsurgi deleted the kamui/fix-milestone-rewind-deadlock branch May 29, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus: fix milestone-mismatch rewind deadlock#2246

consensus: fix milestone-mismatch rewind deadlock#2246
kamuikatsurgi merged 5 commits into
developfrom
kamui/fix-milestone-rewind-deadlock

kamuikatsurgi commented May 27, 2026

Uh oh!

claude Bot left a comment

Uh oh!

kamuikatsurgi commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Uh oh!

lucca30 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kamuikatsurgi commented May 27, 2026

Summary

Validation on a live devnet

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

kamuikatsurgi commented May 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Quality Gate passed

Uh oh!

lucca30 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented May 27, 2026 •

edited

Loading