Skip to content

consensus: fix milestone-mismatch rewind deadlock#2246

Merged
kamuikatsurgi merged 5 commits into
developfrom
kamui/fix-milestone-rewind-deadlock
May 29, 2026
Merged

consensus: fix milestone-mismatch rewind deadlock#2246
kamuikatsurgi merged 5 commits into
developfrom
kamui/fix-milestone-rewind-deadlock

Conversation

@kamuikatsurgi
Copy link
Copy Markdown
Member

Summary

Bor nodes that forked off the canonical chain due to a Heimdall startup hiccup, span propagation gap, or similar issue could not recover automatically. The mismatch-rewind path refused to act without the canonical sidechain in the local DB, while the downloader refused to deliver that sidechain because the bad-fork tip was already in place. This created a chicken-and-egg deadlock where a manual chain data wipe was the only recovery path.

INFO [06-24-32.167] Sealing successful                       number=378
INFO [06-24-52.504] Whitelisting new milestone from heimdall block=397 hash=aafd6f..e695f7
INFO [06-25-01.113] Sealing successful                       number=407  ← peak
WARN [06-25-03.949] End block hash mismatch while whitelisting milestone expected=919b75.. got=91b2b4..
INFO [06-25-03.949] Rewinding chain due to milestone endblock hash mismatch number=397
WARN [06-25-03.949] Milestone mismatch: refusing to rewind without a canonical chain segment head=407 rewindTo=397 start=398 end=398
WARN [06-25-08.310] Sidechain ghost-state attack detected    number=398 sideroot=b4d9e9.. canonroot=b4d9e9..
WARN [06-26-20.967] Synchronisation failed, dropping peer    err="retrieved hash chain is invalid: sidechain ghost-state attack"

Root cause

In eth/bor_checkpoint_verifier.go before the fix:

canonicalChain = eth.BlockChain().GetBlocksFromHash(canonicalHash, length)
if len(canonicalChain) == 0 {
    log.Warn("Milestone mismatch: refusing to rewind without a canonical chain segment")
    return errHashMismatch
}

GetBlocksFromHash only returns blocks that already exist in the local DB. After a fork, those canonical blocks live on peers, but the downloader rejects them as sidechain data. The refusal logic predates the fix rewritten in this PR and was originally added in PR #23 to prevent a separate “merkle root reorg on nil chain” failure mode. This fix preserves that protection while allowing recovery.

Validation on a live devnet

Two RPC nodes hit the bug naturally and recovered:

WARN [11:31:47.845] End block hash mismatch while whitelisting milestone expected=ebe1a9bd.. got=fabc9940..
WARN [11:31:47.845] Milestone mismatch: rewinding to attested canonical ancestor without local sidechain; canonical chain will resync from peers head=639 rewindTo=589 start=590 end=591
WARN [11:31:47.845] Rewinding blockchain to block            target=589
INFO [11:31:47.845] Rewound to block with state              number=638..589

Copilot AI review requested due to automatic review settings May 27, 2026 12:27
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@kamuikatsurgi
Copy link
Copy Markdown
Member Author

@claude review once

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a recovery deadlock where Bor nodes on a bad fork could not automatically rewind on a milestone mismatch because the canonical sidechain segment was missing locally and the downloader would refuse to fetch it while the bad-fork tip was still present. The updated rewind logic allows rewinding to an attested canonical ancestor (when the canonical segment isn’t locally available), and purges stale milestone/fork-validation state so canonical resync from peers can proceed.

Changes:

  • Update milestone-mismatch handling to allow rewinding to an attested canonical ancestor even when the canonical chain segment isn’t present locally, while preserving the “don’t rewind blindly” safety checks.
  • Add milestone-state purge support (PurgeMilestonesAfter / PurgeAfter) to clear stale whitelisted/locked/future milestone entries and fork-validation cache after a rewind.
  • Add rawdb helper to delete persisted finality entries and introduce new tests covering purge and rewind behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
interfaces.go Extends ChainValidator with PurgeMilestonesAfter to support post-rewind cleanup.
eth/downloader/whitelist/service.go Implements PurgeMilestonesAfter to purge milestone state and reset fork-validation cache/bounds.
eth/downloader/whitelist/milestone.go Adds PurgeAfter to drop stale milestone whitelist/lock/future entries in-memory and on-disk.
eth/downloader/whitelist/milestone_test.go Adds unit tests validating PurgeAfter behavior for memory+disk combinations.
eth/downloader/bor_downloader_test.go Updates whitelist fake to satisfy the extended ChainValidator interface.
eth/bor_checkpoint_verifier.go Refactors mismatch rewind logic (attested blind rewind fallback), adds miner pause helper, and invokes milestone purge after successful rewinds.
eth/bor_checkpoint_verifier_test.go Adds integration tests for the new rewind behavior and updates helper logic.
core/rawdb/milestone.go Adds DeleteLastFinality helper used by milestone purge.
core/forkchoice_test.go Updates chain validator fake to satisfy the extended ChainValidator interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eth/bor_checkpoint_verifier.go
Comment thread eth/downloader/whitelist/service.go
Comment thread eth/bor_checkpoint_verifier_test.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 62.41135% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.91%. Comparing base (7fdfdb4) to head (09fafb5).

Files with missing lines Patch % Lines
eth/bor_checkpoint_verifier.go 63.63% 23 Missing and 5 partials ⚠️
eth/downloader/whitelist/milestone.go 77.77% 7 Missing and 3 partials ⚠️
eth/downloader/whitelist/service.go 30.76% 9 Missing ⚠️
core/rawdb/milestone.go 0.00% 6 Missing ⚠️

❌ Your patch check has failed because the patch coverage (62.41%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2246      +/-   ##
===========================================
+ Coverage    52.84%   52.91%   +0.07%     
===========================================
  Files          886      886              
  Lines       156786   156863      +77     
===========================================
+ Hits         82848    83008     +160     
+ Misses       68688    68605      -83     
  Partials      5250     5250              
Files with missing lines Coverage Δ
interfaces.go 0.00% <ø> (ø)
core/rawdb/milestone.go 0.00% <0.00%> (ø)
eth/downloader/whitelist/service.go 89.15% <30.76%> (-2.84%) ⬇️
eth/downloader/whitelist/milestone.go 81.71% <77.77%> (-2.26%) ⬇️
eth/bor_checkpoint_verifier.go 58.48% <63.63%> (+12.66%) ⬆️

... and 21 files with indirect coverage changes

Files with missing lines Coverage Δ
interfaces.go 0.00% <ø> (ø)
core/rawdb/milestone.go 0.00% <0.00%> (ø)
eth/downloader/whitelist/service.go 89.15% <30.76%> (-2.84%) ⬇️
eth/downloader/whitelist/milestone.go 81.71% <77.77%> (-2.26%) ⬇️
eth/bor_checkpoint_verifier.go 58.48% <63.63%> (+12.66%) ⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't surface any bugs, but this PR rewrites the milestone-mismatch rewind state machine and adds new chain-reorg behavior (blind rewind on attested ancestor + cross-component whitelist purge), so it warrants a careful human review of the consensus path.

Extended reasoning...

Overview

The PR fixes a real deadlock where a forked Bor node couldn't auto-recover: GetBlocksFromHash returned nothing because the canonical sidechain only existed on peers, so the rewind path bailed; meanwhile the downloader rejected those peers as 'sidechain ghost-state attack' because the bad-fork tip was still local. Recovery previously required wiping chain data. The fix introduces an attested-anchor concept (whitelist hash matches local hash at the anchor block and num < end) and, only for milestones, performs SetHead to that anchor without inserting a canonical chain, letting the downloader resync from peers. It also adds PurgeMilestonesAfter to drop stale whitelist/lock/future-milestone state above the new head so the downloader's validators don't reject canonical peers post-rewind.

Security risks

The new blind-rewind path is gated by three conjoined conditions (not a checkpoint, rewindAttested is true, and head-rewindTo <= maxRewindLen) and the attestation explicitly requires num < end && GetCanonicalHash(num) == expectedHash, so an attacker can't induce a rewind on a hash they control unless that hash already matches the local canonical chain at that height — i.e. the local node would have to already agree. The checkpoint path is intentionally excluded because the checkpoint hash is a merkle root, not a block hash, and cannot attest. The maxRewindLen clamp prevents unbounded rewinds and explicitly clears rewindAttested when the target is clamped. PurgeMilestonesAfter also caps lastValidForkBlock so checkForkCorrectness's max(milestoneNumber, lastValidForkBlock) bound doesn't blind-accept peer chains past the rewind anchor — a subtle but important detail. The risks I'd want a human to weigh are: (1) whether the attested predicate is sufficient under adversarial peer scenarios where a milestone might be processed before reorg detection completes, (2) the genesis-anchor edge case (rewindTo=0 with an attested genesis hash deliberately distinguished from 'unset'), and (3) whether dropping the future-milestone list could in some scenario weaken protection against a fork the node was correctly rejecting.

Level of scrutiny

High. This touches the consensus-validation rewind path, modifies chain reorg behavior, and adds a new interface method (PurgeMilestonesAfter) that ripples across ChainValidator consumers. Bugs here can desync nodes from the canonical chain or — in the worst case — make a node accept a non-canonical fork. The author has added thoughtful test coverage including negative cases (whitelist hash differs, rewindTo==end no-op, unattested fallback, genesis anchor), and the change was validated on a live devnet, but the state-machine reasoning is dense enough that a maintainer familiar with this code path should sign off rather than an automated reviewer.

Other factors

Code quality is high: the new helpers (pauseMiner, rewind, reorgToFinalized now returning bool) are sensible refactors that fix a previously ignored SetHead error; comments explain the why (e.g. race between findCommonAncestor's re-read and the outer mismatch check). The PR also adds a metric (PurgeAfterDBErrorMeter) for observability on DB failures during purge. SonarQube quality gate passed. The only nit I noticed is that this is a substantial behavioral change with no CODEOWNERS sign-off visible yet — that alone is reason to defer.

@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@lucca30 lucca30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kamuikatsurgi kamuikatsurgi merged commit b267c75 into develop May 29, 2026
20 of 21 checks passed
@kamuikatsurgi kamuikatsurgi deleted the kamui/fix-milestone-rewind-deadlock branch May 29, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants