-
Notifications
You must be signed in to change notification settings - Fork 182
feat: rewind chain head on bad fork (wrong actor bundle) #6193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds actor-bundle-aware head rewinding: new StateManager APIs expose and repeatedly rewind the heaviest tipset when actor-bundle metadata diverges, integrates that rewind into daemon startup, exposes manifest metadata lookup, and enables equality for actor-bundle metadata. Changes
Sequence Diagram(s)sequenceDiagram
participant Daemon
participant SM as StateManager
participant CC as ChainConfig
participant BS as Blockstore
participant Manifest as BuiltinActorManifest
participant Bundles as ACTOR_BUNDLES_METADATA
Daemon->>SM: maybe_rewind_heaviest_tipset()?
loop until no rewind
SM->>SM: heaviest_tipset() (current head)
SM->>CC: network_height_with_actor_bundle(epoch)
CC->>BS: load manifest CID
BS-->>CC: manifest bytes
CC-->>SM: HeightInfoWithActorManifest (height, info, manifest_cid)
SM->>Manifest: manifest.metadata()?
Manifest->>Bundles: lookup matching ActorBundleMetadata
alt metadata matches current state
SM-->>SM: stop rewinding
else metadata diverged or missing
SM->>SM: rewind heaviest tipset one step (repeat)
end
end
SM-->>Daemon: done
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
e4630b7 to
aed3c9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/daemon/mod.rs (1)
598-611: Run rewind again after snapshot import (before services continue)Import sets a new HEAD that may still be on the bad fork. Call rewind immediately after import to ensure the node evaluates the correct chain before continuing (and also covers halt-after-import). Minimal patch:
@@ - maybe_import_snapshot(opts, &mut config, &ctx).await?; + maybe_import_snapshot(opts, &mut config, &ctx).await?; + // Re-evaluate HEAD after import to ensure we rewind off any bad fork introduced by the snapshot + ctx.state_manager.maybe_rewind_heaviest_tipset()?; if opts.halt_after_import {Also applies to: 613-619
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
Cargo.toml(1 hunks)src/daemon/mod.rs(1 hunks)src/networks/actors_bundle.rs(2 hunks)src/networks/mod.rs(3 hunks)src/shim/machine/manifest.rs(2 hunks)src/state_manager/mod.rs(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/networks/mod.rs (2)
src/shim/state_tree.rs (2)
store(180-180)store(274-281)src/shim/machine/manifest.rs (1)
load_manifest(126-134)
src/daemon/mod.rs (3)
src/rpc/methods/sync.rs (1)
ctx(177-252)src/tool/subcommands/api_cmd/generate_test_snapshot.rs (1)
ctx(106-159)src/tool/subcommands/api_cmd/test_snapshot.rs (1)
ctx(127-178)
src/state_manager/mod.rs (2)
src/chain/store/chain_store.rs (1)
heaviest_tipset(221-229)src/blocks/tipset.rs (10)
parent_state(349-351)parent_state(535-537)from(105-110)from(160-162)from(166-168)from(172-177)from(181-186)from(205-214)from(479-484)chain(390-397)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: V1 snapshot export checks
- GitHub Check: Forest CLI checks
- GitHub Check: Calibnet stateless mode check
- GitHub Check: Calibnet eth mapping check
- GitHub Check: Calibnet no discovery checks
- GitHub Check: Calibnet kademlia checks
- GitHub Check: Wallet tests
- GitHub Check: Diff snapshot export checks
- GitHub Check: db-migration-checks
- GitHub Check: State migrations
- GitHub Check: Bootstrap checks - Forest
- GitHub Check: Calibnet check
- GitHub Check: V2 snapshot export checks
- GitHub Check: Devnet checks
- GitHub Check: Calibnet api test-stateful check
🔇 Additional comments (4)
Cargo.toml (1)
66-66: Educe PartialEq feature addition — OKMatches the new #[educe(PartialEq)] usage; no concerns.
src/networks/actors_bundle.rs (1)
114-123: Derive semantics look rightIgnoring manifest in PartialEq avoids heavy structural compares and keeps equality on network/version/bundle_cid. Good fit for rewind checks.
src/networks/mod.rs (1)
427-445: Bundle-aware network height helper — LGTMFinds the latest upgrade with a bundle before the given epoch and loads its manifest; behavior is correct.
src/state_manager/mod.rs (1)
218-255: Rewind logic is sound; relies on accurate metadata mappingFlow and checks look good; once metadata lookup (in BuiltinActorManifest::metadata) is fixed to match by bundle CID, this should behave deterministically across networks.
Do a quick local check once the metadata() fix lands:
- Start with a HEAD just after an upgrade with the wrong bundle; confirm the log shows a rewind to (upgrade_epoch-1) and that subsequent validation/migration proceeds.
aed3c9c to
9de356a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/state_manager/mod.rs (1)
218-254: Sound rewind logic with good safety checks.The implementation correctly:
- Compares expected vs. actual actor bundle metadata
- Rewinds to the epoch before the network upgrade (
expected_height_info.epoch - 1)- Validates that the target state exists before committing the rewind
- Prevents unnecessary rewinds when
target_epoch >= current_epochConsider adding inline documentation explaining why the target epoch is
expected_height_info.epoch - 1rather than the epoch itself. This would help future maintainers understand that we're rewinding to just before the upgrade boundary to allow the node to re-evaluate and follow the correct chain after being upgraded.Additionally, the error message at lines 245-248 could include guidance, such as:
anyhow::bail!( "failed to rewind, state tree @ {target_epoch} is missing from blockstore: {}. \ This may indicate an incomplete or corrupted snapshot. Consider re-syncing from a trusted snapshot.", target_head.parent_state() );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
Cargo.toml(1 hunks)src/daemon/mod.rs(1 hunks)src/networks/actors_bundle.rs(2 hunks)src/networks/mod.rs(3 hunks)src/shim/machine/manifest.rs(2 hunks)src/state_manager/mod.rs(8 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/shim/machine/manifest.rs
- src/daemon/mod.rs
- src/networks/mod.rs
- src/networks/actors_bundle.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/state_manager/mod.rs (3)
src/chain/store/chain_store.rs (1)
heaviest_tipset(221-229)src/blocks/tipset.rs (10)
parent_state(349-351)parent_state(535-537)from(105-110)from(160-162)from(166-168)from(172-177)from(181-186)from(205-214)from(479-484)chain(390-397)src/chain/store/index.rs (1)
chain(191-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: All lint checks
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build MacOS
- GitHub Check: Build Ubuntu
- GitHub Check: tests
- GitHub Check: tests-release
🔇 Additional comments (4)
Cargo.toml (1)
66-66: No issues found — the educe dependency change is correct and well-motivated.educe 0.6 supports both Debug and PartialEq features with per-field controls, matching the requirements in this PR. The struct
ActorBundleMetadataat src/networks/actors_bundle.rs:116 already derives#[educe(PartialEq)]with selective field ignoring, confirming the PartialEq feature is actively in use. The version constraint relaxation from "0.6.0" to "0.6" allows patch updates and is standard practice. The change aligns with the codebase's existing patterns and needs.src/state_manager/mod.rs (3)
207-210: LGTM: Clean abstraction for accessing the heaviest tipset.This wrapper method provides a convenient way to access the heaviest tipset and enables consistent usage across the state manager.
212-216: LGTM: Correctly handles multiple consecutive rewinds.The loop correctly handles scenarios where the node missed multiple network upgrades. Each iteration re-validates the current head against the expected actor bundle.
While the logic is correct, this loop could perform multiple expensive operations during startup if many network upgrades were missed. Consider verifying that this startup delay is acceptable in your operational environment, or add logging to track the number of iterations for observability.
263-263: LGTM: Consistent refactoring to use the new accessor.All call sites correctly updated to use
self.heaviest_tipset()instead of directly accessingself.cs.heaviest_tipset(). This improves encapsulation and maintains consistency with the new API.Also applies to: 662-662, 672-672, 716-716, 1129-1129, 1248-1251, 1548-1548, 1728-1728
src/networks/actors_bundle.rs
Outdated
| #[educe(PartialEq(ignore))] | ||
| pub manifest: BuiltinActorManifest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we ignore it for PartialEq?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I thought BuiltinActorManifest does not implement PartialEq
src/networks/mod.rs
Outdated
| .map(|(height, _)| *height) | ||
| } | ||
|
|
||
| pub fn network_height_with_actor_bundle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some unit tests and docs here? Also, the return type is complex - could we contrive a type aggregating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| pub fn builtin_actors(&self) -> impl ExactSizeIterator<Item = (BuiltinActor, Cid)> + '_ { | ||
| self.builtin2cid.iter().map(|(k, v)| (*k, *v)) // std::iter::Copied doesn't play well with the tuple here | ||
| } | ||
| /// Get the actor bundle metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trivial, but a unit test for better coverage would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| self.chain_store().heaviest_tipset() | ||
| } | ||
|
|
||
| /// Returns the currently tracked heaviest tipset and rewind to a most recent valid one if necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on what valid means in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| "num-traits", | ||
| ] | ||
|
|
||
| [[package]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including Cargo.lock change because dependabot has stopped working for Rust deps for a few weeks. https://github.com/ChainSafe/forest/network/updates/13772088/jobs
LesnyRumcajs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary of changes
There's a common case that an old node would go into a bad fork after a network upgrade. This PR allows the node to automatically rewind the chain head and perform the network upgrade after the node binary is updated
Manual test
Log
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes #6089
Other information and links
Change checklist
Summary by CodeRabbit
New Features
Improvements
Chores