Skip to content

Conversation

@hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Oct 27, 2025

Summary of changes

There's a common case that an old node would go into a bad fork after a network upgrade. This PR allows the node to automatically rewind the chain head and perform the network upgrade after the node binary is updated

Manual test

ln forest_snapshot_calibnet_2025-09-11_height_3008356.forest.car.zst calibnet/0.30.2/car_db/forest_snapshot_calibnet_2025-09-11_height_3008356.forest.car.zst
forest --encrypt-keystore=false --no-gc --chain calibnet &
forest-cli chain set-head bafy2bzaced2v35cwfc6yvibuw544kxhztvwwqow66k3fp7lvi364qlm5m5ls2
forest-cli shutdown --force
forest --encrypt-keystore=false --no-gc --chain calibnet

Log

2025-10-27T15:29:25.999257Z  INFO forest::daemon: Using network :: calibnet
2025-10-27T15:29:26.003400Z  WARN forest::state_manager: rewinding chain head from 3008356 to 3007293, actor bundle: v16.0.1, expected: v17.0.0
2025-10-27T15:29:26.005411Z  INFO forest::libp2p::behaviour: libp2p Forest version: 0.30.2+git.5b60a60da
...
2025-10-27T15:29:34.846117Z  INFO forest::state_manager: Evaluating tipset: EPOCH=3007294, blocks=1, tsk=[bafy2bzaceb54as5qltgjajpra6fzqusqxhcghhhgl4ub5smlurz64ut2ytafa]
2025-10-27T15:29:34.991606Z  INFO forest::state_manager: Evaluating tipset: EPOCH=3007295, blocks=2, tsk=[bafy2bzaceatbrjx43qwlrmchfazccqzmmryzpptcy3d7xwnbuoeeawwnfb4oc, bafy2bzacedvprarzalosodtiqo34f4woggcj6cxoaswpptwzrgemuqwyn5jmg]
2025-10-27T15:29:34.991832Z  INFO compute_tipset_state_blocking: forest::state_migration: Running GoldenWeek migration at epoch 3007294

Changes introduced in this pull request:

Reference issue to close (if applicable)

Closes #6089

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

  • New Features

    • Exposed tipset tracking API and on-demand validation to rewind node state to the most recent valid tipset.
    • Added manifest metadata access for actor bundles.
  • Improvements

    • Actor bundle metadata supports equality comparisons.
    • Network configuration can locate actor bundles by network height.
    • Node startup may perform a state rewind when divergence is detected.
  • Chores

    • Dependency version updated.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 27, 2025

Walkthrough

Adds actor-bundle-aware head rewinding: new StateManager APIs expose and repeatedly rewind the heaviest tipset when actor-bundle metadata diverges, integrates that rewind into daemon startup, exposes manifest metadata lookup, and enables equality for actor-bundle metadata.

Changes

Cohort / File(s) Summary
Dependency
Cargo.toml
Updated educe dependency from "0.6.0" to "0.6".
Daemon startup
src/daemon/mod.rs
Call inserted: ctx.state_manager.maybe_rewind_heaviest_tipset()? during service startup (before P2P creation).
State manager
src/state_manager/mod.rs
Added pub fn heaviest_tipset(&self) -> Arc<Tipset> and pub fn maybe_rewind_heaviest_tipset(&self) -> anyhow::Result<()>; added single-step rewind helper and replaced internal uses of the old heaviest accessor.
Network helpers
src/networks/mod.rs
Added HeightInfoWithActorManifest<'a> wrapper and pub fn network_height_with_actor_bundle<'a>(&'a self, epoch: ChainEpoch) -> Option<HeightInfoWithActorManifest<'a>>; added imports for Blockstore and BuiltinActorManifest; added tests test_network_height_with_actor_bundle.
Actor bundle types
src/networks/actors_bundle.rs
ActorBundleMetadata now derives PartialEq (derive list extended).
Manifest metadata accessor
src/shim/machine/manifest.rs
Added pub fn metadata(&self) -> anyhow::Result<&ActorBundleMetadata> on BuiltinActorManifest; added related imports and unit test test_manifest_metadata.

Sequence Diagram(s)

sequenceDiagram
    participant Daemon
    participant SM as StateManager
    participant CC as ChainConfig
    participant BS as Blockstore
    participant Manifest as BuiltinActorManifest
    participant Bundles as ACTOR_BUNDLES_METADATA

    Daemon->>SM: maybe_rewind_heaviest_tipset()?
    loop until no rewind
        SM->>SM: heaviest_tipset() (current head)
        SM->>CC: network_height_with_actor_bundle(epoch)
        CC->>BS: load manifest CID
        BS-->>CC: manifest bytes
        CC-->>SM: HeightInfoWithActorManifest (height, info, manifest_cid)
        SM->>Manifest: manifest.metadata()?
        Manifest->>Bundles: lookup matching ActorBundleMetadata
        alt metadata matches current state
            SM-->>SM: stop rewinding
        else metadata diverged or missing
            SM->>SM: rewind heaviest tipset one step (repeat)
        end
    end
    SM-->>Daemon: done
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing extra attention:
    • Correctness and termination conditions of maybe_rewind_heaviest_tipset loop and single-step rewind.
    • Safety and correctness of replacing prior heaviest-accessor call sites with heaviest_tipset().
    • Error handling and messaging when BuiltinActorManifest::metadata() fails to find matching bundle metadata.
    • Interactions and ordering when invoking rewind during daemon startup.

Possibly related PRs

Suggested reviewers

  • akaladarshi
  • LesnyRumcajs

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "feat: rewind chain head on bad fork (wrong actor bundle)" clearly and directly describes the primary objective of the changeset. The changes implement automatic chain head rewinding logic triggered by actor bundle mismatches, including new state manager APIs, actor bundle metadata comparison support, and daemon-level invocation of the rewind logic. The title is concise, specific, and accurately conveys the main feature being introduced without vague terminology.
Linked Issues Check ✅ Passed The pull request directly implements the requirements from linked issue #6089 to detect and automatically rewind chain head on bad forks caused by actor bundle mismatches. The changes introduce public API methods in StateManager (heaviest_tipset, maybe_rewind_heaviest_tipset, and maybe_rewind_heaviest_tipset_once) that detect actor bundle mismatches and rewind to valid states, new struct HeightInfoWithActorManifest to retrieve manifests alongside height info, and PartialEq support for ActorBundleMetadata to enable metadata comparison. The daemon integration in start_services ensures rewinding occurs during initialization. The manual testing evidence provided in the PR description demonstrates successful detection and rewinding with proper actor bundle version validation.
Out of Scope Changes Check ✅ Passed All changes in the pull request are directly related to the objective of implementing chain head rewind on bad forks due to actor bundle mismatches. The dependency update in Cargo.toml supports the PartialEq derive addition, the actor bundle metadata struct receives PartialEq for equality comparisons needed during fork detection, the networks module gains struct and methods to retrieve actor manifest metadata, the manifest module adds metadata lookup capability, and state_manager implements the core rewinding logic. The daemon integration invokes the rewind during service startup. No extraneous or unrelated modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hm/rewind-on-bad-fork

Comment @coderabbitai help to get the list of available commands and usage tips.

@hanabi1224 hanabi1224 force-pushed the hm/rewind-on-bad-fork branch from e4630b7 to aed3c9c Compare October 27, 2025 15:49
@hanabi1224 hanabi1224 marked this pull request as ready for review October 27, 2025 16:01
@hanabi1224 hanabi1224 requested a review from a team as a code owner October 27, 2025 16:01
@hanabi1224 hanabi1224 requested review from LesnyRumcajs and elmattic and removed request for a team October 27, 2025 16:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/daemon/mod.rs (1)

598-611: Run rewind again after snapshot import (before services continue)

Import sets a new HEAD that may still be on the bad fork. Call rewind immediately after import to ensure the node evaluates the correct chain before continuing (and also covers halt-after-import). Minimal patch:

@@
-    maybe_import_snapshot(opts, &mut config, &ctx).await?;
+    maybe_import_snapshot(opts, &mut config, &ctx).await?;
+    // Re-evaluate HEAD after import to ensure we rewind off any bad fork introduced by the snapshot
+    ctx.state_manager.maybe_rewind_heaviest_tipset()?;
     if opts.halt_after_import {

Also applies to: 613-619

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b60a60 and aed3c9c.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • Cargo.toml (1 hunks)
  • src/daemon/mod.rs (1 hunks)
  • src/networks/actors_bundle.rs (2 hunks)
  • src/networks/mod.rs (3 hunks)
  • src/shim/machine/manifest.rs (2 hunks)
  • src/state_manager/mod.rs (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/networks/mod.rs (2)
src/shim/state_tree.rs (2)
  • store (180-180)
  • store (274-281)
src/shim/machine/manifest.rs (1)
  • load_manifest (126-134)
src/daemon/mod.rs (3)
src/rpc/methods/sync.rs (1)
  • ctx (177-252)
src/tool/subcommands/api_cmd/generate_test_snapshot.rs (1)
  • ctx (106-159)
src/tool/subcommands/api_cmd/test_snapshot.rs (1)
  • ctx (127-178)
src/state_manager/mod.rs (2)
src/chain/store/chain_store.rs (1)
  • heaviest_tipset (221-229)
src/blocks/tipset.rs (10)
  • parent_state (349-351)
  • parent_state (535-537)
  • from (105-110)
  • from (160-162)
  • from (166-168)
  • from (172-177)
  • from (181-186)
  • from (205-214)
  • from (479-484)
  • chain (390-397)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: V1 snapshot export checks
  • GitHub Check: Forest CLI checks
  • GitHub Check: Calibnet stateless mode check
  • GitHub Check: Calibnet eth mapping check
  • GitHub Check: Calibnet no discovery checks
  • GitHub Check: Calibnet kademlia checks
  • GitHub Check: Wallet tests
  • GitHub Check: Diff snapshot export checks
  • GitHub Check: db-migration-checks
  • GitHub Check: State migrations
  • GitHub Check: Bootstrap checks - Forest
  • GitHub Check: Calibnet check
  • GitHub Check: V2 snapshot export checks
  • GitHub Check: Devnet checks
  • GitHub Check: Calibnet api test-stateful check
🔇 Additional comments (4)
Cargo.toml (1)

66-66: Educe PartialEq feature addition — OK

Matches the new #[educe(PartialEq)] usage; no concerns.

src/networks/actors_bundle.rs (1)

114-123: Derive semantics look right

Ignoring manifest in PartialEq avoids heavy structural compares and keeps equality on network/version/bundle_cid. Good fit for rewind checks.

src/networks/mod.rs (1)

427-445: Bundle-aware network height helper — LGTM

Finds the latest upgrade with a bundle before the given epoch and loads its manifest; behavior is correct.

src/state_manager/mod.rs (1)

218-255: Rewind logic is sound; relies on accurate metadata mapping

Flow and checks look good; once metadata lookup (in BuiltinActorManifest::metadata) is fixed to match by bundle CID, this should behave deterministically across networks.

Do a quick local check once the metadata() fix lands:

  • Start with a HEAD just after an upgrade with the wrong bundle; confirm the log shows a rewind to (upgrade_epoch-1) and that subsequent validation/migration proceeds.

@hanabi1224 hanabi1224 force-pushed the hm/rewind-on-bad-fork branch from aed3c9c to 9de356a Compare October 28, 2025 07:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/state_manager/mod.rs (1)

218-254: Sound rewind logic with good safety checks.

The implementation correctly:

  • Compares expected vs. actual actor bundle metadata
  • Rewinds to the epoch before the network upgrade (expected_height_info.epoch - 1)
  • Validates that the target state exists before committing the rewind
  • Prevents unnecessary rewinds when target_epoch >= current_epoch

Consider adding inline documentation explaining why the target epoch is expected_height_info.epoch - 1 rather than the epoch itself. This would help future maintainers understand that we're rewinding to just before the upgrade boundary to allow the node to re-evaluate and follow the correct chain after being upgraded.

Additionally, the error message at lines 245-248 could include guidance, such as:

anyhow::bail!(
    "failed to rewind, state tree @ {target_epoch} is missing from blockstore: {}. \
     This may indicate an incomplete or corrupted snapshot. Consider re-syncing from a trusted snapshot.",
    target_head.parent_state()
);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aed3c9c and cccb004.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • Cargo.toml (1 hunks)
  • src/daemon/mod.rs (1 hunks)
  • src/networks/actors_bundle.rs (2 hunks)
  • src/networks/mod.rs (3 hunks)
  • src/shim/machine/manifest.rs (2 hunks)
  • src/state_manager/mod.rs (8 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/shim/machine/manifest.rs
  • src/daemon/mod.rs
  • src/networks/mod.rs
  • src/networks/actors_bundle.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/state_manager/mod.rs (3)
src/chain/store/chain_store.rs (1)
  • heaviest_tipset (221-229)
src/blocks/tipset.rs (10)
  • parent_state (349-351)
  • parent_state (535-537)
  • from (105-110)
  • from (160-162)
  • from (166-168)
  • from (172-177)
  • from (181-186)
  • from (205-214)
  • from (479-484)
  • chain (390-397)
src/chain/store/index.rs (1)
  • chain (191-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: All lint checks
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: Build Ubuntu
  • GitHub Check: tests
  • GitHub Check: tests-release
🔇 Additional comments (4)
Cargo.toml (1)

66-66: No issues found — the educe dependency change is correct and well-motivated.

educe 0.6 supports both Debug and PartialEq features with per-field controls, matching the requirements in this PR. The struct ActorBundleMetadata at src/networks/actors_bundle.rs:116 already derives #[educe(PartialEq)] with selective field ignoring, confirming the PartialEq feature is actively in use. The version constraint relaxation from "0.6.0" to "0.6" allows patch updates and is standard practice. The change aligns with the codebase's existing patterns and needs.

src/state_manager/mod.rs (3)

207-210: LGTM: Clean abstraction for accessing the heaviest tipset.

This wrapper method provides a convenient way to access the heaviest tipset and enables consistent usage across the state manager.


212-216: LGTM: Correctly handles multiple consecutive rewinds.

The loop correctly handles scenarios where the node missed multiple network upgrades. Each iteration re-validates the current head against the expected actor bundle.

While the logic is correct, this loop could perform multiple expensive operations during startup if many network upgrades were missed. Consider verifying that this startup delay is acceptable in your operational environment, or add logging to track the number of iterations for observability.


263-263: LGTM: Consistent refactoring to use the new accessor.

All call sites correctly updated to use self.heaviest_tipset() instead of directly accessing self.cs.heaviest_tipset(). This improves encapsulation and maintains consistency with the new API.

Also applies to: 662-662, 672-672, 716-716, 1129-1129, 1248-1251, 1548-1548, 1728-1728

Comment on lines 121 to 122
#[educe(PartialEq(ignore))]
pub manifest: BuiltinActorManifest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we ignore it for PartialEq?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I thought BuiltinActorManifest does not implement PartialEq

.map(|(height, _)| *height)
}

pub fn network_height_with_actor_bundle(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some unit tests and docs here? Also, the return type is complex - could we contrive a type aggregating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

pub fn builtin_actors(&self) -> impl ExactSizeIterator<Item = (BuiltinActor, Cid)> + '_ {
self.builtin2cid.iter().map(|(k, v)| (*k, *v)) // std::iter::Copied doesn't play well with the tuple here
}
/// Get the actor bundle metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's trivial, but a unit test for better coverage would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

self.chain_store().heaviest_tipset()
}

/// Returns the currently tracked heaviest tipset and rewind to a most recent valid one if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on what valid means in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"num-traits",
]

[[package]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including Cargo.lock change because dependabot has stopped working for Rust deps for a few weeks. https://github.com/ChainSafe/forest/network/updates/13772088/jobs

Copy link
Member

@LesnyRumcajs LesnyRumcajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LesnyRumcajs LesnyRumcajs added this pull request to the merge queue Oct 28, 2025
Merged via the queue into main with commit b8abe8d Oct 28, 2025
40 checks passed
@LesnyRumcajs LesnyRumcajs deleted the hm/rewind-on-bad-fork branch October 28, 2025 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node rewind on a bad fork

3 participants