Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Unpause atomic writes on block insert error #2457

Open
wants to merge 2 commits into
base: mainnet-staging
Choose a base branch
from

Conversation

niklaslong
Copy link
Contributor

@niklaslong niklaslong commented May 17, 2024

This PR contains a potential fix to one of the atomic panics we've seen on the devnet:

May 15 15:55:28 ip-172-31-42-88 snarkos[58891]: 2024-05-15T15:55:28.771227Z  WARN snarkos_node_sync::block_sync: Attempted to insert a block at the incorrect height into storage
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: thread 'tokio-runtime-worker' panicked at /Users/nkls/dev/snarkVM/ledger/store/src/helpers/rocksdb/internal/mod.rs:235:9:
May 15 15:55:29 ip-172-31-42-88 snarkos[58891]: assertion failed: !already_paused

I initially thought the error might suggest the occurrence of simultaneous block inserts from different threads (there are no await points so it can't be from the same task) but the block lock ruled this out. This leaves us with improper handling of unpausing which this changeset aims to fix.

Drafting this PR as I haven't yet been able to confirm this fix is sufficient to resolve the issue entirely.

@niklaslong niklaslong changed the title fix: unpause atomics on block insert error [Fix] Unpause atomics on block insert error May 17, 2024
@niklaslong niklaslong changed the title [Fix] Unpause atomics on block insert error [Fix] Unpause atomic writes on block insert error May 17, 2024
// Note: This call is guaranteed to succeed (without error), because `DISCARD_BATCH == true`.
self.block_store().unpause_atomic_writes::<true>()?;
// Rollback the Merkle tree.
self.block_store().remove_last_n_from_tree_only(1).map_err(|removal_error| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was an issue with adding the block to the block store, why would we need to revert the tree or rollback the block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we might not, I thought there might be partial state that would need rolling back but I'm not yet super familiar with this codepath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a commit to remove this logic

Copy link
Contributor

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@raychu86
Copy link
Contributor

Is there a way to force this error to occur and validate that this fix correctly addresses the failure case?

In addition, is there a way to determine why the rocksDB write might have failed? Could there be any issues with the BFTPersistentStorage rocksDB usage in snarkOS? Are there any locks we need to enforce between the two rocksdb writes (ledger vs bft)?

@ljedrz
Copy link
Contributor

ljedrz commented May 20, 2024

@niklaslong correct me if I'm mistaken, but I believe the error was not rocksDB-related, but rather "couldn't insert block at height N", i.e. a logic error.

@raychu86
Copy link
Contributor

I believe the only errors in self.block_store().insert(block) are the merkle tree update or RocksDB insert.

@ljedrz
Copy link
Contributor

ljedrz commented May 20, 2024

BlockStore::insert can also fail as such:

if block.height() != u32::try_from(updated_tree.number_of_leaves())? - 1 {
    bail!("Attempted to insert a block at the incorrect height into storage")
}

@niklaslong
Copy link
Contributor Author

Yes, the error I managed to trigger on the devnet was the "incorrect height" ☝️

force this error to occur and validate that this fix correctly addresses the failure case?

I didn't have any further luck last week on the devnets but we could try to trigger an error on insert manually to check if the issue can be replicated that way. It would at least validate that it's possible to trigger it on that codepath.

@raychu86
Copy link
Contributor

Is this ready for review or is there still changes/testing to be done?

@niklaslong
Copy link
Contributor Author

Should be good to go in theory. Fabiano has encountered the error again on a client, I've sent him a patched branch to try, hopefully we can validate the fix in practice that way 🤞

@HarukaMa
Copy link
Contributor

Tested this on my 3-node setup and this works. Node would still corrupt the database or stop syncing for unknown reasons, but there's no assertion failed: !already_paused anymore.

@niklaslong
Copy link
Contributor Author

@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?

@niklaslong niklaslong marked this pull request as ready for review May 29, 2024 08:33
@HarukaMa
Copy link
Contributor

@HarukaMa thank you for testing the branch! Did you spot any other atomic related panics in your logs?

hmm, after grepping the logs I did see a few panics about 16 hours ago, but as the stderr is not logged to log file I'm not sure what they are about now.

Copy link
Contributor

@ljedrz ljedrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM post-rebase 👌

self.block_store().abort_atomic();
// Disable the atomic batch override.
// Note: This call is guaranteed to succeed (without error), because `DISCARD_BATCH == true`.
self.block_store().unpause_atomic_writes::<true>()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the correct approach, but can we be sure this doesn't have any deleterious side effects? How are we sure that it doesn't?

Copy link
Contributor

@ljedrz ljedrz May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always should have been there; block insertion is the only operation where pausing is used (because block insertion and finalization are technically separate atomic operations, but they need to be treated atomically here), and once the pause is in force, it MUST be undone by the end of that operation or if that operation fails (which is covered by this PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants