-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HWY-239: Pause if too much stake is offline #842
Conversation
This adds a ping mechanism that allows validators to signal that they are online even if they are not creating new units. If we see fewer validators online than the required quorum for finality at the configured fault tolerance threshold, we pause and don't create new units. This avoids inflating the protocol state unnecessarily during periods where too many validators have an outage.
} | ||
|
||
/// Returns a `Timestamp` provided the vertex is a `Vertex::Unit` | ||
pub(crate) fn timestamp(&self) -> Option<Timestamp> { | ||
match self { | ||
Vertex::Unit(signed_wire_unit) => Some(signed_wire_unit.wire_unit().timestamp), | ||
Vertex::Evidence(_) => None, | ||
Vertex::Endorsements(_) => None, | ||
Vertex::Ping(ping) => Some(ping.timestamp()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment could be updated. Now it only mentions Vertex::Unit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the comment has been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it has. GitHub just shows the snippet of the version you commented on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I can't see any blockers but I will give it another pass after the comments are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
bors r+ |
Build succeeded: |
902: HWY-252: Pause consensus if execution lags behind finalization. r=afck a=afck In a test with 75 validators and 1000 transactions per block we observed that block execution was much slower than consensus. That didn't stop consensus from finalizing all the era-0 blocks, though, which ended up in the block executor queue. Then there was a gap that would have lasted several hours, where consensus waited for the switch block to be executed, so it could start era 1. This PR makes consensus wait if the block executor falls behind by more than 3 blocks (configurable): Whenever the height of the latest executed block is more than 3 below the height of the latest finalized block, consensus will switch to "paused" mode, using the mechanism introduced in #842. https://casperlabs.atlassian.net/browse/HWY-252 Co-authored-by: Andreas Fackler <andreas@casperlabs.io>
This prevents the remaining validators from creating more and more units (increasing the protocol state size, and thus memory usage) while too many validators are offline to finalize anything. E.g. if there's a 34% fault tolerance threshold, we need more than 67% = 50% + 34% / 2 online for liveness, so if 33% or more are offline, the other validators will pause.
During a pause, instead of units, they gossip new
Ping
messages, so that others can keep track of who's still online. Once enough validators are back online again, they resume.https://casperlabs.atlassian.net/browse/HWY-239