Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWY-239: Pause if too much stake is offline #842

Merged
merged 4 commits into from
Feb 2, 2021
Merged

Conversation

afck
Copy link
Contributor

@afck afck commented Feb 1, 2021

This prevents the remaining validators from creating more and more units (increasing the protocol state size, and thus memory usage) while too many validators are offline to finalize anything. E.g. if there's a 34% fault tolerance threshold, we need more than 67% = 50% + 34% / 2 online for liveness, so if 33% or more are offline, the other validators will pause.

During a pause, instead of units, they gossip new Ping messages, so that others can keep track of who's still online. Once enough validators are back online again, they resume.

https://casperlabs.atlassian.net/browse/HWY-239

This adds a ping mechanism that allows validators to signal that they
are online even if they are not creating new units. If we see fewer
validators online than the required quorum for finality at the
configured fault tolerance threshold, we pause and don't create new
units. This avoids inflating the protocol state unnecessarily during
periods where too many validators have an outage.
}

/// Returns a `Timestamp` provided the vertex is a `Vertex::Unit`
pub(crate) fn timestamp(&self) -> Option<Timestamp> {
match self {
Vertex::Unit(signed_wire_unit) => Some(signed_wire_unit.wire_unit().timestamp),
Vertex::Evidence(_) => None,
Vertex::Endorsements(_) => None,
Vertex::Ping(ping) => Some(ping.timestamp()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment could be updated. Now it only mentions Vertex::Unit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the comment has been updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it has. GitHub just shows the snippet of the version you commented on.

Copy link
Collaborator

@goral09 goral09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I can't see any blockers but I will give it another pass after the comments are addressed.

Copy link
Collaborator

@goral09 goral09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@afck
Copy link
Contributor Author

afck commented Feb 2, 2021

bors r+

@bors
Copy link
Contributor

bors bot commented Feb 2, 2021

Build succeeded:

@bors bors bot merged commit f5ce2a5 into casper-network:master Feb 2, 2021
@afck afck deleted the ping branch February 2, 2021 11:58
bors bot added a commit that referenced this pull request Feb 12, 2021
902: HWY-252: Pause consensus if execution lags behind finalization. r=afck a=afck

In a test with 75 validators and 1000 transactions per block we observed that block execution was much slower than consensus. That didn't stop consensus from finalizing all the era-0 blocks, though, which ended up in the block executor queue. Then there was a gap that would have lasted several hours, where consensus waited for the switch block to be executed, so it could start era 1.

This PR makes consensus wait if the block executor falls behind by more than 3 blocks (configurable): Whenever the height of the latest executed block is more than 3 below the height of the latest finalized block, consensus will switch to "paused" mode, using the mechanism introduced in #842.

https://casperlabs.atlassian.net/browse/HWY-252

Co-authored-by: Andreas Fackler <andreas@casperlabs.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants