Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase base consensus timeout and make it a config #4532

Merged
merged 1 commit into from
Sep 8, 2022
Merged

Conversation

andll
Copy link
Contributor

@andll andll commented Sep 8, 2022

Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout.

In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed see that control delay is adjusted accordingly to the timeouts and nw is not stuck.

However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again.

While this is not end of the world, it is noise and probably suboptimal.

It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency.

In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.

Copy link
Contributor

@lxfind lxfind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout.

In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed [see](http://grafana.shared.internal.sui.io:3000/goto/wqnUWbGVz?orgId=1) that control delay is adjusted accordingly to the timeouts and nw is not stuck.

However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again.

While this is not end of the world, it is noise and probably suboptimal.

It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency.

In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.
@andll andll merged commit ee77d8a into main Sep 8, 2022
@andll andll deleted the andrey-21 branch September 8, 2022 21:22
@@ -152,6 +152,7 @@ impl NodeConfig {
pub struct ConsensusConfig {
pub consensus_address: Multiaddr,
pub consensus_db_path: PathBuf,
pub delay_step: Option<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I think it may be cleaner if you don't use Option here but provide a serde default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, too late :)
I don't really have a strong opinion on that, can use default next time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit of serde default is that we can then deploy this without a new config file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is same with Option, it deserializes to None by default

longbowlu pushed a commit that referenced this pull request Sep 11, 2022
Sui has a timeout on sending transaction to the nw consensus and automatically adjusts the timeout based on recent latency from nw it has seen in case of success, or linearly increases in case of timeout.

In the stress test network we see quite a few of nw timeouts from the sui transaction submission side, and we indeed [see](http://grafana.shared.internal.sui.io:3000/goto/wqnUWbGVz?orgId=1) that control delay is adjusted accordingly to the timeouts and nw is not stuck.

However, it seems that current timeout adjustment logic drops timeout pretty quickly which causes timeouts again.

While this is not end of the world, it is noise and probably suboptimal.

It also can be noticed that current base timeout value is very close to the actual consensus delay, so the timeout has to be frequently adjusted. Therefore, a short-medium term fix (this PR) is just to bump the delay step for timeout calculations so that it is further away from median nw latency.

In the future we can improve timeout calculation by using weighted avg over a longer window instead of weighting current and next timing only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants