Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Narwhal] improve resilience of nodes growing and following DAG #8288

Closed
3 of 7 tasks
mwtian opened this issue Feb 14, 2023 · 4 comments
Closed
3 of 7 tasks

[Narwhal] improve resilience of nodes growing and following DAG #8288

mwtian opened this issue Feb 14, 2023 · 4 comments
Assignees

Comments

@mwtian
Copy link
Member

mwtian commented Feb 14, 2023

Currently, the limiting factor to reduce Narwhal round generation speed (hence reducing e2e Narwhal latency) is the resilience of nodes growing and following DAG. It is observed that once more Narwhal rounds are generated per sec, at some limit nodes start to fail behind in following the DAG, resulting in higher e2e latencies. We plan to make the following improvements to help nodes receiving certificate out of order or missing one or more of them to stay up-to-speed with growth of the DAG.

Propagating recently created certificates

  • Move Certificate handling from Core into Synchronizer, which allows concurrent processing of certificates and access to metadata.
  • When broadcasting created own certificates, allow multiple inflight instead of only keeping the most recent certificate and cancelling the rest.
  • Create a buffer to keep track of known certificates that have missing parents, and certificates only known by digests. Their dependencies should be tracked so they can be accepted efficiently when possible.

Improving certificates catchup throughput

  • Improve efficiency of certificate verifications during catchup, by verifying higher round certificates and only ensuring digests match for child certificates.
  • Flesh out details for chaining certificates from own origin.
  • Improve efficiency of certificate fetching by fetching from highest rounds of accepted certificates per origin, instead of trying to fill in certificates within gc_depth.
  • Pipeline certificate fetching after verifications are no longer a major factor in catchup time.
@mwtian mwtian self-assigned this Feb 14, 2023
@akichidis
Copy link
Contributor

When broadcasting created own certificates, each peer should be sent a bounded list of certificates that they may need, not just the most recently created own certificate.

I assume we'll want to do that only to the nodes who haven't voted for the underlying header, right?

Improve efficiency of certificate verifications during catchup, by verifying higher round certificates and only ensuring digests match for child certificates.

To confirm that child certificate digests match we'll need to have previously the child certificates processed? Trying to see the difference with current approach.

@mwtian
Copy link
Member Author

mwtian commented Feb 15, 2023

I assume we'll want to do that only to the nodes who haven't voted for the underlying header, right?

Why is that the case? This should be done against all peers for created certificates that have not been ack'ed, with a max size bound e.g. 100, to avoid the recipient having to rely on the higher latency fetch process.

@mwtian
Copy link
Member Author

mwtian commented Feb 15, 2023

To confirm that child certificate digests match we'll need to have previously the child certificates processed? Trying to see the difference with current approach.

The difference is that the previous child certificates do not have to be verified, if their digests are included transitively from verified certificates. The caveat is fetch response served from these data may contain certificates with unverified signatures now, so the recipient of the fetch response needs to be able to handle that.

mwtian added a commit that referenced this issue Feb 21, 2023
…#8270)

Currently much of `Certificate` processing logic lives in `Core`, which
has the following drawbacks:
- Processing is single threaded, even if much of `Certificate`
processing can be parallelized, e.g. verifications, checking parents,
etc.
- It is hard to evolve, e.g. if we want to lookup suspended certificates
while checking existence of parents, a new channel need to be added.

This PR moves the `Certificate` processing logic into `Synchronizer`,
which is now responsible for synchronizing DAG, not just the missing
parent certificates.

In future, additional data structures will be added to `Synchronizer` to
keep track of suspended certificates.

Majority of the changes are the required plumbing. The most significant
change is that certificate broadcast is re-implemented.

#8288
mwtian added a commit that referenced this issue Feb 27, 2023
## Description 

Narwhal used to have `HeaderWaiter` that suspends Headers and
Certificates, and commits a certificate when it has no missing parent.
The active requests for parent certificates were a bit problematic, and
the idea of suspending received certificates is useful. As Narwhal
decreases its round intervals, more nodes may start to receive
certificates out of their causal order. It is useful to temporarily
buffer the certificates to smooth out disruptions from network or slower
nodes.

## Test Plan 
unit tests
deployed to private testnet

#8288
@mwtian
Copy link
Member Author

mwtian commented Mar 17, 2023

Catch up performance will be tracked separately. Currently the bottleneck to catch up performance is payload fetching.

@mwtian mwtian closed this as completed Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants