Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial sync as state machine #1961

Merged
merged 26 commits into from
Jan 25, 2021

Conversation

dapplion
Copy link
Contributor

@dapplion dapplion commented Jan 17, 2021

Refactors initial sync with a heavily Lightouse inspired sync.

TLDR: higher sync speeds - works as a state machine: easier testing, debugging and predictability

Record of successfully completing initial sync on cloud instances

  • lodestar-cloud-01: Pyrmont on container lodestar_lodestar_1: 2 times
  • lodestar-cloud-02: Mainnet on container lodestar_lodestar_1: 3 times

The Sync has two integral components:

Batch

Represent a fix range of slots to fetch and process. A batch is a state machine that transitions through multiple status with public methods:

  • AwaitingDownload: Initial state, or the batch has failed either downloading or processing, but can be requested again.
  • Downloading: The batch is being downloaded
  • AwaitingProcessing: The batch has been completely downloaded and is ready for processing
  • Processing: The batch is being processed.
  • AwaitingValidation: The batch was successfully processed and is waiting to be validated. It is not sufficient to process a batch successfully to consider it correct. This is because batches could be erroneously empty, or incomplete. Therefore, a batch is considered valid, only if the next sequential batch imports at least a block.

Tracking the status in this way simplifies a lot the logic to concurrently handle and download multiple batches at once. The AwaitingValidation also allow to handle infinitely long periods of completely skipped slots without issues.

Batches always start a slot 1 to ease committee forecasting when batch signature verification is ON.

 Epoch boundary |                                   |
  ... | 30 | 31 | 32 | 33 | 34 | ... | 61 | 62 | 63 | 64 | 65 |
       Batch 1       |              Batch 2              |  Batch 3

Batches also track the failed attemps which would allow sofisticate peer scoring in upcomming PRs. So far it's used to reduce the chance of re-requesting a batch to a peer with a failed attempt.

SyncChain

Main driver that handles a map of Batches. It uses an AsyncIterable batchProcessor to processes batches to be processed in order and only one at once. All the state necessary to chose the next to be downloaded batch and when syncing is done is derived from the batches. Plus startEpoch which tracks the latest validated epoch.

It interfaces with the sync manager through 3 methods only

  • processChainSegment
  • downloadBeaconBlocksByRange
  • getPeersAndTargetEpoch: Abstracts the peer handling logic upstream. Allows to sync manager to enforce minPeers even throughout the sync

SyncChain instead of start() stop() exposes a single sync() method that resolves when sync. It uses an abort signal for stopping.

SyncChain is able to detect that the sync has stalled with a recurring maybeStuckTimeout. It will print detailed data to help debugging and try to re-fetch new batches and reprocess pending. It also detects specific invalid internal state combinations and will nuke itself if those happen.

@github-actions github-actions bot added the scope-networking All issues related to networking, gossip, and libp2p. label Jan 17, 2021
@codeclimate
Copy link

codeclimate bot commented Jan 17, 2021

Code Climate has analyzed commit 35dae04 and detected 8 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 6
Duplication 2

View more on Code Climate.

@twoeths
Copy link
Contributor

twoeths commented Jan 18, 2021

this is great, the sync speed is really impressive to me! /eth/v1/node/syncing does not seem to be correct anymore during initial sync.

@@ -57,8 +56,7 @@ export class ArchiveBlocksTask implements ITask {
i = upperBound;
}
await this.deleteNonCanonicalBlocks();
this.logger.profile("Archive Blocks epoch #" + this.finalized.epoch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you remove this? it's used to track how much time we use to archive blocks

Copy link
Contributor Author

@dapplion dapplion Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it's currently it's very noisy to log that to info. I agree it's valuable info but it should not be logged to info always. We should profile both the archive blocks task, the state transition and everything else but on demand only.

For now I've remove it since there also this statement

this.logger.verbose("Archiving of finalized blocks complete"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting we will implement profiling through metrics in an upcoming PR


// At least one block was successfully verified and imported, so we can be sure all
// previous batches are valid and we only need to download the current failed batch.
if (e instanceof BlockProcessorError && e.importedBlocks.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this BlockProcessorError being thrown anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm waiting for Cayman to tweak the chain segment processor and add this error. The error is not necessary to work, it just adds a performance boost in some situations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change in a separate PR

@dapplion
Copy link
Contributor Author

/eth/v1/node/syncing does not seem to be correct anymore during initial sync.

What behavior specifically do you see that does not match expectations? I've changed the target to just be target = clock.currentSlot (as Lighthouse does) as an interim solution for a future PR to handle the case of skipped slots on target.

@ChainSafe ChainSafe deleted a comment from lgtm-com bot Jan 18, 2021
@dapplion dapplion force-pushed the dapplion/initial-sync-as-state-machine branch from d8e9587 to e097c19 Compare January 18, 2021 22:12
@wemeetagain wemeetagain added this to the 0.15.0 milestone Jan 19, 2021
@twoeths
Copy link
Contributor

twoeths commented Jan 21, 2021

/eth/v1/node/syncing does not seem to be correct anymore during initial sync.

What behavior specifically do you see that does not match expectations? I've changed the target to just be target = clock.currentSlot (as Lighthouse does) as an interim solution for a future PR to handle the case of skipped slots on target.

it returns {"data":{"head_slot":"438615","sync_distance":"422901"}} (head_slot is wrong) although I just started the sync in 30 minutes

@dapplion
Copy link
Contributor Author

dapplion commented Jan 22, 2021

it returns {"data":{"head_slot":"438615","sync_distance":"422901"}} (head_slot is wrong) although I just started the sync in 30 minutes

Thanks for testing it out! Fixed with 88feee9. After a couple of minutes of testing it returns

/usr/app # curl http://localhost:9596/eth/v1/node/syncing
{"data":{"head_slot":"386","sync_distance":"467256"}}

await this.regularSync.stop();
await this.initialSync.start();
}
private processChainSegment: ProcessChainSegment = async (blocks) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove in a future PR

@wemeetagain wemeetagain merged commit 8bb6e91 into master Jan 25, 2021
@wemeetagain wemeetagain deleted the dapplion/initial-sync-as-state-machine branch January 25, 2021 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope-networking All issues related to networking, gossip, and libp2p.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants