Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChainSync refactor #693

Merged
merged 14 commits into from
Sep 14, 2020
Merged

ChainSync refactor #693

merged 14 commits into from
Sep 14, 2020

Conversation

austinabell
Copy link
Contributor

Summary of changes
Changes introduced in this pull request:

  • Handles syncing asynchronously through sync workers on separate tasks
    • All syncing logic now belongs in the SyncWorker struct, and the ChainSyncer just handles network events and scheduling/ handling sync workers
    • Separating some of the fields so they are threadsafe across tasks meant there was some auxiliary updates
  • Updates peer selection for blocksync request to be random (should be made better with Implement peer manager stats for each peer #692)
  • Fixes some bugs with tipset scheduling and other functions

So the loading/ fetching of tipsets from hello protocol is done in separate threads as to not block the ChainSyncer polling or the network event channel will fill and stall (what was happening before). This can be handled better but I want to keep the scope of this PR to a minimum as it's already changing a lot.

Reference issue to close (if applicable)

Closes #690

Other information and links

Copy link
Member

@ec2 ec2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like while in bootstrap phase, every hello we get, we'll be doing a sync from that Tipset? So if there are 3 Hello messages, they all get scheduled? Or am I misreading?

blockchain/chain/src/store/chain_store.rs Show resolved Hide resolved
node/forest_libp2p/src/service.rs Outdated Show resolved Hide resolved
@austinabell
Copy link
Contributor Author

It seems like while in bootstrap phase, every hello we get, we'll be doing a sync from that Tipset? So if there are 3 Hello messages, they all get scheduled? Or am I misreading?

No, only tipsets that aren't connected get scheduled, which is the point of having multiple workers (so one doesn't get sidetracked with an invalid chain and have to start from scratch again)

@ec2
Copy link
Member

ec2 commented Sep 10, 2020

No, only tipsets that aren't connected get scheduled, which is the point of having multiple workers (so one doesn't get sidetracked with an invalid chain and have to start from scratch again)

What do you mean by connected here? If i understand, connected means common ancestor?

Lets say we receive a Hello from PeerA. We start doing a reverse header sync. Half way through a reverse header sync, we receive another Hello from PeerB. PeerB will probably have a newer head than PeerA. If we start to do a reverse header sync on PeerB as well, in most cases (assuming no attacks or weird things which I think is the most likely and far more common scenario), we will have a large amount of the same fetches. This is because a reverse header sync can take quite a bit of time if the chain is long, and the blocks dont get persisted into the store until ALL the blocks from head -> 0 are fetched, so WorkerB fetching from PeerB will not know that WorkerA fetching from PeerA is already fetching the same blocks.

@austinabell
Copy link
Contributor Author

No, only tipsets that aren't connected get scheduled, which is the point of having multiple workers (so one doesn't get sidetracked with an invalid chain and have to start from scratch again)

What do you mean by connected here? If i understand, connected means common ancestor?

Lets say we receive a Hello from PeerA. We start doing a reverse header sync. Half way through a reverse header sync, we receive another Hello from PeerB. PeerB will probably have a newer head than PeerA. If we start to do a reverse header sync on PeerB as well, in most cases (assuming no attacks or weird things which I think is the most likely and far more common scenario), we will have a large amount of the same fetches. This is because a reverse header sync can take quite a bit of time if the chain is long, and the blocks dont get persisted into the store until ALL the blocks from head -> 0 are fetched, so WorkerB fetching from PeerB will not know that WorkerA fetching from PeerA is already fetching the same blocks.

It's all based on the hello messages received, they get grouped as they come in, so as long as there isn't a break in between two epochs (probably rare) they will be considered on the same chain. This is determined by checking if either the tipset parent is the other or if they are the same.

This logic existed, but is used more correctly than before (can still be improved on later). This is the point of having those tipset buckets (to track which tipsets are being synced and what to assign another worker to)

Copy link
Member

@ec2 ec2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it L(Real)GTM. I noticed that currently we dont verify the winning post proof. Cant remember why that is, but that can be addressed in another PR. Also nice one getting rid of the network_handler. I always got mixed up between that and the context when i was heads deep.

blockchain/chain_sync/src/sync.rs Show resolved Hide resolved
Copy link
Contributor

@dutterbutter dutterbutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job! Great separating out the sync_worker logic away from sync.rs. Just a few nits and more questions than anything.

let ts: Tipset = Tipset::new(vec![header.clone()])?;
// TODO this is really broken, should not be setting the tipset metadata to a tipset with just
// the one header.
pub async fn set_tipset_tracker(&self, header: &BlockHeader) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the need for the tip_index.rs file anymore. I recall implementing it really early thinking it would be good for tipset data retrievals but it doesn't look like we are using it anywhere for that purpose? We set the tracker but never make use of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was planning on leaving it for now because out of scope of this and the changes were already a lot

use vm::TokenAmount;

// TODO revisit this type, necessary for two sets of Arc<Mutex<>> because each state is
// on seperate thread and needs to be mutated independently, but the vec needs to be read
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: seperate -> separate

.chain_store
.heaviest_tipset()
.await
.map(|heaviest| ts.weight() >= heaviest.weight())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this just be ts.weight() > heaviest.weight()? Do we want to validate and set peer head when weight is equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, intentional that if the weight is the same it will also be considered. Also the logic matches Lotus (Although they setup the function differently)

}

/// informs the syncer about a new potential tipset
/// This should be called when connecting to new peers, and additionally
/// when receiving new blocks from the network
pub async fn inform_new_head(&mut self, peer: PeerId, fts: &FullTipset) -> Result<(), Error> {
pub async fn inform_new_head(&mut self, peer: PeerId, ts: &FullTipset) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in the Lotus implementation they are adding the peerId to their ChainExchange pool, should there be a TODO to add this peer to the available peers for blocksync peers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're doing it when the hello message is handled by the syncer. I don't see why it would need to be handled after this process

@austinabell austinabell merged commit 3411459 into main Sep 14, 2020
@austinabell austinabell deleted the austin/sync/refactor branch September 14, 2020 19:14
timvermeulen added a commit that referenced this pull request Sep 17, 2020
commit 8ef5ae5
Author: Austin Abell <austinabell8@gmail.com>
Date:   Wed Sep 16 12:30:01 2020 -0400

    Peer stats tracking and selection (#701)

    * Added peer manager logic and started to move blocksync handling into network context

    * Migrate to using handled blocksync requests and remove a duplicate code

    * Fix test use

    * Fix duration subtract logic

    * Update peer manager updates from hello requests

    * Docs and cleanup

    * more cleanup

commit dc0ff4c
Author: Eric Tu <6364934+ec2@users.noreply.github.com>
Date:   Tue Sep 15 14:43:18 2020 -0400

    Semantic Validation for Messages (#703)

    * semantic validation

    * suggestions

    * suggestions

commit 3411459
Author: Austin Abell <austinabell8@gmail.com>
Date:   Mon Sep 14 15:14:48 2020 -0400

    ChainSync refactor (#693)

    * Switch chain store to threadsafe

    * Update genesis to arc reference

    * wip refactoring chain sync to workers in async tasks

    * Update network event handling and remove NetworkHandler

    * Update tipset scheduling logic

    * Update peer retrieval to take a random sample of available peers

    * Cleanup and enabling all existing tests

    * fix worker task spawn

    * Add TODO for emit event ignoring and change to error log

    * oops

    * Update comment

    * Fix typo

commit 66ca99e
Author: Eric Tu <6364934+ec2@users.noreply.github.com>
Date:   Mon Sep 14 13:48:22 2020 -0400

    Fix StateManager use in different components (#694)

    * wrap sm in arc

    * lint

    * fix tests

    * suggestions

    * clippy

commit 96b64cb
Author: nannick <rajarupans@gmail.com>
Date:   Mon Sep 14 10:33:27 2020 -0400

    Drand ignore env variable (#697)

    * First commit

    * Using enviromental vairable instead of cli flag

    * FIxing dumb mistakes

    * Renaming drand flag

    * UPdate latest_beacon_entry

    * Adding drand check

    * Adding doc

commit 548a464
Author: Austin Abell <austinabell8@gmail.com>
Date:   Thu Sep 10 13:22:22 2020 -0400

    Print out conformance results and add log for skips (#695)

commit 0d7b16c
Author: Stepan <s.s.naumov@gmail.com>
Date:   Tue Sep 8 09:01:05 2020 -0400

    Add CLI command to add Genesis Miner to Genesis Template (#644)

    * AddMinerGenesis cmd command

    * fixing tests

    * addressing comments

    * removing jsons
timvermeulen added a commit that referenced this pull request Sep 26, 2020
* WIP

* More changes

* Even more changes

* Amt docs

* Clippy & copyright

* Squashed commit of the following:

commit 8ef5ae5
Author: Austin Abell <austinabell8@gmail.com>
Date:   Wed Sep 16 12:30:01 2020 -0400

    Peer stats tracking and selection (#701)

    * Added peer manager logic and started to move blocksync handling into network context

    * Migrate to using handled blocksync requests and remove a duplicate code

    * Fix test use

    * Fix duration subtract logic

    * Update peer manager updates from hello requests

    * Docs and cleanup

    * more cleanup

commit dc0ff4c
Author: Eric Tu <6364934+ec2@users.noreply.github.com>
Date:   Tue Sep 15 14:43:18 2020 -0400

    Semantic Validation for Messages (#703)

    * semantic validation

    * suggestions

    * suggestions

commit 3411459
Author: Austin Abell <austinabell8@gmail.com>
Date:   Mon Sep 14 15:14:48 2020 -0400

    ChainSync refactor (#693)

    * Switch chain store to threadsafe

    * Update genesis to arc reference

    * wip refactoring chain sync to workers in async tasks

    * Update network event handling and remove NetworkHandler

    * Update tipset scheduling logic

    * Update peer retrieval to take a random sample of available peers

    * Cleanup and enabling all existing tests

    * fix worker task spawn

    * Add TODO for emit event ignoring and change to error log

    * oops

    * Update comment

    * Fix typo

commit 66ca99e
Author: Eric Tu <6364934+ec2@users.noreply.github.com>
Date:   Mon Sep 14 13:48:22 2020 -0400

    Fix StateManager use in different components (#694)

    * wrap sm in arc

    * lint

    * fix tests

    * suggestions

    * clippy

commit 96b64cb
Author: nannick <rajarupans@gmail.com>
Date:   Mon Sep 14 10:33:27 2020 -0400

    Drand ignore env variable (#697)

    * First commit

    * Using enviromental vairable instead of cli flag

    * FIxing dumb mistakes

    * Renaming drand flag

    * UPdate latest_beacon_entry

    * Adding drand check

    * Adding doc

commit 548a464
Author: Austin Abell <austinabell8@gmail.com>
Date:   Thu Sep 10 13:22:22 2020 -0400

    Print out conformance results and add log for skips (#695)

commit 0d7b16c
Author: Stepan <s.s.naumov@gmail.com>
Date:   Tue Sep 8 09:01:05 2020 -0400

    Add CLI command to add Genesis Miner to Genesis Template (#644)

    * AddMinerGenesis cmd command

    * fixing tests

    * addressing comments

    * removing jsons

* Delete old deadline tests

* Fix verify_block_signature

* Fix state_api

* Silent unused variable warnings

* clippy

* clippy

* clippy

* Fix miner worker addr function

* fix load_sectors_from_set

* Fix get_miner_info

* miner faults update

* Setup faults and recoveries logic

* Add todo for checking verification function

* Update to specs-actors v0.9.3

* Fix test

* Remove Cid.Undef check

* PR fixes

* Use the actor_error! macro in the miner actor

* Fix type mismatch

* PR changes

* Use return value

* Rename file

* More PR fixes

* Add assertions and error message

* Downcast errors originating from Deadlines::{load_deadline, update_deadline} and State::save_deadlines

* Rename Sectors::{new, load} to Sectors::{load, load_sector}

* Fix inverted logic

* Fix state manager

* Remove unused parameters

* clippy

* Replace function by const, remove duplication, remove comment

* Rename field, fix macro formatting, remove unnecessary vec

* Downcast errors

Co-authored-by: austinabell <austinabell8@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor ChainSyncer to handle initial sync asyncronously
3 participants