Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger sync when unable to forge block due to low consensus #2073

Closed
SargeKhan opened this issue May 29, 2018 · 10 comments
Closed

Trigger sync when unable to forge block due to low consensus #2073

SargeKhan opened this issue May 29, 2018 · 10 comments

Comments

@SargeKhan
Copy link
Contributor

SargeKhan commented May 29, 2018

Parent #2080

Expected behavior

Before forging a block, if consensus is lower than minBroadhashConsensus value, then node should trigger sync operation.

Actual behavior

Node keeps logging consensus is low, and waits for sync trigger to happen.

Steps to reproduce

N/A

Which version(s) does this affect? (Environment, OS, etc...)

1.0.0

@4miners
Copy link
Contributor

4miners commented May 29, 2018

@SargeKhan In what will it help? I doubt that sync will be able to complete during slot window, and even if - then it will affect next delegate because block will be forged at the very end of forging slot, so it's possible that it will reach rest of the network in next slot.

@shuse2
Copy link
Member

shuse2 commented May 29, 2018

We think the problem is not in the sync process or node staying behind.
It's more of using outdated information about peers to calculate the consensus.

  • When a node forge a block, it broadcast their header (like broadhash) to 100 random peers, and it doesn't hop.
  • We update all of peers information every 30s by discoveryAndUpdate.

@SargeKhan
Copy link
Contributor Author

SargeKhan commented May 30, 2018

@4miners, after discussing with the team, we've decided that we should start sync process if consensus is low and:
1- The current slot is one slot before node's delegate slot.
2- The current slot is the node's delegate slot.
I think our application should be designed so that every node is interested in his benefit, and later the benefit of the network. So,

I doubt that sync will be able to complete during slot window, and even if - then it will affect next delegate because block will be forged at the very end of forging slot, so it's possible that it will reach rest of the network in next slot.

I think it's still a better option than not forging a block at all, and just missing a block slot.

@4miners
Copy link
Contributor

4miners commented May 30, 2018

@SargeKhan I don't think that approach proposed here is a good one, for few reasons:

  1. It should be an edge case (node get out of sync and stay behind), if that is so common that we need such a fix for keeping network stable (or trying to) - it means we have issues with propagation, and propagation is a thing that should be fixed instead.
  2. Consensus is always low for some period of a slot (not our own forging slot), so that can be tricky to implement.
  3. We will not always know if current slot is the one before our delegate slot (when delegate is first one of a round).
  4. When syncing node cannot receive new blocks, that can cause additional issues with blocks propagation. Also it can stack with scenario that I described before.

@nazarhussain
Copy link
Contributor

I agree with point 1, if consensus is low quite often that's probably the issue with propagation of blocks and that should be fixed on that layer.

But the we develop syncing process as fail safe process to broadcasting. And it should run independent of knowing that (broadcast failed quite often or not). When our slot came (and concuss is low), we have to do these stuff in 10 seconds slot.

  1. Syncing
  2. Forging
  3. Broadcasting

I feel that in that 10 seconds slot we should only do forging and broadcasting, and we should be prepared for it before. We have 109 slots (1090 secs) to prepare for forging, during that time we should prepare all pre-requisites for forging. One of which is to have high consensus and that requires syncing. So not just one block, even few blocks before time if we do syncing there are vital chances that forging slot goes fine.

Your point 3 is interesting edge case, that we can cover by syncing on last slot of the round. So who ever the first one would be in delegate list of next round would have higher consensus.

The summary of discussion is syncing is developed as fail safe process to broadcasting and it should be triggered on specific and defined use cases, not periodically.

@4miners
Copy link
Contributor

4miners commented May 30, 2018

@nazarhussain Sync was never developed to be the fail safe to broadcasting. Main purpose of sync:

  • sync the blockchain
    • when we start a fresh node
    • when we start old node after some period of downtime
  • catch with the blockchain when node stay behind due to an edge case (for example node is overloaded, network outage, etc.)
    • when we didn't receive a block for 20 seconds (it was 50 or more in the past)
    • after fork 1 recovery (we deleted 2 blocks, so for sure we're behind and need to sync)

The time after which we trigger sync is not accidental. It's because when we didn't receive one block - we assume that delegate failed to forge that block and we not trigger sync. However 2 delegates that fail to forge a block in a row is highly unlikely - so we trigger sync after not receiving 2 blocks.

Broadcasting process should be reliable, every forged block should hit all peers in the network, as soon as possible. Consensus should also be reliable - it was designed to mitigate possible chain splits by not allowing nodes to forge.

Sync on last block of round is dangerous and can cause more issues.

@SargeKhan
Copy link
Contributor Author

@4miners, I agree with your opinion. I will try investigating further why we broadcasts of blocks do not reach all the peers. And @MaciejBaj, what's your opinion on this topic?

@MaciejBaj
Copy link
Contributor

MaciejBaj commented May 30, 2018

The time after which we trigger sync is not accidental.

20 sec interval of sync is conciliatory anyway due to asynchronous nature of the process. Manually triggering the sync process from time to time we aren't changing the behaviour.

Our main pain point is the broadcast process and the update headers not reaching the right amount of peers - 100 is our MAX_PEERS_LIMIT, possibly many more are stored on Peers List. As the result sync process triggers. In this case, it works as the recovery mechanism for the broadcast process not functioning correctly.

I see the value in performing the sync process in one slot before when discovering that Consensus is poor. It's not the solution that guarantees that a block will be forged in the next slot, but increases that probability significantly by going back on the right track with the majority of the network. The corner case of the last block of the round might be ignored. Other 100 cases are valid.

Sync on last block of round is dangerous and can cause more issues.

It can happen anyway, would be nice to know more details.

@4miners
Copy link
Contributor

4miners commented May 30, 2018

Not really matters if sync is triggered exactly when last receipt is 20 old or one second later (even 8 seconds later should be viable). Triggering sync manually because of low consensus (which happens too often currently because of issues with blocks/headers propagation) can cause more new issues than is expected to fix. In my opinion sync cannot be used as recovery or fail safe mechanism for broadcast process - it's a workaround to mitigate bad design/implementation. We should fix root cause instead of wasting time on countless patches that are not even proven to fix anything. Look at 0.9.x - there is no manual sync trigger and network is stable, how is that even possible? 😏

Consensus should be poor only when there is small fork or chain split. In large network it will not prevent from chain splits (requires only 51 peers to match our broadhash). I don't agree that sync will catch up with majority when we're at the beginning of a fork. Peers can have equal height, there can be no longer chain. Triggering sync manually one slot before delegate slot will increase possibility of a fork / chain split further.

Possible scenario:

  • forgers order: A, B, C - expected to forge at slots: 1, 2, 3
  • delegate A node is overloaded, broadcast at the very end of his slot 1 or even at slot 2
  • slot 2, network starts to receive block forged by delegate A
  • delegate B is expected to forge, but not yet received block A, some other nodes already have it - consensus is poor
  • delegate B trigger sync, block A hit delegate B but cannot be received (because of sync phase)
  • delegate B sync with peer that don't have block A (good peers have height >=)
  • delegate B forges his block and broadcast, block get rejected, delegate missed his slot because of sync

Another scenario:

  • (slot 1) delegate A double forge on 2 nodes (or more), broadcast
  • (slot 1) some peers have block A, some A', some don't have it
  • (slot 1) block A' is winning one
  • (slot 1) delegate B only received A (fork not decided)
  • (slot 1) delegate B is next one expected to forge, consensus is low, trigger sync
  • (slot 1) block A' is broadcasted to delegate B, but it cannot receive it because of sync
  • (slot 1) in the meantime block A propagated to more pees or delegate B updated peer info, consensus is fine
  • (slot 2) delegate B forges on top of A, broadcast
  • (slot 2) nodes that have block A - add B to chain
  • (slot 2) delegate C is next one expected to forge, have blocks A and B
  • (slot 3) delegate C forge on top of it, broadcast

Those scenarios meant to demonstrate that sync on unstable network will not fix network stability at all. There are also other bad scenarios. One again - we should fix the root cause - which is related to poor blocks propagation or exchange data (headers) between peers. Messing up with sync is bad.

@MaciejBaj
Copy link
Contributor

We will not proceed with that solution in favour of adjusting 'broadcastLimit' and 'maxRelays' constants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants