Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix vc true stuck partially #286

Merged
merged 68 commits into from
Jun 10, 2021
Merged

Fix vc true stuck partially #286

merged 68 commits into from
Jun 10, 2021

Conversation

peterlimg
Copy link
Member

@peterlimg peterlimg commented May 17, 2021

DO NOT squash and merge
Changes:

  • Return a clone block of latest finalized magic block when it will be accessed concurrently
  • Update to send entity requests concurrently instead of one by one. When there are lots of miners and shaders, sending request one by one would always time out.
  • on getMinerNode function, check the node from state db first, if failed, check from allMinersList
  • Update dkg move functions to return error instead of bool to see errors easier.
  • Fix nodes status monitor issues.
  • Fix unknown batch error
  • Fix panic errors.

@bbist
Copy link
Contributor

bbist commented May 20, 2021

@peterlimg,
Could you resolve the unit test failures?

platsko and others added 27 commits June 9, 2021 16:12
It is hard to known what's wrong when move failed without returning error.
Even when magic block starting round is the same, the magic block may
have changed, and if we don't restart the status monitor, it would inactive miners
or sharders in the old pools, hence the node active status would not be updated
in current magic block. This will lead further issue for calculating the TKN value
when create new magic block.
In move function for contribute, the mpk keys is checked to meet the condition that len(mpks) >= dkgMinersList.K. In the following phase, sos.SignOrShare will be computed base on the mpks, one thing that need to note is the miner's own share will not be included, hence if the current mpks is 7, the sos will be 6. In the previous code, the condition for passing the phase is sos num >= dkgMinersList.N - 2, which is incorrect. As the max number of sos it can be is len(mpks) - 1, however, miners could move to current phase if the len(mpks) >= dkgMinersList.K. Therefore, if the len(mpks) == K, then len(sos) will be K - 1, as long as the N - 2 > K - 1, i.e  N - 1 > K, the DKG process will fail.
Access it through methods instead, it has to be protected by mutex.
Returned block should not be able to modify the block state or update the variables.
Update node_pool locks
…nction

to avoid lock concurrent accessing and updating of latestFinalizedMagicBlock
It should be safe to do this, although doing http requests in scFunc would not able to
update the nodes active status, but the status monitor will update them.
This error is caused by the incorrect conflicts resolving, where the function was removed.
Stop doing unit test cases parallelly, the global config variable is not thread safe.
Push to channel may have goroutine leak if the receiver returns early
Force restart current round in HandleRoundTimeout
It would stuck the BlockMessageChannel if any function in the handler got stuck.
If there's no receiver waiting for the errC channel, the pushing would be stuck and the goroutine will be leaked.
3 seconds is too short which would lead to incorrect result of 'not registered node' and cause unnecessary repeatedly node registration.
If a request failed due to the inactive of sharder, the GetFromSharders function would get stuck in waiting for the response from collection channel.
@peterlimg peterlimg merged commit e083d39 into master Jun 10, 2021
@Sriep Sriep deleted the fix/vc-true-stuck branch July 21, 2021 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants