Fix vc true stuck partially #286

peterlimg · 2021-05-17T03:54:29Z

DO NOT squash and merge
Changes:

Return a clone block of latest finalized magic block when it will be accessed concurrently
Update to send entity requests concurrently instead of one by one. When there are lots of miners and shaders, sending request one by one would always time out.
on getMinerNode function, check the node from state db first, if failed, check from allMinersList
Update dkg move functions to return error instead of bool to see errors easier.
Fix nodes status monitor issues.
Fix unknown batch error
Fix panic errors.

bbist · 2021-05-20T02:19:33Z

@peterlimg,
Could you resolve the unit test failures?

Add debug log

It is hard to known what's wrong when move failed without returning error.

Even when magic block starting round is the same, the magic block may have changed, and if we don't restart the status monitor, it would inactive miners or sharders in the old pools, hence the node active status would not be updated in current magic block. This will lead further issue for calculating the TKN value when create new magic block.

In move function for contribute, the mpk keys is checked to meet the condition that len(mpks) >= dkgMinersList.K. In the following phase, sos.SignOrShare will be computed base on the mpks, one thing that need to note is the miner's own share will not be included, hence if the current mpks is 7, the sos will be 6. In the previous code, the condition for passing the phase is sos num >= dkgMinersList.N - 2, which is incorrect. As the max number of sos it can be is len(mpks) - 1, however, miners could move to current phase if the len(mpks) >= dkgMinersList.K. Therefore, if the len(mpks) == K, then len(sos) will be K - 1, as long as the N - 2 > K - 1, i.e N - 1 > K, the DKG process will fail.

Access it through methods instead, it has to be protected by mutex.

Returned block should not be able to modify the block state or update the variables.

…r list

Update node_pool locks

…ase info

…nction to avoid lock concurrent accessing and updating of latestFinalizedMagicBlock

It should be safe to do this, although doing http requests in scFunc would not able to update the nodes active status, but the status monitor will update them.

This error is caused by the incorrect conflicts resolving, where the function was removed.

Stop doing unit test cases parallelly, the global config variable is not thread safe.

Push to channel may have goroutine leak if the receiver returns early

Force restart current round in HandleRoundTimeout

It would stuck the BlockMessageChannel if any function in the handler got stuck.

If there's no receiver waiting for the errC channel, the pushing would be stuck and the goroutine will be leaked.

3 seconds is too short which would lead to incorrect result of 'not registered node' and cause unnecessary repeatedly node registration.

If a request failed due to the inactive of sharder, the GetFromSharders function would get stuck in waiting for the response from collection channel.

guruhubb requested a review from bbist May 17, 2021 04:29

peterlimg force-pushed the fix/vc-true-stuck branch from f9669c6 to 077ec6d Compare May 17, 2021 12:33

platsko and others added 27 commits June 9, 2021 16:12

ISSUE#227 Correct go.mod and go.sum (#232)

20309f2

Add debug log

Get magic block of a given round without offset for status monitoring

1c20c91

Add debug log

Return error instead of bool value for move functions

76cf0f6

It is hard to known what's wrong when move failed without returning error.

Add round offset back for getting prev magic block from MB

b63ae67

Getting or updating MPT values through functions

f704e4e

Fix unknown batch error

418a63f

Do not return own LFMB for GetLatestFinalizedMagicBlockFromShareders

481cf9a

Add lock when getting latest finalized magic block from sharders

8118e80

Remove useless copyNodes

8d91abf

Do not export chain's latest finalized magic block

47f2039

Access it through methods instead, it has to be protected by mutex.

Add clone method for block.Block

a235347

Returned block should not be able to modify the block state or update the variables.

Set root hash in CreateState method

b440b7c

Change Client.Copy method name to Clone()

efc9149

Change getMinerNode to get node first, if failed, then check the mine…

deb408a

…r list

Export GetPhaseNode

1ea5e7e

Remove Info lock from node

1747287

Update node_pool locks

Add chain.RequestEntityFromMiners to request entity from miners

b678ae1

Add MagicBlockBrief struct for getting latest finalized magic block b…

822f0a9

…ase info

Remove unnecessary bytes converting

503ae0a

Check if miners pool is nil before copying nodes to avoid panic

7bff6e9

Fix ensureLatestFinalizedBlocks

3d633a3

Run scFunc in the lock of LatestFinalizedMagicBlockUpdate callback fu…

aef5858

…nction to avoid lock concurrent accessing and updating of latestFinalizedMagicBlock

Export NewStateContext

e0b4a60

add PORTABLE=1 for rocksdb build

80a8278

Fix logging error

528b1c6

peterlimg added 26 commits June 9, 2021 16:23

Run scFunc against cloned lfmb

cf9ffbc

It should be safe to do this, although doing http requests in scFunc would not able to update the nodes active status, but the status monitor will update them.

Move http request error to N2n logger

d607c01

Fix goroutine leak on SetupSC

2fc9dc0

Fix GetMagicBlockNoOffset missing error

f6dcd9f

This error is caused by the incorrect conflicts resolving, where the function was removed.

Fix MakeSCRestAPICall error

20865ec

Fix panic of sending VRFS

66140c9

Fix integration tests

01857ac

Remove unused code

3796b3e

Restart status monitor with magic block starting round

4ff3125

Fix unit test errors

bedfe42

Stop doing unit test cases parallelly, the global config variable is not thread safe.

Fix goroutine leak in checking node registration

f91e44d

Close channel to notify the finish of HandleRoundTimeout

fba4775

Push to channel may have goroutine leak if the receiver returns early

Minor refactoring of log

ac6d586

Force restart current round

367d745

Force restart current round in HandleRoundTimeout

Do not run computeState in goroutine in AddNotarizedBlock

d779be1

Update getLatestFinalizedBlockFromSharders

a6a67a3

Move the BlockMessage handler to a goroutine

755ca64

It would stuck the BlockMessageChannel if any function in the handler got stuck.

Push err to errC with 'select .. default' to avoid stuck

b3f3ed7

If there's no receiver waiting for the errC channel, the pushing would be stuck and the goroutine will be leaked.

Extend the timeout time to 30 seconds for checking node register

f412cd2

3 seconds is too short which would lead to incorrect result of 'not registered node' and cause unnecessary repeatedly node registration.

Fix goroutine leak in GetFromSharders function

88c066a

If a request failed due to the inactive of sharder, the GetFromSharders function would get stuck in waiting for the response from collection channel.

Update to avoid copying mutex value on Client struct

4779953

Check error on encoding self node client

acf81ac

Reassing miner round to avoid nil panic

f1b6a15

Do not return error for adding miners or sharders that already exist

5106630

Fetch missing state nodes in LFB instead of c.StateDB

8076c1c

Return 500 internal server error for failed to decode gn

0cfa7e1

peterlimg force-pushed the fix/vc-true-stuck branch from a617640 to 0cfa7e1 Compare June 10, 2021 03:34

Fix unit test errors

848b57e

peterlimg merged commit e083d39 into master Jun 10, 2021

Sriep deleted the fix/vc-true-stuck branch July 21, 2021 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix vc true stuck partially #286

Fix vc true stuck partially #286

peterlimg commented May 17, 2021 •

edited

bbist commented May 20, 2021

Fix vc true stuck partially #286

Fix vc true stuck partially #286

Conversation

peterlimg commented May 17, 2021 • edited

bbist commented May 20, 2021

peterlimg commented May 17, 2021 •

edited