Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

View Change process gets stuck at Share phase and restarts whenever one of the MPK miners goes down (or intentionally rejects share) during Share phase #190

Closed
bbist opened this issue Apr 16, 2021 · 6 comments
Assignees

Comments

@bbist
Copy link
Contributor

bbist commented Apr 16, 2021

In the DKGProcess() method in 0chain.net/miner/protocol_view_change.go,
a new phase event is allowed to execute on following conditions:

  1. It's a share phase
  2. It's a start phase
  3. It's one phase after current phase

But this logic breaks the view-change process because in the current implementation when a share phase function keeps returning error, the current phase doesn't move. Re-execution of share phase is okay because periodic phase event would force miner to re-share it's shares to other MPK miners. But when the phase completion time (in terms of rounds) expires, the MinerSC smart contract moves to publish phase, and a publish phase event is received by all DKG miners. But the miners that were stuck in re-executing share phase because they were not able to get verification from at least one of the DKG miners and weren't able to move phase from contribute to share, would not be able to proceed because of the condition 3 mentioned above. The current phase is at contribute, but the phase event received is publish (contribute + 2). This shows up as "dkg process -- jumping over a phase" debug log in 0chain.log.

Solution for this would be to allow share phase to move current phase from contribute to share, even if the share phase function fails to send shares to and receive verification from few of the MPK miners. And allow the share phase to be retried every "repeat" interval (until next phase is started), so that it might succeed this time if there were some network issues before. This will ensure the current phase has moved from contribute to share during share phase even if few MPK miners failed to verify shares (for whatever reason). And the new publish phase event would execute as expected in condition 3.

Terminologies:
DKG Miners: All miners selected for DKG process
MPK Miners: All DKG miners who submitted their MPK to the BC

@bbist bbist changed the title View Change process gets stuck at Share phase and restart whenever one of the MPK miners goes down (or intentionally rejects share) during Share phase View Change process gets stuck at Share phase and restarts whenever one of the MPK miners goes down (or intentionally rejects share) during Share phase Apr 16, 2021
@peterlimg
Copy link
Member

But when the phase completion time (in terms of rounds) expires, the MinerSC smart contract moves to publish phase
May I know where's the code that related to this, thank you.

@bbist
Copy link
Contributor Author

bbist commented Apr 17, 2021

The logs that helped identify this were from https://two.devnet-0chain.net/miner*. Following sequence of logs seems to repeat continuously.

2021-04-14T16:37:36.285Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":0,"start_round":158701,"cu
rrent_round":158702,"restarts":726}, "phase": "unknown", "sc funcs": 5}
2021-04-14T16:37:36.285Z        DEBUG   miner/protocol_view_change.go:178       dkg process move phase  {"next_phase": {"phase":0,"start_round":158701,"cu
rrent_round":158702,"restarts":726}, "phase": "unknown", "txn": null}
2021-04-14T16:37:36.285Z        DEBUG   miner/protocol_view_change.go:187       dkg process moved phase {"old_phase": "unknown", "phase": "start"}
...
2021-04-14T16:38:11.368Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":1,"start_round":158751,"current_round":158752,"restarts":726}, "phase": "start", "sc funcs": 5}
2021-04-14T16:38:11.368Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":1,"start_round":158751,"current_round":158752,"restarts":726}, "phase": "start", "sc funcs": 5}
...
2021-04-14T16:38:11.462Z        DEBUG   miner/protocol_view_change.go:178       dkg process move phase  {"next_phase": {"phase":1,"start_round":158751,"current_round":158752,"restarts":726}, "phase": "start", "txn": {"hash":"107f234e46ec8105c71a5f0a054589aaacaafa0e8bdd192f0c88fab4a94331f1","version":"1.0","client_id":"32468d29660506a60dedab3f5a53fc70d03d26721a59c17e1f162efab6e614eb","public_key":"856d646281b257bab951a05fcf8dee063d580083d7d8bbb613bf346f38aef1039669f556fe96cee365a9324954e9a65c0c2feeb98ed2d45458ef3580aafebd9c","to_client_id":"6dba10422e368813802877a85039d3985d96760ed844092319743fb3a76712d9","chain_id":"0afc093ffb509f059c55478bc1a60351cef7b4e9c008a53a6cc8241ca8617dfe","transaction_data":"{\"name\":\"contributeMpk\",\"input\":{\"ID\":\"32468d29660506a60dedab3f5a53fc70d03d26721a59c17e1f162efab6e614eb\",\"Mpk\":[\"a81ecd03097132db82b0cb27eb992dd5b7e09f34cd384061bc41e1f44b29ac1ab2457ace90b3847cbc36f46393320d94c269b1a123851d4e44e4366f22cf088f\",\"36c68c2b4a72fe722cb63ecbf77b90d901c000275866abb764a89cfc73d27a0654b836d12e2a834240e967750170976ef7f7c72aa18d908ae79a996b547f16a2\",\"39d1c5c9a8dbca0f4990a035b2fadea315222756de164fe5148187acb6bbff1dec52430db6126e1b6554c3846ece1c436eadccd17349e698983eae7a1a9b5b08\"]}}","signature":"93ba5f267d1a63fd1f5ae544677361f4da23336d1ec36cca6ffac9f8ef9ccd23","creation_date":1618418291,"transaction_type":1000,"txn_output_hash":""}}
...
2021-04-14T16:38:14.765Z        DEBUG   miner/protocol_view_change.go:187       dkg process moved phase {"old_phase": "start", "phase": "contribute"}
...
2021-04-14T16:38:46.197Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:38:46.198Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158801,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:38:46.198Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158801,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:38:46.197Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:38:46.198Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158801,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:38:46.198Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158801,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:38:56.191Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:38:56.191Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158815,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:38:56.191Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158815,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:01.192Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:39:01.192Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158822,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:01.192Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158822,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:06.203Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:39:06.203Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158828,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:06.203Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158828,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:11.258Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:39:11.258Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158835,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:11.258Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158835,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:16.321Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:39:16.321Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158842,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:16.321Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158842,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:21.220Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "share", "start_round": 
158801, "restarts": 726}
2021-04-14T16:39:21.220Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":2,"start_round":158801,"current_round":158849,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:21.220Z        INFO    miner/protocol_view_change.go:160       dkg process start       {"next_phase": {"phase":2,"start_round":158801,"current_round":158849,"restarts":726}, "phase": "contribute", "sc funcs": 5}
...
2021-04-14T16:39:26.259Z        DEBUG   chain/protocol_view_change.go:601       dkg_process -- phase from sharders      {"phase": "publish", "start_round"
: 158851, "restarts": 726}
2021-04-14T16:39:26.259Z        DEBUG   miner/protocol_view_change.go:144       dkg process trying      {"next_phase": {"phase":3,"start_round":158851,"current_round":158856,"restarts":726}, "phase": "contribute", "sc funcs": 5}
2021-04-14T16:39:26.259Z        DEBUG   miner/protocol_view_change.go:151       dkg process -- jumping over a phase; skip, wait for 'start'     {"current_phase": 1, "phase": "publish"}

@bbist
Copy link
Contributor Author

bbist commented Apr 17, 2021

Following is the full log file from miner01.
0chain-2021-04-14T16-53-03.179.zip

@peterlimg
Copy link
Member

But when the phase completion time (in terms of rounds) expires, the MinerSC smart contract moves to publish phase,

Round timeout would not move to publish phase. If the miner received publish phase, that means there are miners have moved to publish phase, so as the sharders. If the current miner was in contribute phase, then it will jump over a phase and set current phase to unknown. Setting to unknown is for safety reason I think. The current miner may have to wait for all others move to start phase again, then it can join the DKG process.

@bbist
Copy link
Contributor Author

bbist commented May 5, 2021

Hi @peterlimg,

My concern is not regarding whether a miner can participate again when the DKG process restarts.
As share phase is the phase that's most prone to byzantine attacks, we should avoid as many possibilities as possible that could lead to DKG restart. Even the protocol itself accounts for possibility of failures of some nodes (K). I don't understand why should a miner be forced to rejoin the DKG when it restarts, when only one of the other miners failed to respond to it's share. I think we should allow miners to continue even if they were not able to send shares to and receive signs from some of other DKG miners. If they couldn't accumulate enough signs they'll automatically be removed by minersc. Forcing them to to wait for next start phase, because of phase jumping, is not necessary.

@bbist bbist reopened this May 5, 2021
peterlimg added a commit that referenced this issue May 7, 2021
allow share phase to partially fail and still move to publish phase #190
@peterlimg
Copy link
Member

Closed by #191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants