FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

sanderPostma · 2019-06-04T18:59:53Z

This was a very simple misconfiguration of SetupSim. When the BatchCount exceeds ExpectedHeight the test will fail.

stackdump · 2019-06-04T19:43:39Z

We can get this merged in - It seems the failure is not occuring after bond release.

The error we were seeing before was A leader would panic if it saw it's own Identity on the network

sanderPostma · 2019-06-04T21:01:20Z

I've been testing with the following information & branch:
setting this to 30 or 50 will likely cause a panic.
https://github.com/FactomProject/factomd/blob/FD-821_release_candidate_parchment/simTest/LeaderBrainSwap_test.go#L70
The only panics I got in parchment while running the test were about the timeout. This was resolved by
state0 := SetupSim("LLLFFF", params, batchCount+10, 0, 0, t)
But I see now the bond branch does not contain LeaderBrainSwap_test.go at all.
Unfortunately I do not have access to the whole story. I will let the test run a few more times with a very high batch count to be sure there are no more panics in this branch.

stackdump · 2019-06-04T21:24:58Z

@sanderPostma sorry to make this a moving target

I incorporated your test tweak here : https://github.com/FactomProject/factomd/blob/FD-883_failing_back_to_back_brainswap_longtest/simTest/LeaderBrainSwap_test.go#L32

This branch was built on top of Bond w/ the additions needed for the sim test to run.

Because this is set to run for >10 min - have to override the default time limit for go test like

go test -timeout=500m -v simTest/LeaderBrainSwap_test.go

I made the test run for 101 swaps - here's what the panic looks like
on my system it happened on block 79

Guidance from @factom-clay was we don't really want to call this bug solved unless it can last for 100 blocks under this load condition.

This  AckHash    9de856de360677141412b0069b80d80bbba991f5eba9f3c0b4eb81b661045871
This  ChainID    8888888da6ed14ec63e623cab6917c66b954b361d530770b3f5f5188f87f1738
This  Salt       7627c7a4c5028c4c
This  SaltNumber 765e799e
 for this ackAck   ChainID    8888888da6ed14ec63e623cab6917c66b954b361d530770b3f5f5188f87f1738
Ack   Salt       176654567b3def23
Ack   SaltNumber bf4f5b0
 for this ackpanic: There are two leaders configured with the same Identity in this network!  This is a configuration problem!

goroutine 100 [running]:
github.com/FactomProject/factomd/state.(*ProcessList).AddToProcessList(0xc019fb2600, 0xc00056c000, 0xc003601600, 0x1092680, 0xc0036014a0)
        /home/ork/go/src/github.com/FactomProject/factomd/state/processList.go:1012 +0x1c4d
github.com/FactomProject/factomd/state.(*State).FollowerExecuteMsg(0xc00056c000, 0x1092680, 0xc0036014a0)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:989 +0x22b
github.com/FactomProject/factomd/common/messages.(*DirectoryBlockSignature).FollowerExecute(0xc0036014a0, 0x109a100, 0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/common/messages/directoryBlockSignature.go:251 +0x4a
github.com/FactomProject/factomd/state.(*State).FollowerExecuteAck(0xc00056c000, 0x1091640, 0xc003601600)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:1082 +0x2f6
github.com/FactomProject/factomd/common/messages.(*Ack).FollowerExecute(0xc003601600, 0x109a100, 0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/common/messages/ack.go:197 +0x4a
github.com/FactomProject/factomd/state.(*State).executeMsg(0xc00056c000, 0x1091640, 0xc003601600, 0xdf9100)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:314 +0x917
github.com/FactomProject/factomd/state.(*State).Process(0xc00056c000, 0x1)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:473 +0x83f
github.com/FactomProject/factomd/state.(*State).DoProcessing(0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/state/validation.go:39 +0x7d
created by github.com/FactomProject/factomd/state.(*State).ValidatorLoop
        /home/ork/go/src/github.com/FactomProject/factomd/state/validation.go:73 +0x5c
FAIL    command-line-arguments  804.717s

stackdump · 2019-06-04T21:39:48Z

@sanderPostma the test above was run w/ a pretty tight blktime=10 / I ran again with blktime=15 to see if it is reproducible - this time it failed for me at block 34

Currently we don't think changing block time is good or bad for that test.

sanderPostma · 2019-06-05T07:50:16Z

I finally managed to reproduce it. On my PC it took until block 142 using blktime=15

sanderPostma · 2019-06-07T09:16:31Z

Adding 3 extra nodes worked for me. Looking at the logs I found this one:
13311550 00:22:36.284 96-:-9 Ack: EmbeddedMsg: M-f48676|R-689bc7|H-f48676|0xc003fc1b80 Directory Block Signature[ 7]: DBSig-VM 1: DBHt: 97 -- Signer[1570f8] PrevDBKeyMR[5cb49c] HeaderHash[ae5fc7] BodyMR[7befc0] Timestamp[825643087640-2019-06-07 00:22:36] hash[f48676] header - version: 0 networkid: fa92e5a4 bodymr: 7befc0 prevkeymr: 5cb49c prevfullhash: c7e28b timestamp: 25997662 timestamp str: 2019-06-07 00:22:00 dbheight: 96 blockcount: 12

My theory is this: At block 96 node 1570f8 got an AckChange and did an identity reload. But at the end of the last minute of that block and it signed the DBSig as FNode01, embedded in it an Ack message and dispatched it. A moment later it is on height 97 on a new node with a new timestamp-salt where it receives the message it just sent as FNode01, and boom...
I've added some extra logging to confirm that, but I think it's that. Then I have the following questions:

What do we do when this happens? Toss the message or still process it?
Should we even have state.RunLeader set to true for the next block when there is an AckChange?
state.RunLeader false between ChgAckHeight-1 and AckChange+1 to be safe?

factom-clay · 2019-06-07T14:04:19Z

state.RunLeader is a terribly named variable from the beginning of time. State.Leader is the interesting variable. RunLeader really just means you are out of the ignore/wait period at boot and is true even on followers. I like your diagnosis. If you run with --debuglogs=. I can look in the logs and show you exactly the messages if this is the case. I believe in SendDBSig we generate both the DBSig and Ack. There is another place we also generate DBSigs in some edge case... don't recall the details. I would bet one of them is actually running in minute 9 or 10 instead of minute 0 and thus your senerio can happen because the Leader flag represents the previous block Leader state but the ack is for the next block. I see two options: 1) Move DBSig generation to MoveStateToHeight and do it at the very start of minute 0 after the state.Leader flag is set if it is set to true. 2) Make both existing set of DBSig generating code check for ackchange and if it's that height check if the new ID is a Leader ID before generating the DBSig/Ack pair. Option 1 seems cleaner. I'm open to other ideas. Thanks, Clay Clay Douglass Senior Architect clay@factom.com | https://factom.com <http://factom.com/>

…

On Fri, Jun 7, 2019 at 4:16 AM Sander Postma ***@***.***> wrote: Adding 3 extra nodes worked for me. Looking at the logs I found this one: 13311550 00:22:36.284 96-:-9 Ack: EmbeddedMsg: M-f48676|R-689bc7|H-f48676|0xc003fc1b80 Directory Block Signature[ 7]: DBSig-VM 1: DBHt: 97 -- Signer[1570f8] PrevDBKeyMR[5cb49c] HeaderHash[ae5fc7] BodyMR[7befc0] Timestamp[825643087640-2019-06-07 00:22:36] hash[f48676] header - version: 0 networkid: fa92e5a4 bodymr: 7befc0 prevkeymr: 5cb49c prevfullhash: c7e28b timestamp: 25997662 timestamp str: 2019-06-07 00:22:00 dbheight: 96 blockcount: 12 My theory is this: At block 96 node 1570f8 got an AckChange signal and an identity reload. But it was process list leader in the last minute of the block and it signed the last DBSig of that block as FNode01 and sent it out. A moment later it is on height 97 with on a new node with a new salt where it receives the message it just sent as FNode01. I've added some extra logging to confirm that, but I think it's that. Then I have the following questions: 1. What do we do when this happens? Toss it or still process the message? 2. Should we even have state.RunLeader set to true when there is an AckChange? 3. state.RunLeader false between ChgAckHeight-1 and AckChange+1 to be safe? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#747?email_source=notifications&email_token=AIF3HDO5T6TQ3BQJ6R6YVZ3PZIRPHA5CNFSM4HTAASPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXFJQWQ#issuecomment-499816538>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIF3HDNC5S5QTHSRWRBPSJDPZIRPHANCNFSM4HTAASPA> .

sanderPostma · 2019-06-11T11:38:25Z

It's actually the other way around. FNode01 at block 96 becomes FNode05 at block 97, only moveToHeight is not at the same moment for all nodes. I see up to 100ms difference between the first and last node to switch. FNode01 receives a DBSig from the "future" because it is still at block 96 minute 9 while his "future self" is already at block 97 minute 0. (Similar as what happened with the audit brain swap where one node was lagging behind.)
In the logs that I am looking at 3 out of 8 nodes are behind, the others report at block 96 minute 10 so FNode is with block 96 minute 9 more behind than the other two.

The log summary
7zip all logs

For the audit server heartbeat the fix was to not panic when the message was from the future. Would that be safe to do in this case as well?

factom-clay · 2019-06-11T13:16:28Z

The network and processing delays will make MoveStateToHeight execute at different times on different nodes. That is not surprising. Not panicking because of future ACK messages with the same ID is a valid fix. Thanks, Clay Clay Clay Douglass Senior Architect clay@factom.com | https://factom.com <http://factom.com/>

…

On Tue, Jun 11, 2019 at 6:38 AM Sander Postma ***@***.***> wrote: It's actually the other way around. FNode01 at block 96 becomes FNode05 at block 97, only moveToHeight is not at the same moment for all nodes. I see up to 100ms difference between the first and last node to switch. FNode01 receives a DBSig from the "future" because it is still at block 96 minute 9 while his "future self" is already at block 97 minute 0. (Similar as what happened with the audit brain swap where one node was lagging behind.) In the logs that I am looking at 3 out of 8 nodes are behind, the others report at block 96 minute 10 so FNode is with block 96 minute 9 more behind than the other two. The log summary <https://drive.google.com/open?id=1DzCNvdxA4dkKjJ2mpKB2nlPpy07ESuGW> 7zip all logs <https://drive.google.com/open?id=1YhuCyuU_X26VuWb3C2Nm4Jnkxb_2Tdix> For the audit server heartbeat the fix was to not panic when the message was from the future. Would that be safe to do in this case as well? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#747?email_source=notifications&email_token=AIF3HDNRAVX5XH7AU3ACFHLPZ6FDNA5CNFSM4HTAASPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXM2PXY#issuecomment-500803551>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIF3HDMPQ42GRFY2VRNLDQTPZ6FDNANCNFSM4HTAASPA> .

sanderPostma · 2019-06-13T12:32:11Z

I will need to open a new PR

FD-883: Fixed sim setup in LeaderBrainSwap_test.go

1d0b204

sanderPostma closed this Jun 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

sanderPostma commented Jun 4, 2019

stackdump commented Jun 4, 2019

sanderPostma commented Jun 4, 2019 •

edited

stackdump commented Jun 4, 2019

stackdump commented Jun 4, 2019 •

edited

sanderPostma commented Jun 5, 2019

sanderPostma commented Jun 7, 2019 •

edited

factom-clay commented Jun 7, 2019 via email

sanderPostma commented Jun 11, 2019

factom-clay commented Jun 11, 2019 via email

sanderPostma commented Jun 13, 2019

FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

Conversation

sanderPostma commented Jun 4, 2019

stackdump commented Jun 4, 2019

sanderPostma commented Jun 4, 2019 • edited

stackdump commented Jun 4, 2019

stackdump commented Jun 4, 2019 • edited

sanderPostma commented Jun 5, 2019

sanderPostma commented Jun 7, 2019 • edited

factom-clay commented Jun 7, 2019 via email

sanderPostma commented Jun 11, 2019

factom-clay commented Jun 11, 2019 via email

sanderPostma commented Jun 13, 2019

sanderPostma commented Jun 4, 2019 •

edited

stackdump commented Jun 4, 2019 •

edited

sanderPostma commented Jun 7, 2019 •

edited