Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747

Conversation

sanderPostma
Copy link
Contributor

This was a very simple misconfiguration of SetupSim. When the BatchCount exceeds ExpectedHeight the test will fail.

@stackdump
Copy link
Contributor

We can get this merged in - It seems the failure is not occuring after bond release.

The error we were seeing before was A leader would panic if it saw it's own Identity on the network

@sanderPostma
Copy link
Contributor Author

sanderPostma commented Jun 4, 2019

I've been testing with the following information & branch:
setting this to 30 or 50 will likely cause a panic.
https://github.com/FactomProject/factomd/blob/FD-821_release_candidate_parchment/simTest/LeaderBrainSwap_test.go#L70

The only panics I got in parchment while running the test were about the timeout. This was resolved by
state0 := SetupSim("LLLFFF", params, batchCount+10, 0, 0, t)
But I see now the bond branch does not contain LeaderBrainSwap_test.go at all.
Unfortunately I do not have access to the whole story. I will let the test run a few more times with a very high batch count to be sure there are no more panics in this branch.

@stackdump
Copy link
Contributor

@sanderPostma sorry to make this a moving target

I incorporated your test tweak here : https://github.com/FactomProject/factomd/blob/FD-883_failing_back_to_back_brainswap_longtest/simTest/LeaderBrainSwap_test.go#L32

This branch was built on top of Bond w/ the additions needed for the sim test to run.

Because this is set to run for >10 min - have to override the default time limit for go test like

go test -timeout=500m -v simTest/LeaderBrainSwap_test.go

I made the test run for 101 swaps - here's what the panic looks like
on my system it happened on block 79

Guidance from @factom-clay was we don't really want to call this bug solved unless it can last for 100 blocks under this load condition.

This  AckHash    9de856de360677141412b0069b80d80bbba991f5eba9f3c0b4eb81b661045871
This  ChainID    8888888da6ed14ec63e623cab6917c66b954b361d530770b3f5f5188f87f1738
This  Salt       7627c7a4c5028c4c
This  SaltNumber 765e799e
 for this ackAck   ChainID    8888888da6ed14ec63e623cab6917c66b954b361d530770b3f5f5188f87f1738
Ack   Salt       176654567b3def23
Ack   SaltNumber bf4f5b0
 for this ackpanic: There are two leaders configured with the same Identity in this network!  This is a configuration problem!

goroutine 100 [running]:
github.com/FactomProject/factomd/state.(*ProcessList).AddToProcessList(0xc019fb2600, 0xc00056c000, 0xc003601600, 0x1092680, 0xc0036014a0)
        /home/ork/go/src/github.com/FactomProject/factomd/state/processList.go:1012 +0x1c4d
github.com/FactomProject/factomd/state.(*State).FollowerExecuteMsg(0xc00056c000, 0x1092680, 0xc0036014a0)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:989 +0x22b
github.com/FactomProject/factomd/common/messages.(*DirectoryBlockSignature).FollowerExecute(0xc0036014a0, 0x109a100, 0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/common/messages/directoryBlockSignature.go:251 +0x4a
github.com/FactomProject/factomd/state.(*State).FollowerExecuteAck(0xc00056c000, 0x1091640, 0xc003601600)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:1082 +0x2f6
github.com/FactomProject/factomd/common/messages.(*Ack).FollowerExecute(0xc003601600, 0x109a100, 0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/common/messages/ack.go:197 +0x4a
github.com/FactomProject/factomd/state.(*State).executeMsg(0xc00056c000, 0x1091640, 0xc003601600, 0xdf9100)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:314 +0x917
github.com/FactomProject/factomd/state.(*State).Process(0xc00056c000, 0x1)
        /home/ork/go/src/github.com/FactomProject/factomd/state/stateConsensus.go:473 +0x83f
github.com/FactomProject/factomd/state.(*State).DoProcessing(0xc00056c000)
        /home/ork/go/src/github.com/FactomProject/factomd/state/validation.go:39 +0x7d
created by github.com/FactomProject/factomd/state.(*State).ValidatorLoop
        /home/ork/go/src/github.com/FactomProject/factomd/state/validation.go:73 +0x5c
FAIL    command-line-arguments  804.717s

@stackdump
Copy link
Contributor

stackdump commented Jun 4, 2019

@sanderPostma the test above was run w/ a pretty tight blktime=10 / I ran again with blktime=15 to see if it is reproducible - this time it failed for me at block 34

Currently we don't think changing block time is good or bad for that test.

@sanderPostma
Copy link
Contributor Author

I finally managed to reproduce it. On my PC it took until block 142 using blktime=15

@sanderPostma
Copy link
Contributor Author

sanderPostma commented Jun 7, 2019

Adding 3 extra nodes worked for me. Looking at the logs I found this one:
13311550 00:22:36.284 96-:-9 Ack: EmbeddedMsg: M-f48676|R-689bc7|H-f48676|0xc003fc1b80 Directory Block Signature[ 7]: DBSig-VM 1: DBHt: 97 -- Signer[1570f8] PrevDBKeyMR[5cb49c] HeaderHash[ae5fc7] BodyMR[7befc0] Timestamp[825643087640-2019-06-07 00:22:36] hash[f48676] header - version: 0 networkid: fa92e5a4 bodymr: 7befc0 prevkeymr: 5cb49c prevfullhash: c7e28b timestamp: 25997662 timestamp str: 2019-06-07 00:22:00 dbheight: 96 blockcount: 12

My theory is this: At block 96 node 1570f8 got an AckChange and did an identity reload. But at the end of the last minute of that block and it signed the DBSig as FNode01, embedded in it an Ack message and dispatched it. A moment later it is on height 97 on a new node with a new timestamp-salt where it receives the message it just sent as FNode01, and boom...
I've added some extra logging to confirm that, but I think it's that. Then I have the following questions:

  1. What do we do when this happens? Toss the message or still process it?
  2. Should we even have state.RunLeader set to true for the next block when there is an AckChange?
  3. state.RunLeader false between ChgAckHeight-1 and AckChange+1 to be safe?

@factom-clay
Copy link
Contributor

factom-clay commented Jun 7, 2019 via email

@sanderPostma
Copy link
Contributor Author

It's actually the other way around. FNode01 at block 96 becomes FNode05 at block 97, only moveToHeight is not at the same moment for all nodes. I see up to 100ms difference between the first and last node to switch. FNode01 receives a DBSig from the "future" because it is still at block 96 minute 9 while his "future self" is already at block 97 minute 0. (Similar as what happened with the audit brain swap where one node was lagging behind.)
In the logs that I am looking at 3 out of 8 nodes are behind, the others report at block 96 minute 10 so FNode is with block 96 minute 9 more behind than the other two.

The log summary
7zip all logs

For the audit server heartbeat the fix was to not panic when the message was from the future. Would that be safe to do in this case as well?

@factom-clay
Copy link
Contributor

factom-clay commented Jun 11, 2019 via email

@sanderPostma
Copy link
Contributor Author

I will need to open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants