New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747
FD-883: Fixed sim setup in LeaderBrainSwap_test.go #747
Conversation
We can get this merged in - It seems the failure is not occuring after bond release. The error we were seeing before was A leader would panic if it saw it's own Identity on the network |
I've been testing with the following information & branch: |
@sanderPostma sorry to make this a moving target I incorporated your test tweak here : https://github.com/FactomProject/factomd/blob/FD-883_failing_back_to_back_brainswap_longtest/simTest/LeaderBrainSwap_test.go#L32 This branch was built on top of Bond w/ the additions needed for the sim test to run. Because this is set to run for >10 min - have to override the default time limit for go test like
I made the test run for 101 swaps - here's what the panic looks like Guidance from @factom-clay was we don't really want to call this bug solved unless it can last for 100 blocks under this load condition.
|
@sanderPostma the test above was run w/ a pretty tight blktime=10 / I ran again with blktime=15 to see if it is reproducible - this time it failed for me at block 34 Currently we don't think changing block time is good or bad for that test. |
I finally managed to reproduce it. On my PC it took until block 142 using blktime=15 |
Adding 3 extra nodes worked for me. Looking at the logs I found this one: My theory is this: At block 96 node 1570f8 got an AckChange and did an identity reload. But at the end of the last minute of that block and it signed the DBSig as FNode01, embedded in it an Ack message and dispatched it. A moment later it is on height 97 on a new node with a new timestamp-salt where it receives the message it just sent as FNode01, and boom...
|
state.RunLeader is a terribly named variable from the beginning of time.
State.Leader is the interesting variable. RunLeader really just means you
are out of the ignore/wait period at boot and is true even on followers.
I like your diagnosis. If you run with --debuglogs=. I can look in the logs
and show you exactly the messages if this is the case.
I believe in SendDBSig we generate both the DBSig and Ack. There is another
place we also generate DBSigs in some edge case... don't recall the
details. I would bet one of them is actually running in minute 9 or 10
instead of minute 0 and thus your senerio can happen because the Leader
flag represents the previous block Leader state but the ack is for the
next block.
I see two options:
1) Move DBSig generation to MoveStateToHeight and do it at the very start
of minute 0 after the state.Leader flag is set if it is set to true.
2) Make both existing set of DBSig generating code check for ackchange and
if it's that height check if the new ID is a Leader ID before generating
the DBSig/Ack pair.
Option 1 seems cleaner. I'm open to other ideas.
Thanks,
Clay
Clay Douglass
Senior Architect
clay@factom.com | https://factom.com <http://factom.com/>
…On Fri, Jun 7, 2019 at 4:16 AM Sander Postma ***@***.***> wrote:
Adding 3 extra nodes worked for me. Looking at the logs I found this one:
13311550 00:22:36.284 96-:-9 Ack: EmbeddedMsg:
M-f48676|R-689bc7|H-f48676|0xc003fc1b80 Directory Block Signature[ 7]:
DBSig-VM 1: DBHt: 97 -- Signer[1570f8] PrevDBKeyMR[5cb49c]
HeaderHash[ae5fc7] BodyMR[7befc0] Timestamp[825643087640-2019-06-07
00:22:36] hash[f48676] header - version: 0 networkid: fa92e5a4 bodymr:
7befc0 prevkeymr: 5cb49c prevfullhash: c7e28b timestamp: 25997662 timestamp
str: 2019-06-07 00:22:00 dbheight: 96 blockcount: 12
My theory is this: At block 96 node 1570f8 got an AckChange signal and an
identity reload. But it was process list leader in the last minute of the
block and it signed the last DBSig of that block as FNode01 and sent it
out. A moment later it is on height 97 with on a new node with a new salt
where it receives the message it just sent as FNode01.
I've added some extra logging to confirm that, but I think it's that. Then
I have the following questions:
1. What do we do when this happens? Toss it or still process the
message?
2. Should we even have state.RunLeader set to true when there is an
AckChange?
3. state.RunLeader false between ChgAckHeight-1 and AckChange+1 to be
safe?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#747?email_source=notifications&email_token=AIF3HDO5T6TQ3BQJ6R6YVZ3PZIRPHA5CNFSM4HTAASPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXFJQWQ#issuecomment-499816538>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIF3HDNC5S5QTHSRWRBPSJDPZIRPHANCNFSM4HTAASPA>
.
|
It's actually the other way around. FNode01 at block 96 becomes FNode05 at block 97, only moveToHeight is not at the same moment for all nodes. I see up to 100ms difference between the first and last node to switch. FNode01 receives a DBSig from the "future" because it is still at block 96 minute 9 while his "future self" is already at block 97 minute 0. (Similar as what happened with the audit brain swap where one node was lagging behind.) For the audit server heartbeat the fix was to not panic when the message was from the future. Would that be safe to do in this case as well? |
The network and processing delays will make MoveStateToHeight execute at
different times on different nodes. That is not surprising.
Not panicking because of future ACK messages with the same ID is a valid
fix.
Thanks,
Clay
Clay
Clay Douglass
Senior Architect
clay@factom.com | https://factom.com <http://factom.com/>
…On Tue, Jun 11, 2019 at 6:38 AM Sander Postma ***@***.***> wrote:
It's actually the other way around. FNode01 at block 96 becomes FNode05 at
block 97, only moveToHeight is not at the same moment for all nodes. I see
up to 100ms difference between the first and last node to switch. FNode01
receives a DBSig from the "future" because it is still at block 96 minute 9
while his "future self" is already at block 97 minute 0. (Similar as what
happened with the audit brain swap where one node was lagging behind.)
In the logs that I am looking at 3 out of 8 nodes are behind, the others
report at block 96 minute 10 so FNode is with block 96 minute 9 more behind
than the other two.
The log summary
<https://drive.google.com/open?id=1DzCNvdxA4dkKjJ2mpKB2nlPpy07ESuGW>
7zip all logs
<https://drive.google.com/open?id=1YhuCyuU_X26VuWb3C2Nm4Jnkxb_2Tdix>
For the audit server heartbeat the fix was to not panic when the message
was from the future. Would that be safe to do in this case as well?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#747?email_source=notifications&email_token=AIF3HDNRAVX5XH7AU3ACFHLPZ6FDNA5CNFSM4HTAASPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXM2PXY#issuecomment-500803551>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIF3HDMPQ42GRFY2VRNLDQTPZ6FDNANCNFSM4HTAASPA>
.
|
I will need to open a new PR |
This was a very simple misconfiguration of SetupSim. When the BatchCount exceeds ExpectedHeight the test will fail.