-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[potential BUG] - TraceDidntAdoptBlock for 5 minutes #4720
Comments
Looking at the code for my own education I could trace the possible origin of this message to be in the GC process.
and |
@gufmar Do you have a |
As far as I can tell, that forum thread has five blocks of interest. (by ascending slot)
9c1, 14a, and bbe are on the
A third (bonus 😄) factor is the timeouts.
If no new block arises before that timeout (randomized per upstream connection), then we disconnect from that peer. Depending on re-connection delays etc, that might amplify the duration of some stall in the chain growth. Thanks to those sharing logs, we do have some more specific information in this case 👍 In particular, we have two reports of
Relevant facts:
So, it could be the case that only 15c and 0a2 were minted during the gap (ie no one else was elected). If they had extended 14a instead of 9c1, then the gap would be smaller. But! Why didn't 15c and 0a2 extend 14a instead of its predecessor 9c1?
I am puzzled. Seems unlikely 14a would happen to arrive at just the wrong time at two nodes, 170 seconds apart for this to happen. Other possibilities:
I noticed this while writing the above. Looking more closely at Ada4Good's logs now, I actually see it select 14a between (I remain as mystified as ever about the duplication of
So: this is exactly the anticipated reason for ever seeing What remains odd is that it finished validating 14a 207 seconds after 14a was minted and happened to do so exactly during its leadership attempt, which lasted between 60 and 70 milliseconds? Seems like real bad luck if there turns out to be no systematic correlation here after all... I don't see this in the logs that BBHMM submitted. But I also don't see But the two leadership tests were 170 seconds apart; seems very unlikely that two nodes would be validating 14a at such different times and also exactly during their leadership checks. |
Hi @nfrisby, Thank you for analyzing. TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:17:52.00Z","data":{"chainDensity":4.78205e-2,"credentials":"Cardano","delegMapSize":1237930,"kind":"TraceStartLeadershipCheck","slot":78831181,"utxoSize":9404848},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"1429","sev":"Info","thread":"214"} TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:17:52.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"} TraceForgedBlock {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"credentials":"Cardano","val":{"block":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"} AddedToCurrentChain {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Notice","thread":"204"} TrySwitchToAFork {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"block":{"hash":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","kind":"Point","slot":78831181},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Info","thread":"204"} TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Error","thread":"214"} |
I have observed the same for the first time this epoch. Will update when I learn more from the logs (BP and both relays). First snipped from the BP, anyone saw the same error at the same time frame? {"app":[],"at":"2023-01-02T05:13:57.00Z","data":{"chainDensity":4.7301885e-2,"credentials":"Cardano","delegMapSize":1251425,"kind":"TraceStartLeadershipCheck","slot":81070146,"utxoSize":9553777},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"833","sev":"Info","thread":"353"} |
Please re-open this issue in https://github.com/input-output-hk/ouroboros-network |
Opened Issue IntersectMBO/ouroboros-network#4252 |
External
Area
Other Any other topic (Delegation, Ranking, ...).
Summary
At 2022-12-07 07:17:15am GMT from slot 78831144 and block 8107959 on, there was an extraordinary break in block production.
The next block was adopted at 07:21:59am GMT for slot 78831428 and block https://cardanoscan.io/block/8107960
So 284 Seconds or almost 5 minutes without a new block adopted on chain.
Multiple SPOs reported logs from the producing node, having a successfull
TraceNodeIsLeader
>TraceForgedBlock
>AddedToCurrentChain
but then followed by an unexpectedTraceDidntAdoptBlock
System info (please complete the following information):
Different blockproducing nodes running on v1.35.3 or .4 on mainnet
Screenshots and attachments
Reports from at least 3 Pool operators with forged but not adopted blocks can be found at https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/32
Additional context
https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-NodeKernel.html > TraceForgedBlock > TraceDidntAdoptBlock
So this is not an rarely expectable long slot gap, caused by VRF dice probability. The pools had scheduled slots. they were up, running and actually forged the block. but whatever unknown reason caused them to finally not adopt it even localy. Hence also not anounced to the network. All this happened in a row of 5 minutes, or roughly ~15 blocks. It is unclear what caused it, and what happened then to get out of this situation.
The text was updated successfully, but these errors were encountered: