Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[potential BUG] - TraceDidntAdoptBlock for 5 minutes #4720

Closed
gufmar opened this issue Dec 16, 2022 · 7 comments
Closed

[potential BUG] - TraceDidntAdoptBlock for 5 minutes #4720

gufmar opened this issue Dec 16, 2022 · 7 comments
Labels
comp: networking type: bug Something is not working user type: external Created by an external user

Comments

@gufmar
Copy link
Contributor

gufmar commented Dec 16, 2022

External

Area
Other Any other topic (Delegation, Ranking, ...).

Summary
At 2022-12-07 07:17:15am GMT from slot 78831144 and block 8107959 on, there was an extraordinary break in block production.
The next block was adopted at 07:21:59am GMT for slot 78831428 and block https://cardanoscan.io/block/8107960
So 284 Seconds or almost 5 minutes without a new block adopted on chain.
Multiple SPOs reported logs from the producing node, having a successfull TraceNodeIsLeader > TraceForgedBlock > AddedToCurrentChain but then followed by an unexpected TraceDidntAdoptBlock

System info (please complete the following information):
Different blockproducing nodes running on v1.35.3 or .4 on mainnet

Screenshots and attachments
Reports from at least 3 Pool operators with forged but not adopted blocks can be found at https://forum.cardano.org/t/no-blocks-during-the-epoch-for-almost-5-minutes/111280/32

Additional context
https://input-output-hk.github.io/ouroboros-network/ouroboros-consensus/Ouroboros-Consensus-NodeKernel.html > TraceForgedBlock > TraceDidntAdoptBlock

So this is not an rarely expectable long slot gap, caused by VRF dice probability. The pools had scheduled slots. they were up, running and actually forged the block. but whatever unknown reason caused them to finally not adopt it even localy. Hence also not anounced to the network. All this happened in a row of 5 minutes, or roughly ~15 blocks. It is unclear what caused it, and what happened then to get out of this situation.

@gufmar gufmar added the type: bug Something is not working label Dec 16, 2022
@ghost
Copy link

ghost commented Dec 19, 2022

Looking at the code for my own education I could trace the possible origin of this message to be in the GC process.
The trace is emitted when getIsInvalidBlock returns Nothing, and this function is defined as:

 fmap (fmap (fmap invalidBlockReason) . flip Map.lookup) <$> readTVar cdbInvalid

and cdbInvalid is only modified in the garbageCollect function. So it's as if the newly created block had been garbage collected before it could be propagated?

@ghost
Copy link

ghost commented Dec 19, 2022

@gufmar Do you have a PerformedGC trace somewhere around this TraceDidntAdoptBlock?

@nfrisby
Copy link
Contributor

nfrisby commented Dec 19, 2022

As far as I can tell, that forum thread has five blocks of interest.

(by ascending slot)

hash prefix slot blockno predecessor slot as UTC time full hash
9c1 78831128 8107958 don't care don't care 9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac
14a 78831144 8107959 9c1 2022-12-07T-07:17:15.00Z 14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c
15c 78831181 8107959 9c1 2022-12-07T-07:17:52.00Z 15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803
0a2 78831351 8107959 9c1 2022-12-07T-07:20:42.00Z 0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139
bbe 78831428 8107960 14a 2022-12-07T-07:21:59.00Z bbe6c6301d467bb8277e7c461782c8bc34ea62eefd893ecd1b747a0522ac58af

9c1, 14a, and bbe are on the mainnet chain. The alarm begins because 14a and bbe are 4m44s apart, which is "big" for mainnet Cardano. We have two typical hypotheses for some gap might be that big.

  • The (random!) leader schedule happened to be empty during that duration. The chance of any given gap being >=5minutes is (0.95)^(5*60) = 2.075e-7. But there are a lot of slots in the year. This next arithmetic naively ignores overlap, but the rough order of magnitude is: 2.075e-7 * (365*24*60*60) = approx 6.5 times per year.

  • Some ledger calculation (accidentally) took too long, interfering by starving the leadership-checking thread of its access to the CPU. These blocks around about 8% through the epoch --- I don't know off the top of my head if we have any known computation spikes around that slot.

A third (bonus 😄) factor is the timeouts.

    mustReplyTimeout <- Just <$> randomElem [90, 135, 180, 224, 269]

If no new block arises before that timeout (randomized per upstream connection), then we disconnect from that peer. Depending on re-connection delays etc, that might amplify the duration of some stall in the chain growth.


Thanks to those sharing logs, we do have some more specific information in this case 👍

In particular, we have two reports of TraceDidntAdoptBlock.

Relevant facts:

  • 15c and 0a2 are in the gap, but they were competing with 14a (the winner) to be block number 8107959 --- all three extended 9c1.
  • The VRF output of 14a begins 04..., which means it probably has a stronger tiebreaker than 15c and 0a2. I doubt the logs would include the VRF output of 15c and 0a2 to confirm this, unfortunately. (Those two leading hexdigits are only greater than four other possible hexdigit pairs out of the 255 non-equivalent pairs.)
  • Both Ada4Good and BBHMM shared logs that spanned from TraceStartLeadershipCheck to TraceDidntAdoptBlock in less than 70ms and 60ms respectively. That means the blocks they forged were processed by the ChainDB quicker than that. Note that processed does not necessarily mean validated: if a better block is already in there, the ChainDB won't waste time validating the new block. But it's not unusual for a block to be validated in less than 100ms.
  • The submitted log messages were exactly on time; they have the same exact timestamp as the above table.

So, it could be the case that only 15c and 0a2 were minted during the gap (ie no one else was elected). If they had extended 14a instead of 9c1, then the gap would be smaller.

But! Why didn't 15c and 0a2 extend 14a instead of its predecessor 9c1?

  • They're 37 seconds and 207 seconds later, so presumably they had already received 14a and validated 14a.
  • The TraceDidntApplyBlock message suggests a better block 8107959 was already in the ChainDB when these new blocks were added to it.
  • But it's just 70ms --- so presumably that better block already existed in the ChainDB. Why wasn't it chosen as the block to extend when minting 15c/0a2?

I am puzzled. Seems unlikely 14a would happen to arrive at just the wrong time at two nodes, 170 seconds apart for this to happen.

Other possibilities:

  • An invalid transaction made it through our defense, but getIsInvalidBlock was somehow not updated, so the minting thread reported TraceDidntAdoptBlock instead of TraceForgedInvalidBlock? Could "it took that long to flush the bad transaction" explain the gap size? (Assuming there are more like BBHMM and Ada4Good who just haven't been vocal.) This would require that 14a hadn't arrived to BBHMM within 37 seconds, which is possible, but not typical. And for there to be an invalid transaction in the new block, which seems unlikely. Or maybe another invalid 8107959 block that isn't showing up in these logs?

  • Something is causing alignment between block validations and leadership checks. Such that the leadership thread will choose its new block's predecessor, then a latest-greatest block is added to the ChainDB, then the leadership thread adds its newly minted block (which happens to lose a tiebreaker against the latest-greatest).


I noticed this while writing the above.

Looking more closely at Ada4Good's logs now, I actually see it select 14a between TraceStartLeadershipCheck and TraceDidntAdoptBlock!

(I remain as mystified as ever about the duplication of AddedToCurrentChain and the absurd elision count 😬.)

TraceStartLeadershipCheck	{"app":[],"at":"2022-12-07T07:20:42.00Z","data":{"kind":"TraceStartLeadershipCheck","slot":78831351},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"82143","sev":"Info","thread":"375"}
TraceNodeIsLeader		{"app":[],"at":"2022-12-07T07:20:42.03Z","data":{"val":{"kind":"TraceNodeIsLeader","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
TraceForgedBlock		{"app":[],"at":"2022-12-07T07:20:42.05Z","data":{"val":{"block":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Info","thread":"375"}
AddedToCurrentChain [14a]	{"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
<elisions>			{"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"kind":"LogValue","name":"before next, messages elided","value":{"contents":427653286419604,"tag":"PureI"}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
AddedToCurrentChain [14a]	{"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Notice","thread":"365"}
TrySwitchToAFork [0a2]		{"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"block":{"hash":"0a22def6da84e6fc0e65ee2d96ce0086a5d25c176d9ce056f066fe529ce0d139","kind":"Point","slot":78831351},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"82143","sev":"Info","thread":"365"}
TraceDidntAdoptBlock		{"app":[],"at":"2022-12-07T07:20:42.06Z","data":{"val":{"kind":"TraceDidntAdoptBlock","slot":78831351}},"env":"1.35.4:ebc7b","host":"BPxxx","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"82143","sev":"Error","thread":"375"}

So: this is exactly the anticipated reason for ever seeing TraceDidntAdoptBlock. The node started its leadership check, chose to extend the tip of the chain (ie 9c1), minted that block (ie 0a2), but finished validating and selected a different better block (ie 14a) before its newly minted block had a chance to be considered.

What remains odd is that it finished validating 14a 207 seconds after 14a was minted and happened to do so exactly during its leadership attempt, which lasted between 60 and 70 milliseconds? Seems like real bad luck if there turns out to be no systematic correlation here after all...

I don't see this in the logs that BBHMM submitted. But I also don't see TrySwitchToAFork. So either they grep-ed out intervening messages, or their node took the alternate TryAddToCurrentChain path (which has Debug severity as opposed to TrySwitchToAFork's Info severity, since it's the typical path. Although leading is so rare, perhaps we should make it Info too, just for this kind of examination) and then they wouldn't fit this hypothesis.

But the two leadership tests were 170 seconds apart; seems very unlikely that two nodes would be validating 14a at such different times and also exactly during their leadership checks.

@chadle-git
Copy link

chadle-git commented Dec 22, 2022

Hi @nfrisby, Thank you for analyzing.
I found the same type scenario in my logs. Let me know if you need any more data that can help with this.

TraceStartLeadershipCheck {"app":[],"at":"2022-12-07T07:17:52.00Z","data":{"chainDensity":4.78205e-2,"credentials":"Cardano","delegMapSize":1237930,"kind":"TraceStartLeadershipCheck","slot":78831181,"utxoSize":9404848},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"1429","sev":"Info","thread":"214"}

TraceNodeIsLeader {"app":[],"at":"2022-12-07T07:17:52.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

TraceForgedBlock {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"credentials":"Cardano","val":{"block":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","blockNo":8107959,"blockPrev":"9c1a0cbe38f4a32ef55a0426dcacfcb8bf005defdca46710e5d5a7f0fc193aac","kind":"TraceForgedBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Info","thread":"214"}

AddedToCurrentChain {"app":[],"at":"2022-12-07T07:17:52.04Z","data":{"chainLengthDelta":1,"kind":"TraceAddBlockEvent.AddedToCurrentChain","newtip":"14aa389d1f9ac49a50b4991ce24f2e0c276f22143bbfead582f5cff63bfa1e1c@78831144"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Notice","thread":"204"}

TrySwitchToAFork {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"block":{"hash":"15c3003320feb493e2a6eebeaa34d7c206dcb23c4bd8bbcc05443de542681803","kind":"Point","slot":78831181},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"1429","sev":"Info","thread":"204"}

TraceDidntAdoptBlock {"app":[],"at":"2022-12-07T07:17:52.05Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":78831181}},"env":"1.35.3:950c4","host":"bbhmm-bp","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1429","sev":"Error","thread":"214"}

@Straightpool
Copy link

Straightpool commented Jan 2, 2023

I have observed the same for the first time this epoch. Will update when I learn more from the logs (BP and both relays).

First snipped from the BP, anyone saw the same error at the same time frame?

{"app":[],"at":"2023-01-02T05:13:57.00Z","data":{"chainDensity":4.7301885e-2,"credentials":"Cardano","delegMapSize":1251425,"kind":"TraceStartLeadershipCheck","slot":81070146,"utxoSize":9553777},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.LeadershipCheck"],"pid":"833","sev":"Info","thread":"353"}
{"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeIsLeader","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"}
{"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"block":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","blockNo":8217108,"blockPrev":"9306145db376fd2aa74ea1bb24ed915f2f69a66d40dfdef2f68e82b53141a337","kind":"TraceForgedBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"}
{"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"block":{"hash":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","kind":"Point","slot":81070146},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"833","sev":"Info","thread":"343"}
{"app":[],"at":"2023-01-02T05:13:57.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceDidntAdoptBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Error","thread":"353"}

@Jimbo4350
Copy link
Contributor

Please re-open this issue in https://github.com/input-output-hk/ouroboros-network

@chadle-git
Copy link

Opened Issue IntersectMBO/ouroboros-network#4252

@dorin100 dorin100 added user type: external Created by an external user comp: networking labels Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp: networking type: bug Something is not working user type: external Created by an external user
Projects
None yet
Development

No branches or pull requests

6 participants