-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subpar handling of blocks from the future #4251
Comments
My node had a block from the future around the time of the IntersectMBO/cardano-node#4720 issue . An SPO shared logs with @gufmar regarding another TraceDidntAdoptBlock at 2023-01-02T05:13:57.03 . |
we have noticed several such examples in the last 4-5 epochs, as one particular pool regularly started announcing its blocks a little too early (enough to reach many/most other nodes before the actual slot time) This is an example how it looks like one of the four nodes heard about the new blockheader for slot 79817694 late enough. so he fetched and adopted, all fine. In this example above the second block producer (for slot 79817700) hadn't received the previous block. It looks like all his relays hadn't adopted the block because of the early appearance. |
@karknu For anyone who wants to double check local BP log files for the same issue do this: If there is a "TraceDidntAdoptBlock" with severity "Error" in the BP log, find the prior line in the BP log with the TraceForgedBlock message, in close vicinity above, you should find a block fetch of a future block with negative delay being too early which blocked your own block from local adoption and thus prevented propagation. In my case here the relevant log lines in line with the graphical representation of negative block delay, so my BP saw the future block triggering this issue: {"app":[],"at":"2023-01-02T05:13:44.71Z","data":{"block":"ec095f8782d832cfbe936b43640cfbcfa07f8b57df333be9d23c490d3f766f13","delay":-0.285766939,"kind":"CompletedBlockFetch","peer":{"local":{"addr":"","port":""},"remote":{"addr":"","port":""}},"size":871},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.BlockFetchClient"],"pid":"833","sev":"Info","thread":"631239"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"credentials":"Cardano","val":{"block":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","blockNo":8217108,"blockPrev":"9306145db376fd2aa74ea1bb24ed915f2f69a66d40dfdef2f68e82b53141a337","kind":"TraceForgedBlock","slot":81070146}},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"833","sev":"Info","thread":"353"} {"app":[],"at":"2023-01-02T05:13:57.02Z","data":{"block":{"hash":"521058f6067abb88828611255df6c2695393ae3abb4e113cf550c74ad9e88754","kind":"Point","slot":81070146},"kind":"TraceAddBlockEvent.TrySwitchToAFork"},"env":"1.35.4:ebc7b","host":"ub20cn","loc":null,"msg":"","ns":["cardano.node.ChainDB"],"pid":"833","sev":"Info","thread":"343"} So for my missed block this was most definitely the issue, good catch! Looking greatly forward to the node patch, any missed block is one missed block too many. May I ask though for the rationale in adding the complexity to try to save blocks from the future? Why not keep it simple and just discard them to encourage proper node ntp configuration? Maybe because they are hard to avoid with improper time? |
We can't expect the clocks of all nodes to be completely in sync so there needs to be some allowance for blocks-from-the-future. Even Ouroboros Chronos needs to keep track of blocks-from-the-future. There is a limit (currently 5s) for how far into the future blocks can be before they are ignored. This puts a limit on how out of sync nodes in the network can be. |
I'm not 100% sure my comment will be useful, since Karl created this issue and the IOG team probably already observed similar behavior, but just in case I'll leave my logs here So, the block
and reached my block producer at 2023-01-07 05:14:09.73 UTC. According to the block schedule (based on the analysis of the blockchain), this block should have been created at 2023-01-07 05:14:10 UTC, so the block from the future. After that my pool haven't put this block into the chain (as Karl mentioned early), and just continued checking each slot for leadership till time for block creation came at 2023-01-07 05:16:01.00 UTC. After that the fork occurred
The fact that my pool has also created the block before |
The first suggestion seems reasonable and like only a mild design challenge.
This would ensure that you consider pending blocks before forging. When you do find such blocks this way, it would delay your new block by however long it takes to process the pending block. But your forged block would (usually?) have a higher The other suggestion, checking for pending blocks on some fixed internal (eg the midpoint of each Not super confident either way, but those are my initial thoughts. I'm unsure how to prioritize this wrt to other work. How prevalent is it? Are maleficent SPOs setting their clocks early, causing this to happen often? |
Here below is a list of (almost*) all recorded blocks for last 3 weeks, with an header announced in the future.
(*) note we noted this special behaviour since mid of December, and we refined the monitoring data capture process to also identify the delayed and then not adopted block only after Dec-27 Considering this extraordinarily negative impact on the rewards for the early announced block (50% of the lost slot battles go to its account, while the other 50% are shared with all the other pools), I wouldn't say that it is deliberately malicious. But if you look at the number of early announced blocks since mid-December 2022, you will notice that one particular pool is causing most of these altitude battles. (I won't name names here, the pool has already been informed). This delaying effect may increases probability for #4252
|
I again lost a block due to this issue today at the time mentioned above by Gufmar. My BP completed BlockFetch of the correct parent block of an unnamed pool with a negative delay, a block from the future: https://cardanoscan.io/block/8749998
My 1.35.7 BP did then not apply this block to the currentTip until 4 minutes later colliding with itself, first my BP minted my block on the old, wrong parent
At the exact same UTC time my BP decided to finally add the future block of the unnamed pool to the tip, not once but twice for good measure
Now the BP had created a fork on it's own and tried to decide which fork to take: The one just created itself or the block from the future it added 4 minutes too late to its own ledger db?
My BP then dropped its own block:
The losing out new block of the BP was never picked up by my relays either, so no block explorer saw the issue. This is also visualised in the monitoring, the BP (bright green) is one block ahead and it takes minutes for all nodes to align again |
Any progress on a potential fix since January? I'm curious if a node release version to address this has been planned. Thx. |
Hi 👋 The Consensus team analyzed possible fixes for this issue, but they have security and incentives implications for the network, so we need to look at this with a multi-disciplinary team consisting of members of the Network and Research team. At the moment we don't have people available to work on this, but we'll consider this issue for inclusion in our quarterly planning sessions. |
Happend again today, did not lose a block this time luckily myself, but others for sure did, the chain halted for 6 minutes, see https://pooltool.io/realtime/9861109 Log from one of my relays: |
Unfortunately i was one of the pools that lost a block because of this... Below is a copy of my logs incase they are useful {"app":[],"at":"2024-01-28T12:03:23.03Z","data":{"credentials":"Cardano","val":{"kind":"TraceNodeNotLeader","slot":114877112}},"env":"8.1.2:d2d90","host":"producer","loc":null,"msg":"","ns":["cardano.node.Forge"],"pid":"1547","sev":"Info",> |
Here's what highly probably happened
during these 6 minutes we probably had other slot leaders, all stuck to adopt SCAR's block and hence building their block on RSTK. So they ran into height battles with SCAR
So no new blocks until STAKE was lucky to win the height battle and catch out SCARs block, stuck in adoption on many nodes worldwide. |
Another problem with using the block VRF to deterministically settle "fork battles" is that a malicious pool that is the next valid leader can deliberately choose to cause a fork by building upon the second last block if it knows it will win the VRF battle with the last block. A group of such pools could work together to selectively knock out blocks produced by other pools that are not members of their malicious group. I wonder if this attack method as well as the problem identified in this comment thread could be solved by three changes:
Such changes would also fix another problem that disincentivises geographic decentralisation: Stake pools run from more remote areas (relative to the majority in USA/EU) suffer increased network delays in sending and receiving blocks. Such pools that experience just 1 second delays will suffer 3 times the number of "fork battles", and pools that experience 2 second delays will suffer 5 times the number of fork battles. Since half of these battles will be lost this is a huge cost for the increased geographic decentralisation these pools provide. IE: The majority pools in USA/EU with less than 1 second delays will only lose around 2.5% of their blocks. But, pools with 1 second delays will lose around 7.5% of their blocks, and pools with 2 second delays will lose around 12.5% of their blocks. Such numbers create a massive incentive for operators to centrally re-house their pool in a USA/EU data centre owned by Amazon. The above changes would give each block 4 seconds to propagate across the network. Therefore pools with 1, 2, or even 3 seconds of delays wouldn't be penalised for their decentralisation. |
Sorry for the long delay here; this bug has now been fixed for a while (since node 8.8); but only since Conway we can be sure that all block producers have actually updated to a version containing the fix. See https://updates.cardano.intersectmbo.org/2024-09-07-incident/ (in particular the linked reports) for all of the details. |
@TerminadaPool keep us posted if you notice better block production from your node / area. |
When the node manage to download a block from the future it is not adopted directly, rather it is tucked away with the hope that it can be adopted when the time comes. Any time a new block is downloaded or after a forged block is adopted the node checks to see if there are any old blocks-from-the-future that it could adopt. If there are such blocks they are adopted before the new block.
Notably it appears that a new slot/tick isn't enough to trigger the adoption of blocks-from-the-future, it is always dependent on another block being added.
This means that if all nodes in the network where to download a block early then the only way for the network to progress would be by a BP forging a new block on the current tip (that is without the block from the future). This would cause the BP to notice and adopt the old block-from the future, then it would attempt to adopt its forged block. If it lost due to vrf the TraceDidntAdoptBlock would be shown. If the BP lost it means that another BP would have to attempt to beat the block-from-the-future, since no new block was adopted nothing will cause the rest of the nodes in the network to adopt the block-from-the-future.
Suggestions:
chainSelectionForFutureBlocks
should be run so that the node can forge its block on top of blocks-from-the-future.getBlockToAdd
inaddBlockRunner
. If there was a timeout callchainSelectionForFutureBlocks
and try again.This would stop blocks from the future causing forks.
I suspect that this is at least partially the cause of IntersectMBO/cardano-node#4720
@nfrisby
The text was updated successfully, but these errors were encountered: