New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on mitigation of missed leadership checks due to ledger snapshots #868
Comments
A quick comment.
That may prove to be a useful workaround in the short-term, but our separation of concerns-based goal has so far been to avoid this sort of design (anything that changes the node's behavior based on the upcoming leadership schedule). |
It seems fair to assumed the leadership check thread is starved of CPU while the snapshot is being taken. I think there are two a priori suspects who might be occupying the CPU during that time:
If the GC pauses are greater than a second, then it's entirely possible the leadership check doesn't have CPU time during the elected slot. If we incrementalize the snapshotting work, then perhaps the lower allocation rate will allow for smaller GC pauses, ideally sub-second (ie sub-slot). Generally the RTS scheduler is fair, so I wouldn't expect the snapshotting threads (busy while taking the snapshot) would be able to starve the leadership check thread. But I list them anyway because maybe something about the computation is not playing nice with the RTS's preemption. (Examples include a long-running pure calculation that doesn't allocate, but I don't anticipate that in this particular case. Maybe some "FFI" stuff? I'm unsure how the file is actually written.) The lack of fairness in STM shouldn't matter here. The leadership check thread only reads the current chain, which shouldn't be changing so rapidly to incur contention among STM transactions that depend on it. |
There are possible culprits for the increase in CPU and memory usage, and GC time:
@lehins will look into 1 and 2, to see how much work they would require. As for 0 this is something we could do at the Consensus side. @coot suggested we could also use info tables profiling to extract more information. |
Regarding 0, we are writing a lazy writeSnapshot ::
forall m blk. MonadThrow m
=> SomeHasFS m
-> (ExtLedgerState blk -> Encoding)
-> DiskSnapshot
-> ExtLedgerState blk -> m ()
writeSnapshot (SomeHasFS hasFS) encLedger ss cs = do
withFile hasFS (snapshotToPath ss) (WriteMode MustBeNew) $ \h ->
void $ hPut hasFS h $ CBOR.toBuilder (encode cs)
where
encode :: ExtLedgerState blk -> Encoding
encode = encodeSnapshot encLedger
-- | This function makes sure that the whole 'Builder' is written.
--
-- The chunk size of the resulting 'BL.ByteString' determines how much memory
-- will be used while writing to the handle.
hPut :: forall m h
. (HasCallStack, Monad m)
=> HasFS m h
-> Handle h
-> Builder
-> m Word64
hPut hasFS g = hPutAll hasFS g . BS.toLazyByteString I tried using The following screenshot shows the heap profile, which is started before taking a snapshot of the ledger state: The first spike related to encoders appears on the 3rd page of the "Detailed" tab, and takes only 40 MB |
We also observe a very low productivity during the run above:
|
after processing a fixed number of slots. This is an experiment to investigate #868
Processing 20K blocks and storing 3 snapshots corresponding to the last 3 blocks that were applied results in the following heap profile: The "Detailed" tab seems to show The branch used to obtain these results can be found here. The profile can be produced by running: cabal run exe:db-analyser -- cardano --config $NODE_HOME/configuration/cardano/mainnet-config.json --db $NODE_HOME/mainnet/db --analyse-from 72316896 --only-immutable-db --store-ledger 72336896 +RTS -hi -s -l-agu
|
@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written? |
A thought for a quick-fix partial mitigation to help out users until we eliminate the underlying performance bug: we could take snapshots much less often. The default interval is 72min. The nominal expectation for that duration is 216 new blocks arising. If we increased the interval to 240 minutes, then the expectation would be 720 blocks. The initializing node should still be able to re-process that many blocks in less than a few minutes (my recollection is that deserializing the ledger state snapshot file is far and away the dominant factor in startup-time). ^^^ that all assumes we're considering a node that is already caught-up. A syncing node would have processed many more blocks than 216 during a 72min interval. But the argument above does suggest it would be relatively harmless for a caught-up node to use a inter-snapshot duration of 240min. (Moreover, I don't immediately see why it couldn't even be 10 times that or more. But that seems like a big change, so I hesitate to bless it without giving it much more thought.) (The interval is the first |
Sure thing:
|
@TimSheard suggested we could try increasing the pulse size in this line (eg by doubling it) , and re-run this experiment again. @TimSheard would like to know what the protocol parameters are during this run. We could also re-run the experiment with a GHC build that has info table built into the base libraries, or using different And also, it might be useful to make sure we cross epoch boundaries. |
FTR, Ledger experiment on avoiding forcing the pulser when serializing a ledger state IntersectMBO/cardano-ledger#4196 |
Problem description
John Lotoski informed us that currently on Cardano mainnet, adequately resourced nodes (well above minimum specs) are missing lots of leadership checks during ledger snapshots.
Concretely, during every ledger snapshot (performed every
2k seconds = 72min
by default), which takes about ~2min, the node misses ~30 leadership checks with 32GB RAM, and ~100 with 16GB RAM. This means that the node is missing ~0.7-2.3% of its leadership opportunities, and without mitigations, this number will likely grow as the size of the ledger state increases over time.This problem is not a new one, it has existed since at least node 8.0.0 (and likely even before).
Analysis
Various experiments (credits to John Lotoski) indicate that this problem is due to high GC load/long GC pauses while taking a ledger snapshot (current mainnet size is ~2.6GB serialized). The main reasons for this belief are:
Using
--nonmoving-gc
fixes the problem for some time.1Judging from a 6h log excerpt, both GC time and missed slots increase greatly during a ledger snapshot:
(GC time comes from
gc_cpu_ns
)Changing other aspects of the machine running the node (compute, IOPS) has no effect.
Potential mitigations
Several orthogonal mitigation options have been raised:
--nonmoving-gc
on a more recent GHC is enough, see 1.ByteString
) chunks?Note that UTxO HD will also help, but it will likely not be used for some time by block producers (where this is issue is actually important).
The goal of this ticket is to interact with other teams/stakeholders to identify the best way forward here.
Footnotes
Quoting from John Lotoski:
↩ ↩2The text was updated successfully, but these errors were encountered: