Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide on mitigation of missed leadership checks due to ledger snapshots #868

Open
amesgen opened this issue Jan 8, 2024 · 14 comments
Open
Milestone

Comments

@amesgen
Copy link
Member

amesgen commented Jan 8, 2024

Problem description

John Lotoski informed us that currently on Cardano mainnet, adequately resourced nodes (well above minimum specs) are missing lots of leadership checks during ledger snapshots.

Concretely, during every ledger snapshot (performed every 2k seconds = 72min by default), which takes about ~2min, the node misses ~30 leadership checks with 32GB RAM, and ~100 with 16GB RAM. This means that the node is missing ~0.7-2.3% of its leadership opportunities, and without mitigations, this number will likely grow as the size of the ledger state increases over time.

This problem is not a new one, it has existed since at least node 8.0.0 (and likely even before).

Analysis

Various experiments (credits to John Lotoski) indicate that this problem is due to high GC load/long GC pauses while taking a ledger snapshot (current mainnet size is ~2.6GB serialized). The main reasons for this belief are:

  • Using --nonmoving-gc fixes the problem for some time.1

  • Judging from a 6h log excerpt, both GC time and missed slots increase greatly during a ledger snapshot:

    chart-gc-snapshot

    (GC time comes from gc_cpu_ns)

  • Changing other aspects of the machine running the node (compute, IOPS) has no effect.

Potential mitigations

Several orthogonal mitigation options have been raised:

  1. Try to find a set of RTS options that fixes the problem without any code changes. In particular, it might be the case that --nonmoving-gc on a more recent GHC is enough, see 1.
  2. Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.
  3. Don't take ledger snapshots around the time the node is elected. Since we know in advance when the node will be elected, we can take this into account when deciding whether to take a ledger snapshot, and choose not to do so if an election opportunity is imminent (eg <5min). This way, we are guaranteed not to miss slots due to this problem when it is actually important.
  4. Tro to optimize allocation behavior of ledger snapshot serialization There might be some opportunities to improve the existing code to not allocate as much or in a gentler way, eg maybe by using unpinned instead of pinned (ByteString) chunks?

Note that UTxO HD will also help, but it will likely not be used for some time by block producers (where this is issue is actually important).


The goal of this ticket is to interact with other teams/stakeholders to identify the best way forward here.

Footnotes

  1. Quoting from John Lotoski:

    Trying the non-moving GC out did in fact resolve the missed slots, at least for about 5 days, after which missed slots started happening again, and eventually they surpassed the default copying GC missed slots in quantity of about 30 per hour and looked to continue increasing further over time.
    I suspect this is due to increasing fragmentation which the default copying GC is better at minimizing. There have been several improvements and bug fixes in the non-moving garbage collector through GHC 9.4.X, including improving the non-moving GC's ability to return allocated memory back to the OS which doesn't seem to happen at all on 8.10.7, so perhaps there might be better results with the non-moving GC once node is compiled on GHC >= ~9.4.X. In any case, when using the non-moving GC on 8.10.7, there are no new observed segfaults or other immediately obvious problem related to the use of the non-moving GC.

    2
@nfrisby
Copy link
Contributor

nfrisby commented Jan 8, 2024

A quick comment.

Don't take ledger snapshots around the time the node is elected.

That may prove to be a useful workaround in the short-term, but our separation of concerns-based goal has so far been to avoid this sort of design (anything that changes the node's behavior based on the upcoming leadership schedule).

@nfrisby
Copy link
Contributor

nfrisby commented Jan 10, 2024

Incrementalising the creation of ledger snapshots. Specifically, this could be achieved by introducing a delay between writing individual chunks of the snapshot to disk. One hope here is that by spreading out the allocation work over a longer period of time (default GC interval is 72min), the total time spent GCing will be less, but a priori, it might just be the case that we miss as many slots as before, just spread out more evenly over a longer period of time.

It seems fair to assumed the leadership check thread is starved of CPU while the snapshot is being taken. I think there are two a priori suspects who might be occupying the CPU during that time:

  • long GC pauses
  • the snapshotting thread (threads?)

If the GC pauses are greater than a second, then it's entirely possible the leadership check doesn't have CPU time during the elected slot. If we incrementalize the snapshotting work, then perhaps the lower allocation rate will allow for smaller GC pauses, ideally sub-second (ie sub-slot).

Generally the RTS scheduler is fair, so I wouldn't expect the snapshotting threads (busy while taking the snapshot) would be able to starve the leadership check thread. But I list them anyway because maybe something about the computation is not playing nice with the RTS's preemption. (Examples include a long-running pure calculation that doesn't allocate, but I don't anticipate that in this particular case. Maybe some "FFI" stuff? I'm unsure how the file is actually written.)

The lack of fairness in STM shouldn't matter here. The leadership check thread only reads the current chain, which shouldn't be changing so rapidly to incur contention among STM transactions that depend on it.

@dnadales dnadales added this to the Q1 2024 milestone Jan 25, 2024
@dnadales dnadales self-assigned this Feb 2, 2024
@dnadales
Copy link
Member

The heap profile shows that the spikes are dominated by pinned memory (ARR_WORDS):

image

However, we also see allocations of Data.Map and Data.Set:

image

We will discuss with the Ledger team possible ways to avoid these allocations.

@dnadales
Copy link
Member

There are possible culprits for the increase in CPU and memory usage, and GC time:

  1. The encoding is not consumed as a stream, which requires a large amount of data to be encoded as bytstring at once before being written to disk.
  2. We are forcing the Ledger's pulser, which requires a time consuming computation (as suggested by @jasagredo).
  3. We are converting the compact representation of multi assets, which allocates a Data.Map.

@lehins will look into 1 and 2, to see how much work they would require. As for 0 this is something we could do at the Consensus side.

@coot suggested we could also use info tables profiling to extract more information.

@dnadales
Copy link
Member

dnadales commented Feb 16, 2024

Regarding 0, we are writing a lazy ByteString to disk, so we are not building a huge ByteString before writing it to disk, but instead, we are writing it chunk by chunk:

  writeSnapshot ::
       forall m blk. MonadThrow m
    => SomeHasFS m
    -> (ExtLedgerState blk -> Encoding)
    -> DiskSnapshot
    -> ExtLedgerState blk -> m ()
  writeSnapshot (SomeHasFS hasFS) encLedger ss cs = do
      withFile hasFS (snapshotToPath ss) (WriteMode MustBeNew) $ \h ->
        void $ hPut hasFS h $ CBOR.toBuilder (encode cs)
    where
      encode :: ExtLedgerState blk -> Encoding
      encode = encodeSnapshot encLedger

-- | This function makes sure that the whole 'Builder' is written.
--
-- The chunk size of the resulting 'BL.ByteString' determines how much memory
-- will be used while writing to the handle.
hPut :: forall m h
     .  (HasCallStack, Monad m)
     => HasFS m h
     -> Handle h
     -> Builder
     -> m Word64
hPut hasFS g = hPutAll hasFS g . BS.toLazyByteString

I tried using hPutBuilder (using this branch) to check whether it would reduce memory usage by avoiding the allocation of chunks of bytestrings, but it did not seem to help.

The following screenshot shows the heap profile, which is started before taking a snapshot of the ledger state:

image

The first spike related to encoders appears on the 3rd page of the "Detailed" tab, and takes only 40 MB

image
db-analyser.eventlog.zip

@dnadales
Copy link
Member

A second of db-analyser built without the profiling flag does not seem to exhibit the spikes observed above
image

@dnadales
Copy link
Member

We also observe a very low productivity during the run above:

[426.09964258s] Started StoreLedgerStateAt (SlotNo 72436908)
[442.253250968s] BlockNo 7793000	SlotNo 72343251
[473.668029643s] BlockNo 7794000	SlotNo 72364115
[504.554850119s] BlockNo 7795000	SlotNo 72384162
[544.457781846s] BlockNo 7796000	SlotNo 72405218
[568.979508903s] BlockNo 7797000	SlotNo 72425439
....
[933.002690111s] Snapshot stored at SlotNo 72436985
[933.00286017s] Snapshot was created at SlotNo 72436985 because there was no block forged at requested SlotNo 72436908
[933.00319006s] Done
ImmutableDB tip: At (Block {blockPointSlot = SlotNo 109730906, blockPointHash = c51616e6e4eefa40d13f5d0f9a47371f50d01ce760e3b8f0d0add212449e9b12})
 216,427,440,776 bytes allocated in the heap
1,487,832,561,768 bytes copied during GC
   3,344,035,160 bytes maximum residency (684 sample(s))
     122,786,736 bytes maximum slop
            6465 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6211 colls,  6211 par   22.448s  22.395s     0.0036s    0.0370s
  Gen  1       684 colls,   683 par   1208.587s  852.713s     1.2467s    1.8203s

  Parallel GC work balance: 90.70% (serial 0%, perfect 100%)

  TASKS: 7 (1 bound, 6 peak workers (6 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.263s  (  0.263s elapsed)
  MUT     time   58.518s  ( 58.299s elapsed)
  GC      time  1231.035s  (875.108s elapsed)
  EXIT    time    0.000s  (  0.010s elapsed)
  Total   time  1289.816s  (933.680s elapsed)

  Alloc rate    3,698,500,665 bytes per MUT second

  Productivity   4.5% of total user, 6.2% of total elapsed

dnadales added a commit that referenced this issue Feb 27, 2024
after processing a fixed number of slots.

This is an experiment to investigate #868
@dnadales
Copy link
Member

Processing 20K blocks and storing 3 snapshots corresponding to the last 3 blocks that were applied results in the following heap profile:

image

The "Detailed" tab seems to show PulsingReward as allocating a large portion of data:

image

The branch used to obtain these results can be found here.

The profile can be produced by running:

cabal run exe:db-analyser -- cardano   --config $NODE_HOME/configuration/cardano/mainnet-config.json   --db $NODE_HOME/mainnet/db   --analyse-from 72316896   --only-immutable-db --store-ledger 72336896 +RTS -hi -s -l-agu 

NODE_HOME points to the directory where the node was run (which contains the chain database).

db-analyser.eventlog.zip

@nfrisby
Copy link
Contributor

nfrisby commented Feb 28, 2024

@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written?

@nfrisby
Copy link
Contributor

nfrisby commented Mar 1, 2024

A thought for a quick-fix partial mitigation to help out users until we eliminate the underlying performance bug: we could take snapshots much less often. The default interval is 72min. The nominal expectation for that duration is 216 new blocks arising. If we increased the interval to 240 minutes, then the expectation would be 720 blocks. The initializing node should still be able to re-process that many blocks in less than a few minutes (my recollection is that deserializing the ledger state snapshot file is far and away the dominant factor in startup-time).

^^^ that all assumes we're considering a node that is already caught-up. A syncing node would have processed many more blocks than 216 during a 72min interval. But the argument above does suggest it would be relatively harmless for a caught-up node to use a inter-snapshot duration of 240min. (Moreover, I don't immediately see why it couldn't even be 10 times that or more. But that seems like a big change, so I hesitate to bless it without giving it much more thought.)

(The interval is the first DiskPolicyArgs, exposed as of #943 . I'm unsure what name the Node code uses for that in the JSON files etc.)

@dnadales
Copy link
Member

dnadales commented Mar 7, 2024

@dnadales could you run it once more, with the https://hackage.haskell.org/package/base-4.19.1.0/docs/GHC-Stats.html#v:getRTSStats dump before and after each of the ledger snapshots being written?

Sure thing:

  GC stats before snapshot
  Total number of GCs:       6632
  Total number of major GCs: 21
  Allocated bytes:           195637928136
  Copied bytes:              26769009240
  Total GC CPU time:         22224343455
  Total elapsed GC time:     18213635966

  GC stats afterq snapshot
  Total number of GCs:       8939
  Total number of major GCs: 26
  Allocated bytes:           286259560712
  Copied bytes:              60956628280
  Total GC CPU time:         47662376906
  Total elapsed GC time:     38520211470
  [120.563698243s] Snapshot stored at SlotNo 72727294


  GC stats before snapshot
  Total number of GCs:       8939
  Total number of major GCs: 26
  Allocated bytes:           286259560712
  Copied bytes:              60956628280
  Total GC CPU time:         47662376906
  Total elapsed GC time:     38520211470

  GC stats afterq snapshot
  Total number of GCs:       11201
  Total number of major GCs: 31
  Allocated bytes:           375574341072
  Copied bytes:              93958305552
  Total GC CPU time:         70104502875
  Total elapsed GC time:     57027413813
  [155.701501429s] Snapshot stored at SlotNo 72727306


  GC stats before snapshot
  Total number of GCs:       11201
  Total number of major GCs: 31
  Allocated bytes:           375574341072
  Copied bytes:              93958305552
  Total GC CPU time:         70104502875
  Total elapsed GC time:     57027413813

  GC stats afterq snapshot
  Total number of GCs:       13464
  Total number of major GCs: 36
  Allocated bytes:           464890139512
  Copied bytes:              126969005864
  Total GC CPU time:         93114399263
  Total elapsed GC time:     76153926493
  [191.899606387s] Snapshot stored at SlotNo 72727311
Elapsed time after snapshot 1:         20.306575504
Elapsed time between snapshot 1 and 2: 0
Elapsed time after snapshot 2:         18.507202343
Elapsed time between snapshot 2 and 3: 0
Elapsed time after snapshot 3:         19.12651268

@dnadales
Copy link
Member

dnadales commented Mar 13, 2024

@TimSheard suggested we could try increasing the pulse size in this line (eg by doubling it) , and re-run this experiment again.

@TimSheard would like to know what the protocol parameters are during this run.

We could also re-run the experiment with a GHC build that has info table built into the base libraries, or using different h options.

And also, it might be useful to make sure we cross epoch boundaries.

@dnadales
Copy link
Member

FTR, Ledger experiment on avoiding forcing the pulser when serializing a ledger state IntersectMBO/cardano-ledger#4196

@dnadales dnadales modified the milestones: Q1 2024, QX Apr 17, 2024
@dnadales dnadales removed their assignment Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🚫 Help needed
Development

No branches or pull requests

3 participants