Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep every snapshot in memory and remove the concept of snapEvery #2459

Merged
merged 1 commit into from
Aug 25, 2020

Conversation

kderme
Copy link
Contributor

@kderme kderme commented Jul 23, 2020

#2440
I tried to

  • remove some tests which became trivial.
  • update the documentation.
  • simplify some constraints

but it's possible I missed some of those.

@mrBliss mrBliss added the consensus issues related to ouroboros-consensus label Jul 23, 2020
@mrBliss mrBliss linked an issue Jul 23, 2020 that may be closed by this pull request
Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep up the good work!

Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor things.

Can you set #2446 to be the base branch of this PR?

data LedgerDB l r = LedgerDB {
-- | The ledger state at the tip of the chain
ledgerDbCurrent :: !l

-- | Older ledger states
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that ledgerDbCurrent is no longer a field, we only have these "older ledger states", which includes the current one. So rename it to:

Suggested change
-- | Older ledger states
-- | Ledger state checkpoints

@mrBliss
Copy link
Contributor

mrBliss commented Aug 10, 2020

Just as a sanity check, I tried this out using cardano-node:

cardano-node: e8fb486e44aceb4b9962d7b3b76849becd510820
ouroboros-network: ec74d1c + with and without cherry picking c4d4435

Using a database recently synced with mainnet, in the Shelley era. I disabled syncing. Very important: the VolatileDB must contain a chain of at least k blocks, otherwise we won't have k snapshots in memory.

Trace output:

[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:16.18 UTC] Opened imm db with immutable tip at (Point 4564640, "35b10144ee477ba9f80389fbcdbb4f5ebd7678e1750dbaa2583eeff00de69466") and epoch 211
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:16.46 UTC] Opened vol db
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:19.91 UTC] Replaying ledger from snapshot DiskSnapshot 28 at (Point 4563080, "2c85088fa88973c87cd0f2aca3aebab30ab9719e72e2b784d2395b7df692071e")
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:20.23 UTC] Replayed block: slot SlotNo {unSlotNo = 4563100} of At (SlotNo {unSlotNo = 4564640})
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:22.92 UTC] block replay progress (%) = 100.0
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] before next, messages elided = 4563119
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] Replayed block: slot SlotNo {unSlotNo = 4564640} of At (SlotNo {unSlotNo = 4564640})
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] Opened lgr db
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:19:51.55 UTC] Opened db with immutable tip at (Point 4564640, "35b10144ee477ba9f80389fbcdbb4f5ebd7678e1750dbaa2583eeff00de69466") and tip (Point 4608280, "64455962d499039d109e642aafce53408df6c9dcc0f8f726074a837b739682c7")

I looked at the memory usage of cardano-node with htop.

Without this PR: 911 MB RAM
With this PR: 7717 MB RAM (!!!)

I repeated this a few times and always got similar results. Trying with snapEvery = 1 instead of this PR gave similar results.

This means we should definitely not merge this PR now.

Thoughts:

  • The measurements in Keep _every_ snapshot of the ledger state in memory #1936 were done using the Byron ledger, not the Shelley (or Cardano) one.
  • I suspect that for some reason, there is much less sharing in the Shelley ledger than in the Byron one.
  • Recently there has been a lot transactions in the chain, record numbers. This might explain some extra memory usage, but not this much.
  • This is without Fix thunks cardano-ledger#1722, which fixes some thunks. I'll measure again with this fix in, but I'm not expecting much of it. UPDATE: doesn't improve things.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 10, 2020

As a sanity check, I repeated my measurements using mainnet at the end of the Byron:

Trace output:

[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:16.18 UTC] Opened imm db with immutable tip at (Point 4564640, "35b10144ee477ba9f80389fbcdbb4f5ebd7678e1750dbaa2583eeff00de69466") and epoch 211
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:16.46 UTC] Opened vol db
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:19.91 UTC] Replaying ledger from snapshot DiskSnapshot 28 at (Point 4563080, "2c85088fa88973c87cd0f2aca3aebab30ab9719e72e2b784d2395b7df692071e")
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:20.23 UTC] Replayed block: slot SlotNo {unSlotNo = 4563100} of At (SlotNo {unSlotNo = 4564640})
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:22.92 UTC] block replay progress (%) = 100.0
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] before next, messages elided = 4563119
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] Replayed block: slot SlotNo {unSlotNo = 4564640} of At (SlotNo {unSlotNo = 4564640})
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:16:24.87 UTC] Opened lgr db
[desktop:cardano.node.ChainDB:Info:5] [2020-08-10 10:19:51.55 UTC] Opened db with immutable tip at (Point 4564640, "35b10144ee477ba9f80389fbcdbb4f5ebd7678e1750dbaa2583eeff00de69466") and tip (Point 4608280, "64455962d499039d109e642aafce53408df6c9dcc0f8f726074a837b739682c7")

Without this PR: 305 MB RAM
With this PR: 455-465 MB RAM

Not the catastrophic increase we saw for Shelley, but still more than expected from #1936. That's 150 MB more instead of the expected <10 MB 🤔. I'm not sure how the Haskell heap grows, maybe that plays a role here?

I'm wondering whether we're overlooking something 🤔.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 11, 2020

@kderme Could you repeat the experiments from #1936 with the Shelley ledger?

@kderme
Copy link
Contributor Author

kderme commented Aug 13, 2020

With 8e7b5e2 and cherry-picked IntersectMBO/cardano-ledger#1775 I still see very big memory usage (maximum residency) on Shelley.

snapEvery = 1:

 4,105,555,200 bytes maximum residency (47 sample(s))
 104,871,740,160 bytes allocated in the heap
 92,482,133,216 bytes copied during GC

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     97207 colls,     0 par   38.548s  38.702s     0.0004s    0.0033s
  Gen  1        47 colls,     0 par   36.904s  40.847s     0.8691s    4.6105s

  MUT     time   60.596s  ( 62.544s elapsed)
  GC      time   75.452s  ( 79.549s elapsed)
  Total   time  136.048s  (142.094s elapsed)

snapEvery = 100:

 831,904,064 bytes maximum residency (120 sample(s))
 104,871,360,408 bytes allocated in the heap
 91,016,416,384 bytes copied during GC

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     97134 colls,     0 par   38.149s  38.292s     0.0004s    0.0030s
  Gen  1       120 colls,     0 par   20.708s  21.219s     0.1768s    0.6442s

  MUT     time   61.557s  ( 63.407s elapsed)
  GC      time   58.857s  ( 59.510s elapsed)
  Total   time  120.415s  (122.918s elapsed)

snapEvery = k:

 682,901,344 bytes maximum residency (127 sample(s))
 104,871,422,048 bytes allocated in the heap
 90,606,678,416 bytes copied during GC

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     97127 colls,     0 par   37.810s  37.952s     0.0004s    0.0035s
  Gen  1       127 colls,     0 par   18.535s  18.971s     0.1494s    0.4656s

  MUT     time   59.905s  ( 61.707s elapsed)
  GC      time   56.345s  ( 56.922s elapsed)
  Total   time  116.251s  (118.630s elapsed)

@kderme
Copy link
Contributor Author

kderme commented Aug 16, 2020

It looks like what caused the issue was the delegationTransition fixed on IntersectMBO/cardano-ledger#1779. Creating a new reward Map (with over 40k entries) probably killed sharing between consecutive ledger.

snapEvery = 1, without delegation fix:
every-1-no-fix-2

With the fix cherry-picked, the graph looks much different

snapEvery = 1, with delegation fix:

every-1-fix

This looks similar with
snapEvery = 100, with delegation fix:

every-100-fix

@kderme
Copy link
Contributor Author

kderme commented Aug 16, 2020

Some points:

  • All these results are with Avoid using Map.keysSet on nesOsched  cardano-ledger#1775 cherry-picked
  • The memory difference between snapEvery=1 vs 100 (with the fix) happens because of utxoInductive (dark green on snapEvery=1 graph). Maybe something we can optimize or see if it's reasonable.
  • I'd suggest we delay this a bit more, until all fixes are in place and also fixes on epoch boundaries, to make sure we don't cause big memory spikes

@mrBliss
Copy link
Contributor

mrBliss commented Aug 17, 2020

It looks like what caused the issue was the delegationTransition fixed on input-output-hk/cardano-ledger-specs#1779. Creating a new reward Map (with over 40k entries) probably killed sharing between consecutive ledger.

[..]

Great, thanks for this investigation! Also nice to see I unknowingly already fixed the issue 🙂

Some points:

* All these results are with [input-output-hk/cardano-ledger-specs#1775](https://github.com/input-output-hk/cardano-ledger-specs/pull/1775) cherry-picked

* The memory difference between snapEvery=1 vs 100 (with the fix) happens because of `utxoInductive` (dark green on snapEvery=1 graph). Maybe something we can optimize or see if it's reasonable.

I believe it corresponds to this line:

      { _utxo = eval ((txins txb  utxo)  txouts txb),

which is the main expected contributor to the memory growth, i.e., UTxO changes. Unless that eval is doing something suboptimal, there's not much we can do.

* I'd suggest we delay this a bit more, until all fixes are in place and also fixes on epoch boundaries, to make sure we don't cause big memory spikes

Which fixes do you mean? IntersectMBO/cardano-ledger#1779 has been merged and IntersectMBO/cardano-ledger#1775 should not affect sharing, right? Do you mean IntersectMBO/cardano-ledger#1785?

I would actually be interested in seeing the impact of an epoch transition on the memory usage (with IntersectMBO/cardano-ledger#1785 applied).

I'm relieved that we can still go through with this simplification after all 😌. But I agree, let's first optimise the overlay schedule so we can reuse the memory freed by that optimisation for this simplification.

@kderme
Copy link
Contributor Author

kderme commented Aug 20, 2020

On epoch boundaries the results also seem similar. This is using the db-analyser, starting from a snapshot at the very beginning of epoch 208 and validating up to a very recent block:

snapEvery = 100:
every-100

snapEvery = 1:
every-1-boundaries

Both cases report a maximum residency of 2,500,000,000 bytes and I see max 4GB used by top/ps (What's this PINNED memory?).
Total time is slightly higher on snapEvery = 1 (around 10 secs). I believe this is because of more time spent on gc:

snapEvery = 100:

MUT     time   18.213s  ( 56.391s elapsed)
GC      time   64.126s  ( 34.119s elapsed)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4246 colls,  4246 par    9.601s   5.887s     0.0014s    0.0353s
  Gen  1       146 colls,   145 par   131.197s  66.556s     0.4559s    1.3432s

snapEvery = 1:

MUT     time   13.467s  ( 60.153s elapsed)
GC      time   83.366s  ( 44.040s elapsed)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4240 colls,  4240 par   11.467s   7.066s     0.0017s    0.0429s
  Gen  1       164 colls,   163 par   165.699s  83.906s     0.5116s    1.3530s


@mrBliss
Copy link
Contributor

mrBliss commented Aug 20, 2020

On epoch boundaries the results also seem similar. This is using the db-analyser, starting from a snapshot at the very beginning of epoch 208 and validating up to a very recent block:

snapEvery = 100:
[..]

snapEvery = 1:
[..]

Both cases report a maximum residency of 2,500,000,000 bytes and I see max 4GB used by top/ps (What's this PINNED memory?).

Good, that's what I was expecting/hoping: there can be at most one epoch boundary transition in a range of k blocks, so snapEvery should not matter.

PINNED memory is typically the bytes in ByteStrings. As explained on Slack, my hypothesis is that these might be deserialised Byron addresses.

Total time is slightly higher on snapEvery = 1 (around 10 secs). I believe this is because of more time spent on gc:

snapEvery = 100:

MUT     time   18.213s  ( 56.391s elapsed)
GC      time   64.126s  ( 34.119s elapsed)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4246 colls,  4246 par    9.601s   5.887s     0.0014s    0.0353s
  Gen  1       146 colls,   145 par   131.197s  66.556s     0.4559s    1.3432s

snapEvery = 1:

MUT     time   13.467s  ( 60.153s elapsed)
GC      time   83.366s  ( 44.040s elapsed)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      4240 colls,  4240 par   11.467s   7.066s     0.0017s    0.0429s
  Gen  1       164 colls,   163 par   165.699s  83.906s     0.5116s    1.3530s

More things to traverse, so more time, that's expected. Note that profiling skews this result, running with -s without profiling, the result should be better. Also, that epoch boundary transition is generating tons of garbage, so it makes sense that GC is doing a lot of work.

@kderme
Copy link
Contributor Author

kderme commented Aug 24, 2020

The results after the fix #2532 look pretty similar again:

before pr:
every-100

after pr:
pr

I also tested what happened if we never call prune (keep all ledger on memory). This shows that sharing is pretty good, if we take into account that by the end there are ~90000 ledger states stored.
no-prune

I also tested times and memory with psrecord (no ghc profiling involved) and the pr doesn't create any visible difference:

plot1

@mrBliss
Copy link
Contributor

mrBliss commented Aug 24, 2020

Great, let's merge!

@mrBliss
Copy link
Contributor

mrBliss commented Aug 24, 2020

bors merge

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Aug 24, 2020

👎 Rejected by too few approved reviews

@mrBliss mrBliss self-requested a review August 24, 2020 15:04
Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

@mrBliss mrBliss self-requested a review August 24, 2020 15:04
Copy link
Contributor

@mrBliss mrBliss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

@mrBliss
Copy link
Contributor

mrBliss commented Aug 25, 2020

bors ping

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Aug 25, 2020

pong

@mrBliss
Copy link
Contributor

mrBliss commented Aug 25, 2020

bors merge

@iohk-bors
Copy link
Contributor

iohk-bors bot commented Aug 25, 2020

@iohk-bors iohk-bors bot merged commit 04734c4 into master Aug 25, 2020
@iohk-bors iohk-bors bot deleted the kderme/snapEvery-1 branch August 25, 2020 06:55
mrBliss added a commit that referenced this pull request Aug 26, 2020
Fixes #1935.

Consider the following situation:

    A -> B          -- current chain
      \
       > B' -> C'   -- fork

To validate the header of C', we need a ledger view that is valid for the slot
of C'. We can't always use the current ledger to produce a ledger view, because
our chain might include changes (that applied after the intersection A) to the
ledger view that are not present in the fork. So we must get a ledger view at
the intersection, A, and use that to forecast the ledger view at C'.

Previously, we only kept a ledger state in memory for every 100 blocks, so
obtaining a ledger state at the intersection point might require reading blocks
from disk and reapplying them. As this is expensive for us, but cheap to trigger
by attackers (create headers and serve them to us), this would lead to DoS
possibilities.

For that reason, each ledger state stored snapshots of past ledger views. So to
obtain a ledger view for A, we could ask B for a past ledger view at the slot of
A.

After #2459, we keep all `k` past ledgers in memory, which makes it cheap to ask
for a past ledger and thus past ledger view. This means we no longer need to
store ledger view snapshots in each ledger state. Both for Byron and Shelley we
can remove the ledger view history from the ledger. To remain backwards binary
compatible with existing ledger snapshots, we still allow the ledger view
history in the decoders, but ignore it, and don't encode it anymore.

Consequently, `ledgerViewForecastAtTip` no longer needs to specify *at* which
slot (i.e., A's slot) to make the forecast (i.e., which past ledger view
snapshot to use). Instead, we first get the right past ledger with
`getPastLedger` and use that to make forecasts. This results in some
simplifications in the ChainSyncClient.
mrBliss added a commit that referenced this pull request Aug 26, 2020
Fixes #1935.

Consider the following situation:

    A -> B          -- current chain
      \
       > B' -> C'   -- fork

To validate the header of C', we need a ledger view that is valid for the slot
of C'. We can't always use the current ledger to produce a ledger view, because
our chain might include changes (that applied after the intersection A) to the
ledger view that are not present in the fork. So we must get a ledger view at
the intersection, A, and use that to forecast the ledger view at C'.

Previously, we only kept a ledger state in memory for every 100 blocks, so
obtaining a ledger state at the intersection point might require reading blocks
from disk and reapplying them. As this is expensive for us, but cheap to trigger
by attackers (create headers and serve them to us), this would lead to DoS
possibilities.

For that reason, each ledger state stored snapshots of past ledger views. So to
obtain a ledger view for A, we could ask B for a past ledger view at the slot of
A.

After #2459, we keep all `k` past ledgers in memory, which makes it cheap to ask
for a past ledger and thus past ledger view. This means we no longer need to
store ledger view snapshots in each ledger state. Both for Byron and Shelley we
can remove the ledger view history from the ledger. To remain backwards binary
compatible with existing ledger snapshots, we still allow the ledger view
history in the decoders, but ignore it, and don't encode it anymore.

Consequently, `ledgerViewForecastAtTip` no longer needs to specify *at* which
slot (i.e., A's slot) to make the forecast (i.e., which past ledger view
snapshot to use). Instead, we first get the right past ledger with
`getPastLedger` and use that to make forecasts. This results in some
simplifications in the ChainSyncClient.
mrBliss added a commit that referenced this pull request Aug 30, 2020
Fixes #1935.

Consider the following situation:

    A -> B          -- current chain
      \
       > B' -> C'   -- fork

To validate the header of C', we need a ledger view that is valid for the slot
of C'. We can't always use the current ledger to produce a ledger view, because
our chain might include changes to the ledger view that are not present in the
fork (they were activated after the intersection A). So we must get a ledger
view at the intersection, A, and use that to forecast the ledger view at C'.

Previously, we only kept a ledger state in memory for every 100 blocks, so
obtaining a ledger state at the intersection point might require reading blocks
from disk and reapplying them. As this is expensive for us, but cheap to trigger
by attackers (create headers and serve them to us), this would lead to DoS
possibilities.

For that reason, each ledger state stored snapshots of past ledger views. So to
obtain a ledger view for A, we could ask B for a past ledger view at the slot of
A (and use that to forecast for C').

After #2459, we keep all `k` past ledgers in memory, which makes it cheap to ask
for a past ledger and thus past ledger view. This means we no longer need to
store ledger view snapshots in each ledger state. This was awkward, because we
had a double history: we stored snapshots of the ledger state and each ledger
state stored snapshots of the ledger view.

Both for Byron and Shelley we can remove the ledger view history from the
ledger. To remain backwards binary compatible with existing ledger snapshots, we
still allow the ledger view history in the decoders, but ignore it, and don't
encode it anymore.

Consequently, `ledgerViewForecastAtTip` no longer needs to specify *at* which
slot (i.e., A's slot) to make the forecast (i.e., which past ledger view
snapshot to use). Instead, we first get the right past ledger with
`getPastLedger` and use its ledger view to make forecasts. This results in some
simplifications in the ChainSyncClient.
mrBliss added a commit that referenced this pull request Aug 31, 2020
Fixes #1935, #2506, #2559, and #2562.

Consider the following situation:

    A -> B          -- current chain
      \
       > B' -> C'   -- fork

To validate the header of C', we need a ledger view that is valid for the slot
of C'. We can't always use the current ledger to produce a ledger view, because
our chain might include changes to the ledger view that are not present in the
fork (they were activated after the intersection A). So we must get a ledger
view at the intersection, A, and use that to forecast the ledger view at C'.

Previously, we only kept a ledger state in memory for every 100 blocks, so
obtaining a ledger state at the intersection point might require reading blocks
from disk and reapplying them. As this is expensive for us, but cheap to trigger
by attackers (create headers and serve them to us), this would lead to DoS
possibilities.

For that reason, each ledger state stored snapshots of past ledger views. So to
obtain a ledger view for A, we asked B for a past ledger view at the slot of
A (and use that to forecast for C').

After #2459 we keep all `k` past ledgers in memory, which makes it cheap to ask
for a past ledger and thus a past ledger view. This means we no longer need to
store ledger view snapshots in each ledger state. This was awkward, because we
had a double history: we stored snapshots of the ledger state and each ledger
state stored snapshots of the ledger view.

Both for Byron and Shelley we can remove the ledger view history from the
ledger. To remain backwards binary compatible with existing ledger snapshots, we
still allow the ledger view history in the decoders but ignore it, and don't
encode it anymore.

Consequently, `ledgerViewForecastAtTip` no longer needs to specify *at* which
slot (i.e., A's slot) to make the forecast (i.e., which past ledger view
snapshot to use). Instead, we first get the right past ledger with
`getPastLedger` and use its ledger view to make forecasts. This results in some
simplifications in the ChainSyncClient.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify LedgerDB after snapEvery changes to 1
2 participants