Avoid rollbacks on restarts #1190

kderme · 2022-07-11T14:03:57Z

On restarts DBSync finds the latest point for which there is both a snapshot and its data are already inserted in the db. It deletes newer snapshots and data of newer blocks and starts syncing from this point. Deleting newer data can be avoided, if DBSync could instead apply the ledger rules from the snapshot until it reaches a compatible db tip. Before applying the rules, db-sync should first find the db tip and make sure it's on the node chain.

Other ways to avoid big rollbacks on restarts is to store more frequent snapshots or to store snapshots on signal handlers while exiting.

erikd · 2022-07-21T05:28:31Z

Other ways to avoid big rollbacks on restarts is to store more frequent snapshots

Yes, it will reduce the need for rollbacks, but is likely to slow down sync speeds even further. In epoch 319 I see:

[2022-07-21 03:20:55.35 UTC] Took a ledger snapshot at ledger-state/mainnet/52613171-79e0ce53e6.lstate.
          It took 103.31990899s.
[2022-07-21 05:03:43.35 UTC] Took a ledger snapshot at ledger-state/mainnet/52823074-07f01c5bd4.lstate.
          It took 127.062317205s.
[2022-07-21 05:23:36.66 UTC] Took a ledger snapshot at ledger-state/mainnet/52876752-af192981f4-319.lstate.
          It took 55.49092172s.

That is 5 minutes out of the total of 194 minutes it took to sync that epoch.

More frequent snapshots, means more time taken storing them on disk. A partial solution to this is to off load it to another thread, but since db-sync is already I/O bound this will almost certainly do very little to help.

The correct solution is for db-sync to drop its maintenance of ledger state (which I believe was always the idea) and have ledger/consensus/node trickle feed the required ledger state data to db-sync in a fully deterministic way. This means that db-sync would not have any ledger state that need to be stored on disk, and would not need to rollback more than a couple of blocks (worst case) on restart.

erikd · 2022-07-21T05:30:30Z

to store snapshots on signal handlers while exiting.

There are a bunch of real world scenarios where signal handlers and on-exit functions do not get called. For example the machine looses power or the process is killed by the kernel OOM handler. I am sure there are many others.

kderme · 2022-07-21T10:39:05Z

There are a bunch of real world scenarios where signal handlers and on-exit functions do not get called. For example the machine looses power or the process is killed by the kernel OOM handler. I am sure there are many others.

Sure, we're trying to catch the most common case with a normal exit an improve it.

The correct solution is for db-sync to drop its maintenance of ledger state (which I believe was always the idea) and have ledger/consensus/node trickle feed the required ledger state data to db-sync in a fully deterministic way.

That something we should push to the node teams, but can take quite a while to design and implement. In the meantime we can try to improve the situation on our side.

erikd · 2022-07-21T10:52:28Z

In the meantime we can try to improve the situation on our side.

IMO this is a bad idea. If we produce a workaround for this in db-sync and then we try to undo it when the lower level libraries provide us with what we need, there is a good chance that there will be deleterious artifacts left over in the db-sync code. For example, there are still artifacts of the split between the cardano-db-sync and cardano-sync packages that are still in this code base that should not be there, a year or so after these two packages were merged back into one. I doubt these will ever be removed.

kderme · 2022-08-04T15:22:43Z

IMO this is a bad idea.

It's hard to agree that it's a bad idea to improve things on our side in any context. The benefits of spliting the packages in 2 were questionable. However here, there are direct benefits from having fast restarts. It's one of the most requested improvements actually.

centromere · 2022-09-01T08:09:01Z

I am crafting a Helm chart for db-sync, and as part of that development I need to restart it quite often. Below are the most recent logs from my Pod:

[db-sync-node:Info:69] [2022-09-01 07:16:03.07 UTC] Found snapshot file for slot 67766471, hash 5efade4e133986bb9363330beb19fc558ece738ca37efe1bf6c037b8bd0f7f36. It took 2.503379629s to read from disk and 140.376048847s to parse.
[db-sync-node:Info:69] [2022-09-01 07:16:09.17 UTC] Found snapshot file for slot 67766471, hash 5efade4e133986bb9363330beb19fc558ece738ca37efe1bf6c037b8bd0f7f36
[db-sync-node:Info:69] [2022-09-01 07:16:09.18 UTC] File /var/lib/cexplorer/67766471-5efade4e13.lstate exists
[db-sync-node:Info:69] [2022-09-01 07:16:09.29 UTC] Starting at epoch 354
[db-sync-node:Info:69] [2022-09-01 07:16:09.74 UTC] Insert Alonzo Block: epoch 354, slot 67766486, block 7570000, hash f2a92fb5a949c32f562424eb69cd522d6f502671d48891c6e2e14e04f728cca5
[db-sync-node:Info:69] [2022-09-01 07:26:02.90 UTC] Offline pool metadata fetch: 49 results, 43 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:37:57.27 UTC] Offline pool metadata fetch: 50 results, 41 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:42:53.09 UTC] Insert Alonzo Block: epoch 354, slot 67868072, block 7575000, hash 9093365c52520e34bf67184fc407c905c2732eb2881d347c94f1fdb833a21391
[db-sync-node:Info:69] [2022-09-01 07:48:12.88 UTC] Offline pool metadata fetch: 50 results, 39 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:58:58.45 UTC] Offline pool metadata fetch: 49 results, 41 fetch errors

If I need to recreate the Pod, db-sync will be terminated and all those blocks will be rolled back on next start. This is extremely annoying!

At the very least, db-sync should install a SIGTERM or SIGINT handler which initiates an orderly shutdown -- including the creation of a fresh *.lstate snapshot.

kderme · 2022-09-01T08:14:57Z

I am crafting a Helm chart for db-sync, and as part of that development I need to restart it quite often.

If you don't need the historic rewards and other data described in https://github.com/input-output-hk/cardano-db-sync/blob/master/doc/configuration.md#--disable-ledger which come from the ledger you can already use the disable-ledger flag. This does no rollback on restarts already.

kderme · 2022-09-28T16:04:59Z

This has been fixed in #1266

kderme added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jul 11, 2022

vfrsilva assigned kderme Jul 22, 2022

kderme mentioned this issue Sep 18, 2022

Avoid Rollbacks during restart #1266

Merged

kderme closed this as completed in #1266 Sep 28, 2022

kderme mentioned this issue Oct 30, 2022

Speedup Reward and EpochStake insertions #1292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid rollbacks on restarts #1190

Avoid rollbacks on restarts #1190

kderme commented Jul 11, 2022 •

edited

erikd commented Jul 21, 2022 •

edited

erikd commented Jul 21, 2022

kderme commented Jul 21, 2022

erikd commented Jul 21, 2022 •

edited

kderme commented Aug 4, 2022

centromere commented Sep 1, 2022

kderme commented Sep 1, 2022 •

edited

kderme commented Sep 28, 2022

Avoid rollbacks on restarts #1190

Avoid rollbacks on restarts #1190

Comments

kderme commented Jul 11, 2022 • edited

erikd commented Jul 21, 2022 • edited

erikd commented Jul 21, 2022

kderme commented Jul 21, 2022

erikd commented Jul 21, 2022 • edited

kderme commented Aug 4, 2022

centromere commented Sep 1, 2022

kderme commented Sep 1, 2022 • edited

kderme commented Sep 28, 2022

kderme commented Jul 11, 2022 •

edited

erikd commented Jul 21, 2022 •

edited

erikd commented Jul 21, 2022 •

edited

kderme commented Sep 1, 2022 •

edited