Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid rollbacks on restarts #1190

Closed
kderme opened this issue Jul 11, 2022 · 8 comments · Fixed by #1266
Closed

Avoid rollbacks on restarts #1190

kderme opened this issue Jul 11, 2022 · 8 comments · Fixed by #1266
Assignees
Labels
enhancement New feature or request

Comments

@kderme
Copy link
Contributor

kderme commented Jul 11, 2022

On restarts DBSync finds the latest point for which there is both a snapshot and its data are already inserted in the db. It deletes newer snapshots and data of newer blocks and starts syncing from this point. Deleting newer data can be avoided, if DBSync could instead apply the ledger rules from the snapshot until it reaches a compatible db tip. Before applying the rules, db-sync should first find the db tip and make sure it's on the node chain.

Other ways to avoid big rollbacks on restarts is to store more frequent snapshots or to store snapshots on signal handlers while exiting.

@kderme kderme added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jul 11, 2022
@erikd
Copy link
Contributor

erikd commented Jul 21, 2022

Other ways to avoid big rollbacks on restarts is to store more frequent snapshots

Yes, it will reduce the need for rollbacks, but is likely to slow down sync speeds even further. In epoch 319 I see:

[2022-07-21 03:20:55.35 UTC] Took a ledger snapshot at ledger-state/mainnet/52613171-79e0ce53e6.lstate.
          It took 103.31990899s.
[2022-07-21 05:03:43.35 UTC] Took a ledger snapshot at ledger-state/mainnet/52823074-07f01c5bd4.lstate.
          It took 127.062317205s.
[2022-07-21 05:23:36.66 UTC] Took a ledger snapshot at ledger-state/mainnet/52876752-af192981f4-319.lstate.
          It took 55.49092172s.

That is 5 minutes out of the total of 194 minutes it took to sync that epoch.

More frequent snapshots, means more time taken storing them on disk. A partial solution to this is to off load it to another thread, but since db-sync is already I/O bound this will almost certainly do very little to help.

The correct solution is for db-sync to drop its maintenance of ledger state (which I believe was always the idea) and have ledger/consensus/node trickle feed the required ledger state data to db-sync in a fully deterministic way. This means that db-sync would not have any ledger state that need to be stored on disk, and would not need to rollback more than a couple of blocks (worst case) on restart.

@erikd
Copy link
Contributor

erikd commented Jul 21, 2022

to store snapshots on signal handlers while exiting.

There are a bunch of real world scenarios where signal handlers and on-exit functions do not get called. For example the machine looses power or the process is killed by the kernel OOM handler. I am sure there are many others.

@kderme
Copy link
Contributor Author

kderme commented Jul 21, 2022

There are a bunch of real world scenarios where signal handlers and on-exit functions do not get called. For example the machine looses power or the process is killed by the kernel OOM handler. I am sure there are many others.

Sure, we're trying to catch the most common case with a normal exit an improve it.

The correct solution is for db-sync to drop its maintenance of ledger state (which I believe was always the idea) and have ledger/consensus/node trickle feed the required ledger state data to db-sync in a fully deterministic way.

That something we should push to the node teams, but can take quite a while to design and implement. In the meantime we can try to improve the situation on our side.

@erikd
Copy link
Contributor

erikd commented Jul 21, 2022

In the meantime we can try to improve the situation on our side.

IMO this is a bad idea. If we produce a workaround for this in db-sync and then we try to undo it when the lower level libraries provide us with what we need, there is a good chance that there will be deleterious artifacts left over in the db-sync code. For example, there are still artifacts of the split between the cardano-db-sync and cardano-sync packages that are still in this code base that should not be there, a year or so after these two packages were merged back into one. I doubt these will ever be removed.

@kderme
Copy link
Contributor Author

kderme commented Aug 4, 2022

IMO this is a bad idea.

It's hard to agree that it's a bad idea to improve things on our side in any context. The benefits of spliting the packages in 2 were questionable. However here, there are direct benefits from having fast restarts. It's one of the most requested improvements actually.

@centromere
Copy link

I am crafting a Helm chart for db-sync, and as part of that development I need to restart it quite often. Below are the most recent logs from my Pod:

[db-sync-node:Info:69] [2022-09-01 07:16:03.07 UTC] Found snapshot file for slot 67766471, hash 5efade4e133986bb9363330beb19fc558ece738ca37efe1bf6c037b8bd0f7f36. It took 2.503379629s to read from disk and 140.376048847s to parse.
[db-sync-node:Info:69] [2022-09-01 07:16:09.17 UTC] Found snapshot file for slot 67766471, hash 5efade4e133986bb9363330beb19fc558ece738ca37efe1bf6c037b8bd0f7f36
[db-sync-node:Info:69] [2022-09-01 07:16:09.18 UTC] File /var/lib/cexplorer/67766471-5efade4e13.lstate exists
[db-sync-node:Info:69] [2022-09-01 07:16:09.29 UTC] Starting at epoch 354
[db-sync-node:Info:69] [2022-09-01 07:16:09.74 UTC] Insert Alonzo Block: epoch 354, slot 67766486, block 7570000, hash f2a92fb5a949c32f562424eb69cd522d6f502671d48891c6e2e14e04f728cca5
[db-sync-node:Info:69] [2022-09-01 07:26:02.90 UTC] Offline pool metadata fetch: 49 results, 43 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:37:57.27 UTC] Offline pool metadata fetch: 50 results, 41 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:42:53.09 UTC] Insert Alonzo Block: epoch 354, slot 67868072, block 7575000, hash 9093365c52520e34bf67184fc407c905c2732eb2881d347c94f1fdb833a21391
[db-sync-node:Info:69] [2022-09-01 07:48:12.88 UTC] Offline pool metadata fetch: 50 results, 39 fetch errors
[db-sync-node:Info:69] [2022-09-01 07:58:58.45 UTC] Offline pool metadata fetch: 49 results, 41 fetch errors

If I need to recreate the Pod, db-sync will be terminated and all those blocks will be rolled back on next start. This is extremely annoying!

At the very least, db-sync should install a SIGTERM or SIGINT handler which initiates an orderly shutdown -- including the creation of a fresh *.lstate snapshot.

@kderme
Copy link
Contributor Author

kderme commented Sep 1, 2022

I am crafting a Helm chart for db-sync, and as part of that development I need to restart it quite often.

If you don't need the historic rewards and other data described in https://github.com/input-output-hk/cardano-db-sync/blob/master/doc/configuration.md#--disable-ledger which come from the ledger you can already use the disable-ledger flag. This does no rollback on restarts already.

@kderme
Copy link
Contributor Author

kderme commented Sep 28, 2022

This has been fixed in #1266

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants