Most recent node tip of Kupo may get stuck #143

nielstron · 2023-11-01T12:22:25Z

What Git revision / release tag are you using?

2.7.0+41ae67

Describe what the problem is?

Same as #142
It is unfortunately not resolved by upgrading to kupo 2.7.0

$ curl localhost:1482/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107274767
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107274767
$ curl localhost:6666/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107257239
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107257239

What should be the expected behavior?

The most recent node tip should always refer to the actual most recent node tip.

If applicable, what are the logs from the server around the occurence of the problem?

The respective log from kupo

Nov 01 07:49:51 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:51.965094591Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5134,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:51 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:51.995123761Z","thread":"380244","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":1427,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.066252366Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5135,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.160340692Z","thread":"379902","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":2056,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.16747226Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5136,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.245537436Z","thread":"380244","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":1428,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.268632358Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5137,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:52.305537426Z","thread":"14","message":{"Gardener":{"tag":"GardenerExitGarbageCollection","totalPrunedRows":147695}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:52.305640449Z","thread":"14","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"binary_data"}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:55 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:55.888146796Z","thread":"12","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107257204}},"version":"v2.7.0+41ae67"}

The current time when fetching this log and seeing the tip difference was 12:15 UTC

The text was updated successfully, but these errors were encountered:

KtorZ · 2023-11-03T15:48:06Z

Hey,

Are you sure that your node is actually synchronized?
Could you run Kupo with an additional --log-level debug?
Are you able to query the instance for matches and other stored elements in this situation?
May you run sqlite3 your/working/dir/kupo.sqlite3 ".indexes" on your database and forward the result?

Some of your connections have an extremely high number of failed attempts. Which is rather odd. Some are even just failing to open (250ms is the retry delay on open). It might be necessary to ALSO prune binary_data incrementally (incremental deletion only happen for rollbacks and inputs pruning so far) to avoid keeping the database busy for too long. The high --gc-interval and multiple unsuccessful pruning have likely led to that situation.

Saghen · 2023-11-20T23:04:04Z

We've noticed a similar issue on Kupo v2.7.0 where garbage collection on binary_data takes 10-15 minutes. We don't have any traffic coming to the node and it's fully synchronized. I'll grab the additional logging soon

Arguments: --node-socket /ipc/node.socket --node-config /config/cardano-node/config.json --host "0.0.0.0" --workdir /data/db --since origin --match "*/*" --prune-utxo

{"severity":"Notice","timestamp":"2023-11-20T22:48:47.895743118Z","thread":"13","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"binary_data"}},"version":"v2.7.0+41ae67"}
{"severity":"Notice","timestamp":"2023-11-20T23:00:34.770557484Z","thread":"13","message":{"Gardener":{"tag":"GardenerExitGarbageCollection","totalPrunedRows":822}},"version":"v2.7.0+41ae67"}

# sqlite3 kupo.sqlite3 ".indexes"

inputsByAddress                 sqlite_autoindex_binary_data_1
inputsByCreatedAt               sqlite_autoindex_inputs_1
inputsByOutputReference         sqlite_autoindex_patterns_1
inputsByPaymentCredential       sqlite_autoindex_policies_1
inputsBySpentAt                 sqlite_autoindex_scripts_1
policiesByPolicyId

Saghen · 2023-11-22T01:17:08Z

Here's the debug logging excluding health checks (via rg -v HttpRequest | rg -v HttpResponse... hopefully doesn't exclude anything important): https://paste.super.fish/p/9Y3mCS.txt

KtorZ · 2023-11-22T07:37:58Z

Thanks a lot @Saghen; I am a bit overloaded lately but I will look into it by next week the latest. One last question, I take from the command above that you aren't running with a special gc-interval, but using the default one?

Saghen · 2023-11-22T15:17:11Z

Sounds good! And yep, using the default interval

Fixes #143.

nielstron · 2023-12-04T13:15:31Z

Reporting that kupo mainnet still is choking on the pruning (limited to 50k rows now, ~approx since 20 minutes). Would it make sense to make the pruning increment size a command line option?

Dec 04 12:57:12 ns366159 bash[2614907]: {"severity":"Debug","timestamp":"2023-12-04T12:57:12.959016276Z","thread":"13","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBeginQuery","beginQuery":" DELETE FROM binary_data WHERE binary_data_hash IN ( SELECT binary_data_hash FROM binary_data LEFT JOIN inputs ON binary_data_hash = inputs.datum_hash WHERE inputs.ext_output_reference IS NULL ORDER BY inputs.datum_hash LIMIT 50000 )"}}},"version":"v2.7.1+5c0c7a"}

KtorZ · 2023-12-04T17:49:38Z

Would it make sense to make the pruning increment size a command line option?
No.

I am surprised however by the time you're reporting. I've tested this on a dev machine which took less than 30s per batch, with 1.5M entries to delete.

What are the specs of the machine you're running this by?

Saghen · 2023-12-04T18:07:03Z

We're seeing similar behavior as Niels; Garbage collection took 15 minutes and stalled the syncing. We're running with 2 cores, 2GBs RAM and NVME.

KtorZ · 2023-12-04T18:12:08Z

As identified in another discussion, Kupo is missing one crucial database index somehow. This got probably lost in some refactoring and failed to be caught by tests.

So right now, you may want to run (offline):

sqlite3 your/working/dir/kupo.sqlite3 "CREATE INDEX IF NOT EXISTS inputsByDatumHash ON inputs(datum_hash)"

This should take a minute or so, and solve the GC issue for now. A single GC increment should take no more than a minute. And given that you probably have a lot of data to clean up from previously interrupted GC, it'll take several minutes / hours to clear everything down. Though synchronization should be interleaved without problem.

biofreeze · 2023-12-30T16:27:00Z

Just to double check, we should be running the following command manually each time we start Kupo with the --defer-db-indexes option?

sqlite3 your/working/dir/kupo.sqlite3 "CREATE INDEX IF NOT EXISTS inputsByDatumHash ON inputs(datum_hash)"

I noticed that if you start a sync with the --defer-db-indexes option, it will actually remove that index.

So i'm assuming we should be starting Kupo, and then manually create the index after it's started, each time? If it's necessary to speed up garbage collection, is there a way this index can not be deleted when we restart Kupo with the --defer-db-indexes option?

KtorZ · 2023-12-30T17:48:41Z

Nothing extra is needed since 2.7.2, the issue was fixed.

KtorZ closed this as completed in fff8c25 Dec 3, 2023

KtorZ added a commit that referenced this issue Dec 3, 2023

Prune binary data incrementally.

2442d32

Fixes #143.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most recent node tip of Kupo may get stuck #143

Most recent node tip of Kupo may get stuck #143

nielstron commented Nov 1, 2023 •

edited

KtorZ commented Nov 3, 2023 •

edited

Saghen commented Nov 20, 2023 •

edited

Saghen commented Nov 22, 2023

KtorZ commented Nov 22, 2023

Saghen commented Nov 22, 2023

nielstron commented Dec 4, 2023 •

edited

KtorZ commented Dec 4, 2023

Saghen commented Dec 4, 2023 •

edited

KtorZ commented Dec 4, 2023

biofreeze commented Dec 30, 2023 •

edited

KtorZ commented Dec 30, 2023

Most recent node tip of Kupo may get stuck #143

Most recent node tip of Kupo may get stuck #143

Comments

nielstron commented Nov 1, 2023 • edited

What Git revision / release tag are you using?

Describe what the problem is?

What should be the expected behavior?

If applicable, what are the logs from the server around the occurence of the problem?

KtorZ commented Nov 3, 2023 • edited

Saghen commented Nov 20, 2023 • edited

Saghen commented Nov 22, 2023

KtorZ commented Nov 22, 2023

Saghen commented Nov 22, 2023

nielstron commented Dec 4, 2023 • edited

KtorZ commented Dec 4, 2023

Saghen commented Dec 4, 2023 • edited

KtorZ commented Dec 4, 2023

biofreeze commented Dec 30, 2023 • edited

KtorZ commented Dec 30, 2023

nielstron commented Nov 1, 2023 •

edited

KtorZ commented Nov 3, 2023 •

edited

Saghen commented Nov 20, 2023 •

edited

nielstron commented Dec 4, 2023 •

edited

Saghen commented Dec 4, 2023 •

edited

biofreeze commented Dec 30, 2023 •

edited