Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most recent node tip of Kupo may get stuck #143

Closed
nielstron opened this issue Nov 1, 2023 · 11 comments
Closed

Most recent node tip of Kupo may get stuck #143

nielstron opened this issue Nov 1, 2023 · 11 comments

Comments

@nielstron
Copy link
Contributor

nielstron commented Nov 1, 2023

What Git revision / release tag are you using?

2.7.0+41ae67

Describe what the problem is?

Same as #142
It is unfortunately not resolved by upgrading to kupo 2.7.0

$ curl localhost:1482/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107274767
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107274767
$ curl localhost:6666/health
# TYPE kupo_connection_status gauge
kupo_connection_status  1.0
# TYPE kupo_most_recent_checkpoint counter
kupo_most_recent_checkpoint  107257239
# TYPE kupo_most_recent_node_tip counter
kupo_most_recent_node_tip  107257239

What should be the expected behavior?

The most recent node tip should always refer to the actual most recent node tip.

If applicable, what are the logs from the server around the occurence of the problem?

The respective log from kupo

Nov 01 07:49:51 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:51.965094591Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5134,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:51 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:51.995123761Z","thread":"380244","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":1427,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.066252366Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5135,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.160340692Z","thread":"379902","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":2056,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.16747226Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5136,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.245537436Z","thread":"380244","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":1428,"retryingIn":0.25}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Warning","timestamp":"2023-11-01T07:49:52.268632358Z","thread":"379893","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBusy","attempts":5137,"retryingIn":0.1}}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:52.305537426Z","thread":"14","message":{"Gardener":{"tag":"GardenerExitGarbageCollection","totalPrunedRows":147695}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:52 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:52.305640449Z","thread":"14","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"binary_data"}},"version":"v2.7.0+41ae67"}
Nov 01 07:49:55 ns366159 bash[2889517]: {"severity":"Notice","timestamp":"2023-11-01T07:49:55.888146796Z","thread":"12","message":{"Consumer":{"tag":"ConsumerRollBackward","point":107257204}},"version":"v2.7.0+41ae67"}

The current time when fetching this log and seeing the tip difference was 12:15 UTC

@KtorZ
Copy link
Member

KtorZ commented Nov 3, 2023

Hey,

  1. Are you sure that your node is actually synchronized?
  2. Could you run Kupo with an additional --log-level debug?
  3. Are you able to query the instance for matches and other stored elements in this situation?
  4. May you run sqlite3 your/working/dir/kupo.sqlite3 ".indexes" on your database and forward the result?

Some of your connections have an extremely high number of failed attempts. Which is rather odd. Some are even just failing to open (250ms is the retry delay on open). It might be necessary to ALSO prune binary_data incrementally (incremental deletion only happen for rollbacks and inputs pruning so far) to avoid keeping the database busy for too long. The high --gc-interval and multiple unsuccessful pruning have likely led to that situation.

@Saghen
Copy link

Saghen commented Nov 20, 2023

We've noticed a similar issue on Kupo v2.7.0 where garbage collection on binary_data takes 10-15 minutes. We don't have any traffic coming to the node and it's fully synchronized. I'll grab the additional logging soon

Arguments: --node-socket /ipc/node.socket --node-config /config/cardano-node/config.json --host "0.0.0.0" --workdir /data/db --since origin --match "*/*" --prune-utxo

{"severity":"Notice","timestamp":"2023-11-20T22:48:47.895743118Z","thread":"13","message":{"Gardener":{"tag":"GardenerBeginGarbageCollection","pruning":"binary_data"}},"version":"v2.7.0+41ae67"}
{"severity":"Notice","timestamp":"2023-11-20T23:00:34.770557484Z","thread":"13","message":{"Gardener":{"tag":"GardenerExitGarbageCollection","totalPrunedRows":822}},"version":"v2.7.0+41ae67"}
# sqlite3 kupo.sqlite3 ".indexes"

inputsByAddress                 sqlite_autoindex_binary_data_1
inputsByCreatedAt               sqlite_autoindex_inputs_1
inputsByOutputReference         sqlite_autoindex_patterns_1
inputsByPaymentCredential       sqlite_autoindex_policies_1
inputsBySpentAt                 sqlite_autoindex_scripts_1
policiesByPolicyId

@Saghen
Copy link

Saghen commented Nov 22, 2023

Here's the debug logging excluding health checks (via rg -v HttpRequest | rg -v HttpResponse... hopefully doesn't exclude anything important): https://paste.super.fish/p/9Y3mCS.txt

@KtorZ
Copy link
Member

KtorZ commented Nov 22, 2023

Thanks a lot @Saghen; I am a bit overloaded lately but I will look into it by next week the latest. One last question, I take from the command above that you aren't running with a special gc-interval, but using the default one?

@Saghen
Copy link

Saghen commented Nov 22, 2023

Sounds good! And yep, using the default interval

@KtorZ KtorZ closed this as completed in fff8c25 Dec 3, 2023
KtorZ added a commit that referenced this issue Dec 3, 2023
@nielstron
Copy link
Contributor Author

nielstron commented Dec 4, 2023

Reporting that kupo mainnet still is choking on the pruning (limited to 50k rows now, ~approx since 20 minutes). Would it make sense to make the pruning increment size a command line option?

Dec 04 12:57:12 ns366159 bash[2614907]: {"severity":"Debug","timestamp":"2023-12-04T12:57:12.959016276Z","thread":"13","message":{"Database":{"tag":"DatabaseConnection","message":{"tag":"ConnectionBeginQuery","beginQuery":" DELETE FROM binary_data WHERE binary_data_hash IN ( SELECT binary_data_hash FROM binary_data LEFT JOIN inputs ON binary_data_hash = inputs.datum_hash WHERE inputs.ext_output_reference IS NULL ORDER BY inputs.datum_hash LIMIT 50000 )"}}},"version":"v2.7.1+5c0c7a"}

@KtorZ
Copy link
Member

KtorZ commented Dec 4, 2023

Would it make sense to make the pruning increment size a command line option?
No.

I am surprised however by the time you're reporting. I've tested this on a dev machine which took less than 30s per batch, with 1.5M entries to delete.

What are the specs of the machine you're running this by?

@Saghen
Copy link

Saghen commented Dec 4, 2023

We're seeing similar behavior as Niels; Garbage collection took 15 minutes and stalled the syncing. We're running with 2 cores, 2GBs RAM and NVME.

@KtorZ
Copy link
Member

KtorZ commented Dec 4, 2023

As identified in another discussion, Kupo is missing one crucial database index somehow. This got probably lost in some refactoring and failed to be caught by tests.

So right now, you may want to run (offline):

sqlite3 your/working/dir/kupo.sqlite3 "CREATE INDEX IF NOT EXISTS inputsByDatumHash ON inputs(datum_hash)"

This should take a minute or so, and solve the GC issue for now. A single GC increment should take no more than a minute. And given that you probably have a lot of data to clean up from previously interrupted GC, it'll take several minutes / hours to clear everything down. Though synchronization should be interleaved without problem.

@biofreeze
Copy link

biofreeze commented Dec 30, 2023

Just to double check, we should be running the following command manually each time we start Kupo with the --defer-db-indexes option?

sqlite3 your/working/dir/kupo.sqlite3 "CREATE INDEX IF NOT EXISTS inputsByDatumHash ON inputs(datum_hash)"

I noticed that if you start a sync with the --defer-db-indexes option, it will actually remove that index.

Screenshot 2023-12-30 at 11 24 40 AM

So i'm assuming we should be starting Kupo, and then manually create the index after it's started, each time? If it's necessary to speed up garbage collection, is there a way this index can not be deleted when we restart Kupo with the --defer-db-indexes option?

@KtorZ
Copy link
Member

KtorZ commented Dec 30, 2023

Nothing extra is needed since 2.7.2, the issue was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants