Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-linear search for payment credentials by introducing virtual indices #58

Closed
nielstron opened this issue Aug 11, 2022 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@nielstron
Copy link
Contributor

nielstron commented Aug 11, 2022

Describe your idea, in simple words.

Currently, search the UTxO set for UTxOs by payment credential is slow (~2s on 3GB of data) due to it being performed as linear search by sqlite. We can improve on that.

Currently kupo serializes addresses internally like this

     ┏━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━┓
     ┃ tag │ delegation credentials │ header │ payment credentials ┃
     ┗━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━┛

which makes total sense if we only want to leverage SQLite indices for delegation credentials.

We can actually leverage sqlite indices even better and achieve better search performance on both payment and stake credentials like this:

  1. we switch back to this layout of addresses
     ┏━━━━━┯━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━┓
     ┃ tag │ header │payment credentials │  delegation credentials ┃
     ┗━━━━━┷━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━┛
  1. add a virtual column address_delegation_credentials as substr(address, 60)
  2. add an index on address_delegation_credentials
  3. query on delegation credentials on address_delegation_credentials = ? instead of a LIKE comparison

Note: we could even add a virtual column that only contains the payment credentials and add an index on that too.

Why is it a good idea?

In the case that users want to query the UTxO set by payment credentials this is useful. I know of at least two handy use cases of this.
This will dramatically speed up queries in this use case.

This also avoid the issue of memory duplication in #57 (the size of kupo.sqlite3 with this index after vacuum is 2.7GB vs 2.4GB. timings for queries TBD)

Are you willing to work on it yourself?

Yes

@KtorZ
Copy link
Member

KtorZ commented Aug 11, 2022

I like the idea 🤨. I am curious to see how this affects the behavior of indexes.
We can play around using EXPLAIN QUERY PLAN to see whether indexes are still being used.

@KtorZ KtorZ added the enhancement New feature or request label Aug 12, 2022
@nielstron
Copy link
Contributor Author

nielstron commented Aug 12, 2022

Findings from playing around with this:

EXPLAIN QUERY PLAN select * from inputs where address_dc = "..." / LIKE = "...%"
-> SEARCH TABLE inputs USING INDEX inputsByDelegation (address_dc=?)

Real query takes 5 ms.
It is important to index using "COLLATE" however.

@nielstron
Copy link
Contributor Author

Note that virtual columns also support more complex generators such as case substr(address, 1, 2) when "01" then substr(address, 60) else NULL end. This does not seem to affect query time but likely causes a (small) penalty on insertion time.

@KtorZ
Copy link
Member

KtorZ commented Aug 12, 2022

Real query takes 5 ms.

SEARCH is the keyword here. Regarding COLLATE, I guess that if you create the virtual column as COLLATE NOCASE already, you may not need to specify it in the index. In principle, you need either. Having said that, I don't know of the pros/cons of having it on the column vs in the index 🤷

@KtorZ
Copy link
Member

KtorZ commented Oct 8, 2022

Looking into this ☝️, actually I kept the column as it was an simply added the virtual column with a substr(address, -56); firstly because this avoir re-writing existing stuff and providing possibly heavy migrations for it and second, because it actually works nicely as such!

With an index on the column and... well:

Before

Server Software:        kupo
Server Hostname:        localhost
Server Port:            1442

Document Path:          /matches/e19375828f94f381bdf6dcee56debe02856b027e327d26992db9f8dc/*

Concurrency Level:      10
Complete requests:      100
Time per request:       9925.704 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   9982
  66%  10175
  75%  10446
  80%  10665
  90%  11176
  95%  11176
  98%  11177
  99%  11177
 100%  11177 (longest request)

After

Server Software:        kupo
Server Hostname:        localhost
Server Port:            1442

Document Path:          /matches/e19375828f94f381bdf6dcee56debe02856b027e327d26992db9f8dc/*

Concurrency Level:      10
Complete requests:      100
Time per request:       2195.870 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   2193
  66%   2268
  75%   2296
  80%   2307
  90%   2393
  95%   2517
  98%   2545
  99%   2550
 100%   2550 (longest request)

This being the busiest address on the testnet (125k entries). Neat.

@KtorZ KtorZ closed this as completed Oct 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants