Distinguish between spent and unspent matched inputs #20

KtorZ · 2022-05-17T17:49:03Z

Describe your idea, in simple words.

Kupo currently synchronizes and stores the entire chain history (matching the provided patterns). However, there are use-cases where users are only interested in the edge of the UTxO graph. See the to e discussion here: #19

A 2-step idea could be to:

Mark entries as spent or unspent as blocks are being scrutinized and allow searching only matches that are spent or unspent.
Allow a flag to automatically prune spent entries from the database.

Why is it a good idea?

It enables client applications to detect spending of inputs and trigger some logic in response.
For clients only interested in the edge of the graph, it keeps the database relatively tiny and with only a reasonable (if not null, depending on the patterns) growth.

Are you willing to work on it yourself?

Yes.

The text was updated successfully, but these errors were encountered:

KtorZ · 2022-05-23T11:08:17Z

Some thoughts / refinement:

Some reference benchmark from a full testnet synchronization (--match *) before implementing this feature:

Revision Pattern Syncing Time Max Memory Usage Database Size Total entries

5fcd314 all 39min 680MB 4.1 GB 10 465 339

5fcd314 single address 8min 680MB 184KB 469
This requires a database migration on the inputs table. While it would be possible to make a migration that migrate an existing index (by leverage the node's state-query protocol to find out if previously stored entries are spent or unspent), it would, in all likelihood, take more time than re-syncing the entire index from scratch.
Marking entries as spent or unspent means that the internal consumer process needs now to look at each and every transaction inputs (and collaterals, when applicable) and then, proceed to adjust the index accordingly. Ideally we can bundle the marking / deleting operation in the same db-transaction that is adding new matches. Though there may be cases where only marking / deleting is needed because none of the new outputs actually match.
One of the major reasons that makes kupo fast is that it limits the number of database operations done. In particular, when defining a restrictive pattern, there's almost no database operation happening (except checkpoint storing). To alleviate this issue and only perform necessary db operations, we need to maintain in-memory some representation of the database UTxO so that we can know without querying the database whether some UTxO in the database needs to be marked / deleted. On mainnet, there are about 40M transactions, so that's about the maximum number of elements that could be in our data-structure. Beyond a certain number of entries, there's not much point trying to optimize anything because an operation will likely be needed anyway; I see mainly two options:
- (a) use a balanced binary tree with the first 8 bytes of each transaction id. This should be fairly collision-resistant, log(n) for insertion, log(n) for deletion and log(n) for search. Yet, the size is "unbounded" and grows linearly for each transaction (i.e. 8 bytes per transaction). While currently this is "only" 300MB; it is already 300MB; plus the overhead of maintaining such a data-structure in-memory.
(b) Use a bloom-filter, which gives us a hard "not in database" or "maybe in database" with a probability that depends on parameters we choose. For example, with k = 2 and m = 2MB, we get <1% false positive for sets smaller than 1M entries; it gets better as the set gets smaller (so quite restrictive patterns would be very efficient in this case). As the number of entries gets larger, the filter gets worse but, it is also less needed since db operations becomes a lot more probable.

(b) sounds definitely like a better solution, especially because it is only a small overhead in memory, remains quite efficient in construction, and would perform extremely well for small entries (i.e. for restrictive) patterns which is precisely where this sort of optimization comes in handy.

bakon11 · 2022-05-23T11:33:22Z

Maybe a flag option for users that don't want to have the checks done?

Absolutely love this process and am super excited about this feature.

KtorZ · 2022-05-23T11:47:29Z

@bakon11 are you suggesting a flag different from the one I mentioned in the issue description? If I get you correctly, you mean a flag to completely disable marking spent / unspent.

If that's the case, then I would be quite opposed to this because of the inherent complexity / inconsistence it may introduce (for example, the DB schema would have to be modified for that feature, but the modification would be pointless / unused should the feature be disabled via a flag).

bakon11 · 2022-05-23T11:58:56Z

Ah ha I see, I got fixated on this portion here:

One of the major reasons that makes kupo fast is that it limits the number of database operations done.

In my head I figured if a user wanted to have a faster DB sync they could have the option for Kupo not to check for spend UTXO and flag them, and completely misinterpreted the rest of the post.

because of the inherent complexity / inconsistence it may introduce( the DB schema would have to be modified for that feature)

This makes a bit more sense now thanks again for explaining. Personally I think being able to keep track of spend UTXOs is a must have feature even if it means/meant some performance sacrifice because of extra operations performed.

KtorZ · 2022-05-23T12:52:58Z

At this stage I don't know what to expect performance wise. It could as well be that this addition does not impact much the overall behavior (hence the benchmark). But if it does, at least I have a plan.

KtorZ · 2022-06-07T21:09:26Z

Here's some preliminary results, without any particular optimization beyond good practices:

Option	Pattern	Syncing Time	Max Memory Usage	Database Size	Total entries	Total UTxO
None	all	55min	680MB	4.3GB	10 465 221	2 310 842
`--prune-utxo`	all	42min	680MB	1.5GB	2 310 839	2 310 839
None	single address	10min	680MB	192KB	469	2
`--prune-utxo`	single address	9min	680MB	80KB	2	2

Still a few TODOs (in particular, making this available through the API now) but looks already promising without much further work needed. It's about 40% slower in the marking case but remain without an acceptable range. This version runs with and without a new flag --prune-utxo. When set, the index will automatically remove inputs that are spent. When not set, it will only mark them as spent. So both behavior are possible.

One weird thing however is the discrepancy in the number of "unspent" addresses between the two scenarios. The scenarios (and queries) are executed on the exact same time range, which is relatively far in the past (in the stable area) so I would expect them to be strictly equal 🤔 ... will investigate and add few more tests.

MartinSchere · 2022-06-07T21:15:34Z

@KtorZ Amazing work

do you have performance results (i.e. how much time the query takes) for a large UTxO query for an address?

bakon11 · 2022-06-08T04:35:10Z

This is really amazing, it's literally the type of tool I was looking forward to.

Absolutely fantastic work man. IMO 55 min sync time or none pruned all address is pretty fast and MORE than acceptable IMO.

KtorZ · 2022-06-08T12:00:54Z

@MartinSchere "large" UTxO query on the testnet is a bit tricky. The most active address on the testnet is this one:

X = addr_test1vrd26p8d9dlknc4fhevzudcfzuul5qm2znquytugqcw583czzqrpm

Which is the one I used for the tests above. It has now around ~600 entries, and 72 of which are unspent at the moment of writing this.

With this, I get the following results (which includes the whole e2e pipeline, including JSON serialization. This is timing curl commands on the API):

Options	Total Entries	address = X	address = X AND status = 'unspent'
--prune-utxo	2 310 839	11ms	11ms
none (marked)	10 465 221	23ms	12ms

KtorZ · 2022-06-08T12:04:00Z

I really don't understand what people have against SQLite. It's really an amazing tool 😄

MartinSchere · 2022-06-08T13:37:57Z

@KtorZ Wow. It's fair to say it's an improvement compared to the 45 minutes it takes in db-sync

Great work!

KtorZ · 2022-06-08T14:01:24Z

45 min to query information about an address? For real 😳?

MartinSchere · 2022-06-08T14:44:40Z

To query UTxOs, you need to query TxOuts, then join them with their respective address, then check for existence in every record of the transactions to see if that TxOut has been spent or not, resulting in quadratic complexity.

On the positive side, it gives you time to reflect on life and grab a coffee

KtorZ · 2022-06-08T14:47:49Z

But.. a chain index should be optimized for querying UTxO because this is where the main added-value is :|

Anyway. Okay. Whatever. I am trying to quit coffee though.

MartinSchere · 2022-06-08T14:52:05Z

Yeah, I absolutely agree with you and many of us are waiting for an indexer that does this, (I'll use it for my explorer).

Let me know if I can contribute in some way.

KtorZ · 2022-06-08T15:13:52Z

What would you miss from Kupo beyond this ticket and perhaps #21 ?

MartinSchere · 2022-06-08T15:17:03Z

I think this is getting out of the scope of the issue. Mind continuing the conv. in Discord?

waalge · 2022-06-08T17:42:05Z

I'd be interested in that conversation

KtorZ · 2022-06-08T18:32:52Z

@waalge I'll create issues out of the conversation 👍

KtorZ · 2022-06-11T09:38:52Z

@MartinSchere, back to your question:

do you have performance results (i.e. how much time the query takes) for a large UTxO query for an address?

Thanks to @fivebinaries trying things out on mainnet, I now have an answer. On the current master ( 29835e5 ), querying one of the largest address on mainnet (200k+ UTxO):

# time curl http://localhost:1442/v1/matches/addr1w999n67e86jn6xal07pzxtrmqynspgx0fwmcmpua4wc6yzsxpljz3
real    0m5.925s

That's about 132MB of JSON streamed from the database. So I guess we can say that ~6s is an upper-bound 😅

MartinSchere · 2022-06-11T12:59:29Z

@KtorZ Thanks for the metric.

Guess what, that's jpg.store's contract

KtorZ added the enhancement New feature or request label May 17, 2022

KtorZ self-assigned this May 17, 2022

KtorZ mentioned this issue Jun 9, 2022

Distinguish between spent and unspent matched inputs #23

Merged

KtorZ closed this as completed in 1b30776 Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between spent and unspent matched inputs #20

Distinguish between spent and unspent matched inputs #20

KtorZ commented May 17, 2022

KtorZ commented May 23, 2022 •

edited

Loading

bakon11 commented May 23, 2022

KtorZ commented May 23, 2022

bakon11 commented May 23, 2022

KtorZ commented May 23, 2022

KtorZ commented Jun 7, 2022

MartinSchere commented Jun 7, 2022

bakon11 commented Jun 8, 2022

KtorZ commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

waalge commented Jun 8, 2022

KtorZ commented Jun 8, 2022

KtorZ commented Jun 11, 2022

MartinSchere commented Jun 11, 2022

Distinguish between spent and unspent matched inputs #20

Distinguish between spent and unspent matched inputs #20

Comments

KtorZ commented May 17, 2022

Describe your idea, in simple words.

Why is it a good idea?

Are you willing to work on it yourself?

KtorZ commented May 23, 2022 • edited Loading

bakon11 commented May 23, 2022

KtorZ commented May 23, 2022

bakon11 commented May 23, 2022

KtorZ commented May 23, 2022

KtorZ commented Jun 7, 2022

MartinSchere commented Jun 7, 2022

bakon11 commented Jun 8, 2022

KtorZ commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

KtorZ commented Jun 8, 2022

MartinSchere commented Jun 8, 2022

waalge commented Jun 8, 2022

KtorZ commented Jun 8, 2022

KtorZ commented Jun 11, 2022

MartinSchere commented Jun 11, 2022

KtorZ commented May 23, 2022 •

edited

Loading