Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between spent and unspent matched inputs #20

Closed
2 tasks
KtorZ opened this issue May 17, 2022 · 21 comments · Fixed by #23
Closed
2 tasks

Distinguish between spent and unspent matched inputs #20

KtorZ opened this issue May 17, 2022 · 21 comments · Fixed by #23
Assignees
Labels
enhancement New feature or request

Comments

@KtorZ
Copy link
Member

KtorZ commented May 17, 2022

Describe your idea, in simple words.

Kupo currently synchronizes and stores the entire chain history (matching the provided patterns). However, there are use-cases where users are only interested in the edge of the UTxO graph. See the to e discussion here: #19

A 2-step idea could be to:

  • Mark entries as spent or unspent as blocks are being scrutinized and allow searching only matches that are spent or unspent.
  • Allow a flag to automatically prune spent entries from the database.

Why is it a good idea?

  • It enables client applications to detect spending of inputs and trigger some logic in response.
  • For clients only interested in the edge of the graph, it keeps the database relatively tiny and with only a reasonable (if not null, depending on the patterns) growth.

Are you willing to work on it yourself?

Yes.

@KtorZ KtorZ added the enhancement New feature or request label May 17, 2022
@KtorZ KtorZ self-assigned this May 17, 2022
@KtorZ
Copy link
Member Author

KtorZ commented May 23, 2022

Some thoughts / refinement:

  • Some reference benchmark from a full testnet synchronization (--match *) before implementing this feature:

    Revision Pattern Syncing Time Max Memory Usage Database Size Total entries
    5fcd314 all 39min 680MB 4.1 GB 10 465 339
    5fcd314 single address 8min 680MB 184KB 469
  • This requires a database migration on the inputs table. While it would be possible to make a migration that migrate an existing index (by leverage the node's state-query protocol to find out if previously stored entries are spent or unspent), it would, in all likelihood, take more time than re-syncing the entire index from scratch.

  • Marking entries as spent or unspent means that the internal consumer process needs now to look at each and every transaction inputs (and collaterals, when applicable) and then, proceed to adjust the index accordingly. Ideally we can bundle the marking / deleting operation in the same db-transaction that is adding new matches. Though there may be cases where only marking / deleting is needed because none of the new outputs actually match.

  • One of the major reasons that makes kupo fast is that it limits the number of database operations done. In particular, when defining a restrictive pattern, there's almost no database operation happening (except checkpoint storing). To alleviate this issue and only perform necessary db operations, we need to maintain in-memory some representation of the database UTxO so that we can know without querying the database whether some UTxO in the database needs to be marked / deleted. On mainnet, there are about 40M transactions, so that's about the maximum number of elements that could be in our data-structure. Beyond a certain number of entries, there's not much point trying to optimize anything because an operation will likely be needed anyway; I see mainly two options:

    • (a) use a balanced binary tree with the first 8 bytes of each transaction id. This should be fairly collision-resistant, log(n) for insertion, log(n) for deletion and log(n) for search. Yet, the size is "unbounded" and grows linearly for each transaction (i.e. 8 bytes per transaction). While currently this is "only" 300MB; it is already 300MB; plus the overhead of maintaining such a data-structure in-memory.
  • (b) Use a bloom-filter, which gives us a hard "not in database" or "maybe in database" with a probability that depends on parameters we choose. For example, with k = 2 and m = 2MB, we get <1% false positive for sets smaller than 1M entries; it gets better as the set gets smaller (so quite restrictive patterns would be very efficient in this case). As the number of entries gets larger, the filter gets worse but, it is also less needed since db operations becomes a lot more probable.

(b) sounds definitely like a better solution, especially because it is only a small overhead in memory, remains quite efficient in construction, and would perform extremely well for small entries (i.e. for restrictive) patterns which is precisely where this sort of optimization comes in handy.

@bakon11
Copy link

bakon11 commented May 23, 2022

Maybe a flag option for users that don't want to have the checks done?

Absolutely love this process and am super excited about this feature.

@KtorZ
Copy link
Member Author

KtorZ commented May 23, 2022

@bakon11 are you suggesting a flag different from the one I mentioned in the issue description? If I get you correctly, you mean a flag to completely disable marking spent / unspent.

If that's the case, then I would be quite opposed to this because of the inherent complexity / inconsistence it may introduce (for example, the DB schema would have to be modified for that feature, but the modification would be pointless / unused should the feature be disabled via a flag).

@bakon11
Copy link

bakon11 commented May 23, 2022

Ah ha I see, I got fixated on this portion here:

One of the major reasons that makes kupo fast is that it limits the number of database operations done.

In my head I figured if a user wanted to have a faster DB sync they could have the option for Kupo not to check for spend UTXO and flag them, and completely misinterpreted the rest of the post.

because of the inherent complexity / inconsistence it may introduce( the DB schema would have to be modified for that feature)

This makes a bit more sense now thanks again for explaining. Personally I think being able to keep track of spend UTXOs is a must have feature even if it means/meant some performance sacrifice because of extra operations performed.

@KtorZ
Copy link
Member Author

KtorZ commented May 23, 2022

At this stage I don't know what to expect performance wise. It could as well be that this addition does not impact much the overall behavior (hence the benchmark). But if it does, at least I have a plan.

@KtorZ
Copy link
Member Author

KtorZ commented Jun 7, 2022

Here's some preliminary results, without any particular optimization beyond good practices:

Option Pattern Syncing Time Max Memory Usage Database Size Total entries Total UTxO
None all 55min 680MB 4.3GB 10 465 221 2 310 842
--prune-utxo all 42min 680MB 1.5GB 2 310 839 2 310 839
None single address 10min 680MB 192KB 469 2
--prune-utxo single address 9min 680MB 80KB 2 2

Still a few TODOs (in particular, making this available through the API now) but looks already promising without much further work needed. It's about 40% slower in the marking case but remain without an acceptable range. This version runs with and without a new flag --prune-utxo. When set, the index will automatically remove inputs that are spent. When not set, it will only mark them as spent. So both behavior are possible.

One weird thing however is the discrepancy in the number of "unspent" addresses between the two scenarios. The scenarios (and queries) are executed on the exact same time range, which is relatively far in the past (in the stable area) so I would expect them to be strictly equal 🤔 ... will investigate and add few more tests.

@MartinSchere
Copy link

@KtorZ Amazing work

do you have performance results (i.e. how much time the query takes) for a large UTxO query for an address?

@bakon11
Copy link

bakon11 commented Jun 8, 2022

This is really amazing, it's literally the type of tool I was looking forward to.

Absolutely fantastic work man. IMO 55 min sync time or none pruned all address is pretty fast and MORE than acceptable IMO.

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

@MartinSchere "large" UTxO query on the testnet is a bit tricky. The most active address on the testnet is this one:

  • X = addr_test1vrd26p8d9dlknc4fhevzudcfzuul5qm2znquytugqcw583czzqrpm

Which is the one I used for the tests above. It has now around ~600 entries, and 72 of which are unspent at the moment of writing this.

With this, I get the following results (which includes the whole e2e pipeline, including JSON serialization. This is timing curl commands on the API):

Options Total Entries address = X address = X AND status = 'unspent'
--prune-utxo 2 310 839 11ms 11ms
none (marked) 10 465 221 23ms 12ms

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

I really don't understand what people have against SQLite. It's really an amazing tool 😄

@MartinSchere
Copy link

@KtorZ Wow. It's fair to say it's an improvement compared to the 45 minutes it takes in db-sync

Great work!

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

45 min to query information about an address? For real 😳?

@MartinSchere
Copy link

To query UTxOs, you need to query TxOuts, then join them with their respective address, then check for existence in every record of the transactions to see if that TxOut has been spent or not, resulting in quadratic complexity.

On the positive side, it gives you time to reflect on life and grab a coffee

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

But.. a chain index should be optimized for querying UTxO because this is where the main added-value is :|

Anyway. Okay. Whatever. I am trying to quit coffee though.

@MartinSchere
Copy link

Yeah, I absolutely agree with you and many of us are waiting for an indexer that does this, (I'll use it for my explorer).

Let me know if I can contribute in some way.

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

What would you miss from Kupo beyond this ticket and perhaps #21 ?

@MartinSchere
Copy link

I think this is getting out of the scope of the issue. Mind continuing the conv. in Discord?

@waalge
Copy link

waalge commented Jun 8, 2022

I'd be interested in that conversation

@KtorZ
Copy link
Member Author

KtorZ commented Jun 8, 2022

@waalge I'll create issues out of the conversation 👍

@KtorZ
Copy link
Member Author

KtorZ commented Jun 11, 2022

@MartinSchere, back to your question:

do you have performance results (i.e. how much time the query takes) for a large UTxO query for an address?

Thanks to @fivebinaries trying things out on mainnet, I now have an answer. On the current master ( 29835e5 ), querying one of the largest address on mainnet (200k+ UTxO):

# time curl http://localhost:1442/v1/matches/addr1w999n67e86jn6xal07pzxtrmqynspgx0fwmcmpua4wc6yzsxpljz3
real    0m5.925s

That's about 132MB of JSON streamed from the database. So I guess we can say that ~6s is an upper-bound 😅

@MartinSchere
Copy link

@KtorZ Thanks for the metric.

Guess what, that's jpg.store's contract

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants