Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On performance and memory usage #1

Open
dpc opened this issue Apr 27, 2021 · 25 comments
Open

On performance and memory usage #1

dpc opened this issue Apr 27, 2021 · 25 comments

Comments

@dpc
Copy link
Contributor

dpc commented Apr 27, 2021

I'm moving discussion from rust-bitcoin/rust-bitcoin#595 here since I think this might be an ongoing discussion, and people might find it useful, and feels like we're off topic for a long time now.

@dpc
Copy link
Contributor Author

dpc commented Apr 27, 2021

Is there any reason to use fxhash instead of just take some bytes of txhash? txhash is already uniform and "random", no? Is this to protect from people crafting collisions on purpose? But then - they could as well craft them against fxhash since it doesn't seem to be randomized.

The main reason to use another hash is vout, because it's not uniformly distributed in the search space. For example if you take 6 bytes from the txhash and only 2 for vout (u16 because historical max vout is 30000) you will end up with one byte almost always 0, increasing chances of collisions.
It's fxhash just because it's faster than standard siphash, especially if you have longish keys. And yeah, if someone craft transactions to create collisions on this iterator I can switch to siphash.
Also, note that I am doing the hash externally and then using a PasstroughHasher which basically skip hashing inside the hashmap because you already have a uniformly distributed key

Oh, I see. That's better than what I've been doing then.

I'm wondering if it access to this map will not have to be serialized due to potential of race conditions etc.

Exactly, I can't wait to try something like this https://stackoverflow.com/a/65038724/1113302 :)

There is bunch of concurrent hashmaps on crates.io . Maybe some of them would fit the bill.

FYI I did a couple of tests:

  • branch testser using Vec<u8>=serialize(Vec<Option<TxOut>>) , on testnet it uses about half memory ~5GB down from ~10GB, however it's about 3 times slower and even more on mainnet
  • branch script_reuse using your idea of keeping a reference to scripts, on mainnet saves about 10% of memory but it is slower about 10% more. Moreover, as I have implemented it, eventual script collision would result in getting wrong prevout scripts.

In the end I decided neither of the two approaches is worth pursuing

Thanks for letting me know!

The only one left that I know is worth it LRU. But it requires -txindex (me thinks). Any missing output is just read from the fullnode, which is slower, but it doesn't happen too often because most outputs are spent quite soon after they have been created.

A great tradeoff between perf and memory usage. It is also putting a strict ceiling on memory usage, which gives the end user the ability to fine tune it to their system. You have a small board with 2GB? You can make the LRU to fit exactly that size, and use the tool as fast as it can go with these resources.

Also with that additional code (to get output straight from the node), it would allow block_iterator to be restartable, or start from a fixed height, which is very useful on its own.

@dpc dpc changed the title One performance and memory usage of tracking UTXO set One performance and memory usage Apr 27, 2021
@dpc dpc changed the title One performance and memory usage On performance and memory usage Apr 27, 2021
@dpc
Copy link
Contributor Author

dpc commented Apr 27, 2021

Oh, I see that you're using block file parsing to get the block data. Most bitcoin related projects do, and I never got to try benchmark the difference. In ruts-bitcoin-indexer, I've been just using JsonRPC to request hex encoded blocks, and my bottleneck was always writing to the database, I didn't worry about it much. On SSDs you might want to try to parallelized this code. At very least to read from storage, and decode in parallel, but modern storage is often so fast that it benefits from running concurrent reads (obviously not great for HDDs). In rust-bitcoin-indexer I just had N workes that were doing everything in parallel for the next N blocks in sequence and then the iterator that would put it all in order . https://github.com/dpc/rust-bitcoin-indexer/blob/e859ead4bef62661940097c8a87d7fcb39039f30/src/node/prefetcher.rs#L290

You probably would benefit from a good user-accessible benchmark time tracking. First for your own tuning, and then for users to be able to find bottleneck on their setup and tune according to the results. If the bottleneck is readying the data, then 10% hit to the UTXO tracking performance might not matter at all.

@RCasatta
Copy link
Owner

Oh, I see that you're using block file parsing to get the block data. Most bitcoin related projects do, and I never got to try benchmark the difference. In ruts-bitcoin-indexer, I've been just using JsonRPC to request hex encoded blocks, and my bottleneck was always writing to the database

I know electrs is faster when you configure it to read the blocks.dat and usually my experience with RPC is that is not super fast but I didn't bench it.

my bottleneck was always writing to the database

Yes, if that is the bottleneck makes sense to use RPC, for blocks_iterator the bottleneck is previous_output computation

You probably would benefit from a good user-accessible benchmark time tracking. First for your own tuning, and then for users to be able to find bottleneck on their setup and tune according to the results. If the bottleneck is readying the data, then 10% hit to the UTXO tracking performance might not matter at all.

Any idea on how to do this? Maybe a github actions with a docker loaded with testnet blocks*.dat ?

@dpc
Copy link
Contributor Author

dpc commented Apr 29, 2021

my experience with RPC is that is not super fast but I didn't bench it.

It's important that the RPC is queried with some parallelism to get rid of latency overhead, and the node itself has to do the same IO to the same files, then encoding to hex, sending over, decoding hex are all kind of cheap. Seems to me that all the overheads tend to not matter as the IO stays 100% saturated If one puts the node on another box the RPC approach might become faster as the load just get scaled to two boxes.

for blocks_iterator the bottleneck is previous_output computation

I'm really surprised. That would mean the it's CPU/memory bandwidth bound. I mean - the memory usage is quite big, but adding and looking up stuff in the hashmap should be reaaaaaally fast.

Any idea on how to do this? Maybe a github actions with a docker loaded with testnet blocks*.dat

My favorite trick that I use in in some pieces of Rust code is that I wrap Sender and Receiver of every channel with a newtype, that wraps recv and send with timing code and I measure how long does it block on these. There is all sorts of stuff one can do with that. Eg. I take the duration, saturate-substract some tiny value (around 1ms, to take into account fixed overheads of the operation itself) to get the time it was blocked and then keep accumulating how long was spent in it, and every time it crosses certain threashold (10s) there's a log line that prints "blocked on <recv/send> of ". This can be gated by cmdline flags or RUST_LOG env or something. In more enterprise setup I'd plug this to prometheus.

But generally that's it - measuring crucial parts and then exposing it somehow to the user.

@dpc
Copy link
Contributor Author

dpc commented Apr 29, 2021

BTW. channel of 200 size are in my experience just wasting memory. Eventually one part becomes a permanent bottleneck one way or another, and all the other places are always full/empty. The channel size is there purely to amortize variance, so in my experience channels of size 2 are already plenty, and I usually use 1 or even a rendezvous one.

@RCasatta
Copy link
Owner

My favorite trick...

lol, here I did exactly the opposite, measuring the busy time as the time the thread spends non-blocked on recv.

I'm really surprised. That would mean the it's CPU/memory bandwidth bound. I mean - the memory usage is quite big, but adding and looking up stuff in the hashmap should be reaaaaaally fast.

If you check the busy time, in my machine the fee thread is the one it takes the longest. The reason for that is that there are many insertions in the map (the number of outputs in a block) and lookup (the number of inputs in a block). Lookups are made on 2 maps because the key truncation trick comes at the expense of checking the non-truncated map firsts... Also, the map is pretty big so the CPU cache effect becomes less powerful.

BTW. channel of 200 size are in my experience just wasting memory. Eventually one part becomes a permanent bottleneck one way or another, and all the other places are always full/empty. The channel size is there purely to amortize variance, so in my experience channels of size 2 are already plenty, and I usually use 1 or even a rendezvous one.

I stayed a bit large because I imagined thread sleep/wake up could make some issues, like the bottleneck thread find the input channel empty because the input thread is taking some time to wake up. Need to experiment on this

@RCasatta
Copy link
Owner

RCasatta commented Apr 30, 2021

Finally, I am very happy with the last 2 improvements: decreased memory needed a lot (see readme for stat table).

In particular the most obvious but shamely missing, is skipping the provably unspendable scripts 547f9d2

The other one is putting most common script directly on the stack f92857e and a1c9d65

@dpc
Copy link
Contributor Author

dpc commented Apr 30, 2021

lol, here I did exactly the opposite, measuring the busy time as the time the thread spends non-blocked on recv.

They are both useful. Logically each step in a pipeline is "recv", "process", "send" and they should all add up to a total runtime. Oftentimes it's worthwhile to measure sub-part of "process" as well.

Also, the map is pretty big so the CPU cache effect becomes less powerful.

I was wondering what you said about using fxhash to minimize collisions due vout, and something tells me that just taking txid prefix and xoring last byte with vout is going to be good enough, and significantly faster even than already fast fxhash. If that's your bottleneck, I think it's worth giving a try.

I imagined thread sleep/wake up could make some issues, like the bottleneck thread find the input channel empty because the input thread is taking some time to wake up.

Assuming a roughly uniform dataset (which Bitcoin blockchain definitely is), the pipeline formed by a bunch of threads connected with a channel always stabilize at a steady state. All channels before the bottleneck are always full and their sending threads are sleeping waiting to insert one additional element, and all channel past the bottleneck, are always empty and their receiver sleep waiting to get next work. The bottleneck thread always have a next work ready in the (always full) channel, and can send (to an otherwise always empty channel) without blocking.

If you spot results contradicting this, please let me know about it. I would have some projects to fix. :)

@dpc
Copy link
Contributor Author

dpc commented Apr 30, 2021

is skipping the provably unspendable scripts

Out of curiosity, how much does it save?

@RCasatta
Copy link
Owner

RCasatta commented May 1, 2021

is skipping the provably unspendable scripts

Out of curiosity, how much does it save?

It's about 3GB, but in practice it avoids the last doubling of the utxo map saving almost 10GB!

@dpc
Copy link
Contributor Author

dpc commented May 1, 2021

It's about 3GB, but in practice it avoids the last doubling of the utxo map saving almost 10GB!

Thought: would sharding this one (two) maps between 32-256 smaller bucket-maps indexed based on a first byte of a txid, make the memory usage slightly smoother?

@RCasatta
Copy link
Owner

RCasatta commented May 3, 2021

It's about 3GB, but in practice it avoids the last doubling of the utxo map saving almost 10GB!

Thought: would sharding this one (two) maps between 32-256 smaller bucket-maps indexed based on a first byte of a txid, make the memory usage slightly smoother?

So, I am taking the simple case, two maps, pair txindex goes in one, odd in the other. If pair and odds are uniforms, they will double almost at the same time?

Honestly, I find it weird there isn't an option in the std map to configure the map increase rate. In these cases, something like 1.5 would require more reallocations but less peak memory usage

@RCasatta
Copy link
Owner

RCasatta commented May 3, 2021

So, I am taking the simple case, two maps, pair txindex goes in one, odd in the other. If pair and odds are uniforms, they will double almost at the same time?

Maybe using a logic not 50/50 like 2/3 requests goes to one map and 1/3 to the other you can create something a little smoother? (but you always have the moment where both double... maybe choosing relative primes? Not sure looks complicated, again I would like to tweak the map reallocation factor)

@dpc
Copy link
Contributor Author

dpc commented May 3, 2021

they will double almost at the same time

I forgot to comment on it, but I thought a little bit longer about and it was a silly idea.

something like 1.5 would require more reallocations but less peak memory usage

BTW. Don't you just want to pre-allocate (and even resize) the map manually then?

@RCasatta
Copy link
Owner

RCasatta commented May 3, 2021

BTW. Don't you just want to pre-allocate (and even resize) the map manually then?

The problem with preallocation are:

  • Initial capacity is different per network (not a big issue)
  • You will need to tweak the value in the future if utxo grows
  • It looks there are brackets in map allocation, when I tried a value lower than what automatically allocated, the capacity used ended up to be the same

@dpc
Copy link
Contributor Author

dpc commented Jan 27, 2022

Hi!

For a long while I was itching into trying to improve rust-bitcoin-indexer and I finally got some time & motivations to get into it.

I want to use more iterator-based APIs (based on a crate I published recently: dpc-pariter), and possibly modularize it, which would possibly make it have similar capabilities as blocks_iterator (since reusing each piece should be easier).

I started from the block fetching (jsonrpc) code. While at it I decided to benchmark that code against blocks_iterator. I'm getting ~110k txs/s, while blocks_iterator is getting ~350k tx/s on the same machine, indexing roughly the same beginning of the mainnet blockchain. Not sure if anyone ever did a comparison so I thought it might be interesting. Blockchain data is sitting on some slow desktop SMR HDD.

If you don't mind I'm planning to eventually copy & adapt your blk-files reading code, and support both modes of sourcing the blockchain data.

@RCasatta
Copy link
Owner

RCasatta commented Jan 27, 2022

Thanks for the comparison, that's very interesting, did you use skip-prevout flag in the blocks_iterator tests?

If you don't mind I'm planning to eventually copy & adapt your blk-files reading code, and support both modes of sourcing the blockchain data.

You are free to copy whatever you like :)
Just wondering if you have considered having blocks_iterator as a dep.

@RCasatta
Copy link
Owner

bitcoin028 branch should improve performance also, thanks to rust-bitcoin/rust-bitcoin#672

@dpc
Copy link
Contributor Author

dpc commented Jan 27, 2022

did you use skip-prevout flag in the blocks_iterator tests?

Yes. Otherwise the utxo tracking becomes the bottleneck anyway, and you can't tell max IO throughput.

On my machine the rocksdb utxo tracking maxes around ~70k txs/s, memory one around 200txs/s. This is just the early mainnet blocks, so kind of empty etc. but a good first order approximation.

@dpc
Copy link
Contributor Author

dpc commented Jan 27, 2022

Just wondering if you have considered having blocks_iterator as a dep.

I actually did and still do. :D

BTW. Looking at the code, I was confused about reorder step, until I read through the implementation. IMO it would be better if ReadDetect and reordering were part of some inner-thing (even if just a function that calls both exactly like iterate does right now), because they are just an implementation detail of the same functionalisty: getting stream of block data.

@dpc
Copy link
Contributor Author

dpc commented Feb 3, 2022

I've pushed https://github.com/dpc/block-iter/ recently. Right now it kind of turned into a slow research project in feasibility of https://docs.rs/fallible-iterator/ and https://docs.rs/dpc-pariter , with heavy reliance on copying some of your code and turning it into impl {Fallible}Iterator for yourcode { . I hit some roadblocks with FallibleIterator, but after that I guess I'm going to copy your Utxo stores (they are so well optimized, that I can't come with anything better) and only then I'll start refactoring rust-bitcoin-indexer parts on top of it, and figure out some other minor details.

@RCasatta
Copy link
Owner

RCasatta commented Feb 3, 2022

they are so well optimized

I am very happy :)

@dpc
Copy link
Contributor Author

dpc commented Feb 3, 2022

I mean, if I find anything that makes it better I'll report it back. I'm sure an LRU cache on top of that DB utxo store would be huge speedup, from the top of my head.

@dpc
Copy link
Contributor Author

dpc commented Feb 3, 2022

I am planing to take a somewhat unified approach to benchmarking, if you look at https://github.com/dpc/block-iter/tree/master/libs/main/examples

@RCasatta
Copy link
Owner

RCasatta commented Feb 6, 2022

memory one around 200txs/s

I finally had a chance to try the bench example, I got about 150000txs/s on mainnet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants