New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On performance and memory usage #1
Comments
Oh, I see. That's better than what I've been doing then.
There is bunch of concurrent hashmaps on crates.io . Maybe some of them would fit the bill.
Thanks for letting me know! The only one left that I know is worth it LRU. But it requires A great tradeoff between perf and memory usage. It is also putting a strict ceiling on memory usage, which gives the end user the ability to fine tune it to their system. You have a small board with 2GB? You can make the LRU to fit exactly that size, and use the tool as fast as it can go with these resources. Also with that additional code (to get output straight from the node), it would allow |
Oh, I see that you're using block file parsing to get the block data. Most bitcoin related projects do, and I never got to try benchmark the difference. In You probably would benefit from a good user-accessible benchmark time tracking. First for your own tuning, and then for users to be able to find bottleneck on their setup and tune according to the results. If the bottleneck is readying the data, then 10% hit to the UTXO tracking performance might not matter at all. |
I know electrs is faster when you configure it to read the blocks.dat and usually my experience with RPC is that is not super fast but I didn't bench it.
Yes, if that is the bottleneck makes sense to use RPC, for blocks_iterator the bottleneck is previous_output computation
Any idea on how to do this? Maybe a github actions with a docker loaded with testnet blocks*.dat ? |
It's important that the RPC is queried with some parallelism to get rid of latency overhead, and the node itself has to do the same IO to the same files, then encoding to hex, sending over, decoding hex are all kind of cheap. Seems to me that all the overheads tend to not matter as the IO stays 100% saturated If one puts the node on another box the RPC approach might become faster as the load just get scaled to two boxes.
I'm really surprised. That would mean the it's CPU/memory bandwidth bound. I mean - the memory usage is quite big, but adding and looking up stuff in the hashmap should be reaaaaaally fast.
My favorite trick that I use in in some pieces of Rust code is that I wrap But generally that's it - measuring crucial parts and then exposing it somehow to the user. |
BTW. channel of 200 size are in my experience just wasting memory. Eventually one part becomes a permanent bottleneck one way or another, and all the other places are always full/empty. The channel size is there purely to amortize variance, so in my experience channels of size |
lol, here I did exactly the opposite, measuring the busy time as the time the thread spends non-blocked on
If you check the busy time, in my machine the fee thread is the one it takes the longest. The reason for that is that there are many insertions in the map (the number of outputs in a block) and lookup (the number of inputs in a block). Lookups are made on 2 maps because the key truncation trick comes at the expense of checking the non-truncated map firsts... Also, the map is pretty big so the CPU cache effect becomes less powerful.
I stayed a bit large because I imagined thread sleep/wake up could make some issues, like the bottleneck thread find the input channel empty because the input thread is taking some time to wake up. Need to experiment on this |
Finally, I am very happy with the last 2 improvements: decreased memory needed a lot (see readme for stat table). In particular the most obvious but shamely missing, is skipping the provably unspendable scripts 547f9d2 The other one is putting most common script directly on the stack f92857e and a1c9d65 |
They are both useful. Logically each step in a pipeline is "recv", "process", "send" and they should all add up to a total runtime. Oftentimes it's worthwhile to measure sub-part of "process" as well.
I was wondering what you said about using
Assuming a roughly uniform dataset (which Bitcoin blockchain definitely is), the pipeline formed by a bunch of threads connected with a channel always stabilize at a steady state. All channels before the bottleneck are always full and their sending threads are sleeping waiting to insert one additional element, and all channel past the bottleneck, are always empty and their receiver sleep waiting to get next work. The bottleneck thread always have a next work ready in the (always full) channel, and can send (to an otherwise always empty channel) without blocking. If you spot results contradicting this, please let me know about it. I would have some projects to fix. :) |
Out of curiosity, how much does it save? |
It's about 3GB, but in practice it avoids the last doubling of the utxo map saving almost 10GB! |
Thought: would sharding this one (two) maps between 32-256 smaller bucket-maps indexed based on a first byte of a txid, make the memory usage slightly smoother? |
So, I am taking the simple case, two maps, pair txindex goes in one, odd in the other. If pair and odds are uniforms, they will double almost at the same time? Honestly, I find it weird there isn't an option in the std map to configure the map increase rate. In these cases, something like 1.5 would require more reallocations but less peak memory usage |
Maybe using a logic not 50/50 like 2/3 requests goes to one map and 1/3 to the other you can create something a little smoother? (but you always have the moment where both double... maybe choosing relative primes? Not sure looks complicated, again I would like to tweak the map reallocation factor) |
I forgot to comment on it, but I thought a little bit longer about and it was a silly idea.
BTW. Don't you just want to pre-allocate (and even resize) the map manually then? |
The problem with preallocation are:
|
Hi! For a long while I was itching into trying to improve I want to use more iterator-based APIs (based on a crate I published recently: I started from the block fetching (jsonrpc) code. While at it I decided to benchmark that code against If you don't mind I'm planning to eventually copy & adapt your blk-files reading code, and support both modes of sourcing the blockchain data. |
Thanks for the comparison, that's very interesting, did you use
You are free to copy whatever you like :) |
bitcoin028 branch should improve performance also, thanks to rust-bitcoin/rust-bitcoin#672 |
Yes. Otherwise the utxo tracking becomes the bottleneck anyway, and you can't tell max IO throughput. On my machine the rocksdb utxo tracking maxes around ~70k txs/s, memory one around 200txs/s. This is just the early mainnet blocks, so kind of empty etc. but a good first order approximation. |
I actually did and still do. :D BTW. Looking at the code, I was confused about |
I've pushed https://github.com/dpc/block-iter/ recently. Right now it kind of turned into a slow research project in feasibility of https://docs.rs/fallible-iterator/ and https://docs.rs/dpc-pariter , with heavy reliance on copying some of your code and turning it into |
I am very happy :) |
I mean, if I find anything that makes it better I'll report it back. I'm sure an LRU cache on top of that DB utxo store would be huge speedup, from the top of my head. |
I am planing to take a somewhat unified approach to benchmarking, if you look at https://github.com/dpc/block-iter/tree/master/libs/main/examples |
I finally had a chance to try the bench example, I got about 150000txs/s on mainnet |
I'm moving discussion from rust-bitcoin/rust-bitcoin#595 here since I think this might be an ongoing discussion, and people might find it useful, and feels like we're off topic for a long time now.
The text was updated successfully, but these errors were encountered: