New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Bulk Next() and Add() #123
Comments
Evidently, we are always happy to receive contributions. Anything that you use in an actual application with good results should make it into our code. At a minimum, we should make sure that you can get the features you need without any need of maintaining a private fork (something that is not good for anyone). Let us chat. I'd break your contributions into two distinct ones. First, it seems that you propose a way to iterate over values faster by doing it in In C, we sidestep the problem of iterating over the values with an iterator, and use a kind of for-each construct. The function takes a function pointer as input. The function pointer just calls your function with each value (possibly stopping if your function returns false). It is unknown to me how well it works in Go, but in C, it is ridiculously fast. So it is possible that there are alternative designs that you have not considered that might serve your performance need even better. Or not. Then you want to add values in bulk. We have a function called |
c.c. @maciej @glycerine |
@lemire Thanks for the thorough response. Absolutely not tied to any of the names we're currently using. Here are some benchmarks for
1% density
So in an apples to apples comparison (same overall inputs) we see a 3-4x speedup. Let me know if you spot any errors here, these were very quickly constructed. For comparison:
0.1% density:
I'll try to put together something similar for |
Thanks! I love hard numbers! |
Next rowsAgain, forgive the rough benchmarks here, and point out any mistakes I've made.
So ~16x on rle, ~3.5x on array, ~2.5x on bitmap In practise these numbers are worst case for us, as dealing with the rows in blocks helps speed up our code also. (the bitmap nextrows is very naively implemented, hence the extra memory pressure, I'd fix this when upstreaming, and likely achieve numbers that were quite a bit better) |
Awesome. Wow, that is significant. This is great work. I'd just like to encourage you to polish it and submit a PR it so roaring can be improved for all users. When I wrote the RLE part of roaring in Go, we established correctness with extensive tests, but there was/is still performance tuning that can and should be done. So thanks for doing this! |
Implemented a smarter nextRows for bitmaps:
Memory is much better, and its up to ~3.5x I'll put together a PR in the near future. I'm not entirely decided on the interface.
It's kinda ugly though As for Also, it's a bit more work to ensure things stay in sync if the same iterator object supports calls to both NextMany and Next. Also, the interface is different anyway. Next() requires a hasNext, while in our |
I agree with @glycerine.
I would support this. Again, though I will not get into bikeshedding about exactly how things should be named, we have no concept of rows in our API, so it is probably best not to use the term, I think
Yes. Some naive thoughts:
Maybe spelling out
This could be openly discouraged/disallowed. The scenario where one would try to use both sounds a bit adversarial. If we simply document (don't do this), it is probably good enough.
The Go way is to make things simple. Adding extra wrappers does not sound Go-like to me, but that's not something I feel strongly about. |
I completely agree with what was said here.
I don't understand why |
@maciej That's a very good question. I am somewhat puzzled by the split between low and high. |
Cool, I'll start with a PR for
We want the ability to construct the bitmap in multiple chunks, I'm not sure how to do that without the function being a method on bitmap itself, or as part of a builder.
We saw better performance - Our implementation iterates through the buffer, looking for when the high 16bits change (ie the container changes), and then it passes all of the low 16bits down to the container at once. Best case with something like an arraycontainer, with the high and low bits already split, it's a trivial memcpy of the entire |
What is wrong with the following...
This is not rhetorical, I'd like to better understand your point of view. |
@cakey I was pretty curious about how would a The benchmarks are not somehow worse than what you showed (1% density):
which makes me even more curious to see your pull request! |
@lemire Nothing wrong with that if @maciej I compared your functions to mine: I then tried a variant of my own AddManySeqNew - but taking inspiration from your much simpler functions which remove a ton of container overhead, and again trying to choose the right container. I'm quite happy with the results. (
and without the split:
Results:0.1% density:
1% density:
10% density
So it's about 10-30% gain by splitting into highs and lows. Might be a bit of work to keep the performance of the above approach with chunking though. The other reason we want chunking, is it gives us the opportunity to use addRange for rle, which is common in the format we're converting from, and is key for performance in a subset of our use cases. (to avoid materialising all the uint32s) |
@cakey ok, now looking at your implementation the performance gains seem obvious. If you can copy subslice from the input What about the idea of having a function that accepts a "generator" (an anonymous functions, I don't know how you'd call such a pattern in Go): You said that the high-low split gives you an opportunity to use |
I think that this issue has been resolved entirely. Reopen if you disagree. |
Hey!
Great library! Big fan of the research you put out :)
We've been starting to use roaring for much faster union/intersection operations, but we've found that a big bottleneck for us has been creating those recordsets, and then iterating through them after the operations.
Internally we've had a lot of success implementing bulk creation and iteration functions - to avoid unnecessary per row work. We see speedups for some of our usecases of up to 10-20x.
We'd love to contribute them back upstream!
Before cleaning up our code and submitting a PR, it would be great to get your feedback on what you'd like the API to look like.
Our signatures currently look something like:
Each container type then has corresponding functions that can use the assumptions to avoid unnecessary work on each row.
For example for rle: nextrows:
Gives us 20x performance over naively calling next() repeatedly.
Excited to contribute :) What are the next steps here?
The text was updated successfully, but these errors were encountered: