-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalise bitmap construction and expose heuristics #276
generalise bitmap construction and expose heuristics #276
Conversation
Great work. Maybe someone wants to review this ? (I will review.) |
OK. I have just removed one of the marker interfaces to reduce the verbosity required for type bounds over the different implementations. Since there are quite a few interesting features contributed by various users it would be good to try to tie them together into a consistent API somehow: this PR attempts to start that process by generalising construction. I am happy to make changes if necessary. |
I have just asked @blacelle and @rafael-telles to review... It would be great to get wider feedback. |
I've fixed an existing bug where the key wraps. |
I've put it in the bitmap builders codebase I maintain. I love the API, all is clear, and suited for advanced users. I can see differences in processing time between different heuristics used. LGTM! great job! As next step I would imagine a support for users in deciding, which heuristics to enable. Given that some of my bitmaps collections are quite similar every day I build them, I'd like to decide which heuristics to use. The simplest version could be a tool to analyse some simple statistics like The more advanced would be |
@ppiotrow that would be quite useful. Actually, I'm interested in automated data quality checks and producing a significantly different "recommendation" day on day would be a very cheap indication that a new dataset is significantly different. Do you have ideas about how to implement it? |
Looking at your code, the most powerful method seems to be I wonder if we could use information about number of array, bitmap and run containers, also the average number of elements per each container type to apply some expert rules. But no better ideas now. |
The performance rationale for the writer abstraction in this PR, along with the benefits of doing a partial radix sort, is described here. |
Are there any concerns to be addressed? |
The |
I'm inclined to agree, but not without quite a lot of changes like changing the visibility of e.g. |
I think this might be good place&time to share. While building gigabytes of bitmaps we are getting best results by not using |
@ppiotrow does that relate to the features in this PR? Or just GC tuning advice? |
General tip for creating Roaring Bitmaps. I assume, that the new builders code (if used smartly) should decrease number of allocations. Still I expect better results with |
OK. I guess they don't call it the "throughput collector" for nothing! Maybe you could get yourself onto JDK11 and give ZGC a try? Performance wise, these features are designed for mostly ordered insertions, and for completely random insertions could only be slightly worse than repeatedly calling |
…richardstartin/RoaringBitmap into ordered-writer-array-heuristics
I'm mostly working ordered data. Previous |
If you have 32 threads, 32 * 8kB = 256kB really isn't that much memory, particularly compared to what you're building or what it's being built from, or considering container transitions. I think it might be a good idea to allow resetting the writer (clearing the buffer and reinitialising the underlying bitmap) and then you would always have a lot less than 1MB for buffering. |
I meant holding millions of bitmaps in memory to build them. Think of |
I don't know anything about how your application works but the moment you put a single bitmap container in those 5M bitmaps you have 40GB anyway, so it feels like having 5M bitmaps, let alone building 5M bitmaps at the same time in a single process, could be part of the problem. @lemire discusses two encoding strategies for reducing the number of bitmaps required here. |
Not if this is array or run container. |
Sure, but 1 |
That is what I meant. The previous API was just classic "space–time tradeoff" example where |
No I understand entirely, and hope I didn't give you the impression I didn't like what you were saying. I just never imagined anyone creating 5 million of these! |
Ok. So this looks good to me. We lost 0.005% in coverage and somehow coveralls thinks this warrants a red flag. Mysterious. |
|
||
boolean isEmpty(); | ||
|
||
T runOptimize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's little too late but I've noticed this.
I think this method should be called more general toEfficientContainer
and have that logic implemented.
Right now if we "optimiseForBitmaps", we may end up with BitmapContainers with cardinality lower than 4096 in the final bitmap which should not happen.
Also it's happy coincidence that RunContainer::runOptimize
calls RunContainer::toEfficientContainer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not too late because nobody (even me) is using this yet, and optimiseForBitmaps
should probably either not exist, or just call constantMemory
.
The idea of this interface was to simply create a generic type bound, and leave the implementations alone. toEfficientContainer
isn't defined for all container types at the moment. But I think it should be an easy change to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my second question why there is both constantMemory
and optimiseForBitmaps
. I'd remove optimiseForBitmaps
if we still can.
We can issue a new release. |
* generalise bitmap construction and expose heuristics * remove need for one marker interface * get rid of unnecessary Appender interface * avoid wrap around for key, test writes to max key succeed * add buffer test cases * feed expected container size into arry containers * allow resetting of a writer
This PR generalises efficient construction of bitmaps. This introduces a few extra constructors and marker interfaces, which I expect need socialising amongst users. This breaks the interface of
OrderedWriter
yet again, but this is unlikely to affect any user but myself, no warnings or compiler errors are introduced anywhere else. The aim is to be able to make optimisations to bitmap build times without needing to modify source code within the library, by providing various toggles, and to make the choice between implementations simpler to switch between using generics.OrderedWriter
becomesRoaringBitmapWriter
because it can accept writes in any orderadd(long min, long max)
toRoaringBitmapWriter
addMany(int.. data)
toRoaringBitmapWriter
Container
andMappeableContainer
with the introduction of a marker interface as a generic type boundRoaringArray
andMutableRoaringArray
via a marker interfaceRoaringArray
, whether to partially sort data first, whether to expect mostly array, runs or bitmapsFastRankRoaringBitmap
- it's possibleFastRankRoaringBitmap
could be properly integrated intoFastAggregation
andParallelAggregation
by adopting this approach to constructing bitmaps.For example, to create a large, contiguous "existence bitmap"
To create a
FastRankRoaringBitmap
with constant memory during construction:And to build a buffer bitmap: