-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added 64-bit C++ version using std::map for the high-order bits #75
Conversation
This looks good to me. Clean code. Complete with unit tests. Maybe @fsaintjacques and @madscientist want to review before we pull this into master? |
c.c. @owenkaser |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cardinality counter of a "full" 64bits bitmap does not fit in uint64_t
, thus all signatures that references this concept must be changed to the non-standard uint128_t
. In other words, uint64_t
maximum value is 2^64 - 1 while a full bitmaps contains 2^64 active bits. With RLE and enough ram, it is definitively possible to construct such monstrosity.
We had the same problem, we started with a uint32_t
cardinality and moved on to uint64_t
.
Otherwise I also note inconsistencies in loop index types, a melting pot of int
, int32_t
, size_t
, ...
// we can iterate over all values using custom functions | ||
uint64_t counter = 0; | ||
r1.iterate(roaring_iterator_sumall64, &counter); | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'd like me to remove the comment? it's emulating the one in the 32-bit test above
assert_true(r1.contains((uint64_t)14000000000000000500ull)); | ||
|
||
// compute how many bits there are: | ||
uint32_t cardinality = r1.cardinality(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be a uint64_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should
|
||
// if your bitmaps have long runs, you can compress them by calling | ||
// run_optimize | ||
uint32_t size = r1.getSizeInBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, size could possibly take more than uint32_t bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed this part of the test as well
/** | ||
* Returns true if the bitmap is strict subset of the other. | ||
*/ | ||
bool isStrictSubset(const Roaring64Map &r) const { return isSubset(r) && !(*this == r); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do a full equality check, it'll be really costly when the 2 are equals. You only want to check the size (since is subset would return true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
*/ | ||
bool operator==(const Roaring64Map &r) const { | ||
// we cannot use operator == on the map because either side may contain empty Roaring Bitmaps | ||
auto lhs_iter = roarings.cbegin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is un-reviewable. I can't follow what it does.
In theory, if iterators are sorted, all you need is a all_of(zip(lhs, rhs), [](a, b) {a == b}))
. You can probably early return by validating if both TreeMap are of same size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a number of comments to improve readability. The maps do not remove empty roaring elements unless shrinkToFit() is called. Thus we need a more complicated equality that skips over any empties found.
void iterate(roaring_iterator64 iterator, void *ptr) const { | ||
std::for_each(roarings.begin(), roarings.cend(), | ||
[=](const std::pair<uint32_t, Roaring>& map_entry) { | ||
roaring_iterate64(map_entry.second.roaring, iterator, uint64_t(map_entry.first) << 32, ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern uint64_t(x) << 32
is used everywhere, make an inline function / macro for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an inline function that stitches two 32-bit values back into a 64-bit one
*/ | ||
bool select(uint64_t rank, uint32_t *element) const { | ||
for (const auto& map_entry : roarings) { | ||
uint64_t sub_cardinality = (uint32_t)map_entry.second.cardinality(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t x = (uint32_t) expr;
is this on purpose, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was supposed to be a (uint64_t) cast to keep Visual Studio from complaining
(Thanks for the thorough review.) I am not eager for us to use non-standard types. For example, I'd like for this C++ API to work more or less unchanged under Visual Studio or other compilers. And who wants to deal with non-standard types on a regular fashion? You can't even print these things easily! I think we should favor "standard types" as much as possible. The problem you identify is real, however. I am not dismissing your concern in the least. |
We really, really don't want to make the API hard to use with custom data types where we can avoid it. The case where the cardinality is 1<<64 is unique and can be optimized out at the application level. We don't want everyone to suffer because of this one case. The idea that we should be able to hold data structures as large as memory would allow is quickly invalidated... The following program will crash as a 32-bit binary when sizeof(size_t) is 4...
Try it... g++ -m32 -O3 -o toobig toobig.cpp
./toobig You should get The corresponding thing should happen on a 64-bit machine with a much larger data structure but, of obvious reasons I cannot test it. So we can add a test of the type "if computed cardinality is zero and the roaring bitmap is not empty, then throw an exception". It is not perfect but it is as good as what STL does. And we should document it, of course. Thoughts? |
So unless @fsaintjacques comes back with something better, I recommend you modify your API so that when the cardinality overflows (it is equal to zero but the bitmap is not empty) then you throw an exception. |
Great stuff @ksajme. Regarding the cardinality issue @fsaintjacques; I see your point about that. However, while it might be possible to create a completely full roaring32 in a reasonable size using RLE, this implementation uses a std::map to hold the individual 32bit roaring structures without any RLE etc. in the upper 32bits. I don't know how large a full roaring32 is, but an empty one is 44 bytes and I doubt it's smaller than that. So, just the memory needed to store 2^32 full roaring32 must be at least 176G (4G * 44), and that's not counting whatever overhead is required for that many std::map elements. A different implementation could definitely reduce this significantly. I don't know if it's worthwhile to force the API on this implementation, given that it requires non-standard types, to avoid changing it if/when a new fancier implementation comes along. I wrote some ideas for how to resolve this but @lemire already mentioned them... the only other one would be to return a structure instead of a uint64_t which contains a cardinality value and an extra boolean that means "full", but given all the above I have a hard time justifying that. |
@fsaintjacques thank you for the feedback. I'll address the issues that you found. Interesting off-by-one problem. Visual Studio doesn't have the 128 bit numbers we need but you came up with a good solution @lemire . I'll implement it. I forgot to add the 64-bit header to the amalgamation and the demo. I'll go ahead and do that. |
Agreed. With run compression, a full 32-bit roaring might easily fit in 64 bytes. So we are talking about maybe 256 GB to create a full 64-bit roaring bitmap using the approach indicated here. So, yes, it is huge. That would not be the right way to go about it. You don't want to be using hundreds of gigabytes for something like this. |
@lemire yes... I think the first person who needs something like that should be drafted into enhancing this implementation :-). This implementation will run into problems in situations where you set bits whose values are scattered across a significant portion of the upper 32bit space. An implementation that handles that more gracefully would be very cool, no doubt... but an enormous amount of effort I'd imagine. |
Exception might be overkill, I'd prefer a struct with a bool that says full. Much less intrusive. |
The advantage of an exception (to me) is that for people who are confident their machine will keel over long, long before they have to worry about it (which is most of us) they can just ignore the exception, or catch it in main() or whatever for sanity. Personally I'd prefer to deal with an exception in the vanishingly unlikely case that you really do have a completely full roaring64, than pay the price for returning a 12+byte structure on every cardinality check and the added complexity of dealing with a returned structure and checking two different values. Plus as @lemire points out this behavior matches other parts of the STL. Another option would be to provide an additional non-throwing cardinality method, for those who felt it might be desirable. In that case they'd need to check "isempty" if the cardinality was 0, to see whether it was really 0 bits set or if it was 2^64 bits set. Or something like that. Of course this can be easily wrapped by the user themselves if they need it. Anyway that's my $0.02. I'll leave it up to you all to sort it. One note: if you do throw would it be worthwhile to throw a roaring-specific subclass of std::length_error instead of the standard exception itself... just to make it simpler to catch? Not sure, just a consideration. |
I'll throw an exception then. A full data structure will be an exceptional circumstance due to the current limitations of physical memory. Meanwhile the common case will be unaffected with an unusual return type. |
Could you add a So one could write exception-free code like so...
|
@ksajme Please add yourself to the AUTHORS file. |
* responding to issues found in code review * handling cardinality larger than a 64-bit unsigned integer * updating amalgamation
Seems that there are compilation errors having to do with https://travis-ci.org/RoaringBitmap/CRoaring/builds/195387727 We probably can solve this with something like...
|
You're right, I think this is: Which is a lack of emplace for maps in gcc's C++11 libstdc++ prior to version 4.8. Although Travis is using Clang, I'm guess it's building against the gcc STL. I wrote a macro to detect old versions of the gcc STL and use insert instead. Let's see if Travis agrees with me... |
* Attempt to fix compilation on Travis CI
Ah. Ah. Tests are passing now. |
@fsaintjacques I trust that you feel that it is acceptable now? |
I am wondering, would it be possible (and easy) to use this C++ code in the C library? Or would it be better to reimplement it in C from scratch? |
@Ezibenroc I think it would be a lot easier to code a 64-bit data structure in C now that we have a fully worked out example in C. All one needs is an ordered int-to-pointer map in C. I am sure we can find a decent and lightweight C implementation we can adapt. |
Go ahead, thanks for the contribution @ksajme . |
❤️ |
a 64-bit C++ implementation using the TreeMap methodology described in the http://r-libre.teluq.ca/930/1/Roaring64bits.pdf paper.