Added 64-bit C++ version using std::map for the high-order bits #75

ksajme · 2017-01-25T16:11:23Z

a 64-bit C++ implementation using the TreeMap methodology described in the http://r-libre.teluq.ca/930/1/Roaring64bits.pdf paper.

lemire · 2017-01-25T16:28:37Z

This looks good to me. Clean code. Complete with unit tests.

Maybe @fsaintjacques and @madscientist want to review before we pull this into master?

lemire · 2017-01-25T16:28:46Z

c.c. @owenkaser

fsaintjacques

The cardinality counter of a "full" 64bits bitmap does not fit in uint64_t, thus all signatures that references this concept must be changed to the non-standard uint128_t. In other words, uint64_t maximum value is 2^64 - 1 while a full bitmaps contains 2^64 active bits. With RLE and enough ram, it is definitively possible to construct such monstrosity.

We had the same problem, we started with a uint32_t cardinality and moved on to uint64_t.

Otherwise I also note inconsistencies in loop index types, a melting pot of int, int32_t, size_t, ...

fsaintjacques · 2017-01-25T17:09:01Z

tests/cpp_unit.cpp

+    // we can iterate over all values using custom functions
+    uint64_t counter = 0;
+    r1.iterate(roaring_iterator_sumall64, &counter);
+    /**


Remove this.

you'd like me to remove the comment? it's emulating the one in the 32-bit test above

fsaintjacques · 2017-01-25T17:11:14Z

tests/cpp_unit.cpp

+    assert_true(r1.contains((uint64_t)14000000000000000500ull));
+
+    // compute how many bits there are:
+    uint32_t cardinality = r1.cardinality();


shouldn't this be a uint64_t?

fsaintjacques · 2017-01-25T17:11:57Z

tests/cpp_unit.cpp

+
+    // if your bitmaps have long runs, you can compress them by calling
+    // run_optimize
+    uint32_t size = r1.getSizeInBytes();


Same here, size could possibly take more than uint32_t bytes.

fixed this part of the test as well

fsaintjacques · 2017-01-25T17:33:02Z

cpp/roaring64map.hh

+    /**
+    * Returns true if the bitmap is strict subset of the other.
+    */
+    bool isStrictSubset(const Roaring64Map &r) const { return isSubset(r) && !(*this == r); }


Don't do a full equality check, it'll be really costly when the 2 are equals. You only want to check the size (since is subset would return true).

fsaintjacques · 2017-01-25T17:42:39Z

cpp/roaring64map.hh

+     */
+    bool operator==(const Roaring64Map &r) const {
+        // we cannot use operator == on the map because either side may contain empty Roaring Bitmaps
+        auto lhs_iter = roarings.cbegin();


This method is un-reviewable. I can't follow what it does.

In theory, if iterators are sorted, all you need is a all_of(zip(lhs, rhs), [](a, b) {a == b})). You can probably early return by validating if both TreeMap are of same size.

I added a number of comments to improve readability. The maps do not remove empty roaring elements unless shrinkToFit() is called. Thus we need a more complicated equality that skips over any empties found.

fsaintjacques · 2017-01-25T17:44:56Z

cpp/roaring64map.hh

+    void iterate(roaring_iterator64 iterator, void *ptr) const {
+        std::for_each(roarings.begin(), roarings.cend(),
+            [=](const std::pair<uint32_t, Roaring>& map_entry) {
+                roaring_iterate64(map_entry.second.roaring, iterator, uint64_t(map_entry.first) << 32, ptr);


The pattern uint64_t(x) << 32 is used everywhere, make an inline function / macro for it.

I added an inline function that stitches two 32-bit values back into a 64-bit one

fsaintjacques · 2017-01-25T17:46:18Z

cpp/roaring64map.hh

+     */
+    bool select(uint64_t rank, uint32_t *element) const {
+        for (const auto& map_entry : roarings) {
+            uint64_t sub_cardinality = (uint32_t)map_entry.second.cardinality();


uint64_t x = (uint32_t) expr; is this on purpose, am I missing something?

this was supposed to be a (uint64_t) cast to keep Visual Studio from complaining

lemire · 2017-01-25T18:47:15Z

@fsaintjacques

(Thanks for the thorough review.)

I am not eager for us to use non-standard types. For example, I'd like for this C++ API to work more or less unchanged under Visual Studio or other compilers. And who wants to deal with non-standard types on a regular fashion? You can't even print these things easily! I think we should favor "standard types" as much as possible.

The problem you identify is real, however. I am not dismissing your concern in the least.

lemire · 2017-01-25T19:27:18Z

@fsaintjacques
@ksajme

We really, really don't want to make the API hard to use with custom data types where we can avoid it. The case where the cardinality is 1<<64 is unique and can be optimized out at the application level. We don't want everyone to suffer because of this one case.

The idea that we should be able to hold data structures as large as memory would allow is quickly invalidated... The following program will crash as a 32-bit binary when sizeof(size_t) is 4...

#include <iostream>
#include <vector>


int main() {
  std::cout << sizeof(size_t) << std::endl;
  std::vector<bool> x;
  for(uint64_t z = 0 ; z < UINT64_C(1) << 32; ++z)
    x.push_back(0);
  std::cout << x.size() << std::endl;
}

Try it...

g++ -m32 -O3 -o toobig toobig.cpp
./toobig

You should get std::length_error.

The corresponding thing should happen on a 64-bit machine with a much larger data structure but, of obvious reasons I cannot test it.

So we can add a test of the type "if computed cardinality is zero and the roaring bitmap is not empty, then throw an exception".

It is not perfect but it is as good as what STL does.

And we should document it, of course.

Thoughts?

lemire · 2017-01-25T19:30:08Z

@ksajme

So unless @fsaintjacques comes back with something better, I recommend you modify your API so that when the cardinality overflows (it is equal to zero but the bitmap is not empty) then you throw an exception.

madscientist · 2017-01-25T19:36:15Z

Great stuff @ksajme. Regarding the cardinality issue @fsaintjacques; I see your point about that. However, while it might be possible to create a completely full roaring32 in a reasonable size using RLE, this implementation uses a std::map to hold the individual 32bit roaring structures without any RLE etc. in the upper 32bits. I don't know how large a full roaring32 is, but an empty one is 44 bytes and I doubt it's smaller than that. So, just the memory needed to store 2^32 full roaring32 must be at least 176G (4G * 44), and that's not counting whatever overhead is required for that many std::map elements.

A different implementation could definitely reduce this significantly. I don't know if it's worthwhile to force the API on this implementation, given that it requires non-standard types, to avoid changing it if/when a new fancier implementation comes along.

I wrote some ideas for how to resolve this but @lemire already mentioned them... the only other one would be to return a structure instead of a uint64_t which contains a cardinality value and an extra boolean that means "full", but given all the above I have a hard time justifying that.

ksajme · 2017-01-25T19:43:22Z

@fsaintjacques thank you for the feedback. I'll address the issues that you found. Interesting off-by-one problem.

Visual Studio doesn't have the 128 bit numbers we need but you came up with a good solution @lemire . I'll implement it.

I forgot to add the 64-bit header to the amalgamation and the demo. I'll go ahead and do that.

lemire · 2017-01-25T19:43:41Z

@madscientist

Agreed.

With run compression, a full 32-bit roaring might easily fit in 64 bytes. So we are talking about maybe 256 GB to create a full 64-bit roaring bitmap using the approach indicated here. So, yes, it is huge. That would not be the right way to go about it. You don't want to be using hundreds of gigabytes for something like this.

madscientist · 2017-01-25T19:55:28Z

@lemire yes... I think the first person who needs something like that should be drafted into enhancing this implementation :-). This implementation will run into problems in situations where you set bits whose values are scattered across a significant portion of the upper 32bit space. An implementation that handles that more gracefully would be very cool, no doubt... but an enormous amount of effort I'd imagine.

fsaintjacques · 2017-01-25T20:01:05Z

Exception might be overkill, I'd prefer a struct with a bool that says full. Much less intrusive.

madscientist · 2017-01-25T20:17:23Z

The advantage of an exception (to me) is that for people who are confident their machine will keel over long, long before they have to worry about it (which is most of us) they can just ignore the exception, or catch it in main() or whatever for sanity. Personally I'd prefer to deal with an exception in the vanishingly unlikely case that you really do have a completely full roaring64, than pay the price for returning a 12+byte structure on every cardinality check and the added complexity of dealing with a returned structure and checking two different values.

Plus as @lemire points out this behavior matches other parts of the STL.

Another option would be to provide an additional non-throwing cardinality method, for those who felt it might be desirable. In that case they'd need to check "isempty" if the cardinality was 0, to see whether it was really 0 bits set or if it was 2^64 bits set. Or something like that. Of course this can be easily wrapped by the user themselves if they need it.

Anyway that's my $0.02. I'll leave it up to you all to sort it.

One note: if you do throw would it be worthwhile to throw a roaring-specific subclass of std::length_error instead of the standard exception itself... just to make it simpler to catch? Not sure, just a consideration.

ksajme · 2017-01-25T21:20:57Z

I'll throw an exception then. A full data structure will be an exceptional circumstance due to the current limitations of physical memory. Meanwhile the common case will be unaffected with an unusual return type.

lemire · 2017-01-25T21:23:33Z

@ksajme

Could you add a isFull() method? In this manner, the user could first check whether the roaring is full... and if so then he/she knows that the cardinality method will not return the right value. The call to isFull() could be dirt cheap in practice because it could start by checking the size of the map. In all practical cases, the map would be too small to generate a full bitmap.

So one could write exception-free code like so...

uint128_t card = b.isFull() ?  1<< 64 : b.cardinality()

lemire · 2017-01-25T21:26:59Z

@ksajme Please add yourself to the AUTHORS file.

* responding to issues found in code review * handling cardinality larger than a 64-bit unsigned integer * updating amalgamation

lemire · 2017-01-26T01:14:49Z

@ksajme

Seems that there are compilation errors having to do with emplace:

https://travis-ci.org/RoaringBitmap/CRoaring/builds/195387727

We probably can solve this with something like...

#ifdef __cplusplus >= 201103L
// do something wild
#else
// do boring stuff
#endif

ksajme · 2017-01-26T22:16:48Z

You're right, I think this is:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=44436

Which is a lack of emplace for maps in gcc's C++11 libstdc++ prior to version 4.8. Although Travis is using Clang, I'm guess it's building against the gcc STL.

I wrote a macro to detect old versions of the gcc STL and use insert instead. Let's see if Travis agrees with me...

* Attempt to fix compilation on Travis CI

lemire · 2017-01-26T22:29:24Z

Ah. Ah. Tests are passing now.

lemire · 2017-01-27T04:43:15Z

@fsaintjacques I trust that you feel that it is acceptable now?

Ezibenroc · 2017-01-27T09:28:29Z

I am wondering, would it be possible (and easy) to use this C++ code in the C library? Or would it be better to reimplement it in C from scratch?

lemire · 2017-01-27T12:49:55Z

@Ezibenroc I think it would be a lot easier to code a 64-bit data structure in C now that we have a fully worked out example in C. All one needs is an ordered int-to-pointer map in C. I am sure we can find a decent and lightweight C implementation we can adapt.

fsaintjacques · 2017-01-27T16:18:44Z

Go ahead, thanks for the contribution @ksajme .

lemire · 2017-01-27T16:25:31Z

❤️

Added 64-bit C++ version using std::map for the high-order bits

cee6714

lemire mentioned this pull request Jan 25, 2017

Support for bitmaps of size > 2^32 in C #1

Closed

fsaintjacques reviewed Jan 25, 2017

View reviewed changes

Added 64-bit C++ version using std::map for the high-order bits

2f2d5af

* responding to issues found in code review * handling cardinality larger than a 64-bit unsigned integer * updating amalgamation

Added 64-bit C++ version using std::map for the high-order bits

bc9ba83

* Attempt to fix compilation on Travis CI

lemire merged commit 72d4759 into RoaringBitmap:master Jan 27, 2017

ksajme deleted the roaring64map branch January 30, 2017 09:39

richardstartin mentioned this pull request Aug 12, 2017

Introduce RoaringTreeMap to handle a bitmap holding any long value RoaringBitmap/RoaringBitmap#176

Merged

gaodayue mentioned this pull request Jan 15, 2020

Support 64-bits integers as elements for BITMAP type apache/doris#2771

Closed

tleasure mentioned this pull request Aug 10, 2024

Support for 64-bit aviggiano/redis-roaring#110

Open

Added 64-bit C++ version using std::map for the high-order bits #75

Added 64-bit C++ version using std::map for the high-order bits #75

Conversation

ksajme commented Jan 25, 2017

lemire commented Jan 25, 2017

lemire commented Jan 25, 2017

fsaintjacques left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Jan 25, 2017

lemire commented Jan 25, 2017

lemire commented Jan 25, 2017

madscientist commented Jan 25, 2017

ksajme commented Jan 25, 2017

lemire commented Jan 25, 2017

madscientist commented Jan 25, 2017

fsaintjacques commented Jan 25, 2017

madscientist commented Jan 25, 2017

ksajme commented Jan 25, 2017

lemire commented Jan 25, 2017

lemire commented Jan 25, 2017

lemire commented Jan 26, 2017

ksajme commented Jan 26, 2017

lemire commented Jan 26, 2017

lemire commented Jan 27, 2017

Ezibenroc commented Jan 27, 2017

lemire commented Jan 27, 2017

fsaintjacques commented Jan 27, 2017

lemire commented Jan 27, 2017