Skip to content

Implement support for 64-bit integers #109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lemire opened this issue May 12, 2016 · 12 comments
Closed

Implement support for 64-bit integers #109

lemire opened this issue May 12, 2016 · 12 comments

Comments

@lemire
Copy link
Member

lemire commented May 12, 2016

We have several implementations of Roaring over 64-bit integers in separate repositories...

https://bitbucket.org/samytto/lazyroaring64-bits

https://bitbucket.org/samytto/roaring64bits2levels

https://bitbucket.org/samytto/roaringtreemap

... as well as an evaluation...

https://bitbucket.org/samytto/datastructures64-bitints

Work is needed to bring the best of these implementations in the main RoaringBitmap library.

@luoguohao
Copy link

thx,that`s what i want.

@JasonHuangHuster
Copy link

hi , lemire, i 've checked the "Roaring over 64-bit" code , it support long quite good , but it's didn't realize any Serialization and Deserialization which eagerly needed in my project .. can use suggest any other java BitMap which support 64 bit ? BestWishes ! 😃

@lemire
Copy link
Member Author

lemire commented Jul 14, 2016

@JasonHuangHuster

I do not. Sorry.

However, you can fork the code and build up the features you need.

@ponkin
Copy link

ponkin commented Nov 16, 2016

+1 for this feature in RoaringBitmap. Now we have to create RoaringBitmap[] to hold huge BloomFilters( up to 10 billions bits)

@lemire
Copy link
Member Author

lemire commented Nov 16, 2016

Warning: this issue is marked as "help wanted" meaning that nobody is working on it.

@richardstartin
Copy link
Member

richardstartin commented Mar 30, 2017

@lemire what about using a long to store the most significant bits in RoaringArray?

Using a long costs 48 bits per key on the short required for 32 bit storage, and allows storage of up to 80 bit integers. A cost of 48 bits per key is competitive with any solution requiring an object reference (32 bits on a small heap, 64 bits > 32GB heap). Thoughts?

Does an admissible solution require that RoaringBitmaps storing only 32 bit integers do not get larger or slower under any circumstances?

@lemire
Copy link
Member Author

lemire commented Mar 30, 2017

@openkappa Yes, what you describe is one possibility out of a few that are quite reasonable. We have a paper http://r-libre.teluq.ca/930/ that reviews these possibilities and there are some experiments. Sadly the paper is in French. There is more work to be done on this front.

@richardstartin
Copy link
Member

@lemire It sounds like this isn't a decision you're likely to rush into, is this perhaps an open research topic in its early stages? If this functionality is required in a short time-frame, especially by anyone more interested in the "D" than "R" of "R&D", is it recommended to take a view on which implementation is best (for a particular definition of "good") and fork? I noticed the rust implementation went for RoaringTreemap (RoaringBitmap/roaring-rs#38) - is this the preferred data structure?

For anyone else who prefers English, here's what I gleaned from the paper about the three proposals:

  • RoaringTreemap is an implementation of a red-black tree, much like java.util.TreeMap. Longs are not stored directly but broken into 32 most significant bits and 32 least significant bits. A node in a RoaringTreemap consists of a 32 bit key and a RoaringBitmap. The key consists of the most significant 32 bits, and the RoaringBitmap stores the least significant 32 bits as unsigned integers. RoaringTreemap utilises an application of prefix compression on the 32 most significant bits of each node, which can save up to 32 * (2 ^ 32 - 1) bits per node.
  • Roaring2Level uses a 2 level structure similar to RoaringBitmap. This is basically what I asked about in my last post (long keys storing the 48 MSBs) with the innovation that the wasted 16 bits in the key are used to store the cardinality, which then can be moved out of the container.
  • LazyRoaring uses three levels of indexing. The first level is an int[] storing the 32 MSBs. The second level is a short[][] where each inner array stores bits 33 to 48. The third array consists of containers. Each entry in the first level int[] points to one and only one short[] in the second level. Each container is pointed to by one and only one short in the second level inner arrays.

The implementations were benchmarked on some synthetic data over a range of densities generated from a uniform distribution and Zipfian distribution separately. Compression, performance of union, intersection and insertion were measured for each data set.

For uniform data each proposal produces a similar compression ratio. Compression ratio across a range of densities is not presented for Zipfian data. Each implementation is similar for intersection performance, for both Zipfian and uniform data. LazyRoaring and Roaring2Level out perform RoaringTreemap for union, for all densities and both uniform and Zipfian data. RoaringTreemap and Roaring2Level significantly outperform LazyRoaring for insertion; Roaring2Level is better for high density uniformly distributed data; results for Zipfian data are not presented.

@lemire
Copy link
Member Author

lemire commented Mar 30, 2017

@openkappa

If this functionality is required in a short time-frame, especially by anyone more interested in the "D" than "R" of "R&D", is it recommended to take a view on which implementation is best (for a particular definition of "good") and fork?

A fork is always possible, but I wish people wouldn't do this.

Instead, one could probably implement RoaringTreemap in a weekend as an extension of the existing RoaringBitmap library.

This was done in the past by Samy Chambi. His code might still be around on bitbucket. So we know that it is doable. And because we had benchmarks, we know that the results are likely to be reasonable.

The same thing has been done in CRoaring: someone implemented a 64-bit version of Roaring by extending the 32-bit version.

Forks lead to effort duplication. By building on the existing library, we can minimize duplication and maximize code reuse.

It sounds like this isn't a decision you're likely to rush into

Roaring is a community driven project.

Look at the CRoaring library, for example, it supports 64-bit integers using one model. This happened because people with the talent, time and interest push a solution forward with a willingness to maintain the end result for the duration.

is this perhaps an open research topic in its early stages?

I wouldn't say so. I don't think that there are deep computer science problems. It is a matter of producing quality code, testing it and so forth.

In open source projects, if something is not happening, it might simply be because nobody showed enough interest.

@richardstartin
Copy link
Member

@lemire Sure, I get the community aspect. I ask because it's not clear to me if there is consensus on the preferred data structure - is there? The paper indicates that there's a tradeoff between Roaring2Level and RoaringTreemap depending on whether you care about inserts or treat bitmaps as immutable and just want query performance. If someone goes off one weekend and unilaterally chooses to implement one or the other, then it might not be the best outcome, even if it's better than not supporting longs at all.

@lemire
Copy link
Member Author

lemire commented Mar 31, 2017

@openkappa

If someone goes off one weekend and unilaterally chooses to implement one or the other, then it might not be the best outcome, even if it's better than not supporting longs at all.

I sort of gave you an answer: The strong argument in favor of RoaringTreemap is that it is easier to implement.

Can something else be implemented and end up being better for some use cases? Sure.

Now, here is the thing: with this sort of work, there will never be one solution that is always best in all cases. I mean, obviously you can micro-optimize one given design, and this is win-win... but there are design choices to be made and some choices will favor one type of use cases over others. That's just how it is.

It is slightly frustrating because whatever you implement, someone will come up with a use case that makes you look bad.

But then there is something of a fix. What you can do is setup a competition against some reasonable alternatives and measure the results in a use case you like. So people can say "in my use case, you work is useless", and then you can say "sure, but look at this other example".

Watch my recent talk where I sort of do this...

https://www.youtube.com/watch?v=ubykHUyNi_0

Finally, there is another point to consider: code that is correct and reasonably efficient always win over code that might theoretically exist but that nobody has coded.

@lemire
Copy link
Member Author

lemire commented Aug 29, 2017

It is now supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants