Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for bitmaps of size > 2^32 in C #1

Open
fsaintjacques opened this issue Jan 4, 2016 · 12 comments
Open

Support for bitmaps of size > 2^32 in C #1

fsaintjacques opened this issue Jan 4, 2016 · 12 comments

Comments

@fsaintjacques
Copy link
Member

In OLAP workload, 2^32 rows is quickly reached. This limitation impedes the use of roaring bitmaps as bitmap indices on such system.

It seems the implementation (and the paper) are strongly built on 2^32 assumption. What are the implication of growing the number of buckets vs the size of a bucket?

@lemire
Copy link
Member

lemire commented Jan 4, 2016

@fsaintjacques

  1. It is "easy" to build a 64-bit data structure on top a 32-bit one. Put the Roaring bitmaps into a tree with 32-bit keys. Samy Chambi has done the work and will have a chapter in his thesis on this topic. There are a few interesting design issues. This work should be published in 2016. Because it can be built on top of a 32-bit data structure, then it is not an issue. It is also possible to build a custom 64-bit data structure, but, again, the 16-bit containers can still prove useful as-is. So supporting 64-bit integers can be done separately without harm.
  2. Roaring is used in analytical systems like Druid or Apache Kylin that do scale up to very large data sets, while using only 32 bits. That's because 2^32 is a large number and it often makes sense to partition the problem well before you get to such counts. You probably want to partition for other reasons... such as parallelism.
  3. If you do have a practical use case where, for example, hash sets are not an obviously good choice, then I would be very interested.

@lemire
Copy link
Member

lemire commented Jan 4, 2016

I'm going to leave this open.

@lemire
Copy link
Member

lemire commented May 12, 2016

We have several implementations of Roaring over 64-bit integers in separate repositories...

https://bitbucket.org/samytto/lazyroaring64-bits

https://bitbucket.org/samytto/roaring64bits2levels

https://bitbucket.org/samytto/roaringtreemap

... as well as an evaluation...

https://bitbucket.org/samytto/datastructures64-bitints

@madscientist
Copy link
Contributor

All the implementations above appear to be in Java; any news on a 64bit version for C/C++? We have various IDs which are 64bits (32 bits is a lot, yes, but when you're talking 100's of transactions / sec and services running for weeks/months, and being stored durably as well, it goes by pretty quickly...). We'll look into a two-layer approach. You mentioned a year ago that Samy Chambi was working on doing this and might publish something; did that happen and if so is there someplace to get a copy?

Cheers!

@lemire
Copy link
Member

lemire commented Jan 5, 2017

@madscientist

any news on a 64bit version for C/C++?

Not that I know. I have marked the issue as "help wanted" just now.

You mentioned a year ago that Samy Chambi was working on doing this and might publish something; did that happen and if so is there someplace to get a copy?

Yes. There is a paper in French with experimental results as well as links to (Java) implementations. http://r-libre.teluq.ca/930/1/Roaring64bits.pdf

@lemire
Copy link
Member

lemire commented Jan 25, 2017

We have a relevant PR : #75

@lemire
Copy link
Member

lemire commented Jan 27, 2017

With the merger of PR #75, this issue is resolved for the c++ layer.

@lemire lemire changed the title Support for bitmaps of size > 2^32 Support for bitmaps of size > 2^32 in C Jan 27, 2017
@lemire
Copy link
Member

lemire commented Jan 27, 2017

I think that the issue remains open for C users.

@K2
Copy link

K2 commented Jul 23, 2017

@lemire I just wrote a C++/CLI wrapper that allow's .NET callers (C#, F# etc..) to use the 64 bit CPP interface. It's able to be amalgamated since I used inline #pragma define's for the CLI mode. It requires a small change to the build settings in visual studio that specifies mixed mode (only for the shim.cpp). I can send you a pull request if you'd like, I'm not really great with cmake but can include a project settings file if that helps. The nice thing about mixed mode is that there is only the one DLL instead of at least two with other thunk mechanisms.

@lemire
Copy link
Member

lemire commented Jul 23, 2017

I'd encourage you to issue a PR.

@lemire
Copy link
Member

lemire commented Aug 9, 2017

@K2 My main concern at this point is to make sure that your code can be used by others in a reasonable manner. It is fine to ignore CMake... but I am concerned about dropping undocumented mysterious files in the middle of a medium-size project. Who is going to find these files and know how to use them?

@K2
Copy link

K2 commented Aug 9, 2017

Sure I can take a look at CMake if you'd like or if you have other suggestions I'm open to it, maybe another .NET (the RogueException/CRoaring.Net is very good with 32bit) for 64.

Similar to those who started this thread (and others Tornhoof/RoaringBitmap#2) regarding 64 implementations, I was looking over the various implementations for a 64 bit compatible version before going out and making my own when I noticed the CPP version and figured that it'd be nice if that were exported into .NET since it's an existing and tested implementation. The CLI code I wrote is almost (exception move operator, I could do better at simulating that;) 1:1 with the Roaring64Map in terms of interfaces.

To me it seemed less error prone to add 64 bit .NET support by extending an existing underlying native implementation, I'd be assured not to inject any serious bugs. An added bonus was that as the Roaring64Map class performance improves it'd automatically boost this interface.

I think in a perfect world if/when the underlying C interfaces are extended to 64bit then RogueException's exemplary work would do well for any other .NET language. I'd tend to defer to @RogueException to extend his code on his timeline, moving to 64 isn't quite as easy to-do and maintain good performance since the *Many operators essentially turn into enumerated forms over the inputs that slows everything down a lot, i.e. Roaring64Map::AddMany. If I were to rewrite those methods it'd be a lot nicer to sort of determine a *Many threshold to sort/bucket the inputs and use the underlying C *many call's to speed them up....

I can fork this off or invite @RogueException for suggestions or @ksajme or anybody else...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants