-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redesign how containers are stored : move typecodes
in tagged pointer, replace pointers by actual container values, so on
#5
Comments
Sounds like a good idea but I think that this needs to be synced with the work that @owenkaser is doing, |
Unfortunately, I don't expect to do anything more for at least a week, If we borrow the high order 16 bits of the pointer for key, do we know how Also, it might be nicer (cachewise) to conduct most of the binary search On Mon, Feb 8, 2016 at 3:21 PM, Daniel Lemire notifications@github.com
|
I also really like the idea of hiding away the type-code parameters. Even better if we can find a way to do it that is elegant. @fsaintjacques : want to tackle this optimization?
I am guessing that we can check the addressable range using the cpuid instruction. So you could do it, but add a check when you build the code. I am more concerned with the loss of code clarity and the potential for (small) performance losses. |
Looks like 16 TB of RAM in Java is a thing https://www.linkedin.com/pulse/javaone-2015-operating-16tb-jvm-antoine-chambille |
On Mon, Feb 8, 2016 at 4:02 PM, Daniel Lemire notifications@github.com
It's worth noting that with paging, the physical address space of the whole Another possibility we might consider would be dropping down to 32-bit I'd be interested in talking over the potential benefit of this approach if
I strongly agree with this: anything we are going to search over should
Looks like yes, CPUID with 80000008H in EAX will give the number of --nate |
As I said initially, the |
What @nkurz suggest, with 32-bit offsets, is appealing and I wonder if this design issue (that goes beyond the current issue) shouldn't be discussed seriously. What @owenkaser implemented is perfectly reasonable but there are other interesting alternatives. Why don't we pack the container structs in one array, making sure that they all have the same size (possibly using padding). This is a variation on Nate's idea, but where the offsets are just flat. Think about summing the cardinality of all of the containers or doing a select: If all the structs are stored consecutively, you avoid cache misses. In the current code, computing the cardinality of a roaring bitmap would mean accessing each and every container... possibly generating cache issues. Another thing to consider is supporting copy-on-write at the container level. |
Another issue is that with memory-file mapping, the pointer might not be word-aligned. It seems to me that we would win most by "standardizing the container size". They are all small, and all made of pointers to other data elsewhere. So, really, the pointer-to-container thing is a waste. |
The 3 structs are of the following form:
This fits in 128bit (capacity can probably be shrinked with further limitations). The offset strategy will make it very cumbersome to resize at will the |
Right. Details aside, we can have the struct fit in a nice 128 bits, as you say. It is a nice round number. There will be some waste, but if you save a pointer, you save 64 bits... We have enough space in there to have the typecodes... for example, you allocate 32 bits to cardinality, but none of the containers can store more than 1<<16 values... so there is a good margin. |
are you thinking to maintain a containers array per bitmap (perhaps kept ordered by key), or one shared by all bitmaps (presumably disordered, and using some freelist approach)? |
I was thinking about one container array. If you are to implement copy-on-write, then you'd have to copy the container struct... but it is only 128 bits. I am almost certain it would use less memory (saving the pointer) and be slightly faster. |
If we limit capacity to be 2^16 - 1, we can shrink it to uint16_t and properly use uint8_t without bit hacks, but then the number of elements in array might not be a multiple of By having containers in array, what do you think of pre-allocating in slabs of 4k:
Slabs have the nice property of being multiples of |
Note that it doesn't change much from a standard array, but the underneath memory allocator probably allocate in chunks of 4k. |
Is there any reason to think that copy on write would usually be a big win? My impression is that it would have to come with awkward machinery for reference counting etc, so that we know when to free memory. Your Go implementation uses the language's GC, right? |
I would skip COW mechanism for the first implementation. |
@lemire @fsaintjacques |
It is not "my" Go implementation, but yes, it relies on the language's GC. I do not plan to implement COW in CRoaring. I am just playing devil's advocate so that we do not make the design less flexible than it needs to be. |
In many applications, you only have a handful of containers per bitmap. Allocating them in blocks of 256 would be terribly wasteful. It seems much more prudent to allocate arrays the usually manner... while keep space for further tuning using customized memory allocation as an extra feature. |
I think that going with 128-bit containers is the wisest course of action, and that's the one you suggested. Right now, we are keeping it close to the Java/Go implementations, but we might part way at some point... it would be interesting then to have enough "room" to support other container types. |
Shall we defined only one struct and use typedefs, e.g. struct container_s {
int32_t cardinality;
uint32_t capacity;
void *data;
};
typedef struct container_s bitset_container_t;
typedef struct container_s array_container_t;
typedef struct container_s run_container_t; I'll open a branch for this experimentation. |
Is it possible to do the job with a union? |
How would you do the union? I foresee a declaration order ambiguity if the tag is in the pointer to the data. Do you envision something like: /**
* Requires an implicit ordering in the definitions of the containers structs. This is required
* to find the tag type. But, the sizeof(container_t) will always be correct and explicit.
*/
union container_u {
struct bitset_container_t bitset;
struct array_container_t array;
struct run_container_t run;
} |
I don't know how to make it work but the piece of code is elegant, isn't it? |
The original proposition is succinct and explicit but might make it harder for future expansion. The union version is friendly to future expansion but induce a dangerous ordering (and alignment!) dependency that can be solved by having an explicit enum/tag but lose the 128bits packing per container type. |
I'm looking over the run code, and I'll have to go over it. I spotted a few bug. |
@fsaintjacques We now have four container types, including the COW containers. |
typecodes
in tagged pointertypecodes
in tagged pointer, replace pointers by actual container values, so on
We need to revist this point as I think that CRoaring still leaves performance on the table. |
This issue is still outstanding. Instead of using a pointer to a container, we should be using a union type. The current design works, but it is unnecessarily inefficient. |
Although it is a known technique used by projects like V8, the very fact that it is commonly exploited means that chances are not small that it will be used by some layer of a system. This means that weirder compilers and transpilers will leverage it in their C implementation itself...or be built on a runtime stack that uses it somewhere. So I'd be wary of doing this if your codebase is not itself the platform. A good generic C bitset library should conform to the standard, to make it usable anywhere. That said: as long as it can be turned off with a #define somehow, it's not a bad idea if it doesn't complicate other things too much. Though I'd suggest prioritizing other reorganizations first. |
Another idea that may reduce indirection is to optimize short arrays and runs by using the pointer as storage. typedef struct run_container_s {
uint32_t n_runs;
uint32_t capacity;
union {
rle16_t *runs;
rle16_t s_runs[2]; // 2 32-bit runs per pointer.
};
} run_container_t;
// Assumes i is in range.
rle16_t get_run_at_index(run_container_t *r, int i) {
if (r->capacity > 2) return r->runs[i];
return r->s_runs[i];
} That way things like full ranges won't require any kind of heap memory overhead, whether it be the accounting or the indirection itself. It should be more cache friendly as well. |
Even better, since this way capacity is never below
struct generic_container_s {
int32_t cardinality;
uint8_t typecode;
uint8_t reserved1;
uint32_t reserved2;
void *payload;
};
struct bitset_container_s {
int32_t cardinality;
uint8_t typecode;
uint8_t unused1;
uint16_t unused2;
uint64_t *words;
};
struct array_container_s {
int32_t cardinality;
uint8_t typecode;
uint8_t unused;
uint16_t capacity;
union {
uint16_t *array;
uint16_t s_array[sizeof(void*)/sizeof(uint16_t)];
};
};
struct run_container_s {
int32_t n_runs;
uint8_t typecode;
uint8_t unused;
uint16_t capacity;
union {
rle16_t *runs;
rle16_t s_runs[sizeof(void*)/sizeof(rle16_t)];
};
};
union container_u {
struct generic_container_s generic;
struct bitset_container_s bitset;
struct array_container_s array;
struct run_container_s run;
} We would always add |
Full containers could steal a bit from the |
About pointer tagging, we could play with alignment. It'd be less robust about adding new container types, but it's guaranteed the platform won't play us because it's in the meaningful part of the pointer. |
Many good ideas. We have one extra jump per container and it must add up in some cases. |
I know. The short run optimization should reduce jumps tho. The embedding the typecode bit I thought was required to embed the containers in the array, but now that I think of it we're good as long as they're all the same size. They don't even need to have the same layout in principle, but the size needs to be reasonable. |
@Oppen There are gains to be had... but I would be shy about introducing new container types while changing the layout of how they are stored. It would be best to isolate these issues. |
A dumb enough proposal to handle shared containers in a flatter world: just add an array of pointers to reference counts in the high low container. If the pointer is |
@Oppen The current design is optimized so that if you do not use shared containers, then you do not pay for them. My expectation is that few people use shared containers. They have their uses, but it is kind of specific. |
That's true. There would be a cost in memory you'd be paying whether you use shared containers or not (although you could skip it if the bitmap is not marked COW). We still check for it in many places due to the indirection. At worst, removing those checks makes the code simpler, at best it can remove a lot of branching. The gain I see is that you keep the wins of using a flat structure this way, with no extra indirection for non-modifying operations regardless of the containers being shared. Regarding users, I know I don't use them, but I have no idea how widespread that is in general. |
Another option would be to still store pointers to containers but take advantage of the fact their metadata is quite small and just embed both in a single allocation, like this: struct bitset_container_s {
int32_t cardinality;
uint64_t words[1024];
};
struct array_container_s {
uint16_t cardinality;
uint16_t capacity;
uint16_t array[];
};
struct run_container_s {
uint16_t n_runs;
uint16_t capacity;
rle16_t runs[];
}; I think the most annoying part here is that then we would need to always check for reallocations at the |
Maybe we don't need to put the typecode in the struct so that the shared_container problem is naturally solved. struct bitset_container_s {
int32_t cardinality;
int32_t use_count;
uint64_t *words;
};
struct array_container_s {
uint16_t cardinality;
uint16_t capacity;
int32_t use_count;
uint16_t *array;
};
struct run_container_s {
uint16_t n_runs;
uint16_t capacity;
int32_t use_count;
rle16_t *runs;
};
union container_u {
struct bitset_container_s bitset;
struct array_container_s array;
struct run_container_s run;
} |
That would make all of them have some memory overhead regardless of actually being shared, plus require to also keep using pointers to the structs instead of embedding them, otherwise copying a bitmap with COW wouldn't propagate the new reference count. |
The
roaring_array_t
could be optimized by moving thetypecode
enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).Note that it should also be possible to store the
key
(prefix) in this pointer too, since only 48 bits are addressable, but this will impact the initial bisect search for a prefix container.The text was updated successfully, but these errors were encountered: