redesign how containers are stored : move `typecodes` in tagged pointer, replace pointers by actual container values, so on #5

fsaintjacques · 2016-02-08T14:52:44Z

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Note that it should also be possible to store the key (prefix) in this pointer too, since only 48 bits are addressable, but this will impact the initial bisect search for a prefix container.

The text was updated successfully, but these errors were encountered:

lemire · 2016-02-08T16:44:53Z

Sounds like a good idea but I think that this needs to be synced with the work that @owenkaser is doing,

owenkaser · 2016-02-08T22:47:33Z

Unfortunately, I don't expect to do anything more for at least a week,
maybe two. So the risk of merge conflicts is low. Packing the type code in
the low-order bits seems like a good idea. Avoiding the *uint8_t and
uint8_t parameters for a separate type-code parameter will probably pay for
the bit masking.

If we borrow the high order 16 bits of the pointer for key, do we know how
many years it will be until Intel decides that those bits are needed?

Also, it might be nicer (cachewise) to conduct most of the binary search
over the more concise uint16_t array than the new alternative.
-owen

On Mon, Feb 8, 2016 at 3:21 PM, Daniel Lemire notifications@github.com
wrote:

Sounds like a good idea but I think that this needs to be synced with the
work that @owenkaser https://github.com/owenkaser is doing,

—
Reply to this email directly or view it on GitHub
#5 (comment)
.

lemire · 2016-02-09T00:02:28Z

I also really like the idea of hiding away the type-code parameters. Even better if we can find a way to do it that is elegant.

@fsaintjacques : want to tackle this optimization?

If we borrow the high order 16 bits of the pointer for key, do we know how many years it will be until Intel decides that those bits are needed? Also, it might be nicer (cachewise) to conduct most of the binary search over the more concise uint16_t array than the new alternative.

I am guessing that we can check the addressable range using the cpuid instruction. So you could do it, but add a check when you build the code. I am more concerned with the loss of code clarity and the potential for (small) performance losses.

lemire · 2016-02-09T00:21:11Z

Looks like 16 TB of RAM in Java is a thing

https://www.linkedin.com/pulse/javaone-2015-operating-16tb-jvm-antoine-chambille

nkurz · 2016-02-09T00:40:11Z

On Mon, Feb 8, 2016 at 4:02 PM, Daniel Lemire notifications@github.com
wrote:

If we borrow the high order 16 bits of the pointer for key, do we know how
many years it will be until Intel decides that those bits are needed?

Unknown, but I don't think it will be a problem in practice. The current
belief is that 48-bit virtual addresses should hold us for quite some time
to come. HP's "Machine" is one of the largest memory systems currently
proposed (, and they have announced that they will stick with 48-bit
virtual:
http://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/

It's worth noting that with paging, the physical address space of the whole
machine can be much larger than the virtual address space of each process.
It's also worth noting that while these high bits are not currently used,
the AMD64 spec (official name of the thing colloquially referred to as x64
or x86_64) specifies that these bits do need to all be set to the same
value, so masking will be required.

Another possibility we might consider would be dropping down to 32-bit
"offsets" instead of storing pointers. I'm not sure if this would be
possible or not, but in addition to shaving some size it might allow for
better integration with mmap(). That is, if instead of storing 64-bit
pointers in memory we were to store "offset" * 16B (or 32B) from a base
address, we might be able to have the on disk representation match the
in-memory representation.

I'd be interested in talking over the potential benefit of this approach if
anyone is interested.

Also, it might be nicer (cachewise) to conduct most of the binary search
over the more concise uint16_t array than the new alternative.

I strongly agree with this: anything we are going to search over should
be packed as densely as it can be. Not doing so hurts us on cache
efficiency, but uses memory bandwidth reduces the gain from vectorization
as SIMD continues to gets wider. Splitting data into parallel dense
streams is a worthy target.

I am guessing that we can check the addressable range using the cpuid
instruction. So you could do it, but add a check when you build the code. I
am more concerned with the loss of code clarity and the potential for
(small) performance losses.

Looks like yes, CPUID with 80000008H in EAX will give the number of
physical address bits in AX, and the number of "linear" address bits in AH.
I think "linear" is the same as "virtual", but I'm not certain.

--nate

fsaintjacques · 2016-02-09T01:00:43Z

As I said initially, the uint16_t are not worth packing. I will concentrate on packing the type tags.

lemire · 2016-02-09T01:24:44Z

What @nkurz suggest, with 32-bit offsets, is appealing and I wonder if this design issue (that goes beyond the current issue) shouldn't be discussed seriously.

What @owenkaser implemented is perfectly reasonable but there are other interesting alternatives. Why don't we pack the container structs in one array, making sure that they all have the same size (possibly using padding). This is a variation on Nate's idea, but where the offsets are just flat. Think about summing the cardinality of all of the containers or doing a select:

https://github.com/RoaringBitmap/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/buffer/ImmutableRoaringBitmap.java#L1005-L1026

If all the structs are stored consecutively, you avoid cache misses.

In the current code, computing the cardinality of a roaring bitmap would mean accessing each and every container... possibly generating cache issues.

Another thing to consider is supporting copy-on-write at the container level.

lemire · 2016-02-16T16:17:37Z

Another issue is that with memory-file mapping, the pointer might not be word-aligned.

It seems to me that we would win most by "standardizing the container size". They are all small, and all made of pointers to other data elsewhere. So, really, the pointer-to-container thing is a waste.

fsaintjacques · 2016-02-16T17:06:17Z

The 3 structs are of the following form:

struct container {
  int32_t cardinality;
  int32_t capacity; // not used in bitset
  void* pointer_to_data;
};

This fits in 128bit (capacity can probably be shrinked with further limitations).

The offset strategy will make it very cumbersome to resize at will the run and array type containers AND make it work with memory-mapped backend.

lemire · 2016-02-16T17:14:29Z

@fsaintjacques

Right. Details aside, we can have the struct fit in a nice 128 bits, as you say. It is a nice round number. There will be some waste, but if you save a pointer, you save 64 bits...

We have enough space in there to have the typecodes... for example, you allocate 32 bits to cardinality, but none of the containers can store more than 1<<16 values... so there is a good margin.

owenkaser · 2016-02-16T17:27:46Z

are you thinking to maintain a containers array per bitmap (perhaps kept ordered by key), or one shared by all bitmaps (presumably disordered, and using some freelist approach)?

lemire · 2016-02-16T17:30:50Z

@owenkaser

I was thinking about one container array.

If you are to implement copy-on-write, then you'd have to copy the container struct... but it is only 128 bits.

I am almost certain it would use less memory (saving the pointer) and be slightly faster.

fsaintjacques · 2016-02-16T18:04:28Z

If we limit capacity to be 2^16 - 1, we can shrink it to uint16_t and properly use uint8_t without bit hacks, but then the number of elements in array might not be a multiple of _m256i (so many things to consider!).

By having containers in array, what do you think of pre-allocating in slabs of 4k:

struct roaring {
  uint16_t n_containers;
  uint16_t *prefixes;
  // n_slabs can be derived from n_containers
  struct container_slab* slabs;
}

// fits in a page.
struct container_slab {
  struct container containers[256];
}

Slabs have the nice property of being multiples of PAGE_SIZE, thus easily mmap-able.

fsaintjacques · 2016-02-16T18:05:48Z

Note that it doesn't change much from a standard array, but the underneath memory allocator probably allocate in chunks of 4k.

owenkaser · 2016-02-16T18:06:26Z

@lemire

Is there any reason to think that copy on write would usually be a big win? My impression is that it would have to come with awkward machinery for reference counting etc, so that we know when to free memory. Your Go implementation uses the language's GC, right?

fsaintjacques · 2016-02-16T18:14:35Z

I would skip COW mechanism for the first implementation.

owenkaser · 2016-02-16T18:17:57Z

@lemire
With one big containers array shared by all bitmaps, your dream of an efficient scan to find a single bitmap's cardinality may be hard to realize. Are there mmapping consequences too?

@fsaintjacques
Within a slab, would you allow containers from multiple bitmaps? If not, how would we stop tiny bitmaps from causing internal fragmentation?

lemire · 2016-02-16T19:07:51Z

@owenkaser

It is not "my" Go implementation, but yes, it relies on the language's GC.

I do not plan to implement COW in CRoaring. I am just playing devil's advocate so that we do not make the design less flexible than it needs to be.

lemire · 2016-02-16T19:14:16Z

@fsaintjacques

In many applications, you only have a handful of containers per bitmap. Allocating them in blocks of 256 would be terribly wasteful. It seems much more prudent to allocate arrays the usually manner... while keep space for further tuning using customized memory allocation as an extra feature.

lemire · 2016-02-16T19:16:41Z

@fsaintjacques

I think that going with 128-bit containers is the wisest course of action, and that's the one you suggested.

Right now, we are keeping it close to the Java/Go implementations, but we might part way at some point... it would be interesting then to have enough "room" to support other container types.

fsaintjacques · 2016-02-17T01:29:18Z

Shall we defined only one struct and use typedefs, e.g.

struct container_s {
  int32_t cardinality;
  uint32_t capacity;
  void *data;
};

typedef struct container_s bitset_container_t;
typedef struct container_s array_container_t;
typedef struct container_s run_container_t;

I'll open a branch for this experimentation.

lemire · 2016-02-17T01:31:23Z

@fsaintjacques

Is it possible to do the job with a union?

fsaintjacques · 2016-02-17T02:45:19Z

How would you do the union? I foresee a declaration order ambiguity if the tag is in the pointer to the data. Do you envision something like:

/** 
 * Requires an implicit ordering in the definitions of the containers structs. This is required 
 * to find the tag type. But, the sizeof(container_t) will always be correct and explicit.
 */
union container_u {
   struct bitset_container_t bitset;
   struct array_container_t array;
   struct run_container_t run;
}

lemire · 2016-02-17T02:48:32Z

@fsaintjacques

I don't know how to make it work but the piece of code is elegant, isn't it?

fsaintjacques · 2016-02-17T02:55:58Z

The original proposition is succinct and explicit but might make it harder for future expansion. The union version is friendly to future expansion but induce a dangerous ordering (and alignment!) dependency that can be solved by having an explicit enum/tag but lose the 128bits packing per container type.

fsaintjacques · 2016-02-17T14:48:55Z

I'm looking over the run code, and I'll have to go over it. I spotted a few bug.

lemire · 2016-09-03T14:20:41Z

@fsaintjacques We now have four container types, including the COW containers.

lemire · 2016-09-03T14:29:20Z

We need to revist this point as I think that CRoaring still leaves performance on the table.

lemire · 2020-10-29T17:05:56Z

This issue is still outstanding. Instead of using a pointer to a container, we should be using a union type. The current design works, but it is unnecessarily inefficient.

AE1020 · 2020-11-02T20:38:59Z

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Although it is a known technique used by projects like V8, the very fact that it is commonly exploited means that chances are not small that it will be used by some layer of a system.

This means that weirder compilers and transpilers will leverage it in their C implementation itself...or be built on a runtime stack that uses it somewhere.

So I'd be wary of doing this if your codebase is not itself the platform. A good generic C bitset library should conform to the standard, to make it usable anywhere.

That said: as long as it can be turned off with a #define somehow, it's not a bad idea if it doesn't complicate other things too much. Though I'd suggest prioritizing other reorganizations first.

Oppen · 2021-06-13T00:45:14Z

Another idea that may reduce indirection is to optimize short arrays and runs by using the pointer as storage.
Something like this:

typedef struct run_container_s {
    uint32_t n_runs;
    uint32_t capacity;
    union {
        rle16_t *runs;
        rle16_t s_runs[2]; // 2 32-bit runs per pointer.
    };
} run_container_t;
// Assumes i is in range.
rle16_t get_run_at_index(run_container_t *r, int i) {
    if (r->capacity > 2) return r->runs[i];
    return r->s_runs[i];
}

That way things like full ranges won't require any kind of heap memory overhead, whether it be the accounting or the indirection itself. It should be more cache friendly as well.
The downside is the extra step and the danger of forgetting that and derreferencing garbage.

Oppen · 2021-06-13T00:55:49Z

Even better, since this way capacity is never below 2 for runs or 4 for arrays (or 1 and 2 if running in 32-bits platforms) we can start counting from 1, so one uint16_t is enough for capacity and we can use the extra 2 bytes for the tag.

container_t could then look like this:

struct generic_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t reserved1;
    uint32_t reserved2;
    void *payload;
};

struct bitset_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t unused1;
    uint16_t unused2;
    uint64_t *words;
};

struct array_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t unused;
    uint16_t capacity;
    union {
         uint16_t *array;
         uint16_t s_array[sizeof(void*)/sizeof(uint16_t)];
    };
};

struct run_container_s {
    int32_t n_runs;
    uint8_t typecode;
    uint8_t unused;
    uint16_t capacity;
    union {
         rle16_t *runs;
         rle16_t s_runs[sizeof(void*)/sizeof(rle16_t)];
    };
};

union container_u {
     struct generic_container_s generic;
     struct bitset_container_s bitset;
     struct array_container_s array;
     struct run_container_s run;
}

We would always add 1 to the capacity field to get the real capacity. Since the capacity will be <= (1 << 16), the field will be < (1 << 16) and fit in the uint16_t storage.

Oppen · 2021-06-13T04:10:14Z

Full containers could steal a bit from the typecode as well, being a different type. That type requires no extra allocation and allows the invariants of the other types to include not being full, thus cardinalities being always under (1 << 16) and representable by a uint16_t. This would be useful because that way we can have n_runs and the cardinality cached, so for any given (repaired, if a lazy operation was done) roaring bitmap we could just sum the sizes when queried for cardinality, plus the number of full containers multiplied by (1 << 16).

Oppen · 2021-06-13T16:10:21Z

About pointer tagging, we could play with alignment. It'd be less robust about adding new container types, but it's guaranteed the platform won't play us because it's in the meaningful part of the pointer.
The con here is that natural alignment doesn't seem to give us enough free bits (arrays are 16 bits aligned and runs 32 bits aligned, so there's not a lot we can unambiguously mask), and a stronger alignment guarantee would require breaking the frozen format.

lemire · 2021-06-13T16:34:26Z

Many good ideas.
Speaking for myself, the primary concern is less about memory usage and more about indirection.

We have one extra jump per container and it must add up in some cases.

Oppen · 2021-06-13T17:40:02Z

I know. The short run optimization should reduce jumps tho. The embedding the typecode bit I thought was required to embed the containers in the array, but now that I think of it we're good as long as they're all the same size. They don't even need to have the same layout in principle, but the size needs to be reasonable.

lemire · 2021-06-13T20:50:04Z

@Oppen There are gains to be had... but I would be shy about introducing new container types while changing the layout of how they are stored. It would be best to isolate these issues.

Oppen · 2021-07-08T19:36:09Z

A dumb enough proposal to handle shared containers in a flatter world: just add an array of pointers to reference counts in the high low container. If the pointer is NULL then the container with the same index is necessarily not shared. This saves extra indirection when working with the containers and allows us to skip the check where it is not relevant. Say, we want the cardinality, that's not a mutating operation, so we don't care if the container is or is not shared, but with the encapsulating approach we need to check that just to know the container's type. That could also help us pool the counts in the future if we deem that worth the effort (although I'm not sure there's anything stopping us with the current approach).

lemire · 2021-07-08T19:54:18Z

@Oppen The current design is optimized so that if you do not use shared containers, then you do not pay for them.

My expectation is that few people use shared containers. They have their uses, but it is kind of specific.

Oppen · 2021-07-08T19:58:51Z

That's true. There would be a cost in memory you'd be paying whether you use shared containers or not (although you could skip it if the bitmap is not marked COW). We still check for it in many places due to the indirection. At worst, removing those checks makes the code simpler, at best it can remove a lot of branching. The gain I see is that you keep the wins of using a flat structure this way, with no extra indirection for non-modifying operations regardless of the containers being shared.

Regarding users, I know I don't use them, but I have no idea how widespread that is in general.

Oppen · 2021-09-08T14:08:47Z

Another option would be to still store pointers to containers but take advantage of the fact their metadata is quite small and just embed both in a single allocation, like this:

struct bitset_container_s {
    int32_t cardinality;
    uint64_t words[1024];
};

struct array_container_s {
    uint16_t cardinality;
    uint16_t capacity;
    uint16_t array[];
};

struct run_container_s {
    uint16_t n_runs;
    uint16_t capacity;
    rle16_t runs[];
};

I think the most annoying part here is that then we would need to always check for reallocations at the roaring_array_t level. Except for bitset containers, where size is always constant.

stdpain · 2021-09-08T15:42:01Z

Maybe we don't need to put the typecode in the struct so that the shared_container problem is naturally solved.
typecode test 0x10 is true, we think this container is shared container.

struct bitset_container_s {
    int32_t cardinality;
    int32_t use_count;
    uint64_t *words;
};

struct array_container_s {
    uint16_t cardinality;
    uint16_t capacity;
    int32_t use_count;
    uint16_t *array;
};

struct run_container_s {
    uint16_t n_runs;
    uint16_t capacity;
    int32_t use_count;
    rle16_t *runs;
};

union container_u {
     struct bitset_container_s bitset;
     struct array_container_s array;
     struct run_container_s run;
}

Oppen · 2021-09-08T15:54:12Z

That would make all of them have some memory overhead regardless of actually being shared, plus require to also keep using pointers to the structs instead of embedding them, otherwise copying a bitmap with COW wouldn't propagate the new reference count.

fsaintjacques added the enhancement label Feb 8, 2016

fsaintjacques self-assigned this Feb 9, 2016

lemire changed the title ~~move typecodes in tagged pointer~~ redesign how containers are stored : move typecodes in tagged pointer, replace pointers by actual container values, so on Sep 3, 2016

lemire added the performance label Sep 6, 2016

lemire added this to the 0.3 milestone Oct 29, 2020

Oppen mentioned this issue Sep 8, 2021

Avoid unnecessary heap allocations #112

Open

rohanaggarwal7997 mentioned this issue Sep 25, 2023

CRoaring does not support the Intel compiler #506

Closed

redesign how containers are stored : move typecodes in tagged pointer, replace pointers by actual container values, so on #5

redesign how containers are stored : move typecodes in tagged pointer, replace pointers by actual container values, so on #5

Comments

fsaintjacques commented Feb 8, 2016

lemire commented Feb 8, 2016

owenkaser commented Feb 8, 2016

lemire commented Feb 9, 2016

lemire commented Feb 9, 2016

nkurz commented Feb 9, 2016

fsaintjacques commented Feb 9, 2016

lemire commented Feb 9, 2016

lemire commented Feb 16, 2016

fsaintjacques commented Feb 16, 2016

lemire commented Feb 16, 2016

owenkaser commented Feb 16, 2016

lemire commented Feb 16, 2016

fsaintjacques commented Feb 16, 2016

fsaintjacques commented Feb 16, 2016

owenkaser commented Feb 16, 2016

fsaintjacques commented Feb 16, 2016

owenkaser commented Feb 16, 2016

lemire commented Feb 16, 2016

lemire commented Feb 16, 2016

lemire commented Feb 16, 2016

fsaintjacques commented Feb 17, 2016

lemire commented Feb 17, 2016

fsaintjacques commented Feb 17, 2016

lemire commented Feb 17, 2016

fsaintjacques commented Feb 17, 2016

fsaintjacques commented Feb 17, 2016

lemire commented Sep 3, 2016

lemire commented Sep 3, 2016

lemire commented Oct 29, 2020

AE1020 commented Nov 2, 2020

Oppen commented Jun 13, 2021

Oppen commented Jun 13, 2021 • edited

Oppen commented Jun 13, 2021

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

Oppen commented Jun 13, 2021

lemire commented Jun 13, 2021

Oppen commented Jul 8, 2021

lemire commented Jul 8, 2021

Oppen commented Jul 8, 2021

Oppen commented Sep 8, 2021

stdpain commented Sep 8, 2021 • edited

Oppen commented Sep 8, 2021

redesign how containers are stored : move `typecodes` in tagged pointer, replace pointers by actual container values, so on #5

redesign how containers are stored : move `typecodes` in tagged pointer, replace pointers by actual container values, so on #5

Oppen commented Jun 13, 2021 •

edited

stdpain commented Sep 8, 2021 •

edited