Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redesign how containers are stored : move typecodes in tagged pointer, replace pointers by actual container values, so on #5

Open
fsaintjacques opened this issue Feb 8, 2016 · 43 comments
Assignees
Milestone

Comments

@fsaintjacques
Copy link
Member

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Note that it should also be possible to store the key (prefix) in this pointer too, since only 48 bits are addressable, but this will impact the initial bisect search for a prefix container.

@lemire
Copy link
Member

lemire commented Feb 8, 2016

Sounds like a good idea but I think that this needs to be synced with the work that @owenkaser is doing,

@owenkaser
Copy link
Member

Unfortunately, I don't expect to do anything more for at least a week,
maybe two. So the risk of merge conflicts is low. Packing the type code in
the low-order bits seems like a good idea. Avoiding the *uint8_t and
uint8_t parameters for a separate type-code parameter will probably pay for
the bit masking.

If we borrow the high order 16 bits of the pointer for key, do we know how
many years it will be until Intel decides that those bits are needed?

Also, it might be nicer (cachewise) to conduct most of the binary search
over the more concise uint16_t array than the new alternative.
-owen

On Mon, Feb 8, 2016 at 3:21 PM, Daniel Lemire notifications@github.com
wrote:

Sounds like a good idea but I think that this needs to be synced with the
work that @owenkaser https://github.com/owenkaser is doing,


Reply to this email directly or view it on GitHub
#5 (comment)
.

@lemire
Copy link
Member

lemire commented Feb 9, 2016

I also really like the idea of hiding away the type-code parameters. Even better if we can find a way to do it that is elegant.

@fsaintjacques : want to tackle this optimization?

If we borrow the high order 16 bits of the pointer for key, do we know how many years it will be until Intel decides that those bits are needed? Also, it might be nicer (cachewise) to conduct most of the binary search over the more concise uint16_t array than the new alternative.

I am guessing that we can check the addressable range using the cpuid instruction. So you could do it, but add a check when you build the code. I am more concerned with the loss of code clarity and the potential for (small) performance losses.

@lemire
Copy link
Member

lemire commented Feb 9, 2016

Looks like 16 TB of RAM in Java is a thing

https://www.linkedin.com/pulse/javaone-2015-operating-16tb-jvm-antoine-chambille

@nkurz
Copy link
Collaborator

nkurz commented Feb 9, 2016

On Mon, Feb 8, 2016 at 4:02 PM, Daniel Lemire notifications@github.com
wrote:

If we borrow the high order 16 bits of the pointer for key, do we know how
many years it will be until Intel decides that those bits are needed?

Unknown, but I don't think it will be a problem in practice. The current
belief is that 48-bit virtual addresses should hold us for quite some time
to come. HP's "Machine" is one of the largest memory systems currently
proposed (, and they have announced that they will stick with 48-bit
virtual:
http://www.nextplatform.com/2016/01/25/the-bits-and-bytes-of-the-machines-storage/

It's worth noting that with paging, the physical address space of the whole
machine can be much larger than the virtual address space of each process.
It's also worth noting that while these high bits are not currently used,
the AMD64 spec (official name of the thing colloquially referred to as x64
or x86_64) specifies that these bits do need to all be set to the same
value, so masking will be required.

Another possibility we might consider would be dropping down to 32-bit
"offsets" instead of storing pointers. I'm not sure if this would be
possible or not, but in addition to shaving some size it might allow for
better integration with mmap(). That is, if instead of storing 64-bit
pointers in memory we were to store "offset" * 16B (or 32B) from a base
address, we might be able to have the on disk representation match the
in-memory representation.

I'd be interested in talking over the potential benefit of this approach if
anyone is interested.

Also, it might be nicer (cachewise) to conduct most of the binary search
over the more concise uint16_t array than the new alternative.

I strongly agree with this: anything we are going to search over should
be packed as densely as it can be. Not doing so hurts us on cache
efficiency, but uses memory bandwidth reduces the gain from vectorization
as SIMD continues to gets wider. Splitting data into parallel dense
streams is a worthy target.

I am guessing that we can check the addressable range using the cpuid
instruction. So you could do it, but add a check when you build the code. I
am more concerned with the loss of code clarity and the potential for
(small) performance losses.

Looks like yes, CPUID with 80000008H in EAX will give the number of
physical address bits in AX, and the number of "linear" address bits in AH.
I think "linear" is the same as "virtual", but I'm not certain.

--nate

@fsaintjacques
Copy link
Member Author

As I said initially, the uint16_t are not worth packing. I will concentrate on packing the type tags.

@fsaintjacques fsaintjacques self-assigned this Feb 9, 2016
@lemire
Copy link
Member

lemire commented Feb 9, 2016

What @nkurz suggest, with 32-bit offsets, is appealing and I wonder if this design issue (that goes beyond the current issue) shouldn't be discussed seriously.

What @owenkaser implemented is perfectly reasonable but there are other interesting alternatives. Why don't we pack the container structs in one array, making sure that they all have the same size (possibly using padding). This is a variation on Nate's idea, but where the offsets are just flat. Think about summing the cardinality of all of the containers or doing a select:

https://github.com/RoaringBitmap/RoaringBitmap/blob/master/src/main/java/org/roaringbitmap/buffer/ImmutableRoaringBitmap.java#L1005-L1026

If all the structs are stored consecutively, you avoid cache misses.

In the current code, computing the cardinality of a roaring bitmap would mean accessing each and every container... possibly generating cache issues.

Another thing to consider is supporting copy-on-write at the container level.

@lemire
Copy link
Member

lemire commented Feb 16, 2016

Another issue is that with memory-file mapping, the pointer might not be word-aligned.

It seems to me that we would win most by "standardizing the container size". They are all small, and all made of pointers to other data elsewhere. So, really, the pointer-to-container thing is a waste.

@fsaintjacques
Copy link
Member Author

The 3 structs are of the following form:

struct container {
  int32_t cardinality;
  int32_t capacity; // not used in bitset
  void* pointer_to_data;
};

This fits in 128bit (capacity can probably be shrinked with further limitations).

The offset strategy will make it very cumbersome to resize at will the run and array type containers AND make it work with memory-mapped backend.

@lemire
Copy link
Member

lemire commented Feb 16, 2016

@fsaintjacques

Right. Details aside, we can have the struct fit in a nice 128 bits, as you say. It is a nice round number. There will be some waste, but if you save a pointer, you save 64 bits...

We have enough space in there to have the typecodes... for example, you allocate 32 bits to cardinality, but none of the containers can store more than 1<<16 values... so there is a good margin.

@owenkaser
Copy link
Member

are you thinking to maintain a containers array per bitmap (perhaps kept ordered by key), or one shared by all bitmaps (presumably disordered, and using some freelist approach)?

@lemire
Copy link
Member

lemire commented Feb 16, 2016

@owenkaser

I was thinking about one container array.

If you are to implement copy-on-write, then you'd have to copy the container struct... but it is only 128 bits.

I am almost certain it would use less memory (saving the pointer) and be slightly faster.

@fsaintjacques
Copy link
Member Author

If we limit capacity to be 2^16 - 1, we can shrink it to uint16_t and properly use uint8_t without bit hacks, but then the number of elements in array might not be a multiple of _m256i (so many things to consider!).

By having containers in array, what do you think of pre-allocating in slabs of 4k:

struct roaring {
  uint16_t n_containers;
  uint16_t *prefixes;
  // n_slabs can be derived from n_containers
  struct container_slab* slabs;
}

// fits in a page.
struct container_slab {
  struct container containers[256];
}

Slabs have the nice property of being multiples of PAGE_SIZE, thus easily mmap-able.

@fsaintjacques
Copy link
Member Author

Note that it doesn't change much from a standard array, but the underneath memory allocator probably allocate in chunks of 4k.

@owenkaser
Copy link
Member

@lemire

Is there any reason to think that copy on write would usually be a big win? My impression is that it would have to come with awkward machinery for reference counting etc, so that we know when to free memory. Your Go implementation uses the language's GC, right?

@fsaintjacques
Copy link
Member Author

I would skip COW mechanism for the first implementation.

@owenkaser
Copy link
Member

@lemire
With one big containers array shared by all bitmaps, your dream of an efficient scan to find a single bitmap's cardinality may be hard to realize. Are there mmapping consequences too?

@fsaintjacques
Within a slab, would you allow containers from multiple bitmaps? If not, how would we stop tiny bitmaps from causing internal fragmentation?

@lemire
Copy link
Member

lemire commented Feb 16, 2016

@owenkaser

It is not "my" Go implementation, but yes, it relies on the language's GC.

I do not plan to implement COW in CRoaring. I am just playing devil's advocate so that we do not make the design less flexible than it needs to be.

@lemire
Copy link
Member

lemire commented Feb 16, 2016

@fsaintjacques

In many applications, you only have a handful of containers per bitmap. Allocating them in blocks of 256 would be terribly wasteful. It seems much more prudent to allocate arrays the usually manner... while keep space for further tuning using customized memory allocation as an extra feature.

@lemire
Copy link
Member

lemire commented Feb 16, 2016

@fsaintjacques

I think that going with 128-bit containers is the wisest course of action, and that's the one you suggested.

Right now, we are keeping it close to the Java/Go implementations, but we might part way at some point... it would be interesting then to have enough "room" to support other container types.

@fsaintjacques
Copy link
Member Author

Shall we defined only one struct and use typedefs, e.g.

struct container_s {
  int32_t cardinality;
  uint32_t capacity;
  void *data;
};

typedef struct container_s bitset_container_t;
typedef struct container_s array_container_t;
typedef struct container_s run_container_t;

I'll open a branch for this experimentation.

@lemire
Copy link
Member

lemire commented Feb 17, 2016

@fsaintjacques

Is it possible to do the job with a union?

@fsaintjacques
Copy link
Member Author

How would you do the union? I foresee a declaration order ambiguity if the tag is in the pointer to the data. Do you envision something like:

/** 
 * Requires an implicit ordering in the definitions of the containers structs. This is required 
 * to find the tag type. But, the sizeof(container_t) will always be correct and explicit.
 */
union container_u {
   struct bitset_container_t bitset;
   struct array_container_t array;
   struct run_container_t run;
}

@lemire
Copy link
Member

lemire commented Feb 17, 2016

@fsaintjacques

I don't know how to make it work but the piece of code is elegant, isn't it?

@fsaintjacques
Copy link
Member Author

The original proposition is succinct and explicit but might make it harder for future expansion. The union version is friendly to future expansion but induce a dangerous ordering (and alignment!) dependency that can be solved by having an explicit enum/tag but lose the 128bits packing per container type.

@fsaintjacques
Copy link
Member Author

I'm looking over the run code, and I'll have to go over it. I spotted a few bug.

@lemire
Copy link
Member

lemire commented Sep 3, 2016

@fsaintjacques We now have four container types, including the COW containers.

@lemire lemire changed the title move typecodes in tagged pointer redesign how containers are stored : move typecodes in tagged pointer, replace pointers by actual container values, so on Sep 3, 2016
@lemire
Copy link
Member

lemire commented Sep 3, 2016

We need to revist this point as I think that CRoaring still leaves performance on the table.

@lemire
Copy link
Member

lemire commented Oct 29, 2020

This issue is still outstanding. Instead of using a pointer to a container, we should be using a union type. The current design works, but it is unnecessarily inefficient.

@lemire lemire added this to the 0.3 milestone Oct 29, 2020
@AE1020
Copy link
Contributor

AE1020 commented Nov 2, 2020

The roaring_array_t could be optimized by moving the typecode enum in the pointer itself. On x86-64 pointers must be aligned, thus freeing the last 3 bits. That's enough space to specify the type of containers in the pointer itself. The goal is to minimize memory usage (and thus cache friendliness).

Although it is a known technique used by projects like V8, the very fact that it is commonly exploited means that chances are not small that it will be used by some layer of a system.

This means that weirder compilers and transpilers will leverage it in their C implementation itself...or be built on a runtime stack that uses it somewhere.

So I'd be wary of doing this if your codebase is not itself the platform. A good generic C bitset library should conform to the standard, to make it usable anywhere.

That said: as long as it can be turned off with a #define somehow, it's not a bad idea if it doesn't complicate other things too much. Though I'd suggest prioritizing other reorganizations first.

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

Another idea that may reduce indirection is to optimize short arrays and runs by using the pointer as storage.
Something like this:

typedef struct run_container_s {
    uint32_t n_runs;
    uint32_t capacity;
    union {
        rle16_t *runs;
        rle16_t s_runs[2]; // 2 32-bit runs per pointer.
    };
} run_container_t;
// Assumes i is in range.
rle16_t get_run_at_index(run_container_t *r, int i) {
    if (r->capacity > 2) return r->runs[i];
    return r->s_runs[i];
}

That way things like full ranges won't require any kind of heap memory overhead, whether it be the accounting or the indirection itself. It should be more cache friendly as well.
The downside is the extra step and the danger of forgetting that and derreferencing garbage.

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

Even better, since this way capacity is never below 2 for runs or 4 for arrays (or 1 and 2 if running in 32-bits platforms) we can start counting from 1, so one uint16_t is enough for capacity and we can use the extra 2 bytes for the tag.

container_t could then look like this:

struct generic_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t reserved1;
    uint32_t reserved2;
    void *payload;
};

struct bitset_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t unused1;
    uint16_t unused2;
    uint64_t *words;
};

struct array_container_s {
    int32_t cardinality;
    uint8_t typecode;
    uint8_t unused;
    uint16_t capacity;
    union {
         uint16_t *array;
         uint16_t s_array[sizeof(void*)/sizeof(uint16_t)];
    };
};

struct run_container_s {
    int32_t n_runs;
    uint8_t typecode;
    uint8_t unused;
    uint16_t capacity;
    union {
         rle16_t *runs;
         rle16_t s_runs[sizeof(void*)/sizeof(rle16_t)];
    };
};

union container_u {
     struct generic_container_s generic;
     struct bitset_container_s bitset;
     struct array_container_s array;
     struct run_container_s run;
}

We would always add 1 to the capacity field to get the real capacity. Since the capacity will be <= (1 << 16), the field will be < (1 << 16) and fit in the uint16_t storage.

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

Full containers could steal a bit from the typecode as well, being a different type. That type requires no extra allocation and allows the invariants of the other types to include not being full, thus cardinalities being always under (1 << 16) and representable by a uint16_t. This would be useful because that way we can have n_runs and the cardinality cached, so for any given (repaired, if a lazy operation was done) roaring bitmap we could just sum the sizes when queried for cardinality, plus the number of full containers multiplied by (1 << 16).

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

About pointer tagging, we could play with alignment. It'd be less robust about adding new container types, but it's guaranteed the platform won't play us because it's in the meaningful part of the pointer.
The con here is that natural alignment doesn't seem to give us enough free bits (arrays are 16 bits aligned and runs 32 bits aligned, so there's not a lot we can unambiguously mask), and a stronger alignment guarantee would require breaking the frozen format.

@lemire
Copy link
Member

lemire commented Jun 13, 2021

Many good ideas.
Speaking for myself, the primary concern is less about memory usage and more about indirection.

We have one extra jump per container and it must add up in some cases.

@Oppen
Copy link
Collaborator

Oppen commented Jun 13, 2021

I know. The short run optimization should reduce jumps tho. The embedding the typecode bit I thought was required to embed the containers in the array, but now that I think of it we're good as long as they're all the same size. They don't even need to have the same layout in principle, but the size needs to be reasonable.

@lemire
Copy link
Member

lemire commented Jun 13, 2021

@Oppen There are gains to be had... but I would be shy about introducing new container types while changing the layout of how they are stored. It would be best to isolate these issues.

@Oppen
Copy link
Collaborator

Oppen commented Jul 8, 2021

A dumb enough proposal to handle shared containers in a flatter world: just add an array of pointers to reference counts in the high low container. If the pointer is NULL then the container with the same index is necessarily not shared. This saves extra indirection when working with the containers and allows us to skip the check where it is not relevant. Say, we want the cardinality, that's not a mutating operation, so we don't care if the container is or is not shared, but with the encapsulating approach we need to check that just to know the container's type. That could also help us pool the counts in the future if we deem that worth the effort (although I'm not sure there's anything stopping us with the current approach).

@lemire
Copy link
Member

lemire commented Jul 8, 2021

@Oppen The current design is optimized so that if you do not use shared containers, then you do not pay for them.

My expectation is that few people use shared containers. They have their uses, but it is kind of specific.

@Oppen
Copy link
Collaborator

Oppen commented Jul 8, 2021

That's true. There would be a cost in memory you'd be paying whether you use shared containers or not (although you could skip it if the bitmap is not marked COW). We still check for it in many places due to the indirection. At worst, removing those checks makes the code simpler, at best it can remove a lot of branching. The gain I see is that you keep the wins of using a flat structure this way, with no extra indirection for non-modifying operations regardless of the containers being shared.

Regarding users, I know I don't use them, but I have no idea how widespread that is in general.

@Oppen
Copy link
Collaborator

Oppen commented Sep 8, 2021

Another option would be to still store pointers to containers but take advantage of the fact their metadata is quite small and just embed both in a single allocation, like this:

struct bitset_container_s {
    int32_t cardinality;
    uint64_t words[1024];
};

struct array_container_s {
    uint16_t cardinality;
    uint16_t capacity;
    uint16_t array[];
};

struct run_container_s {
    uint16_t n_runs;
    uint16_t capacity;
    rle16_t runs[];
};

I think the most annoying part here is that then we would need to always check for reallocations at the roaring_array_t level. Except for bitset containers, where size is always constant.

@stdpain
Copy link
Contributor

stdpain commented Sep 8, 2021

Maybe we don't need to put the typecode in the struct so that the shared_container problem is naturally solved.
typecode test 0x10 is true, we think this container is shared container.

struct bitset_container_s {
    int32_t cardinality;
    int32_t use_count;
    uint64_t *words;
};

struct array_container_s {
    uint16_t cardinality;
    uint16_t capacity;
    int32_t use_count;
    uint16_t *array;
};

struct run_container_s {
    uint16_t n_runs;
    uint16_t capacity;
    int32_t use_count;
    rle16_t *runs;
};

union container_u {
     struct bitset_container_s bitset;
     struct array_container_s array;
     struct run_container_s run;
}

@Oppen
Copy link
Collaborator

Oppen commented Sep 8, 2021

That would make all of them have some memory overhead regardless of actually being shared, plus require to also keep using pointers to the structs instead of embedding them, otherwise copying a bitmap with COW wouldn't propagate the new reference count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants