Fixed-block allocator for large hashtables #32362

Bulat-Ziganshin · 2021-12-08T02:42:22Z

Now, CH tries to optimize memory allocation by employing a better general-purpose allocator (jemalloc, hualloc). I propose to research a different approach - identify main memory eaters and individually customize these data structures and their allocation strategies with the goal to reduce their memory fragmentation and/or memory allocation times. In particular, the following analysis targets large hash tables since I know their implementation and I expect that they are among the main CH memory eaters. @alexey-milovidov, I hope you will read it.

Buddy allocator

In particular, each hash table has its own element size S, and memory segments allocated for a particular hash table are always of S*2^N bytes. So, instead of trying to fit them into segments of 2^N bytes, we can reserve for each S value a large address area (1 TB or so) and use f.e. buddy memory allocator to share this memory area between all hash tables with given S.

Limit the block size

Another opportunity (that may be considered independently) is to store a large hash table as an array of fixed-size segments rather than one big segment. F.e. we can limit segment size to 2 MB, and thus find the largest N for which S*2^N <= 2^21.

Once the hash table is going to have more than 2^N elements, we alloc its memory as multiple fixed-size blocks of S*2^N bytes, and access hash-table with two-level indexing:

hashtable[i] === hashtable.data[ i / 2^N ][ i % 2^N ]

This way, smaller blocks can still be allocated using the buddy allocator, while all larger blocks have the same size and may be kept in the single global list (for each S).

Extendible hashing

But wait - CH hashtables are already 2-level, so this adds the 3rd indexing level. We can have only 2 levels and still limit the sub-hash size with extendible hashing:

first level always contains 2^K entries, where K may vary over time
a sub-hash may be referenced by multiple first-level entries
so, when K increments by 1, we just duplicate all links to sub-hashes in the first-level table
when sub-hash should grow over the limit (f.e. 2M bytes), we split it into 2 sub-hashes (each 2MB), un-duplicating a pair of links in the first-level table
K grows only when we need to split sub-hash already referenced only from one first-level entry

In terms of indexing, it just needs replacing the current i1 = h >> 24 command with i1 = h >> V, increasing the hashtable access latency only by 1 cpu cycle and only on Skylake and older archs.

All blocks are buddies

Instead of making a private buddy allocator for each S, we can put a non-power-of-2 number of elements in each block. F.e. make all blocks 2 MB large, thus a block will contain K = 2MB / S elements. We can use multiplication-based index calculation to accommodate non-power-of-2 hash table size:

i1 = h & SIZE_MASK
i2 = h*K >> 32
subhash = data[i1]
x = subhash[i2]

Note that subhash calculation requires 5-6 cpu cycles (mask+load), while i2 calculation requires 4 cpu cycles (mul+shift), so using MUL-based indexing shoudn't make 2-level hashing slower.

This code, however, has some requirements to the hash function: 1) hash value should have high-quality bits both on the lower and upper ends (since we use them both), 2) we need to know the bitness of actual hash values, f.e. 32 for crc32. See #30969.

Fixed subhash size

We can continue to use AND-based indexing for one-level hash (x = data1[h & SIZE_MASK]) and switch to a 2-level hash only when the hashtable is going to become larger than 2 MB. The last condition will give us a nice property: each subhash will have a fixed size of 2 MB (if we never contradict subhashes).

So, for each subhash we need to keep only 64-bit pointer and 32-bit population. Even with 2MB cpu pages, these metadata may pollute LLC only when hashtables will grow to more than 1 TB combined, so this approach should be no worse than the current 256-way hashtables. Also, the population update&check will not affect the hashtable access latency.

Indexing and population accounting can be reversed in order to avoid comparision with K:

if (--i2 < 0)  i2 = K-1;   // it's even 1 cycle faster than the current i2 = (i2+1) & MASK
if (--population[i1] == 0)  extend_subhash();

The first-level table can be extended in just two steps - 32 KB and 2 MB.

Conclusion

This text considers several independent proposals. I will evaluate only the last approach that combines them all together - extendible hashing with fixed-size 2 MB pages and MUL-based indexing:

for hashtables smaller than 2 MB, the behavior is the same as with the current one-level hashtables. I rely on assumption that the main problem of CH is memory usage and allocation speed of (much) larger hashtables
data for 2-level hashtables are allocated in fixed 2 MB chunks, making the allocation extremely fast even with global chunk pool - resulting in SPEED and LACK OF EXTERNAL FRAGMENTATION
MUL-based indexing means that the entire 2 MB chunk space will be used for data - resulting in LACK OF INTERNAL FRAGMENTATION
the proposed code sequences should make access to the modified hashtables about as fast as the current ones

TLDR: for hashtables larger than 1-2 MB - memory (de)allocation will become extremely fast and both internal and external fragmentation will gone (except for load factor). Access speed and behavior of smaller hashtables will not be affected, but the code complexity will grow.

The text was updated successfully, but these errors were encountered:

azat · 2023-05-07T12:10:35Z

Hi @Bulat-Ziganshin

Now, CH tries to optimize memory allocation by employing a better general-purpose allocator (jemalloc, hualloc)

Builtin HashTable/HashMap/HashSet allocates memory for the whole table. And builtin allocator in ClickHouse will use mmap/mremap directly if the hashtable will exceed 64MiB:

ClickHouse/src/Common/HashTable/HashTableAllocator.h

Line 11 in bcafd70

using HashTableAllocator = Allocator<true /* clear_memory */, true /* mmap_populate */>;
ClickHouse/src/Common/Allocator.cpp

Line 11 in bcafd70

__attribute__((__weak__)) extern const size_t MMAP_THRESHOLD = 64 * (1ULL << 20);
ClickHouse/src/Common/Allocator.h

Line 156 in bcafd70

buf = clickhouse_mremap(buf, old_size, new_size, MREMAP_MAYMOVE,

So as I can see the problem will come only if the stored type is non-POD, which should not be a common use case in ClickHouse and there not a lot that can be done here (jemalloc tries hard to accomplish this, but there are some workloads that jemalloc does not like, for example frequent realloc, here can be found more details jemalloc/jemalloc#566)

Maybe I'm missing something here?

MUL-based indexing means that the entire 2 MB chunk space will be used for data - resulting in LACK OF INTERNAL FRAGMENTATION

But this means that the load factor will be close to 1?
Right now common load factor is 0.5 (to get as much speed as possible)

Bulat-Ziganshin · 2023-05-07T13:22:12Z

there not a lot that can be done here

actually, we can. F.e. consider 5-byte elements which we want to put into 256-byte pages. Each page can contain 51 elements so that we can compute an initial index for data in the hashtable as:

h = hash32(data)
index = (h*51) >> 32
page = h & (num_pages-1)

So the index will be in the 0..50 range and computed from the higher bits of h, while the hash page (subtable) is computed from lower bits of h (the current scheme is page = h>>24, index = h & (pagesize-1)).

By using this approach we can allocate memory for any hash table in the pages of 2^N bytes and fully utilize each page for the data, while the existing system either has internal fragmentation (since it allocates 8 KB block, of which only 5 KB used to keep 1024 5-byte elements), or external fragmentation (because it splits blocks and then can't recombine them back).

So, by using multiplication (h*51 here) in the hashing instead of the existing approach we can alloc memory in 2^N-byte blocks and still fully utilize them for hash items of non-2^K size.

The hash load factor is a separate entity. F.e. if we have a hash table with 257 5-byte items then in the current system (which limits load factor to <=50%)

useful data stored are 257*5=1285 bytes
hash table size is 1024*5 = 5120 bytes (and load factor is 257/1024~=25%)
really allocated block is 5120 rounded up to 2^N = 8192 bytes

while in the proposed system

useful data stored are 257*5=1285 bytes
hash table size is 16*(51*5) = 4080 bytes (and load factor is 1285/4080~=31%)
really allocated block is 4080 rounded up to 2^N = 4096 bytes

So we use less space. And if the hash table contains 255 5-byte elements, both systems will use the same space - 4096 bytes, but the existing system will have 50% load factor, while the proposed one will have only 31%, thus making operations faster.

azat · 2023-05-07T13:45:56Z

I was talking about the following HashMap<uint64, std::string>, here std::string could do dynamic allocations by itself (if it will exceed max static string size), HashMap will do in-place new for std::string, but not for the members of std::string:

ClickHouse/src/Common/HashTable/HashTable.h

Line 976 in ca05a73

new (&buf[place_value]) Cell(key, *this);

So what I was trying to say that for this particular case allocator should be improved, not the HashMap.

Also this case is not that general for ClickHouse, the most common HashMap usage should be with POD types, while for them there is no memory fragmentation issues already.

Bulat-Ziganshin · 2023-05-07T13:51:34Z

If HashMap maps f.e. 32-bit value to 64-bit value, it should contain 12-byte items, isn't it? And in this case we will probably alloc memory (for tables smaller than 64 MB) in 2^N-sized blocks but use only 75% of this space.

Note that with 256-way hashtables, this means that the entire hashtable allocated without mmap, may have size up to 256*64MB = 16 GB. And with 64 threads processing one operation simultaneously, each having its own 256-way hashtable, the entire memory usage may grow up to 1 TB before we will start to use mmap directly.

So while this optimization improves memory utilization only for small tables, they aren't so small actually.

azat · 2023-05-07T14:18:46Z

If HashMap maps f.e. 32-bit value to 64-bit value, it should contain 12-byte items, isn't it?

32 bit key, 64 bit value? It would be 12 byte if there will be no alignment, but it will take 16 bytes in reality.

And in this case we will probably alloc memory (for tables smaller than 64 MB) in 2^N-sized blocks but use only 75% of this space.

Not sure that I got this part.
I guess you are not talking about load factor or padding inside structure here?

Note that with 256-way hashtables

Here you are referencing to TwoLevelHashTable?

this means that the entire hashtable allocated without mmap, may have size up to 256*64MB = 16 GB. And with 64 threads processing one operation simultaneously, each having its own 256-way hashtable, the entire memory usage may grow up to 1 TB before we will start to use mmap directly.

But even if it will use malloc instead, what is the problem here?

For example you have HashMap<uint32, uint64>:

initial size 256, size in bytes 4K
then, when 128 elements inserted it will be grow (using realloc) to 512 elements, size in bytes 8K

So yes, it will cause internal fragmentation in jemalloc, but old allocations could be reused for other small hashtables (but not always due to thread cache)

And you suggesting to have a separate allocator for hashtables? But how you will avoid memory fragmentation there?

Bulat-Ziganshin · 2023-05-08T05:40:27Z

32 bit key, 64 bit value? It would be 12 byte if there will be no alignment, but it will take 16 bytes in reality.

it can be improved, in theory :) replace Cell *buf with byte *buf and operate with the bytes directly

Note that with 256-way hashtables

Here you are referencing to TwoLevelHashTable?

yeah. From now on I will reference them as 256-segment tables as opposed to tables with fixed-size segments I proposing.

I'm sorry for the repeating, unstructured text below. We may arrange a voice call in Russian. Overall the proposed approach has the following improvements over the existing code:

absolutely no memory wasted by all but the smallest hashtables
10-100x less calls to allocation operations
each allocation operation requires just a few CPU commands

Not sure that I got this part.

let's see. whether we use malloc, jemalloc or anything else - general-purpose memory allocation remains a complex task. we spend a lot of time in alloc/free and waste a significant part of memory (aside from the hashtable load factor which is an entirely different problem). The existing approach maintains 256 hashtable segments of varying size per thread, thus forcing a memory allocator to choose between wasting more memory or wasting more CPU time to repeatedly join/split blocks and deal with multiple threads.

only mmap can efficiently handle memory (for large enough allocations) because it just remaps fixed-size pages, but even mmap has its own drawbacks (time spent in OS call + time spent to modify the page table). And since currently mmap used only for segments>=64 MB, malloc/jemalloc has to serve hashtables up to ~~8GB size. That's per thread, so it essentially deals with all but the largest GROUP-BY operations.

So I propose a different approach. Instead of splitting larger hashtables into exactly 256 segments of variable size, we use extendible hashing to split them into segments of fixed size, let's say 2 MB. This means that a pool of segments can be reused by all tables in all threads, with quick alloc/free operations.

The approach proposed here is a bit like software-based paging. Memory is allocated in fixed-size blocks, which allows us to completely avoid wasting memory (like the situation when we have two free 16 MB blocks, but can't satisfy a 32 MB memory request) essentially for all hashtables (a minimum requirement of 2 MB per thread doesn't look too restrictive).

So, pros of the proposed method:

no memory fragmentation for all but the smallest hashtables since any 2MB block is good for any table and is completely used
optional: global memory pool used by all threads
alloc and free operations are just a few CPU commands, plus compare-and-exchange cycle if we use a global pool
significantly reduces the number of alloc/free operations, f.e. if the total size of 256-segment hashtable is 10 MB, extending it to 12 MB will require ~44 realloc operations - compared to the single operation required to extend the proposed hashtable

In the current implementation, we enjoy no memory fragmentation (by using mmap) only for tables larger than ~~8GB which are extremely rare due to multithreading. The proposed change shifts this threshold to 2 MB. It also speeds up the memory (de)allocation and significantly reduces the number of mmap calls, since they are needed only when the memory usage by the entire program grows or shrinks.

I see drawbacks of the proposed method only in merging of hash tables (produced by different threads). While the existing code easily creates 256 independent merging tasks, the proposed method has two problems:

each thread's hashtable maps hashbits into segments in its own, unique way
if one of hashtables being merged contains only N segments, we can split merging only into N tasks

So, an efficient multithreading of hashtable merging is problematic to say the least.

And you suggesting to have a separate allocator for hashtables? But how you will avoid memory fragmentation there?

All we need is a some way to alloc 2 MB buffer. It may be implemented in various ways:

direct mmap/munmap calls for each allocation operation
a pool of free pages, backed up by mmap or any existing allocator
direct calls to an existing allocator
a pool of free pages, used both directly by large hashtables and as the source of free memory by an existing allocator

The last approach seems the best.

Now about guarantees of efficient memory usage. With 256-segment hashtable, each segment is guaranteed to have 25%..50% load factor, so the entire hashtable also guarantees at least 25% load.

With fixed-size segments, we also maintain the load factor of each segment in 25%..50% range, so the entire hashtable also guarantees at least 25% load.

Bulat-Ziganshin added the feature label Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed-block allocator for large hashtables #32362

Fixed-block allocator for large hashtables #32362

Bulat-Ziganshin commented Dec 8, 2021

azat commented May 7, 2023

Bulat-Ziganshin commented May 7, 2023 •

edited

azat commented May 7, 2023

Bulat-Ziganshin commented May 7, 2023 •

edited

azat commented May 7, 2023

Bulat-Ziganshin commented May 8, 2023 •

edited

Navigation Menu

Fixed-block allocator for large hashtables #32362

Fixed-block allocator for large hashtables #32362

Comments

Bulat-Ziganshin commented Dec 8, 2021

Buddy allocator

Limit the block size

Extendible hashing

All blocks are buddies

Fixed subhash size

Conclusion

azat commented May 7, 2023

Bulat-Ziganshin commented May 7, 2023 • edited

azat commented May 7, 2023

Bulat-Ziganshin commented May 7, 2023 • edited

azat commented May 7, 2023

Bulat-Ziganshin commented May 8, 2023 • edited

Bulat-Ziganshin commented May 7, 2023 •

edited

Bulat-Ziganshin commented May 7, 2023 •

edited

Bulat-Ziganshin commented May 8, 2023 •

edited