New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed-block allocator for large hashtables #32362
Comments
Builtin HashTable/HashMap/HashSet allocates memory for the whole table. And builtin allocator in ClickHouse will use
So as I can see the problem will come only if the stored type is non-POD, which should not be a common use case in ClickHouse and there not a lot that can be done here (jemalloc tries hard to accomplish this, but there are some workloads that jemalloc does not like, for example frequent realloc, here can be found more details jemalloc/jemalloc#566) Maybe I'm missing something here?
But this means that the load factor will be close to 1? |
actually, we can. F.e. consider 5-byte elements which we want to put into 256-byte pages. Each page can contain 51 elements so that we can compute an initial index for h = hash32(data) So the By using this approach we can allocate memory for any hash table in the pages of 2^N bytes and fully utilize each page for the data, while the existing system either has internal fragmentation (since it allocates 8 KB block, of which only 5 KB used to keep 1024 5-byte elements), or external fragmentation (because it splits blocks and then can't recombine them back). So, by using multiplication ( The hash load factor is a separate entity. F.e. if we have a hash table with 257 5-byte items then in the current system (which limits load factor to <=50%)
while in the proposed system
So we use less space. And if the hash table contains 255 5-byte elements, both systems will use the same space - 4096 bytes, but the existing system will have 50% load factor, while the proposed one will have only 31%, thus making operations faster. |
I was talking about the following
So what I was trying to say that for this particular case allocator should be improved, not the Also this case is not that general for ClickHouse, the most common |
If HashMap maps f.e. 32-bit value to 64-bit value, it should contain 12-byte items, isn't it? And in this case we will probably alloc memory (for tables smaller than 64 MB) in 2^N-sized blocks but use only 75% of this space. Note that with 256-way hashtables, this means that the entire hashtable allocated without mmap, may have size up to 256*64MB = 16 GB. And with 64 threads processing one operation simultaneously, each having its own 256-way hashtable, the entire memory usage may grow up to 1 TB before we will start to use mmap directly. So while this optimization improves memory utilization only for small tables, they aren't so small actually. |
32 bit key, 64 bit value? It would be 12 byte if there will be no alignment, but it will take 16 bytes in reality.
Not sure that I got this part.
Here you are referencing to
But even if it will use malloc instead, what is the problem here? For example you have
So yes, it will cause internal fragmentation in jemalloc, but old allocations could be reused for other small hashtables (but not always due to thread cache) And you suggesting to have a separate allocator for hashtables? But how you will avoid memory fragmentation there? |
it can be improved, in theory :) replace
yeah. From now on I will reference them as 256-segment tables as opposed to tables with fixed-size segments I proposing. I'm sorry for the repeating, unstructured text below. We may arrange a voice call in Russian. Overall the proposed approach has the following improvements over the existing code:
let's see. whether we use malloc, jemalloc or anything else - general-purpose memory allocation remains a complex task. we spend a lot of time in alloc/free and waste a significant part of memory (aside from the hashtable load factor which is an entirely different problem). The existing approach maintains 256 hashtable segments of varying size per thread, thus forcing a memory allocator to choose between wasting more memory or wasting more CPU time to repeatedly join/split blocks and deal with multiple threads. only mmap can efficiently handle memory (for large enough allocations) because it just remaps fixed-size pages, but even mmap has its own drawbacks (time spent in OS call + time spent to modify the page table). And since currently mmap used only for segments>=64 MB, malloc/jemalloc has to serve hashtables up to ~~8GB size. That's per thread, so it essentially deals with all but the largest GROUP-BY operations. So I propose a different approach. Instead of splitting larger hashtables into exactly 256 segments of variable size, we use extendible hashing to split them into segments of fixed size, let's say 2 MB. This means that a pool of segments can be reused by all tables in all threads, with quick alloc/free operations. The approach proposed here is a bit like software-based paging. Memory is allocated in fixed-size blocks, which allows us to completely avoid wasting memory (like the situation when we have two free 16 MB blocks, but can't satisfy a 32 MB memory request) essentially for all hashtables (a minimum requirement of 2 MB per thread doesn't look too restrictive). So, pros of the proposed method:
In the current implementation, we enjoy no memory fragmentation (by using mmap) only for tables larger than ~~8GB which are extremely rare due to multithreading. The proposed change shifts this threshold to 2 MB. It also speeds up the memory (de)allocation and significantly reduces the number of mmap calls, since they are needed only when the memory usage by the entire program grows or shrinks. I see drawbacks of the proposed method only in merging of hash tables (produced by different threads). While the existing code easily creates 256 independent merging tasks, the proposed method has two problems:
So, an efficient multithreading of hashtable merging is problematic to say the least.
All we need is a some way to alloc 2 MB buffer. It may be implemented in various ways:
The last approach seems the best. Now about guarantees of efficient memory usage. With 256-segment hashtable, each segment is guaranteed to have 25%..50% load factor, so the entire hashtable also guarantees at least 25% load. With fixed-size segments, we also maintain the load factor of each segment in 25%..50% range, so the entire hashtable also guarantees at least 25% load. |
Now, CH tries to optimize memory allocation by employing a better general-purpose allocator (jemalloc, hualloc). I propose to research a different approach - identify main memory eaters and individually customize these data structures and their allocation strategies with the goal to reduce their memory fragmentation and/or memory allocation times. In particular, the following analysis targets large hash tables since I know their implementation and I expect that they are among the main CH memory eaters. @alexey-milovidov, I hope you will read it.
Buddy allocator
In particular, each hash table has its own element size S, and memory segments allocated for a particular hash table are always of S*2^N bytes. So, instead of trying to fit them into segments of 2^N bytes, we can reserve for each S value a large address area (1 TB or so) and use f.e. buddy memory allocator to share this memory area between all hash tables with given S.
Limit the block size
Another opportunity (that may be considered independently) is to store a large hash table as an array of fixed-size segments rather than one big segment. F.e. we can limit segment size to 2 MB, and thus find the largest N for which S*2^N <= 2^21.
Once the hash table is going to have more than 2^N elements, we alloc its memory as multiple fixed-size blocks of S*2^N bytes, and access hash-table with two-level indexing:
hashtable[i] === hashtable.data[ i / 2^N ][ i % 2^N ]
This way, smaller blocks can still be allocated using the buddy allocator, while all larger blocks have the same size and may be kept in the single global list (for each S).
Extendible hashing
But wait - CH hashtables are already 2-level, so this adds the 3rd indexing level. We can have only 2 levels and still limit the sub-hash size with extendible hashing:
In terms of indexing, it just needs replacing the current
i1 = h >> 24
command withi1 = h >> V
, increasing the hashtable access latency only by 1 cpu cycle and only on Skylake and older archs.All blocks are buddies
Instead of making a private buddy allocator for each S, we can put a non-power-of-2 number of elements in each block. F.e. make all blocks 2 MB large, thus a block will contain
K = 2MB / S
elements. We can use multiplication-based index calculation to accommodate non-power-of-2 hash table size:Note that subhash calculation requires 5-6 cpu cycles (mask+load), while i2 calculation requires 4 cpu cycles (mul+shift), so using MUL-based indexing shoudn't make 2-level hashing slower.
This code, however, has some requirements to the hash function: 1) hash value should have high-quality bits both on the lower and upper ends (since we use them both), 2) we need to know the bitness of actual hash values, f.e. 32 for crc32. See #30969.
Fixed subhash size
We can continue to use AND-based indexing for one-level hash (
x = data1[h & SIZE_MASK]
) and switch to a 2-level hash only when the hashtable is going to become larger than 2 MB. The last condition will give us a nice property: each subhash will have a fixed size of 2 MB (if we never contradict subhashes).So, for each subhash we need to keep only 64-bit pointer and 32-bit population. Even with 2MB cpu pages, these metadata may pollute LLC only when hashtables will grow to more than 1 TB combined, so this approach should be no worse than the current 256-way hashtables. Also, the population update&check will not affect the hashtable access latency.
Indexing and population accounting can be reversed in order to avoid comparision with K:
The first-level table can be extended in just two steps - 32 KB and 2 MB.
Conclusion
This text considers several independent proposals. I will evaluate only the last approach that combines them all together - extendible hashing with fixed-size 2 MB pages and MUL-based indexing:
TLDR: for hashtables larger than 1-2 MB - memory (de)allocation will become extremely fast and both internal and external fragmentation will gone (except for load factor). Access speed and behavior of smaller hashtables will not be affected, but the code complexity will grow.
The text was updated successfully, but these errors were encountered: