-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run perf tests with memory sampling (for allocations >1M) #12142
Conversation
This is to know the memory allocation size distribution, that can be obtained later from left-metric-log.tsv. This is an attempt to tune tcmalloc (new CPP version by google) to use lock-free part of the allocator for typical allocations (and it is a bad idea just to increase kMaxSize there, since number of allocation for each size class is also important). P.S. hope that this file will be applied, if no, then the same effect can be reached by tunning defaults in Settings.h Refs: ClickHouse#11590 Cc: @akuzm
Looks like it doesn't affect performance. So we can merge :) |
Indeed, this is interesting, I mean that jemalloc+memory sampling (although it uses pipe w/ O_NONBLOCK, so not sure that all the allocations was written) works faster then tcmalloc (need to look at data)
But maybe worth adjust the P.S. output.7z (archive with all data) now 2x bigger, 5xx MiB vs 1.2GiB |
Allocation distribution (dumb analysis):
|
500% expected variability in query time is something new... I've only seen it go to 150% before. |
https://twitter.com/trav_downs/status/1262058940831092738 A nice trick they use in another sampling memory profiler to record deallocation for the same blocks. |
Interesting, but if understand it right, it will work only for "large enough" allocations |
Didn't read the code, but as I understood from the discussion, they align tracked allocations by 1GB, to skip the untracked ones fast, based on address. Then, if they see an aligned address in free(), they go on slow path to look it up in a hash table. So it works for allocation of any size, it just has to be aligned. And they decide which ones to sample and align in malloc(). |
It uses just an array for all available pointers of aligned allocations ((1<<47)/(1<<30)), sure this can be converted to hashtable But (correct me if I'm wrong):
|
Not sure, didn't study it in detail yet, but they only do this alignment for a small percentage of allocations they randomly chose to sample, so maybe it's ok... By the way, I have another interesting thing for you :) It is viable to record every memory allocation for a query in CH, it gives about 5 times slowdown. Here is a patch that does that: You use it like this:
And you get all the allocations for the query in a form of a flame graph like this: https://clickhouse-test-reports.s3.yandex.net/12142/allocs.svg. This proved useful to me on a couple of occasions, but it was inconvenient to properly integrate into the CH code so that we could actually merge it, so I never did. If it was in the main tree, we could just record full trace for a prewarm run of every query in perf test. Sampling the measured query runs has a drawback that it adds instability to query run time. upd: forgot to add one line to the patch, fixed |
I got what I need (maybe I will try smaller allocations but that's it), so the only question should this be merged or not, but as @akuzm said this does not looks good:
@akuzm this is interesting, thanks! Few months ago I need smth similar (in particular tracking allocation that do not have proper tracker already), but at that time I was thinking about adding MemoryTracker everywhere instead, you idea looks simpler. FWIW maybe you will be interesting too, few times ago I was thinking about adding address of the allocation/deallocation into the trace_log, and you can find memory leaks with it, by analyzing trace_log, not always, since memory can be owned by someone else, i.e. Buffer or similar, but something like in your patch will help with this (and not only leaks, but also places where the memory was moved to another tracker, for debugging memory related issues) |
Let's merge and look how it goes, we can roll it back if it's too bad. |
@akuzm Now it is possible to track every allocation without patching ClickHouse, just look at |
Also there was an idea to sample allocations and deallocations based on hash of address (hash can be calculated in 3 clock ticks) (and do sampling with higher probability for larger sizes), but it has a downside that a loop with alloc/free will likely to always return the same address and some random loops will slowdown due to sampling. |
About alignment by 1 GiB... it seems totally pointless, because you can just do some allocations to return an address inside specific range of virtual address space (you can create separate arena for this purpose). |
I didn't realize we trace every allocation above 1 MB now. Just seen a query where profiling takes 25% of the query time, I'm reverting this now. Probably will have to disable CPU profiling for measured query runs as well, because the instability is just too much. Instead of that, I'll try to add both memory and CPU profiling to prewarm runs. |
This is to know the memory allocation size distribution, that can be
obtained later from left-trace-log.tsv.
This is an attempt to tune tcmalloc (new CPP version by google) to use
lock-free part of the allocator for typical allocations (and it is a bad
idea just to increase kMaxSize there, since number of allocation for
each size class is also important).
P.S. hope that this file will be applied, if no, then the same effect
can be reached by tunning defaults in Settings.h
Refs: #11590
Cc: @akuzm
Changelog category (leave one):
HEAD's: