Run perf tests with memory sampling (for allocations >1M) #12142

azat · 2020-07-05T09:47:23Z

This is to know the memory allocation size distribution, that can be
obtained later from left-trace-log.tsv.

This is an attempt to tune tcmalloc (new CPP version by google) to use
lock-free part of the allocator for typical allocations (and it is a bad
idea just to increase kMaxSize there, since number of allocation for
each size class is also important).

P.S. hope that this file will be applied, if no, then the same effect
can be reached by tunning defaults in Settings.h

Refs: #11590
Cc: @akuzm

Changelog category (leave one):

Not for changelog (changelog entry is not required)

HEAD's:

307c3c9

@akuzm

This is to know the memory allocation size distribution, that can be obtained later from left-metric-log.tsv. This is an attempt to tune tcmalloc (new CPP version by google) to use lock-free part of the allocator for typical allocations (and it is a bad idea just to increase kMaxSize there, since number of allocation for each size class is also important). P.S. hope that this file will be applied, if no, then the same effect can be reached by tunning defaults in Settings.h Refs: ClickHouse#11590 Cc: @akuzm

alexey-milovidov · 2020-07-05T21:01:38Z

Looks like it doesn't affect performance. So we can merge :)

azat · 2020-07-05T22:20:35Z

Looks like it doesn't affect performance.

Indeed, this is interesting, I mean that jemalloc+memory sampling (although it uses pipe w/ O_NONBLOCK, so not sure that all the allocations was written) works faster then tcmalloc (need to look at data)

So we can merge :)

But maybe worth adjust the max_untracked_memory?

P.S. output.7z (archive with all data) now 2x bigger, 5xx MiB vs 1.2GiB

azat · 2020-07-05T23:42:05Z

Allocation distribution (dumb analysis):

$ docker run -v $PWD:/wrk:ro -w /wrk -it --rm --name ch yandex/clickhouse-server clickhouse-local -q "select floor(log2(size)) lg, min(size) min, max(size) max, count() cnt from file ('left-trace-log.tsv', TSVWithNamesAndTypes, '$(cat left-trace-log.tsv.columns)') where trace_type = 'MemorySample' and size > 0 group by lg order by cnt format PrettyCompact"
┌─lg─┬────────min─┬────────max─┬──────cnt─┐
│ 15 │      38696 │      38776 │       14 │
│  9 │        856 │        856 │       28 │
│ 30 │ 1073873728 │ 1074411312 │      224 │
│ 29 │  536871128 │  541056496 │      338 │
│ 28 │  268435456 │  536869536 │      348 │
│ 26 │   67108864 │  134217616 │     7649 │
│ 25 │   33554432 │   67106704 │     7830 │
│ 27 │  134217728 │  201326704 │     8037 │
│ 23 │    8388608 │   16777208 │   292857 │
│ 22 │    4194304 │    8388600 │   547939 │
│ 24 │   16777216 │   33554320 │   733812 │
│ 21 │    2097152 │    4194296 │  1399472 │
│ 20 │    1048577 │    2097144 │ 16922414 │ <-- allocations between 1MB and 2MB is very hot
└────┴────────────┴────────────┴──────────┘

akuzm · 2020-07-06T09:29:22Z

https://clickhouse-test-reports.s3.yandex.net/12142/307c3c92a586e0cde9f5522607bb5d951df32103/performance_comparison/report.html#3.15

500% expected variability in query time is something new... I've only seen it go to 150% before.

akuzm · 2020-07-06T09:39:33Z

https://twitter.com/trav_downs/status/1262058940831092738

A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

azat · 2020-07-06T10:03:05Z

https://twitter.com/trav_downs/status/1262058940831092738
A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

Interesting, but if understand it right, it will work only for "large enough" allocations

akuzm · 2020-07-06T10:06:48Z

https://twitter.com/trav_downs/status/1262058940831092738
A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

Interesting, but if understand it right, it will work only for "large enough" allocations

Didn't read the code, but as I understood from the discussion, they align tracked allocations by 1GB, to skip the untracked ones fast, based on address. Then, if they see an aligned address in free(), they go on slow path to look it up in a hash table. So it works for allocation of any size, it just has to be aligned. And they decide which ones to sample and align in malloc().

azat · 2020-07-06T10:23:03Z

So it works for allocation of any size, it just has to be aligned.

It uses just an array for all available pointers of aligned allocations ((1<<47)/(1<<30)), sure this can be converted to hashtable

But (correct me if I'm wrong):

if the alignment will be too small it will require too much memory and has too much false-positive
if it will be large enough, you can exhaust the virtual address space too fast.
plus there is limit for number of vma's (AFAIR it is 1<<16), so you cannot allocate more then this limit by default (and it is questionable whether it will be fast enough, if this limit will be increased a lot), and with default limit of vma's and 4MB allocations you can allocate not more then 256GB of memory (but the number will be less, since typical usage of vma's for clickhouse-server is around 3-4K)

akuzm · 2020-07-06T10:37:50Z

Not sure, didn't study it in detail yet, but they only do this alignment for a small percentage of allocations they randomly chose to sample, so maybe it's ok...

By the way, I have another interesting thing for you :) It is viable to record every memory allocation for a query in CH, it gives about 5 times slowdown. Here is a patch that does that:
memory-profiler.txt

You use it like this:

:) truncate table system.trace_log;
:) set trace_memory = 1;

:) SELECT uniq(UserID)
FROM hits_100m_single
WHERE AdvEngineID != 0

┌─uniq(UserID)─┐
│       295652 │
└──────────────┘

1 rows in set. Elapsed: 0.640 sec. Processed 100.00 million rows, 457.66 MB (156.16 million rows/s., 714.66 MB/s.) 

:) set trace_memory = 0;
:) system flush logs;

clickhouse-client -q "SELECT  arrayStringConcat((arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, sum(abs(size)) AS samples FROM system.trace_log where trace_type = 'Memory' and event_date = today() group by trace order by samples desc FORMAT TabSeparated" | ~/fg/flamegraph.pl > allocs.svg

And you get all the allocations for the query in a form of a flame graph like this: https://clickhouse-test-reports.s3.yandex.net/12142/allocs.svg. This proved useful to me on a couple of occasions, but it was inconvenient to properly integrate into the CH code so that we could actually merge it, so I never did. If it was in the main tree, we could just record full trace for a prewarm run of every query in perf test. Sampling the measured query runs has a drawback that it adds instability to query run time.

upd: forgot to add one line to the patch, fixed

azat · 2020-07-06T21:33:41Z

So we can merge :)

I got what I need (maybe I will try smaller allocations but that's it), so the only question should this be merged or not, but as @akuzm said this does not looks good:

500% expected variability in query time is something new... I've only seen it go to 150% before.

@alexey-milovidov @akuzm ?

By the way, I have another interesting thing for you :) It is viable to record every memory allocation for a query in CH, it gives about 5 times slowdown. Here is a patch that does that:
memory-profiler.txt

@akuzm this is interesting, thanks! Few months ago I need smth similar (in particular tracking allocation that do not have proper tracker already), but at that time I was thinking about adding MemoryTracker everywhere instead, you idea looks simpler.

FWIW maybe you will be interesting too, few times ago I was thinking about adding address of the allocation/deallocation into the trace_log, and you can find memory leaks with it, by analyzing trace_log, not always, since memory can be owned by someone else, i.e. Buffer or similar, but something like in your patch will help with this (and not only leaks, but also places where the memory was moved to another tracker, for debugging memory related issues)

akuzm · 2020-07-06T22:34:36Z

Let's merge and look how it goes, we can roll it back if it's too bad.

alexey-milovidov · 2020-07-06T22:55:47Z

@akuzm Now it is possible to track every allocation without patching ClickHouse, just look at
max_untracked_memory, total_memory_profiler_step...

alexey-milovidov · 2020-07-06T22:58:00Z

Also there was an idea to sample allocations and deallocations based on hash of address (hash can be calculated in 3 clock ticks) (and do sampling with higher probability for larger sizes), but it has a downside that a loop with alloc/free will likely to always return the same address and some random loops will slowdown due to sampling.

alexey-milovidov · 2020-07-06T22:59:45Z

About alignment by 1 GiB... it seems totally pointless, because you can just do some allocations to return an address inside specific range of virtual address space (you can create separate arena for this purpose).

akuzm · 2020-07-07T16:59:35Z

I didn't realize we trace every allocation above 1 MB now. Just seen a query where profiling takes 25% of the query time, I'm reverting this now. Probably will have to disable CPU profiling for measured query runs as well, because the instability is just too much. Instead of that, I'll try to add both memory and CPU profiling to prewarm runs.

azat marked this pull request as draft July 5, 2020 09:47

blinkov added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 5, 2020

alexey-milovidov approved these changes Jul 5, 2020

View reviewed changes

alexey-milovidov self-assigned this Jul 5, 2020

akuzm marked this pull request as ready for review July 6, 2020 22:34

akuzm merged commit a3826e5 into ClickHouse:master Jul 6, 2020

azat deleted the perf-memory-sampling branch July 7, 2020 08:31

akuzm mentioned this pull request Jul 7, 2020

Revert "Run perf tests with memory sampling (for allocations >1M)" #12265

Merged

qoega added the no-docs-needed label Sep 2, 2020

azat mentioned this pull request Dec 9, 2021

Try experimental hualloc memory allocator from Yandex #31376

Closed

azat mentioned this pull request May 14, 2022

Implementation of gwp-asan technique #36826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run perf tests with memory sampling (for allocations >1M) #12142

Run perf tests with memory sampling (for allocations >1M) #12142

azat commented Jul 5, 2020 •

edited

Loading

alexey-milovidov commented Jul 5, 2020

azat commented Jul 5, 2020

azat commented Jul 5, 2020

akuzm commented Jul 6, 2020

akuzm commented Jul 6, 2020

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020 •

edited

Loading

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020 •

edited

Loading

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020

alexey-milovidov commented Jul 6, 2020

alexey-milovidov commented Jul 6, 2020 •

edited

Loading

alexey-milovidov commented Jul 6, 2020

akuzm commented Jul 7, 2020

Run perf tests with memory sampling (for allocations >1M) #12142

Run perf tests with memory sampling (for allocations >1M) #12142

Conversation

azat commented Jul 5, 2020 • edited Loading

alexey-milovidov commented Jul 5, 2020

azat commented Jul 5, 2020

azat commented Jul 5, 2020

akuzm commented Jul 6, 2020

akuzm commented Jul 6, 2020

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020 • edited Loading

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020 • edited Loading

azat commented Jul 6, 2020

akuzm commented Jul 6, 2020

alexey-milovidov commented Jul 6, 2020

alexey-milovidov commented Jul 6, 2020 • edited Loading

alexey-milovidov commented Jul 6, 2020

akuzm commented Jul 7, 2020

azat commented Jul 5, 2020 •

edited

Loading

akuzm commented Jul 6, 2020 •

edited

Loading

akuzm commented Jul 6, 2020 •

edited

Loading

alexey-milovidov commented Jul 6, 2020 •

edited

Loading