Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run perf tests with memory sampling (for allocations >1M) #12142

Merged
merged 1 commit into from
Jul 6, 2020

Conversation

azat
Copy link
Collaborator

@azat azat commented Jul 5, 2020

This is to know the memory allocation size distribution, that can be
obtained later from left-trace-log.tsv.

This is an attempt to tune tcmalloc (new CPP version by google) to use
lock-free part of the allocator for typical allocations (and it is a bad
idea just to increase kMaxSize there, since number of allocation for
each size class is also important).

P.S. hope that this file will be applied, if no, then the same effect
can be reached by tunning defaults in Settings.h

Refs: #11590
Cc: @akuzm

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

HEAD's:

This is to know the memory allocation size distribution, that can be
obtained later from left-metric-log.tsv.

This is an attempt to tune tcmalloc (new CPP version by google) to use
lock-free part of the allocator for typical allocations (and it is a bad
idea just to increase kMaxSize there, since number of allocation for
each size class is also important).

P.S. hope that this file will be applied, if no, then the same effect
can be reached by tunning defaults in Settings.h

Refs: ClickHouse#11590
Cc: @akuzm
@azat azat marked this pull request as draft July 5, 2020 09:47
@blinkov blinkov added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 5, 2020
@alexey-milovidov
Copy link
Member

Looks like it doesn't affect performance. So we can merge :)

@alexey-milovidov alexey-milovidov self-assigned this Jul 5, 2020
@azat
Copy link
Collaborator Author

azat commented Jul 5, 2020

Looks like it doesn't affect performance.

Indeed, this is interesting, I mean that jemalloc+memory sampling (although it uses pipe w/ O_NONBLOCK, so not sure that all the allocations was written) works faster then tcmalloc (need to look at data)

So we can merge :)

But maybe worth adjust the max_untracked_memory?

P.S. output.7z (archive with all data) now 2x bigger, 5xx MiB vs 1.2GiB

@azat
Copy link
Collaborator Author

azat commented Jul 5, 2020

Allocation distribution (dumb analysis):

$ docker run -v $PWD:/wrk:ro -w /wrk -it --rm --name ch yandex/clickhouse-server clickhouse-local -q "select floor(log2(size)) lg, min(size) min, max(size) max, count() cnt from file ('left-trace-log.tsv', TSVWithNamesAndTypes, '$(cat left-trace-log.tsv.columns)') where trace_type = 'MemorySample' and size > 0 group by lg order by cnt format PrettyCompact"
┌─lg─┬────────min─┬────────max─┬──────cnt─┐
│ 15 │      38696 │      38776 │       14 │
│  9 │        856 │        856 │       28 │
│ 30 │ 1073873728 │ 1074411312 │      224 │
│ 29 │  536871128 │  541056496 │      338 │
│ 28 │  268435456 │  536869536 │      348 │
│ 26 │   67108864 │  134217616 │     7649 │
│ 25 │   33554432 │   67106704 │     7830 │
│ 27 │  134217728 │  201326704 │     8037 │
│ 23 │    8388608 │   16777208 │   292857 │
│ 22 │    4194304 │    8388600 │   547939 │
│ 24 │   16777216 │   33554320 │   733812 │
│ 21 │    2097152 │    4194296 │  1399472 │
│ 20 │    1048577 │    2097144 │ 16922414 │ <-- allocations between 1MB and 2MB is very hot
└────┴────────────┴────────────┴──────────┘

@akuzm
Copy link
Contributor

akuzm commented Jul 6, 2020

https://clickhouse-test-reports.s3.yandex.net/12142/307c3c92a586e0cde9f5522607bb5d951df32103/performance_comparison/report.html#3.15

500% expected variability in query time is something new... I've only seen it go to 150% before.

@akuzm
Copy link
Contributor

akuzm commented Jul 6, 2020

https://twitter.com/trav_downs/status/1262058940831092738

A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

@azat
Copy link
Collaborator Author

azat commented Jul 6, 2020

https://twitter.com/trav_downs/status/1262058940831092738
A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

Interesting, but if understand it right, it will work only for "large enough" allocations

@akuzm
Copy link
Contributor

akuzm commented Jul 6, 2020

https://twitter.com/trav_downs/status/1262058940831092738
A nice trick they use in another sampling memory profiler to record deallocation for the same blocks.

Interesting, but if understand it right, it will work only for "large enough" allocations

Didn't read the code, but as I understood from the discussion, they align tracked allocations by 1GB, to skip the untracked ones fast, based on address. Then, if they see an aligned address in free(), they go on slow path to look it up in a hash table. So it works for allocation of any size, it just has to be aligned. And they decide which ones to sample and align in malloc().

@azat
Copy link
Collaborator Author

azat commented Jul 6, 2020

So it works for allocation of any size, it just has to be aligned.

It uses just an array for all available pointers of aligned allocations ((1<<47)/(1<<30)), sure this can be converted to hashtable

But (correct me if I'm wrong):

  • if the alignment will be too small it will require too much memory and has too much false-positive
  • if it will be large enough, you can exhaust the virtual address space too fast.
  • plus there is limit for number of vma's (AFAIR it is 1<<16), so you cannot allocate more then this limit by default (and it is questionable whether it will be fast enough, if this limit will be increased a lot), and with default limit of vma's and 4MB allocations you can allocate not more then 256GB of memory (but the number will be less, since typical usage of vma's for clickhouse-server is around 3-4K)

@akuzm
Copy link
Contributor

akuzm commented Jul 6, 2020

Not sure, didn't study it in detail yet, but they only do this alignment for a small percentage of allocations they randomly chose to sample, so maybe it's ok...

By the way, I have another interesting thing for you :) It is viable to record every memory allocation for a query in CH, it gives about 5 times slowdown. Here is a patch that does that:
memory-profiler.txt

You use it like this:

:) truncate table system.trace_log;
:) set trace_memory = 1;

:) SELECT uniq(UserID)
FROM hits_100m_single
WHERE AdvEngineID != 0

┌─uniq(UserID)─┐
│       295652 │
└──────────────┘

1 rows in set. Elapsed: 0.640 sec. Processed 100.00 million rows, 457.66 MB (156.16 million rows/s., 714.66 MB/s.) 

:) set trace_memory = 0;
:) system flush logs;

clickhouse-client -q "SELECT  arrayStringConcat((arrayMap(x -> concat(splitByChar('/', addressToLine(x))[-1], '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, sum(abs(size)) AS samples FROM system.trace_log where trace_type = 'Memory' and event_date = today() group by trace order by samples desc FORMAT TabSeparated" | ~/fg/flamegraph.pl > allocs.svg

And you get all the allocations for the query in a form of a flame graph like this: https://clickhouse-test-reports.s3.yandex.net/12142/allocs.svg. This proved useful to me on a couple of occasions, but it was inconvenient to properly integrate into the CH code so that we could actually merge it, so I never did. If it was in the main tree, we could just record full trace for a prewarm run of every query in perf test. Sampling the measured query runs has a drawback that it adds instability to query run time.

upd: forgot to add one line to the patch, fixed

@azat
Copy link
Collaborator Author

azat commented Jul 6, 2020

So we can merge :)

I got what I need (maybe I will try smaller allocations but that's it), so the only question should this be merged or not, but as @akuzm said this does not looks good:

500% expected variability in query time is something new... I've only seen it go to 150% before.

@alexey-milovidov @akuzm ?

By the way, I have another interesting thing for you :) It is viable to record every memory allocation for a query in CH, it gives about 5 times slowdown. Here is a patch that does that:
memory-profiler.txt

@akuzm this is interesting, thanks! Few months ago I need smth similar (in particular tracking allocation that do not have proper tracker already), but at that time I was thinking about adding MemoryTracker everywhere instead, you idea looks simpler.

FWIW maybe you will be interesting too, few times ago I was thinking about adding address of the allocation/deallocation into the trace_log, and you can find memory leaks with it, by analyzing trace_log, not always, since memory can be owned by someone else, i.e. Buffer or similar, but something like in your patch will help with this (and not only leaks, but also places where the memory was moved to another tracker, for debugging memory related issues)

@akuzm
Copy link
Contributor

akuzm commented Jul 6, 2020

Let's merge and look how it goes, we can roll it back if it's too bad.

@akuzm akuzm marked this pull request as ready for review July 6, 2020 22:34
@akuzm akuzm merged commit a3826e5 into ClickHouse:master Jul 6, 2020
@alexey-milovidov
Copy link
Member

@akuzm Now it is possible to track every allocation without patching ClickHouse, just look at
max_untracked_memory, total_memory_profiler_step...

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 6, 2020

Also there was an idea to sample allocations and deallocations based on hash of address (hash can be calculated in 3 clock ticks) (and do sampling with higher probability for larger sizes), but it has a downside that a loop with alloc/free will likely to always return the same address and some random loops will slowdown due to sampling.

@alexey-milovidov
Copy link
Member

About alignment by 1 GiB... it seems totally pointless, because you can just do some allocations to return an address inside specific range of virtual address space (you can create separate arena for this purpose).

@azat azat deleted the perf-memory-sampling branch July 7, 2020 08:31
@akuzm
Copy link
Contributor

akuzm commented Jul 7, 2020

I didn't realize we trace every allocation above 1 MB now. Just seen a query where profiling takes 25% of the query time, I'm reverting this now. Probably will have to disable CPU profiling for measured query runs as well, because the instability is just too much. Instead of that, I'll try to add both memory and CPU profiling to prewarm runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-docs-needed pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants