Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf test: bind server to one NUMA node #15144

Merged
merged 38 commits into from Oct 23, 2020
Merged

Perf test: bind server to one NUMA node #15144

merged 38 commits into from Oct 23, 2020

Conversation

akuzm
Copy link
Contributor

@akuzm akuzm commented Sep 22, 2020

another try with disabled percpu arenas #12631 (comment)

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-clickhouse robot-clickhouse added pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR. labels Sep 22, 2020
@akuzm
Copy link
Contributor Author

akuzm commented Sep 23, 2020

@akuzm
Copy link
Contributor Author

akuzm commented Sep 23, 2020

https://clickhouse-test-reports.s3.yandex.net/15144/ab19bb25fd8c286713580649ad1c183d493ea5dc/performance_comparison/report.html#fail1

disabling percpu arenas didn't help, it still segfaults

Oh wait, it did help -- only the old server segfaults, not the new one. Now to find some way to disable it in runtime...

akuzm and others added 2 commits September 23, 2020 12:02
@@ -7,6 +7,10 @@ trap 'kill $(jobs -pr) ||:' EXIT
stage=${stage:-}
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

# https://github.com/jemalloc/jemalloc/wiki/Getting-Started
export MALLOC_CONF="percpu_arena:disabled"
ln -s "percpu_arena:disabled" /etc/malloc.conf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ln -s "percpu_arena:disabled" /etc/malloc.conf
ln -s "confirm_conf:true,percpu_arena:disabled" /etc/malloc.conf

I would also suggest adding confirm_conf:true, it will print all options with values:

$ MALLOC_CONF=confirm_conf:true,percpu_arena:disabled ./clickhouse server -V
<jemalloc>: malloc_conf #1 (string specified via --with-malloc-conf): "percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:10000"
<jemalloc>: -- Set conf value: percpu_arena:percpu
<jemalloc>: -- Set conf value: oversize_threshold:0
<jemalloc>: -- Set conf value: muzzy_decay_ms:10000
<jemalloc>: malloc_conf #2 (string pointed to by the global variable malloc_conf): ""
<jemalloc>: malloc_conf #3 ("name" of the file referenced by the symbolic link named /etc/malloc.conf): "confirm_conf:true"
<jemalloc>: -- Set conf value: confirm_conf:true
<jemalloc>: malloc_conf #4 (value of the environment variable MALLOC_CONF): "confirm_conf:true,percpu_arena:disabled"
<jemalloc>: -- Set conf value: confirm_conf:true
<jemalloc>: -- Set conf value: percpu_arena:disabled
ClickHouse server version 20.10.1.4618.

@akuzm
Copy link
Contributor Author

akuzm commented Sep 24, 2020

Finally it works, and we have much less unstable queries than in master (only 12 compared to ~100):
https://clickhouse-test-reports.s3.yandex.net/15144/21f9cc65c99e8210d7b14a4bef5969710451bc5e/performance_comparison/report.html#fail1

After jemalloc bug is fixed, we can enable this in master. Or maybe we can just disable percpu arenas for perf test, I'll check how it influences the performance.

@amosbird
Copy link
Collaborator

amosbird commented Sep 24, 2020

Maybe we can also try binding the two instances to different CPU nodes. It might have unfair CPU frequency but less cache contentions. It might also be good to leave out CPU 0 due to system interruption handling

@akuzm
Copy link
Contributor Author

akuzm commented Sep 24, 2020

Maybe we can also try binding the two instances to different CPU nodes. It might have unfair CPU frequency but less cache contentions. It might also be good to leave out CPU 0 due to system interruption handling

I'm afraid of introducing some systemic difference by binding to different nodes, but I'll try it as well. We only have two nodes there (this is from the test log):

2020-09-24 04:03:21	 + numactl --hardware
2020-09-24 04:03:21	 available: 2 nodes (0-1)
2020-09-24 04:03:21	 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
2020-09-24 04:03:21	 node 0 size: 257592 MB
2020-09-24 04:03:21	 node 0 free: 126497 MB
2020-09-24 04:03:21	 node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
2020-09-24 04:03:21	 node 1 size: 258027 MB
2020-09-24 04:03:21	 node 1 free: 83471 MB
2020-09-24 04:03:21	 node distances:
2020-09-24 04:03:21	 node   0   1 
2020-09-24 04:03:21	   0:  10  21 
2020-09-24 04:03:21	   1:  21  10 
2020-09-24 04:03:21	 + lscpu
2020-09-24 04:03:21	 Architecture:        x86_64
2020-09-24 04:03:21	 CPU op-mode(s):      32-bit, 64-bit
2020-09-24 04:03:21	 Byte Order:          Little Endian
2020-09-24 04:03:21	 CPU(s):              80
2020-09-24 04:03:21	 On-line CPU(s) list: 0-79
2020-09-24 04:03:21	 Thread(s) per core:  2
2020-09-24 04:03:21	 Core(s) per socket:  20
2020-09-24 04:03:21	 Socket(s):           2
2020-09-24 04:03:21	 NUMA node(s):        2
2020-09-24 04:03:21	 Vendor ID:           GenuineIntel
2020-09-24 04:03:21	 CPU family:          6
2020-09-24 04:03:21	 Model:               85
2020-09-24 04:03:21	 Model name:          Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
2020-09-24 04:03:21	 Stepping:            7
2020-09-24 04:03:21	 CPU MHz:             2799.957
2020-09-24 04:03:21	 CPU max MHz:         3900.0000
2020-09-24 04:03:21	 CPU min MHz:         800.0000
2020-09-24 04:03:21	 BogoMIPS:            4200.00
2020-09-24 04:03:21	 Virtualization:      VT-x
2020-09-24 04:03:21	 L1d cache:           32K
2020-09-24 04:03:21	 L1i cache:           32K
2020-09-24 04:03:21	 L2 cache:            1024K
2020-09-24 04:03:21	 L3 cache:            28160K
2020-09-24 04:03:21	 NUMA node0 CPU(s):   0-19,40-59
2020-09-24 04:03:21	 NUMA node1 CPU(s):   20-39,60-79
2020-09-24 04:03:21	 Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke flush_l1d arch_capabilities

@akuzm
Copy link
Contributor Author

akuzm commented Sep 25, 2020

@azat
Copy link
Collaborator

azat commented Sep 25, 2020

https://clickhouse-test-reports.s3.yandex.net/15144/26abe8cb30819eea1c1f3383cc99620ab2e9da9a/performance_comparison/report.html#fail1
disabling percpu arenas doesn't really influence the performance

In general it should not, the purpose of using percpu arena (#11084) is to avoid live leaks with big thread pool, since in this case there will be too much arenas and if some thread will not be scheduled for some time it will not free data (there is also background threads way but percpu arena was prefered #11086)

@robot-clickhouse robot-clickhouse removed the submodule changed At least one submodule changed in this PR. label Sep 30, 2020
@akuzm akuzm marked this pull request as ready for review October 23, 2020 11:10
@akuzm akuzm merged commit 4476117 into master Oct 23, 2020
@akuzm akuzm deleted the aku/numa-perf branch October 23, 2020 11:10
@@ -77,23 +77,25 @@ function restart
while killall clickhouse-server; do echo . ; sleep 1 ; done
echo all killed

# Disable percpu arenas because they segfault when the process is bound to
# a particular NUMA node: https://github.com/jemalloc/jemalloc/pull/1939
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it should not already, since #15035 is merged (it updates jemalloc submodule to include backported version of the patch from the referenced pull request in upstream jemalloc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I'll remove it.

@azat
Copy link
Collaborator

azat commented Oct 23, 2020

@akuzm just out of curiosity, what does it improves in perf tests? (taking into account that fact that you disable percpu arena, so you may have a little bit higher memory reserved, and sometimes a little bit less overhead on freeing, but this just a possibility)

@akuzm
Copy link
Contributor Author

akuzm commented Oct 26, 2020

@akuzm just out of curiosity, what does it improves in perf tests? (taking into account that fact that you disable percpu arena, so you may have a little bit higher memory reserved, and sometimes a little bit less overhead on freeing, but this just a possibility)

It gives more stable results for test queries that reference a lot of memory. W/o NUMA binding, we had about 150 unstable queries, and now only 35 are left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-not-for-changelog This PR should not be mentioned in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants