Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

filipecosta90 · 2021-02-21T11:01:46Z

This PR moves from having data nodes data organized in an Array of Structures (AoS) to an structure of Arrays (SoA), meaning that in practical terms we use an array of primary data-types instead of an array of centroids ( struct ).
The impact of this change on both reads/writes can be described by the following table and chart. We see up to 180% less CPU time spent on the same operations.

method / distribution of data / compression	cpu_time for 1 OP Array of Structures (AoS)	cpu_time for 1 OP Stucture of Arrays (SoA)	time_unit	overall % improvement
td_add() unif. dist / cmp. 100	110	58	us	90.0%
td_add() unif. dist / cmp. 200	119	62	us	93.8%
td_add() unif. dist / cmp. 300	125	64	us	95.4%
td_add() unif. dist / cmp. 400	129	66	us	96.1%
td_add() unif. dist / cmp. 500	132	67	us	97.1%
td_add() logn. dist / cmp. 100	111	58	us	92.5%
td_add() logn. dist / cmp. 200	120	62	us	94.2%
td_add() logn. dist / cmp. 300	125	64	us	95.7%
td_add() logn. dist / cmp. 400	129	66	us	96.6%
td_add() logn. dist / cmp. 500	132	67	us	97.1%
td_quantile() logn. dist / cmp. 100	58	55	us	5.9%
td_quantile() logn. dist / cmp. 200	80	80	us	0.4%
td_quantile() logn. dist / cmp. 300	104	106	us	-1.9%
td_quantile() logn. dist / cmp. 400	124	121	us	2.4%
td_quantile() logn. dist / cmp. 500	152	152	us	-0.1%
td_merge() logn. dist / cmp. 100	56	22	us	158.2%
td_merge() logn. dist / cmp. 200	95	35	us	170.3%
td_merge() logn. dist / cmp. 300	135	49	us	178.2%
td_merge() logn. dist / cmp. 400	173	64	us	170.4%
td_merge() logn. dist / cmp. 500	212	92	us	130.3%

Raw benchmark outputs:

Before performance improvements:

build/tests/histogram_benchmark --benchmark_min_time=10
2021-02-28 17:47:05
Running build/tests/histogram_benchmark
Run on (40 X 819.226 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.41, 0.40, 0.19
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000        1098646509 ns   1098624403 ns           13 Centroid_Count=76 Total_Compressions=242.134k items_per_second=700.176k/s
BM_td_add_uniform_dist/200/10000000        1194952265 ns   1194932800 ns           12 Centroid_Count=124 Total_Compressions=110.06k items_per_second=697.389k/s
BM_td_add_uniform_dist/300/10000000        1251559692 ns   1251541906 ns           11 Centroid_Count=167 Total_Compressions=66.769k items_per_second=726.377k/s
BM_td_add_uniform_dist/400/10000000        1292489949 ns   1292473451 ns           11 Centroid_Count=212 Total_Compressions=49.954k items_per_second=703.373k/s
BM_td_add_uniform_dist/500/10000000        1322877635 ns   1322865217 ns           11 Centroid_Count=261 Total_Compressions=39.9k items_per_second=687.214k/s
BM_td_add_lognormal_dist/100/10000000      1113979525 ns   1113955647 ns           12 Centroid_Count=73 Total_Compressions=222.825k items_per_second=748.085k/s
BM_td_add_lognormal_dist/200/10000000      1198644556 ns   1198622031 ns           12 Centroid_Count=124 Total_Compressions=110.264k items_per_second=695.243k/s
BM_td_add_lognormal_dist/300/10000000      1253960787 ns   1253937374 ns           11 Centroid_Count=170 Total_Compressions=66.953k items_per_second=724.989k/s
BM_td_add_lognormal_dist/400/10000000      1294703824 ns   1294678583 ns           11 Centroid_Count=215 Total_Compressions=50.043k items_per_second=702.175k/s
BM_td_add_lognormal_dist/500/10000000      1323218297 ns   1323190558 ns           11 Centroid_Count=254 Total_Compressions=39.821k items_per_second=687.045k/s
BM_td_quantile_lognormal_dist/100/10000000  579410668 ns    579398605 ns           26 items_per_second=663.818k/s
BM_td_quantile_lognormal_dist/200/10000000  800659098 ns    800643219 ns           17 items_per_second=734.703k/s
BM_td_quantile_lognormal_dist/300/10000000 1039924788 ns   1039905417 ns           13 items_per_second=739.712k/s
BM_td_quantile_lognormal_dist/400/10000000 1239878310 ns   1239852700 ns           11 items_per_second=733.225k/s
BM_td_quantile_lognormal_dist/500/10000000 1518966398 ns   1518935909 ns            9 items_per_second=731.506k/s
BM_td_merge_lognormal_dist/100/10000000     557161796 ns    557154135 ns           25 items_per_second=7.17934k/s
BM_td_merge_lognormal_dist/200/10000000     954183623 ns    954170568 ns           15 items_per_second=6.98687k/s
BM_td_merge_lognormal_dist/300/10000000    1353286663 ns   1353246713 ns           11 items_per_second=6.71785k/s
BM_td_merge_lognormal_dist/400/10000000    1725730075 ns   1725634626 ns            8 items_per_second=7.24371k/s
BM_td_merge_lognormal_dist/500/10000000    2123453467 ns   2123344748 ns            7 items_per_second=6.72793k/s

After performance improvements:

build/tests/histogram_benchmark --benchmark_min_time=10
2021-02-28 17:57:37
Running build/tests/histogram_benchmark
Run on (40 X 920.051 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.16, 0.37, 0.35
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000         578131150 ns    578120317 ns           24 Centroid_Count=68 Total_Compressions=444.294k items_per_second=720.727k/s
BM_td_add_uniform_dist/200/10000000         616649934 ns    616638680 ns           23 Centroid_Count=112 Total_Compressions=209.924k items_per_second=705.085k/s
BM_td_add_uniform_dist/300/10000000         640655884 ns    640644563 ns           22 Centroid_Count=155 Total_Compressions=133.172k items_per_second=709.513k/s
BM_td_add_uniform_dist/400/10000000         659102738 ns    659091473 ns           21 Centroid_Count=203 Total_Compressions=95.322k items_per_second=722.495k/s
BM_td_add_uniform_dist/500/10000000         671314223 ns    671302383 ns           21 Centroid_Count=241 Total_Compressions=75.951k items_per_second=709.353k/s
BM_td_add_lognormal_dist/100/10000000       578820296 ns    578810541 ns           24 Centroid_Count=69 Total_Compressions=444.315k items_per_second=719.867k/s
BM_td_add_lognormal_dist/200/10000000       617284259 ns    617274381 ns           23 Centroid_Count=117 Total_Compressions=210.366k items_per_second=704.359k/s
BM_td_add_lognormal_dist/300/10000000       640681537 ns    640671290 ns           22 Centroid_Count=156 Total_Compressions=133.296k items_per_second=709.483k/s
BM_td_add_lognormal_dist/400/10000000       658390596 ns    658379846 ns           21 Centroid_Count=205 Total_Compressions=95.258k items_per_second=723.276k/s
BM_td_add_lognormal_dist/500/10000000       671175851 ns    671189179 ns           21 Centroid_Count=241 Total_Compressions=75.976k items_per_second=709.473k/s
BM_td_quantile_lognormal_dist/100/10000000  547323253 ns    547335555 ns           25 items_per_second=730.813k/s
BM_td_quantile_lognormal_dist/200/10000000  797471026 ns    797486162 ns           17 items_per_second=737.612k/s
BM_td_quantile_lognormal_dist/300/10000000 1059981567 ns   1059999967 ns           14 items_per_second=673.854k/s
BM_td_quantile_lognormal_dist/400/10000000 1211063112 ns   1211081686 ns           12 items_per_second=688.09k/s
BM_td_quantile_lognormal_dist/500/10000000 1520062184 ns   1520081955 ns           10 items_per_second=657.859k/s
BM_td_merge_lognormal_dist/100/10000000     215755283 ns    215757947 ns           64 items_per_second=7.24191k/s
BM_td_merge_lognormal_dist/200/10000000     352967312 ns    352971094 ns           39 items_per_second=7.26434k/s
BM_td_merge_lognormal_dist/300/10000000     486397786 ns    486402209 ns           25 items_per_second=8.22365k/s
BM_td_merge_lognormal_dist/400/10000000     638274014 ns    638278973 ns           21 items_per_second=7.46054k/s
BM_td_merge_lognormal_dist/500/10000000     922054093 ns    922059806 ns           15 items_per_second=7.23019k/s

Measuring Giga-FLOPS per cycle per core per second

We can use the fp_arith_inst_retired.scalar_double event (Number of SSE/AVX computational scalar double precision floating-point) to get the Giga-FLOPS per cycle per core per second for both branches...

Type	cycles	fp_arith_inst_retired.scalar_double	seconds	Giga-FLOPS per cycle per core per second
master branch -- Array of Structures (AoS)	60897564626	2181894073	15.679	0.1392
perf.qsort.central branch -- Stucture of Arrays (SoA)	80443906341	5379084425	20.705	0.2598

Giga-FLOPS per cycle per core per second master branch

fco@hpe10:~/t-digest-c$ sudo taskset -c 0 perf stat -e cycles,fp_arith_inst_retired.scalar_double build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000
2021-03-01 00:03:25
Running build/tests/histogram_benchmark
Run on (40 X 2735.53 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.81, 0.34, 0.21
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000 1097939062 ns   1097912728 ns           13 Centroid_Count=71 Total_Compressions=240.982k items_per_second=700.63k/s

 Performance counter stats for 'build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000':

       60897564626      cycles                                                      
        2181894073      fp_arith_inst_retired.scalar_double                                   

      15.679044836 seconds time elapsed

Giga-FLOPS per cycle per core per second perf.qsort.central branch

fco@hpe10:~/t-digest-c$ sudo taskset -c 0 perf stat -e cycles,fp_arith_inst_retired.scalar_double build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000
2021-03-01 00:02:20
Running build/tests/histogram_benchmark
Run on (40 X 3195.4 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.27, 0.20, 0.16
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000  578997285 ns    578982180 ns           24 Centroid_Count=69 Total_Compressions=444.531k items_per_second=719.654k/s

 Performance counter stats for 'build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000':

       80482605337      cycles                                                      
        5384528505      fp_arith_inst_retired.scalar_double                                   

      20.714224460 seconds time elapsed

codecov · 2021-02-21T11:10:01Z

Codecov Report

Merging #5 (9d30132) into master (265c005) will decrease coverage by 10.22%.
The diff coverage is 81.40%.

@@             Coverage Diff             @@
##           master       #5       +/-   ##
===========================================
- Coverage   95.20%   84.98%   -10.23%     
===========================================
  Files           1        1               
  Lines         167      253       +86     
===========================================
+ Hits          159      215       +56     
- Misses          8       38       +30

Impacted Files	Coverage Δ
src/tdigest.c	`84.98% <81.40%> (-10.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 265c005...9d30132. Read the comment docs.

filipecosta90 · 2021-02-22T10:15:05Z

add the hot code for td_add.
prove that that portion of code is not vectored on master
prove that with the array of primary data-types is vectorized
get numbers for td_quantile on master
get numbers for td_cdf on master
get numbers for merge on master

…ad of array of centroids ( struct )

filipecosta90 · 2021-08-07T14:19:36Z

@ashtul merging given I need the most performant version for a multi sketch benchmark

filipecosta90 force-pushed the perf.qsort.central branch from 8366d72 to 030e490 Compare February 21, 2021 11:09

filipecosta90 force-pushed the perf.qsort.central branch 2 times, most recently from 44064e4 to 7649bd9 Compare February 21, 2021 11:23

filipecosta90 added the enhancement New feature or request label Feb 21, 2021

filipecosta90 requested a review from ashtul February 21, 2021 21:45

filipecosta90 force-pushed the perf.qsort.central branch 2 times, most recently from 2e9535a to ed12416 Compare February 28, 2021 11:25

[add] Performance improvements: use array of primary data-types inste…

29ed47b

…ad of array of centroids ( struct )

filipecosta90 force-pushed the perf.qsort.central branch from ed12416 to 29ed47b Compare February 28, 2021 11:28

filipecosta90 changed the title ~~Performance improvements: use array of primary data-types instead of array of centroids ( struct )~~ Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) Feb 28, 2021

Merge branch 'master' into perf.qsort.central

9d30132

filipecosta90 merged commit 94001de into master Aug 7, 2021

filipecosta90 deleted the perf.qsort.central branch August 7, 2021 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

filipecosta90 commented Feb 21, 2021 •

edited

Loading

codecov bot commented Feb 21, 2021 •

edited

Loading

filipecosta90 commented Feb 22, 2021

filipecosta90 commented Aug 7, 2021

Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

Conversation

filipecosta90 commented Feb 21, 2021 • edited Loading

Measuring Giga-FLOPS per cycle per core per second

Giga-FLOPS per cycle per core per second master branch

Giga-FLOPS per cycle per core per second perf.qsort.central branch

codecov bot commented Feb 21, 2021 • edited Loading

Codecov Report

filipecosta90 commented Feb 22, 2021

filipecosta90 commented Aug 7, 2021

filipecosta90 commented Feb 21, 2021 •

edited

Loading

codecov bot commented Feb 21, 2021 •

edited

Loading