Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) #5

Merged
merged 2 commits into from
Aug 7, 2021

Conversation

filipecosta90
Copy link
Collaborator

@filipecosta90 filipecosta90 commented Feb 21, 2021

This PR moves from having data nodes data organized in an Array of Structures (AoS) to an structure of Arrays (SoA), meaning that in practical terms we use an array of primary data-types instead of an array of centroids ( struct ).
The impact of this change on both reads/writes can be described by the following table and chart. We see up to 180% less CPU time spent on the same operations.

method / distribution of data / compression cpu_time for 1 OP Array of Structures (AoS) cpu_time for 1 OP Stucture of Arrays (SoA) time_unit overall % improvement
td_add() unif. dist / cmp. 100 110 58 us 90.0%
td_add() unif. dist / cmp. 200 119 62 us 93.8%
td_add() unif. dist / cmp. 300 125 64 us 95.4%
td_add() unif. dist / cmp. 400 129 66 us 96.1%
td_add() unif. dist / cmp. 500 132 67 us 97.1%
td_add() logn. dist / cmp. 100 111 58 us 92.5%
td_add() logn. dist / cmp. 200 120 62 us 94.2%
td_add() logn. dist / cmp. 300 125 64 us 95.7%
td_add() logn. dist / cmp. 400 129 66 us 96.6%
td_add() logn. dist / cmp. 500 132 67 us 97.1%
td_quantile() logn. dist / cmp. 100 58 55 us 5.9%
td_quantile() logn. dist / cmp. 200 80 80 us 0.4%
td_quantile() logn. dist / cmp. 300 104 106 us -1.9%
td_quantile() logn. dist / cmp. 400 124 121 us 2.4%
td_quantile() logn. dist / cmp. 500 152 152 us -0.1%
td_merge() logn. dist / cmp. 100 56 22 us 158.2%
td_merge() logn. dist / cmp. 200 95 35 us 170.3%
td_merge() logn. dist / cmp. 300 135 49 us 178.2%
td_merge() logn. dist / cmp. 400 173 64 us 170.4%
td_merge() logn. dist / cmp. 500 212 92 us 130.3%

image


Raw benchmark outputs:

Before performance improvements:

build/tests/histogram_benchmark --benchmark_min_time=10
2021-02-28 17:47:05
Running build/tests/histogram_benchmark
Run on (40 X 819.226 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.41, 0.40, 0.19
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000        1098646509 ns   1098624403 ns           13 Centroid_Count=76 Total_Compressions=242.134k items_per_second=700.176k/s
BM_td_add_uniform_dist/200/10000000        1194952265 ns   1194932800 ns           12 Centroid_Count=124 Total_Compressions=110.06k items_per_second=697.389k/s
BM_td_add_uniform_dist/300/10000000        1251559692 ns   1251541906 ns           11 Centroid_Count=167 Total_Compressions=66.769k items_per_second=726.377k/s
BM_td_add_uniform_dist/400/10000000        1292489949 ns   1292473451 ns           11 Centroid_Count=212 Total_Compressions=49.954k items_per_second=703.373k/s
BM_td_add_uniform_dist/500/10000000        1322877635 ns   1322865217 ns           11 Centroid_Count=261 Total_Compressions=39.9k items_per_second=687.214k/s
BM_td_add_lognormal_dist/100/10000000      1113979525 ns   1113955647 ns           12 Centroid_Count=73 Total_Compressions=222.825k items_per_second=748.085k/s
BM_td_add_lognormal_dist/200/10000000      1198644556 ns   1198622031 ns           12 Centroid_Count=124 Total_Compressions=110.264k items_per_second=695.243k/s
BM_td_add_lognormal_dist/300/10000000      1253960787 ns   1253937374 ns           11 Centroid_Count=170 Total_Compressions=66.953k items_per_second=724.989k/s
BM_td_add_lognormal_dist/400/10000000      1294703824 ns   1294678583 ns           11 Centroid_Count=215 Total_Compressions=50.043k items_per_second=702.175k/s
BM_td_add_lognormal_dist/500/10000000      1323218297 ns   1323190558 ns           11 Centroid_Count=254 Total_Compressions=39.821k items_per_second=687.045k/s
BM_td_quantile_lognormal_dist/100/10000000  579410668 ns    579398605 ns           26 items_per_second=663.818k/s
BM_td_quantile_lognormal_dist/200/10000000  800659098 ns    800643219 ns           17 items_per_second=734.703k/s
BM_td_quantile_lognormal_dist/300/10000000 1039924788 ns   1039905417 ns           13 items_per_second=739.712k/s
BM_td_quantile_lognormal_dist/400/10000000 1239878310 ns   1239852700 ns           11 items_per_second=733.225k/s
BM_td_quantile_lognormal_dist/500/10000000 1518966398 ns   1518935909 ns            9 items_per_second=731.506k/s
BM_td_merge_lognormal_dist/100/10000000     557161796 ns    557154135 ns           25 items_per_second=7.17934k/s
BM_td_merge_lognormal_dist/200/10000000     954183623 ns    954170568 ns           15 items_per_second=6.98687k/s
BM_td_merge_lognormal_dist/300/10000000    1353286663 ns   1353246713 ns           11 items_per_second=6.71785k/s
BM_td_merge_lognormal_dist/400/10000000    1725730075 ns   1725634626 ns            8 items_per_second=7.24371k/s
BM_td_merge_lognormal_dist/500/10000000    2123453467 ns   2123344748 ns            7 items_per_second=6.72793k/s

After performance improvements:

build/tests/histogram_benchmark --benchmark_min_time=10
2021-02-28 17:57:37
Running build/tests/histogram_benchmark
Run on (40 X 920.051 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.16, 0.37, 0.35
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000         578131150 ns    578120317 ns           24 Centroid_Count=68 Total_Compressions=444.294k items_per_second=720.727k/s
BM_td_add_uniform_dist/200/10000000         616649934 ns    616638680 ns           23 Centroid_Count=112 Total_Compressions=209.924k items_per_second=705.085k/s
BM_td_add_uniform_dist/300/10000000         640655884 ns    640644563 ns           22 Centroid_Count=155 Total_Compressions=133.172k items_per_second=709.513k/s
BM_td_add_uniform_dist/400/10000000         659102738 ns    659091473 ns           21 Centroid_Count=203 Total_Compressions=95.322k items_per_second=722.495k/s
BM_td_add_uniform_dist/500/10000000         671314223 ns    671302383 ns           21 Centroid_Count=241 Total_Compressions=75.951k items_per_second=709.353k/s
BM_td_add_lognormal_dist/100/10000000       578820296 ns    578810541 ns           24 Centroid_Count=69 Total_Compressions=444.315k items_per_second=719.867k/s
BM_td_add_lognormal_dist/200/10000000       617284259 ns    617274381 ns           23 Centroid_Count=117 Total_Compressions=210.366k items_per_second=704.359k/s
BM_td_add_lognormal_dist/300/10000000       640681537 ns    640671290 ns           22 Centroid_Count=156 Total_Compressions=133.296k items_per_second=709.483k/s
BM_td_add_lognormal_dist/400/10000000       658390596 ns    658379846 ns           21 Centroid_Count=205 Total_Compressions=95.258k items_per_second=723.276k/s
BM_td_add_lognormal_dist/500/10000000       671175851 ns    671189179 ns           21 Centroid_Count=241 Total_Compressions=75.976k items_per_second=709.473k/s
BM_td_quantile_lognormal_dist/100/10000000  547323253 ns    547335555 ns           25 items_per_second=730.813k/s
BM_td_quantile_lognormal_dist/200/10000000  797471026 ns    797486162 ns           17 items_per_second=737.612k/s
BM_td_quantile_lognormal_dist/300/10000000 1059981567 ns   1059999967 ns           14 items_per_second=673.854k/s
BM_td_quantile_lognormal_dist/400/10000000 1211063112 ns   1211081686 ns           12 items_per_second=688.09k/s
BM_td_quantile_lognormal_dist/500/10000000 1520062184 ns   1520081955 ns           10 items_per_second=657.859k/s
BM_td_merge_lognormal_dist/100/10000000     215755283 ns    215757947 ns           64 items_per_second=7.24191k/s
BM_td_merge_lognormal_dist/200/10000000     352967312 ns    352971094 ns           39 items_per_second=7.26434k/s
BM_td_merge_lognormal_dist/300/10000000     486397786 ns    486402209 ns           25 items_per_second=8.22365k/s
BM_td_merge_lognormal_dist/400/10000000     638274014 ns    638278973 ns           21 items_per_second=7.46054k/s
BM_td_merge_lognormal_dist/500/10000000     922054093 ns    922059806 ns           15 items_per_second=7.23019k/s

Measuring Giga-FLOPS per cycle per core per second

We can use the fp_arith_inst_retired.scalar_double event (Number of SSE/AVX computational scalar double precision floating-point) to get the Giga-FLOPS per cycle per core per second for both branches...

Type cycles fp_arith_inst_retired.scalar_double seconds Giga-FLOPS per cycle per core per second
master branch -- Array of Structures (AoS) 60897564626 2181894073 15.679 0.1392
perf.qsort.central branch -- Stucture of Arrays (SoA) 80443906341 5379084425 20.705 0.2598

Giga-FLOPS per cycle per core per second master branch

fco@hpe10:~/t-digest-c$ sudo taskset -c 0 perf stat -e cycles,fp_arith_inst_retired.scalar_double build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000
2021-03-01 00:03:25
Running build/tests/histogram_benchmark
Run on (40 X 2735.53 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.81, 0.34, 0.21
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000 1097939062 ns   1097912728 ns           13 Centroid_Count=71 Total_Compressions=240.982k items_per_second=700.63k/s

 Performance counter stats for 'build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000':

       60897564626      cycles                                                      
        2181894073      fp_arith_inst_retired.scalar_double                                   

      15.679044836 seconds time elapsed

Giga-FLOPS per cycle per core per second perf.qsort.central branch

fco@hpe10:~/t-digest-c$ sudo taskset -c 0 perf stat -e cycles,fp_arith_inst_retired.scalar_double build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000
2021-03-01 00:02:20
Running build/tests/histogram_benchmark
Run on (40 X 3195.4 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x20)
  L1 Instruction 32 KiB (x20)
  L2 Unified 1024 KiB (x20)
  L3 Unified 28160 KiB (x1)
Load Average: 0.27, 0.20, 0.16
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
BM_td_add_uniform_dist/100/10000000  578997285 ns    578982180 ns           24 Centroid_Count=69 Total_Compressions=444.531k items_per_second=719.654k/s

 Performance counter stats for 'build/tests/histogram_benchmark --benchmark_min_time=10 --benchmark_filter=BM_td_add_uniform_dist/100/10000000':

       80482605337      cycles                                                      
        5384528505      fp_arith_inst_retired.scalar_double                                   

      20.714224460 seconds time elapsed

@codecov
Copy link

codecov bot commented Feb 21, 2021

Codecov Report

Merging #5 (9d30132) into master (265c005) will decrease coverage by 10.22%.
The diff coverage is 81.40%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master       #5       +/-   ##
===========================================
- Coverage   95.20%   84.98%   -10.23%     
===========================================
  Files           1        1               
  Lines         167      253       +86     
===========================================
+ Hits          159      215       +56     
- Misses          8       38       +30     
Impacted Files Coverage Δ
src/tdigest.c 84.98% <81.40%> (-10.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 265c005...9d30132. Read the comment docs.

@filipecosta90 filipecosta90 force-pushed the perf.qsort.central branch 2 times, most recently from 44064e4 to 7649bd9 Compare February 21, 2021 11:23
@filipecosta90 filipecosta90 added the enhancement New feature or request label Feb 21, 2021
@filipecosta90
Copy link
Collaborator Author

  • add the hot code for td_add.
  • prove that that portion of code is not vectored on master
  • prove that with the array of primary data-types is vectorized
  • get numbers for td_quantile on master
  • get numbers for td_cdf on master
  • get numbers for merge on master

@filipecosta90 filipecosta90 force-pushed the perf.qsort.central branch 2 times, most recently from 2e9535a to ed12416 Compare February 28, 2021 11:25
@filipecosta90 filipecosta90 changed the title Performance improvements: use array of primary data-types instead of array of centroids ( struct ) Performance improvements: moving from data organized in an Array of Structures (AoS) to an organization of Stucture of Arrays (SoA) Feb 28, 2021
@filipecosta90
Copy link
Collaborator Author

@ashtul merging given I need the most performant version for a multi sketch benchmark

@filipecosta90 filipecosta90 merged commit 94001de into master Aug 7, 2021
@filipecosta90 filipecosta90 deleted the perf.qsort.central branch August 7, 2021 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant