Skip to content

Multithreading#7

Merged
brenhinkeller merged 9 commits intomainfrom
multithreading
Oct 27, 2021
Merged

Multithreading#7
brenhinkeller merged 9 commits intomainfrom
multithreading

Conversation

@brenhinkeller
Copy link
Copy Markdown
Collaborator

@brenhinkeller brenhinkeller commented Oct 25, 2021

This is an attempt at adding multithreaded vt options for most functions. I'm not sure if this is an option we want to add, but it's easy enough to add by swapping in @tturbo s here and there. On my 4-core AVX2 laptop, I generally see net speedups (over single-threaded vectorized version) for arrays with about 10k elements or more:

100 elements (not faster yet)

julia> a = rand(128);

julia> @benchmark std($a)
BenchmarkTools.Trial: 10000 samples with 927 evaluations.
 Range (min … max):  110.576 ns … 341.696 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     110.711 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   115.900 ns ±  17.193 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂  ▃      ▃        ▃                                       ▁ ▁
  ██▇▅██▆▆▆▇▇█▆▆▇▄▅▅▅▄█▄▆▅▃▅▄▄▅▃▄▄▃▁▃▄▄▁▁▁▁▃▄▃▁▃▃▄▁▁▄▃▁▃▄▁▁▁▁▄█ █
  111 ns        Histogram: log(frequency) by time        215 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark vstd($a)
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  35.796 ns … 106.163 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     36.016 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.422 ns ±   2.923 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▅▂                                                         ▂
  █████▆▅▆▆▆▄▇██▅▅▅▅▅▃▄▃▅▁▄▃▄▅▄▅▄▄▇▇▇▅▄▃▃▅▃▄▃▃▄▅▆▆▆▆▄▅▄▁▃▄▃▁▇▆ █
  35.8 ns       Histogram: log(frequency) by time      47.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark vtstd($a)
BenchmarkTools.Trial: 10000 samples with 987 evaluations.
 Range (min … max):  49.385 ns … 155.669 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.391 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   52.066 ns ±   8.325 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▃      ▃        ▃                                      ▂ ▁
  █▆▇▇█▅▅▅▄▅▄█▆▅█▅▆█▅▃█▅▄▅▅▄▁▅▄▄▄▃▄▄▄▃▄▃▁▃▄▃▃▁▃▃▄▃▁▁▃▄▃▁▃▁▄▄▄█ █
  49.4 ns       Histogram: log(frequency) by time      95.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

10,000 elements (slight speedup)

julia> a = rand(10_000);

julia> @benchmark std($a)
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.073 μs …  13.137 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.096 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.246 μs ± 490.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄  ▄      ▃       ▁▃▁                                    ▁ ▁
  ██▆▅██▆▆▅▅▅██▅▄▅▇▄▅███▅▅▃▃▄▅▄▁▄▄▅▄▄▃▁▁▄▄▃▄▄▃▄▁▁▁▄▁▃▃▄▄▄▁▄▅█ █
  3.07 μs      Histogram: log(frequency) by time      6.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark vstd($a)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.792 μs …   7.883 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.863 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.988 μs ± 374.501 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▆▄▃▂▂  ▁▂▃▂▁   ▁▂▃▃▁▁                             ▁      ▂
  █████████▆██████▇▆███████▃▄▅▅▄▅▅▅▅▄▁▃▃▄▃▄▄▅▄▁▁▄▃▅▁▁▃▄█▇█▆██ █
  1.79 μs      Histogram: log(frequency) by time      3.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark vtstd($a)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.221 μs …   7.955 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.323 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.372 μs ± 247.224 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▁▆▃                                                       
  ▅▆████▃▆▂▆▄▃▃▃▃▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▂▂▂▁▂▁▂▂▁▁▂▂▂▂▁▁▂▂ ▃
  1.22 μs         Histogram: frequency by time        2.54 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

10^6 elements (significant speedup)

julia> a = rand(1000,1000);

julia> @benchmark std($a, dims=2)
BenchmarkTools.Trial: 7497 samples with 1 evaluation.
 Range (min … max):  554.592 μs …   7.403 ms  ┊ GC (min … max): 0.00% … 89.92%
 Time  (median):     609.998 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   657.780 μs ± 151.642 μs  ┊ GC (mean ± σ):  0.13% ±  1.04%

  ▇█▁ ▃▁                                                         
  ███▇██▆▅▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  555 μs           Histogram: frequency by time         1.18 ms <

 Memory estimate: 16.56 KiB, allocs estimate: 16.

julia> @benchmark vstd($a, dims=2)
BenchmarkTools.Trial: 6697 samples with 1 evaluation.
 Range (min … max):  658.113 μs …   1.766 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     702.754 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   737.962 μs ± 108.474 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▄▂                                                          
  ▄████▇▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  658 μs           Histogram: frequency by time         1.25 ms <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

julia> @benchmark vtstd($a, dims=2)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  194.014 μs …  1.190 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     244.788 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.979 μs ± 83.736 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▂▄▆▇███▇▅▄▄▄▃▂▂▂▂▁▂▂▂▂▂▁▂▂▁▁▁▁▁▁                            ▂
  ██████████████████████████████████████▇▇█▇▆▇▇▇▆▆▇▆▆▆▅▆▇▆▇█▇▇ █
  194 μs        Histogram: log(frequency) by time       628 μs <

 Memory estimate: 7.94 KiB, allocs estimate: 1.

@codecov
Copy link
Copy Markdown

codecov bot commented Oct 25, 2021

Codecov Report

Merging #7 (1a79a96) into main (44faf93) will decrease coverage by 0.12%.
The diff coverage is 98.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main       #7      +/-   ##
==========================================
- Coverage   98.62%   98.50%   -0.13%     
==========================================
  Files           7        7              
  Lines         364      800     +436     
==========================================
+ Hits          359      788     +429     
- Misses          5       12       +7     
Impacted Files Coverage Δ
src/VectorizedStatistics.jl 100.00% <ø> (ø)
src/vcov.jl 97.63% <98.00%> (+0.29%) ⬆️
src/vvar.jl 97.70% <98.19%> (-0.26%) ⬇️
src/vsum.jl 99.47% <98.96%> (-0.53%) ⬇️
src/vmean.jl 99.47% <98.97%> (-0.53%) ⬇️
src/vstd.jl 100.00% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 44faf93...1a79a96. Read the comment docs.

@brenhinkeller
Copy link
Copy Markdown
Collaborator Author

Following discussion in https://julialang.zulipchat.com/#narrow/stream/137791-general/topic/VectorizedStatistics.20.2F.20dealing.20with.20high-dimensional.20arrays, the default syntax is now vfunction(...; multithreaded=:auto), where the default :auto option enables multithreading for arrays with at least 4096 elements, but can be overridden by setting multithreaded=false (never multithread) or multithreaded=true (always multithread)

@brenhinkeller brenhinkeller merged commit e3af647 into main Oct 27, 2021
@chriselrod
Copy link
Copy Markdown
Member

We should talk about size thresholding at some point.

LoopVectorization already does this (and uses cost modeling to try and guess good cutoffs).
You shouldn't have to make guesses here. If it isn't doing a good job, we should probably just fix it there.

@brenhinkeller
Copy link
Copy Markdown
Collaborator Author

Oh cool! Yeah, let me know what I can do on that front

@chriselrod chriselrod deleted the multithreading branch October 28, 2021 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants