Pass arguments to _avx_! in registers. #261

chriselrod · 2021-05-12T05:19:30Z

On current master, they are passed in the stack.
This PR tries to pass them in registers instead.

Of course, this only matters when _avx_! isn't inlined.

codecov · 2021-05-12T22:30:46Z

Codecov Report

Merging #261 (2486aaa) into master (3ef16fb) will decrease coverage by 0.04%.
The diff coverage is 86.86%.

@@            Coverage Diff             @@
##           master     #261      +/-   ##
==========================================
- Coverage   90.25%   90.20%   -0.05%     
==========================================
  Files          36       36              
  Lines        7572     7650      +78     
==========================================
+ Hits         6834     6901      +67     
- Misses        738      749      +11

Impacted Files	Coverage Δ
src/LoopVectorization.jl	`100.00% <ø> (ø)`
src/parse/memory_ops_common.jl	`80.60% <0.00%> (ø)`
src/codegen/lowering.jl	`89.67% <66.66%> (-0.31%)`	⬇️
src/codegen/lower_threads.jl	`59.94% <69.23%> (+0.05%)`	⬆️
src/condense_loopset.jl	`92.65% <85.00%> (-0.90%)`	⬇️
src/modeling/determinestrategy.jl	`97.18% <90.90%> (-0.02%)`	⬇️
src/modeling/graphs.jl	`88.82% <95.83%> (+0.23%)`	⬆️
src/codegen/loopstartstopmanager.jl	`89.11% <100.00%> (+0.02%)`	⬆️
src/codegen/lower_compute.jl	`93.84% <100.00%> (ø)`
src/reconstruct_loopset.jl	`93.41% <100.00%> (+0.01%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ef16fb...2486aaa. Read the comment docs.

chriselrod · 2021-05-13T03:05:17Z

The bump in VectorizationBase requirement fixes #214:

julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
  1.378 μs (0 allocations: 0 bytes)

julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
  7.565 μs (0 allocations: 0 bytes)

julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
  9.948 μs (0 allocations: 0 bytes)

julia> # redefine args

julia> eltype(D)
Float32

julia> @btime fun_avxt($D,$QT,$μ,$σ,$m,$i);
  905.000 ns (0 allocations: 0 bytes)

julia> @btime fun_avx($D,$QT,$μ,$σ,$m,$i);
  1.848 μs (0 allocations: 0 bytes)

julia> @btime fun_simd($D,$QT,$μ,$σ,$m,$i);
  1.873 μs (0 allocations: 0 bytes)

This allows using approximate inverses in fun_simd.

Should also fix the issue of passing in types causing type instabilities, and it should theoretically lower the function call overhead, and a few benchmarks seem to suggest that helps threading in cases where _avx_! isn't inlined (as then we do have smaller calls).

using LoopVectorization, Octavian, LinearAlgebra, StaticArrays, BenchmarkTools

@inline function AmulB!(C,A,B)
    @avx for n in indices((B,C),2), m in indices((A,C),1)
        Cmn = zero(eltype(C))
        for k in indices((A,B),(2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
    C
end

M = K = N = 8; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B); Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);
@time(AmulB!(C0,A,B)); C0 ≈ C1 # compile time (a bunch of things were compiled already when I ran this)
@benchmark AmulB!($C0,$A,$B) # dynamic 
@benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
@benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded

Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);
Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);

@benchmark mul!($Cmutable, $Amutable, $Bmutable)
@benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]

@benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
@benchmark matmul!($Cmutable, $Amutable, $Bmutable)

Results:

julia> @time(AmulB!(C0,A,B)); C0 ≈ C1 # I redefined `AmulB!`, but that particular `_avx_!` was already compiled
  0.009071 seconds (37.50 k allocations: 1.917 MiB, 99.85% compilation time)
true

julia> @benchmark AmulB!($C0,$A,$B) # dynamic
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     23.575 ns (0.00% GC)
  median time:      24.313 ns (0.00% GC)
  mean time:        24.355 ns (0.00% GC)
  maximum time:     59.959 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     996

julia> @benchmark matmul_serial!($C0,$A,$B) # dynamically sized Octavian, serial
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.155 ns (0.00% GC)
  median time:      27.295 ns (0.00% GC)
  mean time:        27.356 ns (0.00% GC)
  maximum time:     76.102 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> @benchmark matmul!($C0,$A,$B) # dynamically sized Octavian, multithreaded
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.087 ns (0.00% GC)
  median time:      27.390 ns (0.00% GC)
  mean time:        27.420 ns (0.00% GC)
  maximum time:     65.763 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     995

julia> Astatic = SMatrix{8,8}(A); Bstatic = SMatrix{8,8}(B);

julia> Amutable = MMatrix(Astatic); Bmutable = MMatrix(Bstatic); Cmutable = similar(Amutable);

julia> @benchmark mul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     62.418 ns (0.00% GC)
  median time:      62.452 ns (0.00% GC)
  mean time:        62.533 ns (0.00% GC)
  maximum time:     93.819 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     983

julia> @benchmark $(Ref(Astatic))[] * $(Ref(Bstatic))[]
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     37.092 ns (0.00% GC)
  median time:      37.280 ns (0.00% GC)
  mean time:        37.296 ns (0.00% GC)
  maximum time:     68.081 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> @benchmark AmulB!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.097 ns (0.00% GC)
  median time:      12.146 ns (0.00% GC)
  mean time:        12.168 ns (0.00% GC)
  maximum time:     38.972 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark matmul!($Cmutable, $Amutable, $Bmutable)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.097 ns (0.00% GC)
  median time:      12.148 ns (0.00% GC)
  mean time:        12.168 ns (0.00% GC)
  maximum time:     38.316 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

So the dynamically sized version already yields a good improvement over StaticArrays here.

chriselrod added 3 commits May 12, 2021 01:18

Pass arguments to _avx_! in registers.

8fa0d52

Use new to bypass inner constructors

ff234b8

VectorizationBase 0.20

2486aaa

chriselrod merged commit 93d2be2 into master May 13, 2021

chriselrod deleted the avxargsinregisters branch June 1, 2021 09:23

maleadt mentioned this pull request Jan 8, 2024

Remove broken register passing optimization. #524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass arguments to _avx_! in registers. #261

Pass arguments to _avx_! in registers. #261

chriselrod commented May 12, 2021 •

edited

codecov bot commented May 12, 2021 •

edited

chriselrod commented May 13, 2021

Pass arguments to _avx_! in registers. #261

Pass arguments to _avx_! in registers. #261

Conversation

chriselrod commented May 12, 2021 • edited

codecov bot commented May 12, 2021 • edited

Codecov Report

chriselrod commented May 13, 2021

chriselrod commented May 12, 2021 •

edited

codecov bot commented May 12, 2021 •

edited