Skip to content

Conversation

chriselrod
Copy link
Collaborator

@chriselrod chriselrod commented Jan 4, 2021

BLASBenchmarks currently uses reshape(view(x, prod(dims)), dims) to create its arrays.
This is a workaround for a BenchmarkTools issue, where interpolated variables don't get freed by the GC. Doing it that way makes it easier to run benchmarks on computers with limited RAM, because I can use the same x across all iterations and matrix sizes (rather than allocating new arrays every time that never get freed).

However, I realized that Octavian was unusually slow at small sizes. This was because

julia> x = rand(10000);

julia> A = reshape(view(x, 1:16*16), (16,16));

julia> A isa DenseArray
false

Because two of Octavian's deps already make heavy use of ArrayInterface.jl, I figured the easiest way to handle this case would be to add an explicit dependency. This also opens the door to using it in more places.

A couple asides:
We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Also, we should add threading for loop 5 at smallish sizes, particularly when we don't pack B. This is important at smallish sizes, especially on computers with a lot of threads looking for a chunk of matrix to multiply.

@codecov
Copy link

codecov bot commented Jan 4, 2021

Codecov Report

Merging #29 (11ed033) into master (98ac7ba) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #29   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9         9           
  Lines          116       121    +5     
=========================================
+ Hits           116       121    +5     
Impacted Files Coverage Δ
src/Octavian.jl 100.00% <ø> (ø)
src/matmul.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98ac7ba...11ed033. Read the comment docs.

@DilumAluthge
Copy link
Member

We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Can we e.g. check the matrix size, and if it's small enough, just completely unroll all the loops? Would that be feasible (and performant)?

@DilumAluthge
Copy link
Member

Also, we should add threading for loop 5 at smallish sizes, particularly when we don't pack B. This is important at smallish sizes, especially on computers with a lot of threads looking for a chunk of matrix to multiply.

In this case, the threads won't all be able to share the same Bptr, right?

@chriselrod
Copy link
Collaborator Author

chriselrod commented Jan 4, 2021

We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Can we e.g. check the matrix size, and if it's small enough, just completely unroll all the loops? Would that be feasible (and performant)?

If the arrays were statically sized, PaddedMatrices beats StaticArrays starting from 4x4, tying it before that.
https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/haswell/#Haswell
https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/cascadelake/#Cascadelake
For small statically sized arrays, it just calls @avx for loops directly, and lets everything get inlined.

If they're dynamically sized, it's much harder to unroll all the loops.
It's probably hard to beat LoopVectorization's approach, which doesn't take long to start beating the statically sized fully unrolled code even with dynamic loops. That's why PaddedMatrices tries to minimize the amount of code before the check on whether it should take the fast path.
When taking the fast path, it's around 0.5ns slower than just using a LoopVectorization function directly, while Octavian seems around to 10ns slower.

In this case, the threads won't all be able to share the same Bptr, right?

Note that we only actually use Bptr if do_not_pack_B is false. So if it is true, we can thread loop 5.

if do_not_pack_B
    matmul_loop3!(C, T, A, Bview, α, β, ksize, nsize, M, k, n, Mc)
else
    Bblock = PointerMatrix(Bptr, (ksize,nsize))
    unsafe_copyto_avx!(Bblock, Bview)
    matmul_loop3!(C, T, A, Bblock, α, β, ksize, nsize, M, k, n, Mc)
end

Also

do_not_pack_B = (B isa DenseArray && (K * N  _Kc * _Nc)) || N  LoopVectorization.nᵣ

We should make this check better. But the key is that it is true when K * N <= Kc * Nc. It'll be true unless the matrices are very big. Nc also tends to increase with the number of cores.

@DilumAluthge DilumAluthge merged commit 60689dd into master Jan 4, 2021
@DilumAluthge DilumAluthge deleted the addarrayinterfacedep branch January 4, 2021 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants