Added explicit dependency on ArrayInterface #29

chriselrod · 2021-01-04T10:50:48Z

BLASBenchmarks currently uses reshape(view(x, prod(dims)), dims) to create its arrays.
This is a workaround for a BenchmarkTools issue, where interpolated variables don't get freed by the GC. Doing it that way makes it easier to run benchmarks on computers with limited RAM, because I can use the same x across all iterations and matrix sizes (rather than allocating new arrays every time that never get freed).

However, I realized that Octavian was unusually slow at small sizes. This was because

julia> x = rand(10000);

julia> A = reshape(view(x, 1:16*16), (16,16));

julia> A isa DenseArray
false

Because two of Octavian's deps already make heavy use of ArrayInterface.jl, I figured the easiest way to handle this case would be to add an explicit dependency. This also opens the door to using it in more places.

A couple asides:
We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Also, we should add threading for loop 5 at smallish sizes, particularly when we don't pack B. This is important at smallish sizes, especially on computers with a lot of threads looking for a chunk of matrix to multiply.

codecov · 2021-01-04T10:55:07Z

Codecov Report

Merging #29 (11ed033) into master (98ac7ba) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #29   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            9         9           
  Lines          116       121    +5     
=========================================
+ Hits           116       121    +5

Impacted Files	Coverage Δ
src/Octavian.jl	`100.00% <ø> (ø)`
src/matmul.jl	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 98ac7ba...11ed033. Read the comment docs.

DilumAluthge · 2021-01-04T11:22:23Z

We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Can we e.g. check the matrix size, and if it's small enough, just completely unroll all the loops? Would that be feasible (and performant)?

DilumAluthge · 2021-01-04T11:23:02Z

Also, we should add threading for loop 5 at smallish sizes, particularly when we don't pack B. This is important at smallish sizes, especially on computers with a lot of threads looking for a chunk of matrix to multiply.

In this case, the threads won't all be able to share the same Bptr, right?

chriselrod · 2021-01-04T11:51:36Z

We should really optimize the fast path.
That was important for getting PaddedMatrice's fully dynamic jmul!(::Matrix, ::Matrix, ::Matrix) to quickly overtake StaticArrays, e.g. here. To hit 20 GFLOPS for 6x6 and 35 for 8x8 requires around 22ns and 30ns respectively; every nanosecond matters there.

Can we e.g. check the matrix size, and if it's small enough, just completely unroll all the loops? Would that be feasible (and performant)?

If the arrays were statically sized, PaddedMatrices beats StaticArrays starting from 4x4, tying it before that.
https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/haswell/#Haswell
https://chriselrod.github.io/PaddedMatrices.jl/dev/arches/cascadelake/#Cascadelake
For small statically sized arrays, it just calls @avx for loops directly, and lets everything get inlined.

If they're dynamically sized, it's much harder to unroll all the loops.
It's probably hard to beat LoopVectorization's approach, which doesn't take long to start beating the statically sized fully unrolled code even with dynamic loops. That's why PaddedMatrices tries to minimize the amount of code before the check on whether it should take the fast path.
When taking the fast path, it's around 0.5ns slower than just using a LoopVectorization function directly, while Octavian seems around to 10ns slower.

In this case, the threads won't all be able to share the same Bptr, right?

Note that we only actually use Bptr if do_not_pack_B is false. So if it is true, we can thread loop 5.

if do_not_pack_B
    matmul_loop3!(C, T, A, Bview, α, β, ksize, nsize, M, k, n, Mc)
else
    Bblock = PointerMatrix(Bptr, (ksize,nsize))
    unsafe_copyto_avx!(Bblock, Bview)
    matmul_loop3!(C, T, A, Bblock, α, β, ksize, nsize, M, k, n, Mc)
end

Also

do_not_pack_B = (B isa DenseArray && (K * N ≤ _Kc * _Nc)) || N ≤ LoopVectorization.nᵣ

We should make this check better. But the key is that it is true when K * N <= Kc * Nc. It'll be true unless the matrices are very big. Nc also tends to increase with the number of cores.

Added explicit dependency on ArrayInterface

11ed033

DilumAluthge approved these changes Jan 4, 2021

View reviewed changes

DilumAluthge merged commit 60689dd into master Jan 4, 2021

DilumAluthge deleted the addarrayinterfacedep branch January 4, 2021 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added explicit dependency on ArrayInterface #29

Added explicit dependency on ArrayInterface #29

Uh oh!

chriselrod commented Jan 4, 2021 •

edited

Loading

Uh oh!

codecov bot commented Jan 4, 2021 •

edited

Loading

Uh oh!

DilumAluthge commented Jan 4, 2021

Uh oh!

DilumAluthge commented Jan 4, 2021

Uh oh!

chriselrod commented Jan 4, 2021 •

edited

Loading

Uh oh!

Uh oh!

Added explicit dependency on ArrayInterface #29

Added explicit dependency on ArrayInterface #29

Uh oh!

Conversation

chriselrod commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DilumAluthge commented Jan 4, 2021

Uh oh!

DilumAluthge commented Jan 4, 2021

Uh oh!

chriselrod commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chriselrod commented Jan 4, 2021 •

edited

Loading

codecov bot commented Jan 4, 2021 •

edited

Loading

chriselrod commented Jan 4, 2021 •

edited

Loading