-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fastpath for generic mul!
#51812
Conversation
Also, this implementation does not require one-based-indexing, as long as the axes match. |
If this implementation is faster, shouldn't we be able to remove the current fallback implementation? |
The current fallback implementation takes I agree, ideally, we'd remove that generic implementation, and only go the I'd rather leave the reworking of the internal dispatch hierarchy to someone else, if anyone would like to take it on. ;) |
If this is supposed to handle (for now) only the case |
I think we should only have a BLAS-like API for actually forwarding to BLAS. Having a Julian API that gets processed into a BLAS-like one (meaning using Additionally, it makes the function 4x bigger. Why compile all code paths when we could compile the only one (in most circumstances) that we need? I think the current We may want to dispatch based on |
What you're describing is the status of Julia prior to v1.10. What you see on master, i.e., the character processing and the BLAS-like interface is the new status of Julia v1.10+. That change was actually "celebrated", because it allowed the reduction of the number of And the reason for that reduction is very simple: if you wanted to distinguish between the different storage types (dense, sparse, GPU arrays), you needed to dispatch over type parameters (à la |
These dispatches are internal and thus do not need to add to The terrible function |
Anyway, assuming using ForwardDiff, LinearAlgebra
d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))
function dualify(A, n, j)
if n > 0
A = d.(A, n)
if (j > 0)
A = d.(A, j)
end
end
A
end
A = dualify.(randn(5,5), 7, 2);
B = dualify.(randn(5,5), 7, 2);
C = similar(A);
@time @eval mul!(C, A, B);
@time @eval mul!(C, A', B);
@time @eval mul!(C, A, B');
@time @eval mul!(C, A', B');
@time @eval mul!(C, transpose(A), B);
@time @eval mul!(C, A, transpose(B));
@time @eval mul!(C, transpose(A), transpose(B)); On Julia master, I get julia> @time @eval mul!(C, A, B);
1.121381 seconds (2.98 M allocations: 202.741 MiB, 0.97% gc time, 99.36% compilation time)
julia> @time @eval mul!(C, A', B);
0.006050 seconds (4.22 k allocations: 294.500 KiB, 96.86% compilation time)
julia> @time @eval mul!(C, A, B');
0.003332 seconds (2.67 k allocations: 204.047 KiB, 95.07% compilation time)
julia> @time @eval mul!(C, A', B');
0.003324 seconds (2.72 k allocations: 206.984 KiB, 95.31% compilation time)
julia> @time @eval mul!(C, transpose(A), B);
0.006030 seconds (4.20 k allocations: 293.125 KiB, 97.28% compilation time)
julia> @time @eval mul!(C, A, transpose(B));
0.003468 seconds (2.67 k allocations: 203.719 KiB, 95.28% compilation time)
julia> @time @eval mul!(C, transpose(A), transpose(B));
0.003527 seconds (2.72 k allocations: 207.172 KiB, 94.68% compilation time) while on this PR, I get julia> @time @eval mul!(C, A, B);
0.114683 seconds (137.27 k allocations: 9.591 MiB, 93.84% compilation time)
julia> @time @eval mul!(C, A', B);
0.077514 seconds (37.96 k allocations: 2.566 MiB, 99.74% compilation time)
julia> @time @eval mul!(C, A, B');
0.048722 seconds (29.82 k allocations: 2.013 MiB, 99.61% compilation time)
julia> @time @eval mul!(C, A', B');
0.063893 seconds (29.93 k allocations: 2.019 MiB, 99.72% compilation time)
julia> @time @eval mul!(C, transpose(A), B);
0.076696 seconds (35.31 k allocations: 2.390 MiB, 99.74% compilation time)
julia> @time @eval mul!(C, A, transpose(B));
0.048606 seconds (29.81 k allocations: 2.013 MiB, 99.63% compilation time)
julia> @time @eval mul!(C, transpose(A), transpose(B));
0.073229 seconds (29.92 k allocations: 2.018 MiB, 11.88% gc time, 99.75% compilation time) The thing to point out here is that on this PR, we have close to the same compile time each time, as we're compiling the entire However, on master, we only need to compile The single compile on Julia master was, however, significantly slower than the sum of all 7 separate compile times on this PR. Here is another benchmark using ForwardDiff, LinearAlgebra
d(x, n) = ForwardDiff.Dual(x, ntuple(_ -> randn(), n))
function dualify(A, n, j)
if n > 0
A = d.(A, n)
if (j > 0)
A = d.(A, j)
end
end
A
end
@time for n = 0:8, j = (n!=0):4
A = dualify.(randn(5,5), n, j);
B = dualify.(randn(5,5), n, j);
C = similar(A);
mul!(C, A, B);
mul!(C, A', B);
mul!(C, A, B');
mul!(C, A', B');
mul!(C, transpose(A), B);
mul!(C, A, transpose(B));
mul!(C, transpose(A), transpose(B));
end On Julia master, I get 31.304752 seconds (56.77 M allocations: 3.749 GiB, 1.15% gc time, 99.93% compilation time) versus on this PR: 18.514421 seconds (12.76 M allocations: 860.920 MiB, 0.49% gc time, 99.88% compilation time) If we comment out all the 30.290741 seconds (56.12 M allocations: 3.703 GiB, 1.14% gc time, 99.95% compilation time) versus this PR 5.230921 seconds (6.99 M allocations: 468.793 MiB, 0.99% gc time, 99.76% compilation time) So I suggest:
This PR does both by avoiding If you have strong opinions about the implementation, I suggest we close this PR and you implement the fixes. |
stdlib/LinearAlgebra/src/matmul.jl
Outdated
if (!((eltype(C) <: BlasFloat && eltype(C) === eltype(A) === eltype(B)))) || | ||
(!(C isa StridedVecOrMat && A isa StridedVecOrMat && B isa StridedVecOrMat)) && | ||
α == true && (β == true || β == false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, then this condition applies, for instance, to CUDA, CUSparse and sparse matrices (and/or their transposes and adjoints), right? That would massively slow down SparseArrays.jl and even break a few GPUArrays-related packages, which specifically overload LinearAlgebra.generic_matmatmul!
(with the character arguments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it Array
-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I might have misread the logic. Suppose I want to multiply non-transposed dense GPUArrays with BlasFloat
elements, then this returns ... true? And directs away from generic_matmatmul!
? Too many conditions for my little brain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is supposed to be for BLAS compatibility of types.
BLAS requires both of
1.The eltypes match and they're BlasFloat
2. They're all strided
The idea is, if either of these are the case, call the new overload.
Otherwise, call the BLAS dispatcher.
But because packages like CUDA are relying on extending the non-Julian and non-exported generic_matmatmul!
, it's suddenly a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they do because that significantly reduced load times of those packages, both SparseArrays and all of the GPUArrays-related packages. Otherwise people had to overload tons of mul!
methods, for all combinations of plain/transpose/adjoint factors, possibly dispatching on type parameters. JuliaGPU/CUDA.jl#1904
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, ideally there would have been another way to avoid ``mul!`.
I went with requiring Matrix
for the first two arguments and Array
for the third.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, ideally there would have been another way to avoid
mul!
.
Which one? Since this is all internal, it can be changed to whatever yields the correct dispatch to the underlying *BLAS methods and doesn't increase load times. Method insertion at package loading was the big big issue.
Closing in favor of #52038 |
This is another attempt at improving the compile time issue with generic matmatmul, hopefully improving runtime performance also. @chriselrod @jishnub There seems to be a little typo/oversight somewhere, but it shows how it could work. Locally, this reduces benchmark times from #51812 (comment) by more than 50%. --------- Co-authored-by: Chris Elrod <elrodc@gmail.com>
The current
_generic_matmul!
is bad.Sample results:
This is of course with
Float64
, whichmul!
normally will not hit.The point here is to show that all the extra complexity of
_generic_matmatmul!
makes the code slower to run, even at fairly large sizes.The naive tiling there does not help performance.
Worse, the current method also takes longer to compile.
Here is a benchmark of
log10
runtimes across a range of size inputs.For each size input, the benchmark covers all dual-combinations from no duals, to 1...8 partials, to 1...8 x 1...2 partials. It does this 1981 times on multiple threads (
1981 == length(0.1:0.005:10.0)
).I also added a C++ implementation to the comparison for good measure.
Here is the performance on Julia master:
Versus this PR:
The
x
axis is matrix dimension, and they
axis is log10 time in nanoseconds (so6
is 1ms).Note that the big bump above 3x3 on Julia master is because it special cases a few matrix sizes, like 3x3.
In case the log scale obscures it, I'll emphasize: this PR improves performance several-fold in this
ForwardDiff.Dual
benchmark, just like theFloat64
benchmark above.On this PR, the compile times were in seconds:
for the custom implementation below,
ExponentialUtilities.exponential!
, and theccall
to the C++ version, respectively. Thetime cmake --build
itself reported about 3seconds to actually compile the C++ shared library.On Julia master, I got
I am not sure why
ExponentialUtilities.exponential!
's compile time seemed to regress on this PR, but the total compile time of the combination was reduced.The script:
Anyway, it is annoying that "write your own triple loop instead of calling
mul!
" is an important performance hack whenever dealing with generic code.