Regression in sparse-dense multiplication #52137

dkarrasch · 2023-11-12T19:59:37Z

Running this piece of code, I get

using LinearAlgebra, SparseArrays, BenchmarkTools

function _spmatmul!(C, A, B, α, β)
    size(A, 2) == size(B, 1) || throw(DimensionMismatch())
    size(A, 1) == size(C, 1) || throw(DimensionMismatch())
    size(B, 2) == size(C, 2) || throw(DimensionMismatch())
    nzv = nonzeros(A)
    rv = rowvals(A)
    β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
    for k in 1:size(C, 2)
        @inbounds for col in 1:size(A, 2)
            αxj = B[col,k] * α
            for j in nzrange(A, col)
                C[rv[j], k] += nzv[j]*αxj
            end
        end
    end
    C
end

n = 1000;
A = sparse(Float64, I, n, n);
b = Vector{Float64}(undef, n);
x = ones(n);
@btime _spmatmul!($b, $A, $x, true, false);
#  1.703 μs (0 allocations: 0 bytes) on v1.9.3
# 2.161 μs (0 allocations: 0 bytes) on v1.10.0-rc1
# 2.945 μs (0 allocations: 0 bytes) on an 11 days old master

Multiplication dispatch has changed over the 1.9 to 1.10 transition, but this is nothing but the barebone mutliplication code that we used to have "ever since", so without character processing and all that. And since this is not calling high-level functions, the issue must be outside of SparseArrays.jl, AFAIU.

x-ref JuliaSparse/SparseArrays.jl#469

aravindh-krishnamoorthy · 2023-11-13T18:03:36Z

An initial analysis below. Firstly, I used PProf to profile the function _spmatmul! for v1.9.3 and 85d7cca. Then, I compared the Intel assembly obtained via @code_native (still underway).

PProf

Details

v1.9.3: ```julia _spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool) /home/ark/exp/52137.jl

Total: 52298 53841 (flat, cum) 99.86%
0 1053 1053
1 . . using LinearAlgebra, SparseArrays, BenchmarkTools
2 . .
3 46 46 function _spmatmul!(C, A, B, α, β)
4 115 115 size(A, 2) == size(B, 1) || throw(DimensionMismatch())
5 7 7 size(A, 1) == size(C, 1) || throw(DimensionMismatch())
6 . . size(B, 2) == size(C, 2) || throw(DimensionMismatch())
7 . . nzv = nonzeros(A)
8 . . rv = rowvals(A)
9 . 1543 β != one(β) && LinearAlgebra._rmul_or_fill!(C, β)
10 . . for k in 1:size(C, 2)
11 . . @inbounds for col in 1:size(A, 2)
12 . . αxj = B[col,k] * α
13 10181 10181 for j in nzrange(A, col)
14 32632 32632 C[rv[j], k] += nzv[j]*αxj
15 7388 7388 end
16 853 853 end
17 . . end
18 23 23 C
19 . . end
20 . .

85d7cca:
```julia
_spmatmul!(::Vector{Float64}, ::SparseMatrixCSC{Float64, Int64}, ::Vector{Float64}, ::Bool, ::Bool)
/home/ark/exp/52137.jl

  Total:       76441      77940 (flat, cum)   100%
      0         8239       8239           <instructions with unknown line numbers> 
      1            .          .           using LinearAlgebra, SparseArrays, BenchmarkTools 
      2            .          .            
      3          135        135           function _spmatmul!(C, A, B, α, β) 
      4           77         77               size(A, 2) == size(B, 1) || throw(DimensionMismatch()) 
      5            .          .               size(A, 1) == size(C, 1) || throw(DimensionMismatch()) 
      6            .          .               size(B, 2) == size(C, 2) || throw(DimensionMismatch()) 
      7            .          .               nzv = nonzeros(A) 
      8            .          .               rv = rowvals(A) 
      9           15       1514               β != one(β) && LinearAlgebra._rmul_or_fill!(C, β) 
     10            .          .               for k in 1:size(C, 2) 
     11            2          2                   @inbounds for col in 1:size(A, 2) 
     12            .          .                       αxj = B[col,k] * α 
     13        10453      10453                       for j in nzrange(A, col) 
     14        34117      34117                           C[rv[j], k] += nzv[j]*αxj 
     15        22692      22692                       end 
     16          690        690                   end 
     17            .          .               end 
     18           21         21               C 
     19            .          .           end 
     20            .          .

Observations

Both lines 13 nzrange and 14 (multiply) entail around the same complexity in both versions.
Line 15 end seems to take a lot more cycles in 85d7cca.

Assembly

Assembly @code_native on my Intel PC, with Intel syntax, for both cases is attached: 1.9.3.asm.txt
85d7cca.asm.txt

Observations

85d7cca results in a much longer assembly and unrolling (?). The file is much longer than for v1.9.4.
From the assembly files, the inner loop cycles for C[rv[j], k] += nzv[j]*αxj in both cases seem to be the same (4 instructions, uses SSE xmmX).
My best guess as of today (after looking at the asm files) is that the new code in 85d7cca attempts to do a 4x loop unroll of the inner loop in Line 14 above, but ends up causing a register spill. The cycles counted in Line 15 seem to be from the register spills and reloads. (I also see that some iterator code has changed, but I feel this shouldn't cause such a huge difference in performance.)

More later!

KristofferC · 2023-11-13T18:55:00Z

Could maybe be an LLVM upgrade where the vectorizer now does a worse job?

aravindh-krishnamoorthy · 2023-11-14T20:29:04Z

LLVM Code

Comparing the LLVM code, generated using the command @code_llvm _spmatmul!(b, A, x, true, false), we have the following observations. The files are here: 1.9.3.llvm.txt and 85d7cca.llvm.txt.

The LLVM code generated for 85d7cca is longer than that for 1.9.3.
4x loop unrolling, which was observed in the native code above, seems to have already been performed in the LLVM code.

Hence, at this point, my best guess is that the 4x unrolling in LLVM code causes register spills and reloads (2x of 64 bits each) when mapping to my Intel CPU (i7-1165G7), leading to the increased cycles observed in Line 15 of the above post.

-> Any suggestions to confirm this guess are welcome!

NB: It could be that the generated native code itself is suboptimal. But, based on my initial reading, the code structure, except for minor differences in code and the 4x unrolling, seems pretty similar between the two versions.

aravindh-krishnamoorthy · 2023-12-08T14:38:01Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in sparse-dense multiplication #52137

Regression in sparse-dense multiplication #52137

dkarrasch commented Nov 12, 2023 •

edited

Loading

aravindh-krishnamoorthy commented Nov 13, 2023 •

edited

Loading

KristofferC commented Nov 13, 2023

aravindh-krishnamoorthy commented Nov 14, 2023 •

edited

Loading

aravindh-krishnamoorthy commented Dec 8, 2023

ViralBShah commented May 30, 2024 •

edited

Loading

KristofferC commented May 30, 2024 •

edited

Loading

Regression in sparse-dense multiplication #52137

Regression in sparse-dense multiplication #52137

Comments

dkarrasch commented Nov 12, 2023 • edited Loading

aravindh-krishnamoorthy commented Nov 13, 2023 • edited Loading

PProf

Observations

Assembly

Observations

KristofferC commented Nov 13, 2023

aravindh-krishnamoorthy commented Nov 14, 2023 • edited Loading

LLVM Code

aravindh-krishnamoorthy commented Dec 8, 2023

ViralBShah commented May 30, 2024 • edited Loading

KristofferC commented May 30, 2024 • edited Loading

dkarrasch commented Nov 12, 2023 •

edited

Loading

aravindh-krishnamoorthy commented Nov 13, 2023 •

edited

Loading

aravindh-krishnamoorthy commented Nov 14, 2023 •

edited

Loading

ViralBShah commented May 30, 2024 •

edited

Loading

KristofferC commented May 30, 2024 •

edited

Loading