Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spmatmul sparse matrix multiplication - performance improvements #30372

Merged
merged 16 commits into from Dec 17, 2018

Conversation

@KlausC
Copy link
Contributor

commented Dec 12, 2018

The implementation of Gustavson's algorithm used to need a sorting-step ofter the result matrix is constructed. The algorithm has been modified to produce the result vectors properly ordered.
Speed improvement is about 4 times at comparable space usage.
old results:

julia> n = 10000;
julia> Random.seed!(0); A = sprandn(n, n, 0.01); b = randn(n);  nnz(A)
1000598
julia> @benchmark A * A
BenchmarkTools.Trial: 
  memory estimate:  977.61 MiB
  allocs estimate:  28
  --------------
  minimum time:     5.815 s (3.76% GC)
  median time:      5.815 s (3.76% GC)
  mean time:        5.815 s (3.76% GC)
  maximum time:     5.815 s (3.76% GC)
  --------------
  samples:          1
  evals/sample:     1

new results:

julia> @benchmark A * A
BenchmarkTools.Trial: 
  memory estimate:  1.04 GiB
  allocs estimate:  10
  --------------
  minimum time:     1.299 s (3.74% GC)
  median time:      1.328 s (5.37% GC)
  mean time:        1.327 s (5.55% GC)
  maximum time:     1.353 s (7.64% GC)
  --------------
  samples:          4
  evals/sample:     1

@StefanKarpinski StefanKarpinski requested a review from KristofferC Dec 12, 2018

@jebej

This comment has been minimized.

Copy link
Contributor

commented Dec 13, 2018

this doesn't seem to work with SparseVectors

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

Can you test this with matrices that are 10^6x10^6 with 1 and 10 nonzeros per row? IIRC, we used to have this method in the very early days but changed it.

There are definitely problem sizes for which this will be faster (but not asymptotically), and if we want these benefits, we need to have some reasonable heuristics to decide which variant to pick.

@ViralBShah ViralBShah added the sparse label Dec 13, 2018

@KristofferC

This comment has been minimized.

Copy link
Contributor

commented Dec 13, 2018

I had some benchmarks in #29682 that could be used here as well.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

Just a thought - we can consider having an optional parameter for accumulating into a dense vector or by sorting. The default would have to be the one that is asymptotically good, but this dense accumulator approach could be picked optionally.

@KlausC

This comment has been minimized.

Copy link
Contributor Author

commented Dec 16, 2018

Hi @ViralBShah,
I adapted your first proposal and found a robust and simple way to make the decision between full scan (my first proposal in this PR for the less sparse corner cases) and the sorting of the row indices (the original software which handles the asymptotic case better - huge matrix with strong sparsity).
The decision is made for each result column separately.
Interestinly the decision depends on nnz * log2(nnz) * 3 < m, where m is the result column length and nnz is the number of nonzeros in one column.

Some additional changes are:

  • initial estimation of nnz of result
  • replace Vector{Int} by Vector{Bool} to tag nonzero slots.
  • re-use nzvals as dense akkumulator storage.
  • use in-place sort! for each result column individually - removed sparseMatrixSort! at end.
    As a result all observed cases so far have speed accelerations.

sort_or_scan0_28
This picture shows the evaluation of the alternatives. x-axis is log2(m), y-axis is log2(nnz). The blue points have better performance for the scanning approach, the yellow point for the sorting approach. The solid line is the graph of nnz -> 3 nnz*log2(nnz).

Benchmark results:
The measurements were done in a copy of the original software SparseWrappers.spmatmul_orig and the modified version SparseWrappers.spmatmul. The last one is identical to the version in this PR.

My own original example:

julia> n = 10000;
julia> Random.seed!(0); A = sprandn(n, n, 0.01); b = randn(n);  nnz(A)
1000598
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  5.259 s (28 allocations: 122.60 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  1.400 s (8 allocations: 1.04 GiB)

The asymptotic case of @ViralBShah

julia> n = 10^6
1000000
julia> Random.seed!(0); A = sprandn(n, n, 10 / n); b = randn(n);  nnz(A)
10003563
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  9.613 s (24 allocations: 2.43 GiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  6.449 s (9 allocations: 1.65 GiB)

julia> Random.seed!(0); A = sprandn(n, n, 1 / n); b = randn(n);  nnz(A)
1000757
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  220.634 ms (20 allocations: 106.86 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  162.402 ms (11 allocations: 42.20 MiB) 

The benchmarks proposed by @KristofferC

julia> for N = (^).(10, 1:7)
           S = spdiagm(0 => ones(N), 1 => -ones(N-1));
           SA = sparse(S')
           @show N
           @btime SparseWrappers.spmatmul_orig($S, $SA)
           @btime SparseWrappers.spmatmul($S, $SA)
       end
N = 10
  782.991 ns (10 allocations: 1.80 KiB)
  674.351 ns (7 allocations: 2.22 KiB)
N = 100
  6.028 μs (10 allocations: 11.81 KiB)
  4.024 μs (5 allocations: 8.11 KiB)
N = 1000
  55.681 μs (12 allocations: 110.41 KiB)
  36.536 μs (7 allocations: 77.89 KiB)
N = 10000
  544.767 μs (18 allocations: 1.07 MiB)
  356.850 μs (8 allocations: 775.78 KiB)
N = 100000
  5.501 ms (18 allocations: 10.68 MiB)
  3.685 ms (9 allocations: 7.57 MiB)
N = 1000000
  56.884 ms (18 allocations: 106.81 MiB)
  54.472 ms (9 allocations: 75.72 MiB)
N = 10000000
  995.334 ms (18 allocations: 1.04 GiB)
  620.350 ms (9 allocations: 757.22 MiB)

julia> using MatrixDepot # v0.8.0
populating internal database...

julia> md = mdopen("Wissgott/parabolic_fem")
(RS Wissgott/parabolic_fem(#1853)  525825x525825(3674625/2100225) 2007 [A, b] 'computational fluid dynamics problem' [Parabolic FEM, diffusion-convection reaction, constant homogenous diffusion]()

julia> A = sparse(md.A); # MatrixDepot delivers a Symmetric{SparseMatrixCSC}
julia> @btime SparseWrappers.spmatmul_orig($A, $A);
  422.052 ms (20 allocations: 248.35 MiB)
julia> @btime SparseWrappers.spmatmul($A, $A);
  273.490 ms (9 allocations: 435.51 MiB)
@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 16, 2018

This is some really nice work @KlausC!

Is it possible to for the spmatmul_orig tests to be run with sortcols and doubleTranspose - just so that we have a comprehensive testing of all cases? Also, can we try to test more sprand cases with n varying from 10 to 10^6 and a few varying densities?

It should hopefully be easy to do this easily, and will serve as a good record for this PR.

Is there any benefit from inlining the prefer_sort and ilog2?

@KlausC

This comment has been minimized.

Copy link
Contributor Author

commented Dec 16, 2018

The same examples with sortindices=:doubletranspose showed inferior results, essentially.

julia> for N = (^).(10, 1:7)
           S = spdiagm(0 => ones(N), 1 => -ones(N-1));
           SA = sparse(S')
           @show N
           @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose)
           @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols)
           @btime SparseWrappers.spmatmul($S, $SA)
       end
N = 10
  830.519 ns (12 allocations: 2.25 KiB)
  761.190 ns (10 allocations: 1.80 KiB)
  657.576 ns (7 allocations: 2.22 KiB)
N = 100
  5.906 μs (12 allocations: 15.16 KiB)
  5.898 μs (10 allocations: 11.81 KiB)
  3.900 μs (5 allocations: 8.11 KiB)
N = 1000
  49.402 μs (16 allocations: 141.72 KiB)
  53.948 μs (12 allocations: 110.41 KiB)
  35.144 μs (7 allocations: 77.89 KiB)
N = 10000
  475.227 μs (20 allocations: 1.37 MiB)
  524.470 μs (18 allocations: 1.07 MiB)
  347.792 μs (8 allocations: 775.78 KiB)
N = 100000
  5.015 ms (20 allocations: 13.73 MiB)
  5.319 ms (18 allocations: 10.68 MiB)
  3.577 ms (9 allocations: 7.57 MiB)
N = 1000000
  73.128 ms (20 allocations: 137.33 MiB)
  70.665 ms (18 allocations: 106.81 MiB)
  53.516 ms (9 allocations: 75.72 MiB)
N = 10000000
  1.150 s (20 allocations: 1.34 GiB)
  960.450 ms (18 allocations: 1.04 GiB)
  566.719 ms (9 allocations: 757.22 MiB)

julia> n = 10^6
1000000

julia> Random.seed!(0); A = sprandn(n, n, 1 / n); b = randn(n);  nnz(A)
1000757
julia> S = SA = A;
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose)
  404.522 ms (22 allocations: 106.89 MiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols)
  243.995 ms (20 allocations: 106.86 MiB)
julia> @btime SparseWrappers.spmatmul($S, $SA)
  170.740 ms (11 allocations: 42.20 MiB)

julia> n = 10^6
1000000
julia> Random.seed!(0); A = sprandn(n, n, 10 / n); b = randn(n);  nnz(A)
10003563
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose);
  39.763 s (26 allocations: 3.91 GiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols);
  9.845 s (24 allocations: 2.43 GiB)
julia> @btime SparseWrappers.spmatmul($S, $SA);
  6.227 s (9 allocations: 1.65 GiB)

julia> using MatrixDepot
julia> md = mdopen("Wissgott/parabolic_fem")
(RS Wissgott/parabolic_fem(#1853)  525825x525825(3674625/2100225) 2007 [A, b] 'computational fluid dynamics problem' [Parabolic FEM, diffusion-convection reaction, constant homogenous diffusion]()

julia> A = sparse(md.A);
julia> S = SA = A;
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:doubletranspose);
  715.782 ms (22 allocations: 392.31 MiB)
julia> @btime SparseWrappers.spmatmul_orig($S, $SA, sortindices=:sortcols);
  417.349 ms (20 allocations: 248.35 MiB)
julia> @btime SparseWrappers.spmatmul($S, $SA);
  278.560 ms (9 allocations: 435.51 MiB)
@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

@KlausC The tests should be an easy fix. I also propose that we mention in the NEWS file about this performance enhancement, and that we no longer have the two options for spmatmul. Since spmatmul is in an stdlib and also not an exported function, I think we should be ok with this.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

Also, for future reference, here's a paper by @aydinbuluc on state of the art on sparse matrix multiplication: https://people.eecs.berkeley.edu/~aydin/spgemm_p2s218.pdf with code at https://bitbucket.org/YusukeNagasaka/mtspgemmlib

@KlausC

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

... performance tests and see if it helps?

        if numrows <= 16
            alg = Base.Sort.InsertionSort
        else
            alg = Base.Sort.QuickSort
        end

I tried it and did not see a difference. The QuickSort algorithm itself branches to InsertionSort, if numrows <= 20, so the best we could gain is one invoke.

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

Thanks!

@ViralBShah

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

I am actually happy to merge this as is (the test changes are not that important).

@ViralBShah ViralBShah merged commit fae262c into JuliaLang:master Dec 17, 2018

3 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
julia freebsd ci Build done
Details
@KlausC

This comment has been minimized.

Copy link
Contributor Author

commented Dec 17, 2018

I committed added test cases and NEWS.md after your merged. How do I get these last changes saved?

KristofferC added a commit that referenced this pull request Dec 20, 2018
spmatmul sparse matrix multiplication - performance improvements (#30372
)

* General performance improvements for sparse matmul
Details for the polyalgorithm are in: #30372

(cherry picked from commit fae262c)
@KristofferC KristofferC referenced this pull request Dec 20, 2018
45 of 52 tasks complete
ViralBShah added a commit that referenced this pull request Dec 20, 2018
KristofferC added a commit that referenced this pull request Jan 11, 2019
spmatmul sparse matrix multiplication - performance improvements (#30372
)

* General performance improvements for sparse matmul
Details for the polyalgorithm are in: #30372

(cherry picked from commit fae262c)
@KristofferC KristofferC referenced this pull request Jan 11, 2019
39 of 53 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.