Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: symmetric sparse matrix support #22200

Closed
wants to merge 4 commits into from

Conversation

dpo
Copy link
Contributor

@dpo dpo commented Jun 3, 2017

There is background information for this, including benchmarks of a draft version of this code at https://discourse.julialang.org/t/slow-sparse-matrix-vector-product-with-symmetric-matrices. The more general implementation in this PR, which follows the code for generic sparse matrix-vector product, is a bit slower than what's reported in the thread.

I'd like to gather some feedback and, assuming this is of interest, I'll be happy to implement what's missing and do the same for Hermitian.

@andreasnoack

@tkelman
Copy link
Contributor

tkelman commented Jun 3, 2017

+1 for the direction. Symmetric{SparseMatrixCSC} has a missing-method problem, that this makes a nice dent in.

@jebej
Copy link
Contributor

jebej commented Jun 3, 2017

Nice! Is this faster than just using a regular and symmetric sparse matrix?

Also, would it make sense to store only the upper or lower triangle of A, instead of the full A?

@fredrikekre
Copy link
Member

fredrikekre commented Jun 3, 2017

Also, would it make sense to store only the upper or lower triangle of A, instead of the full A?

Symmetric is just should just be a wrapper. It is up to the user to decide what to store (i.e. when creating the sparse matrix)

@jebej
Copy link
Contributor

jebej commented Jun 3, 2017

Symmetric is just should just be a wrapper. It is up to the user to decide what to store (i.e. when creating the sparse matrix)

That's fair, but I am assuming that there are two reasons to have an optimized sparse symmetric implementation: storage and speed. I was curious about this because I use Hermitian sparse matrices and so an optimized type is something I am interested in. I made a quick benchmark to compare the regular sparse matrix multiplication, with this one, and with one where the storage data comprises only the upper triangle.

Code is there, and here are the results, for a density of 0.32, and square matrices of different sizes:

SymmetricSparseTests.runtests(5)
Normal sparse: Trial(146.313 ns)
Symmetric sparse: Trial(164.220 ns)
Symmetric sparse (triu only): Trial(161.990 ns)
Symmetric sparse (triu only) optimized: Trial(152.664 ns)

SymmetricSparseTests.runtests(10)
Normal sparse: Trial(613.115 ns)
Symmetric sparse: Trial(632.982 ns)
Symmetric sparse (triu only): Trial(605.547 ns)
Symmetric sparse (triu only) optimized: Trial(562.321 ns)

SymmetricSparseTests.runtests(25)
Normal sparse: Trial(5.992 μs)
Symmetric sparse: Trial(6.255 μs)
Symmetric sparse (triu only): Trial(6.196 μs)
Symmetric sparse (triu only) optimized: Trial(5.992 μs)

SymmetricSparseTests.runtests(50)
Normal sparse: Trial(37.411 μs)
Symmetric sparse: Trial(41.211 μs)
Symmetric sparse (triu only): Trial(40.626 μs)
Symmetric sparse (triu only) optimized: Trial(37.119 μs)

SymmetricSparseTests.runtests(200)
Normal sparse: Trial(2.304 ms)
Symmetric sparse: Trial(2.766 ms)
Symmetric sparse (triu only): Trial(2.678 ms)
Symmetric sparse (triu only) optimized: Trial(2.422 ms)

SymmetricSparseTests.runtests(500)
Normal sparse: Trial(34.553 ms)
Symmetric sparse: Trial(45.762 ms)
Symmetric sparse (triu only): Trial(44.226 ms)
Symmetric sparse (triu only) optimized: Trial(40.840 ms)

@KristofferC
Copy link
Sponsor Member

You can always create Symmetric(triu(K)). It just probably shouldn't be done automatically by the Symmetric constructor.

@jebej
Copy link
Contributor

jebej commented Jun 3, 2017

You can always create Symmetric(triu(K)). It just probably shouldn't be done automatically by the Symmetric constructor.

I understand that, but if the wrapper cannot assume that only the triu part of the matrix is stored, there still need to be checks on which triangle should be looked at, and which entries should be looked at.

@dpo
Copy link
Contributor Author

dpo commented Jun 3, 2017

There were some benchmarks in the Discourse thread using a large sparse matrix. You can see them in this gist: https://gist.github.com/dpo/481b0c03dd08d26af342573df98ddc21. Profiling reveals that index tests inside the multiplication kernel do consume substantial time.

row > col && break # assume indices are sorted
a = nzval[j]
C[row, k] += a * αxj
row == col || (tmp += a * B[row, k])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have an alpha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, thanks. I chose to multiply tmp by alpha after the loop to save a few cycles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps: should the kernels be exported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.


function A_mul_B_L_kernel!(α::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}}, B::StridedVecOrMat, β::Number, C::StridedVecOrMat) where {TA,S}

colptr = A.data.colptr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 space indent, and shouldn't start a function body with a blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed. Besides doing the same job for Hermitian, what other methods not yet covered would users find crucial?

@jebej
Copy link
Contributor

jebej commented Jun 3, 2017

Is there anything that can be done to make it faster than the regular sparse type? Right now this is slower, which seems undesirable.

@@ -0,0 +1,72 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

function Symmetric(A::SparseMatrixCSC, uplo::Symbol=:U)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed since it is equivalent to the default constructor:

Symmetric(A::AbstractMatrix, uplo::Symbol=:U) = (checksquare(A); Symmetric{eltype(A),typeof(A)}(A, char_uplo(uplo)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Fixed.

A_mul_B!(one(T), A, B, zero(T), similar(B, T, (A.data.n, size(B, 2))))
end

function A_mul_B_U_kernel!(α::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}}, B::StridedVecOrMat, β::Number, C::StridedVecOrMat) where {TA,S}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are just 2 lines that are different between the U/L kernels. I think we can merge them, and add a Val{:U}/Val{:L} as last argument to the kernel? Something like:

function A_mul_B_UL_kernel!::Number, A::Symmetric{TA,SparseMatrixCSC{TA,S}},
    B::StridedVecOrMat, β::Number, C::StridedVecOrMat, ::Type{Val{uplo}}) where {TA,S,uplo}
    colptr = A.data.colptr
    rowval = A.data.rowval
    nzval = A.data.nzval
    if β != 1
        β != 0 ? scale!(C, β) : fill!(C, zero(eltype(C)))
    end
    @inbounds for k = 1 : size(C, 2)
        @inbounds for col = 1 : A.data.n
            αxj = α * B[col, k]
            tmp = TA(0)
            @inbounds for j = (uplo == :U ? (colptr[col] : (colptr[col + 1] - 1)) :
                                            (colptr[col] : (colptr[col + 1] - 1)))
                row = rowval[j]
                uplo == :U ? (row > col && break) : (row > col && break)  # assume indices are sorted
                a = nzval[j]
                C[row, k] += a * αxj
                row == col || (tmp += a * B[row, k])
            end
            C[col, k] += α * tmp
        end
    end
    C
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a performance impact for that, which is why I separated them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, there's a performance impact for allowing C to be general, which makes me think that a separate version where C is just a vector is desirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a performance impact for that, which is why I separated them.

That should be compiled away.

In fact, there's a performance impact for allowing C to be general, which makes me think that a separate version where C is just a vector is desirable.

If the impact is noticeable it makes sense to have a separate kernel for matvec.

Copy link
Contributor Author

@dpo dpo Jun 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the matrix in my gist, here's the benchmark of A * b:

BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  2
  --------------
  minimum time:     21.475 ms (0.00% GC)
  median time:      22.218 ms (0.00% GC)
  mean time:        22.425 ms (1.17% GC)
  maximum time:     27.263 ms (5.92% GC)
  --------------
  samples:          223
  evals/sample:     1

Now U * b and L * b using separate kernels:

BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  2
  --------------
  minimum time:     28.259 ms (0.00% GC)
  median time:      29.112 ms (0.00% GC)
  mean time:        29.481 ms (0.88% GC)
  maximum time:     35.077 ms (0.00% GC)
  --------------
  samples:          170
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  2
  --------------
  minimum time:     33.965 ms (0.00% GC)
  median time:      35.219 ms (0.00% GC)
  mean time:        35.797 ms (0.92% GC)
  maximum time:     40.802 ms (0.00% GC)
  --------------
  samples:          140
  evals/sample:     1

and U * b and L * b using the joint kernel (after obvious corrections):

BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  4
  --------------
  minimum time:     33.787 ms (0.00% GC)
  median time:      34.744 ms (0.00% GC)
  mean time:        35.347 ms (0.88% GC)
  maximum time:     40.835 ms (0.00% GC)
  --------------
  samples:          142
  evals/sample:     1
BenchmarkTools.Trial: 
  memory estimate:  3.85 MiB
  allocs estimate:  4
  --------------
  minimum time:     33.890 ms (0.00% GC)
  median time:      34.753 ms (0.00% GC)
  mean time:        35.240 ms (0.89% GC)
  maximum time:     41.668 ms (0.00% GC)
  --------------
  samples:          142
  evals/sample:     1

I didn't know about Val. It improves over my initial joint version somewhat, but there'll still be a noticeable difference over large amounts of matrix-vector products.

@fredrikekre
Copy link
Member

Is there anything that can be done to make it faster than the regular sparse type? Right now this is slower, which seems undesirable.

Try some larger matrices, 500x500 matrix with .32 density is not really representable for a real use case of sparse matrices. For the cases in your benchmark post above, calling BLAS is more than 13 times faster on my computer.

@jebej
Copy link
Contributor

jebej commented Jun 3, 2017

Try some larger matrices, 500x500 matrix with .32 density is not really representable for a real use case of sparse matrices. For the cases in your benchmark post above, calling BLAS is more than 13 times faster on my computer.

I have tried larger and sparser matrices with similar results. I think the cost of accessing the dense array not in column order is high:

SymmetricSparseTests.run_tests(800,0.05)
Normal sparse: Trial(25.844 ms)
Symmetric sparse: Trial(32.871 ms)
Symmetric sparse (triu only): Trial(30.654 ms)
Symmetric sparse (triu only) optimized: Trial(27.898 ms)

SymmetricSparseTests.run_tests(1500,0.05)
Normal sparse: Trial(159.849 ms)
Symmetric sparse: Trial(205.979 ms)
Symmetric sparse (triu only): Trial(194.236 ms)
Symmetric sparse (triu only) optimized: Trial(179.533 ms)

@andreasnoack
Copy link
Member

I just did a small test using MKL with n=1000 and A=sprandn(n,n,0.01) |> t -> t + t' and the symmetric multiplication is quite a bit slower than the general multiplication so I don't think we should assume that we can beat the general multiplication. Users would need to choose between speed and memory efficiency.

We could consider adding the check

all(1:size(A,2)) do j
    i = last(colptr[j]:colptr[j+1] - 1)
    return i > 0 ? rowval[i] < j : true
end

to determine if only the triangle is stored. It's an O(n) before an O(nnz) and a few examples suggests that it might be worth it although it complicates the code a bit.

@dpo
Copy link
Contributor Author

dpo commented Jun 4, 2017

Option 3 in my gist above adds O(n) storage to basically recover the performance of general matric-vector multiplication. The disadvantage is that a lot of methods involving sparse Symmetric matrices must be reimplemented.

@fredrikekre
Copy link
Member

Perhaps the index vector in that option can be computed in A_mul_B instead of the Symmetric constructor then?

@jebej
Copy link
Contributor

jebej commented Jun 4, 2017

Is there a plan for implementing more sophisticated algorithms for sparse-dense matrix multiplication?

@dpo
Copy link
Contributor Author

dpo commented Jun 4, 2017

Perhaps the index vector in that option can be computed in A_mul_B instead of the Symmetric constructor then?

You could but it would be very inefficient to reconstruct the same array over and over each time you multiply. In iterative methods for linear systems, you may need tens or hundreds of products.

@dpo
Copy link
Contributor Author

dpo commented Jun 4, 2017

Variant 3 in the gist above, with O(n) extra storage, does lend itself to a "fused" implementation using Val{:L}/Val{:U}. The timings are essentially unchanged and the code is shorter.

@dpo
Copy link
Contributor Author

dpo commented Jun 9, 2017

I wonder if it wouldn't make more sense for Symmetric to be more than just a wrapper in the sparse case. In the dense case, there are no savings to be had in storing only a triangle, but in the large and sparse case, it makes a lot of sense and I would argue that that's one of the main points of having a Symmetric type (besides promising that a matrix is indeed symmetric). In addition, most (probably all) sparse factorization routines I know for symmetric matrices claim that they only access one triangle of the input. This has me wonder whether it wouldn't make more sense for Triangular to call tril()/triu() in the sparse case, and for Symmetric and Hermitian to take a Triangular matrix as input. In my opinion, this is a place where Matlab has always missed out. Working with actual triangular matrices would eliminate the performance hit and the extra storage of Variant 3 in my gist above. It would correspond to Variant 1. If properly documented, I can't imagine it would startle users.

@andreasnoack
Copy link
Member

A problem might be that Symmetric(SparseMatrixCSC) would then make a copy which it usually doesn't do. The check suggested in #22200 (comment) should be pretty cheap. Have you considered that option?

@dpo
Copy link
Contributor Author

dpo commented Jun 9, 2017

Sure, but what do we do if the test isn't satisfied? If Symmetric takes a Triangular as input, it wouldn't make a copy, but perhaps Triangular would. In the sparse case, it doesn't bother me because it would simply encourage users to build only one triangle of symmetric/hermitian matrices.

@andreasnoack
Copy link
Member

Sure, but what do we do if the test isn't satisfied?

Just fall back on the slower version that checks the indices. In most cases, users will provide a genuinely triangular matrix anyway and we could document that if it isn't already then using Symmetric(triu!(A)) will be better. I don't think that XTriangular will help since it is also just a view/interpretation of the underlying array so the same issue applies. It is not that I completely oppose your idea. I just want to think through the alternatives.

@Sacha0
Copy link
Member

Sacha0 commented Jun 11, 2017

(Tangentially, the above (and related past) discussion highlights that introducing a separate class of matrix annotations might be worthwhile at some future point: In some cases you need an annotation that asserts some property of the wrapped storage's contents (for example that some Matrix is nonnegative, or is lower triangular), rather than merely indicating how the wrapped storage's contents should be interpreted (without providing guarantees about the wrapped storage's contents).)

@andreasnoack
Copy link
Member

@dpo Would you be able to revisit this? It would be great to have.

@dpo
Copy link
Contributor Author

dpo commented Aug 23, 2017

@andreasnoack Sure thing. I'll try to get to it this week.

@dpo
Copy link
Contributor Author

dpo commented Aug 24, 2017

@andreasnoack Is this what you have in mind: https://gist.github.com/dpo/481b0c03dd08d26af342573df98ddc21#file-symmetric_matvec_v4-jl

Here's the expected behavior.

The performance will be good if the user supplies a triangular matrix and poor if they don't.

@andreasnoack
Copy link
Member

I'd actually prefer the version that uses the current Symmetric type but checks if just the triangle is stored in the multiplication method.

@dpo
Copy link
Contributor Author

dpo commented Aug 28, 2017

You don't want to do that each time you compute a product. It would completely defeat the savings of storing and using a sparse matrix. Krylov methods would grind to a halt.

@fredrikekre
Copy link
Member

That check is presumably very cheap in comparison to the actual product, no?

@andreasnoack
Copy link
Member

I just did some timings. I thought it would be cheaper to check so I now agree with @dpo's conclusion and we might need a different approach. I can see two solutions

  1. Give up on using Symmetric/Hermitian for sparse matrices and define a custom SparseMatrixCSCHermitian
  2. Make Symmetric always modify sparse input such that we can assume that the storage is triangular

@fredrikekre
Copy link
Member

Out of curiosity, how expensive is the check compared to the multiply?

@andreasnoack
Copy link
Member

About 50% for a tridiagonal and 35-40% for a pentadiagonal matvec.

@dpo
Copy link
Contributor Author

dpo commented Aug 28, 2017

Make Symmetric always modify sparse input such that we can assume that the storage is triangular

In my view, that's the whole point of symmetric sparse matrices. More than just promise that the underlying matrix is symmetric (as in the dense case), you want to save storage and preserve efficient matvecs.

@Sacha0
Copy link
Member

Sacha0 commented Aug 28, 2017

Ref. #17367 for discussion of mutating constructors (Hermitian!). Best!

@dpo
Copy link
Contributor Author

dpo commented Aug 29, 2017

Symmetric! sounds sensible to me.

@andreasnoack
Copy link
Member

Should be obsolete with #30018. Please comment if not.

@dpo
Copy link
Contributor Author

dpo commented Feb 9, 2019

Are people still open to Symmetric!?

@andreasnoack
Copy link
Member

Are people still open to Symmetric!?

Could you remind us, what would be the benefit? To faster symmetric multiplication, you'd need the guarantee that the storage is triangular, right?

@dpo
Copy link
Contributor Author

dpo commented Feb 17, 2019

Right, symmetric! would begin with a tril!() (or triu!()) to save storage.

@andreasnoack
Copy link
Member

So the only benefit would be storage, right? In that case, I think it would be more transparent to ask people to write Symmetric(tril!(A)). That clearly signals the intention. I'm not completely against the idea of introducing mutating constructors but I think it would be nice to have more significant benefits when introducing a new idiom.

@fredrikekre
Copy link
Member

AFAIU from the discussion above (e.g. #22200 (comment) and following comments) the benefit would be that we could know that not entries were stored on the upper/lower half and not have to check it, which wouldn't be the case with Symmetric(tril!(A)).

@andreasnoack
Copy link
Member

andreasnoack commented Feb 18, 2019

...but then Symmetric! would have to be the only constructor, right? Otherwise, you wouldn't have the guarantee.

@fredrikekre
Copy link
Member

fredrikekre commented Feb 18, 2019

Yea. Maybe we could specialize on Symmetric{T,LowerTriangular{T,SparseMatrixCSC{T,Int}}} where T.

Edit: Doesn't work because LowerTriangular is still backed by a regular matrix. I guess Symmetric! would have to create a new type .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants