/ julia Public

# Speed up UMFpack LU for multiple (sparse) right hand sides#19500

Closed
opened this issue Dec 3, 2016 · 13 comments
Closed

# Speed up UMFpack LU for multiple (sparse) right hand sides #19500

opened this issue Dec 3, 2016 · 13 comments
Labels
domain:arrays:sparse Sparse arrays domain:linear algebra Linear algebra performance Must go faster

### lruthotto commented Dec 3, 2016 • edited by andreasnoack

 The LU factorization (through UMFpack) currently implemented in julia 0.5 is rather slow when using it for multiple right hand sides. Looking at umfpack.jl lines 259 you can see that there is a loop over all right hand sides. As far as I could see UMFpack does not seem to have a built-in option for multiple rhs. So, i followed the MATLAB code here to build one. It works quite nicely and I'd be pleased to contribute that to Base. To this end, it would be great to get some help and advice. For example, I'm not sure where to put the code. Also, I'm not sure why it takes so long to get the parts of the lufactorization. See this example here: ```function myLUsolve(A::Base.SparseArrays.UMFPACK.UmfpackLU,B) # get parts of LUfactorization println("time to extract LUfactors") @time (L,U,p,q,R) = A[:(:)]; println("time to solve for \$(size(B,2)) right hand sides") @time begin X = zeros(eltype(B),size(B)) X[q,:] = U\(L\((Diagonal(R) * B)[p,:])) end return X end nrhs = 300 n = 10000 A = UniformScaling(10) + sprandn(n,n,0.0005); rhs = randn(n,nrhs); println("=== lufact(A)\\rhs for real matrix ===") println("\t factorize") @time luA = lufact(A) println("\t julia solve") @time t1 = luA\rhs println("\t my solve") @time t2 = myLUsolve(luA,rhs) println("err: \$(norm(t1-t2)/norm(t1))")``` On my machine, Ubuntu 14, julia 0.5.1, I get the following timings ```=== lufact(A)\rhs for real matrix === factorize 5.479585 seconds (66 allocations: 536.308 MB, 0.64% gc time) julia solve 29.411826 seconds (613 allocations: 183.115 MB, 0.82% gc time) my solve time to extract LUfactors 1.517049 seconds (42 allocations: 562.236 MB, 5.83% gc time) time to solve for 300 right hand sides 14.070643 seconds (31 allocations: 114.442 MB, 0.96% gc time) 15.619639 seconds (7.67 k allocations: 677.021 MB, 1.44% gc time) err: 1.5411555266919556e-12``` The text was updated successfully, but these errors were encountered:

### andreasnoack commented Dec 3, 2016

 That is a quite large speedup. We should try to improve here. Extracting the factors like you do here is one option but it might be faster to use `umfpack_dl_wsolve`. The documentation states that ``````When you have many linear systems to solve, this routine is faster than umfpack_*_solve, since the workspace (Wi, W) needs to be allocated only once, prior to calling umfpack_*_wsolve. ``````

### lruthotto commented Dec 3, 2016

 sounds interesting, but I don't know `umfpack_dl_wsolve`. Is that a Julia function? If you code an example, I'd be happy to run some comparisons. Note that the version I posted up here also works for sparse right hand sides. Do you know if that is also the case in `umfpack_dl_wsolve`?

### lruthotto commented Dec 3, 2016

 On the same note. A big bottleneck still are the triangular solves (see `fwdTriSolve!` and `bwdTriSolve!` in linalg.jl). Is there any way to parallelize them for multiple right hand sides? Would shared array work?

### andreasnoack commented Dec 4, 2016

 I've just taken a look at this and the reason that the default solve is slow is because iterative refinements of the solutions are enabled by default in UMFPACK. You can try it out with `SparseArrays.UMFPACK.umf_ctrl[8] = 0` I get ``` julia solve julia> @time t1 = luA\rhs 23.085403 seconds (28.54 k allocations: 184.735 MB, 0.58% gc time) julia> @time t2 = myLUsolve(luA,rhs) time to extract LUfactors 1.528154 seconds (7.95 k allocations: 548.319 MB, 8.36% gc time) time to solve for 300 right hand sides 13.622108 seconds (504.26 k allocations: 136.302 MB, 0.96% gc time) 15.343389 seconds (762.59 k allocations: 696.246 MB, 1.72% gc time) julia> SparseArrays.UMFPACK.umf_ctrl[8] = 0 0 julia> @time t3 = luA\rhs; 7.729600 seconds (609 allocations: 91.562 MB, 0.34% gc time)``` We might want to expose this through a function. I also tried `umfpack_dl_wsolve` but the allocation time is insignificant so it doesn't really change anything.

### lruthotto commented Dec 4, 2016

 Thanks for the hint. I see a similar speedup now when deactivating iterative refinements. I do see additional room for speedup here (or elsewhere) though if we could have an OpenMP type parallelism for the upper- and lower-triangular solves. UMFpack seems to be using only one thread on my machine on the solve stage.

### andreasnoack commented Dec 4, 2016

 We kind of have that option with `Threads.@threads`. The function `mysolve2` does a `@threads` `for` loop over the columns of `rhs` ```julia> @time MyUMF.mysolve(luA, rhs); 7.381999 seconds (610 allocations: 23.374 MB, 0.02% gc time) julia> @time MyUMF.mysolve2(luA, rhs); 2.049506 seconds (1.64 k allocations: 142.510 MB, 3.53% gc time)``` However, it looks like some of the functions in the `UMFPACK` have a too restrictive signature since they don't allow for `view`s as input. We should change that. You should also be a bit careful when using Julia's threads because more than a single level of `@threads` `for` loops is not working.

added domain:linear algebra Linear algebra performance Must go faster domain:arrays:sparse Sparse arrays labels Dec 4, 2016

### lruthotto commented Dec 4, 2016 • edited by andreasnoack

 Thanks for the hint. I looked at threads now. This would be a great addition to many of our codes. However, so far it does not seem to work well. There is a lot of allocation added also for my example (that doesnt involve UMF) See this example: ```println("Number Of Threads : \$(Threads.nthreads())") function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat) # forward substitution for CSC matrices nrowB, ncolB = size(B, 1), size(B, 2) ncol = LinAlg.checksquare(A) if nrowB != ncol throw(DimensionMismatch("A is \$(ncol) columns and B has \$(nrowB) rows")) end aa = A.nzval ja = A.rowval ia = A.colptr Threads.@threads for k = 1:ncolB for j = 1:nrowB i1 = ia[j] i2 = ia[j + 1] - 1 # loop through the structural zeros ii = i1 jai = ja[ii] while ii <= i2 && jai < j ii += 1 jai = ja[ii] end # check for zero pivot and divide with pivot if jai == j bj = B[ jai,k]/aa[ii] B[jai,k] = bj ii += 1 else throw(LinAlg.SingularException(j)) end # update remaining part for i = ii:i2 B[ ja[i],k] -= bj*aa[i] end end end B end A = tril(sprandn(10000,10000,0.00002) + UniformScaling(100)) rhs = randn(size(A,1),20) t1 = copy(rhs) t2 = copy(rhs) println("no-thread") @time t1 = A\t1 println("with thread") @time t2 = fwdTriSolve2!(A,t2) println("\nerr: \$(norm(t1-t2)./norm(t1))")``` Running this (and repeating a couple of times) I get this: ``````no-thread 0.002094 seconds (7 allocations: 1.526 MB) with thread 0.131745 seconds (1.25 M allocations: 20.965 MB) ``````

### KristofferC commented Dec 4, 2016

 Try ```function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat) # forward substitution for CSC matrices nrowB, ncolB = size(B, 1), size(B, 2) ncol = LinAlg.checksquare(A) if nrowB != ncol throw(DimensionMismatch("A is \$(ncol) columns and B has \$(nrowB) rows")) end aa = A.nzval ja = A.rowval ia = A.colptr Threads.@threads for k = 1:ncolB do_stuff(aa, ja, ia, k, nrowB) end B end function do_stuff(aa, ja, ia, k, nrowB) for j = 1:nrowB i1 = ia[j] i2 = ia[j + 1] - 1 # loop through the structural zeros ii = i1 jai = ja[ii] while ii <= i2 && jai < j ii += 1 jai = ja[ii] end # check for zero pivot and divide with pivot if jai == j bj = B[ jai,k]/aa[ii] B[jai,k] = bj ii += 1 else throw(LinAlg.SingularException(j)) end # update remaining part for i = ii:i2 B[ ja[i],k] -= bj*aa[i] end end end``` ```julia> @time t2 = fwdTriSolve2!(A,t2); 0.000126 seconds (17 allocations: 416 bytes)```

### lruthotto commented Dec 5, 2016

 Thanks for this suggestion. I tried it, but the result was incorrect. I assume you want to provide `B` also to `do_stuff`. Doing this, the result is correct, but the timing is less impressive. For 4 threads I'm getting ``````no-thread 1.419104 seconds (7 allocations: 152.588 MB, 0.86% gc time) with thread 0.690881 seconds (19.85 k allocations: 153.461 MB, 10.19% gc time) `````` So, it's much better than what I previous had.

### KristofferC commented Dec 5, 2016

 Yeah, I messed up, but the point was to put everything inside the thread macro in its own function.

### dmbates commented Dec 5, 2016

 Short term it is worthwhile delving into the internals of UMFPack and SuiteSparse in general to see how to speed things up. But long term, for both the license implications and for the sanity of those trying to interface to SuiteSparse, it is better to create an alternative implementation in Julia.

mentioned this issue Dec 5, 2016

### andreasnoack commented Dec 5, 2016

 I've opened #19511 to make it easier to use threads for this. With that change, you can just use `A_ldiv_B!` and a threaded loop over the columns of `B`. E.g. ```function mysolve{T}(F::UmfpackLU{T}, B::Matrix{T}) n = checksquare(F) n == size(B, 1) || error("") X = similar(B) @threads for j in 1:size(B,2) A_ldiv_B!(view(X,:,j), F, view(B,:,j)) end return X end``` I don't think extracting the factors of the LU will give a speed up.

### ViralBShah commented Dec 17, 2018

 Please reopen if something further can be done here.

mentioned this issue Feb 19, 2019