Speed up UMFpack LU for multiple (sparse) right hand sides #19500
Comments
That is a quite large speedup. We should try to improve here. Extracting the factors like you do here is one option but it might be faster to use
|
sounds interesting, but I don't know |
On the same note. A big bottleneck still are the triangular solves (see |
I've just taken a look at this and the reason that the default solve is slow is because iterative refinements of the solutions are enabled by default in UMFPACK. You can try it out with SparseArrays.UMFPACK.umf_ctrl[8] = 0 I get julia solve
julia> @time t1 = luA\rhs
23.085403 seconds (28.54 k allocations: 184.735 MB, 0.58% gc time)
julia> @time t2 = myLUsolve(luA,rhs)
time to extract LUfactors
1.528154 seconds (7.95 k allocations: 548.319 MB, 8.36% gc time)
time to solve for 300 right hand sides
13.622108 seconds (504.26 k allocations: 136.302 MB, 0.96% gc time)
15.343389 seconds (762.59 k allocations: 696.246 MB, 1.72% gc time)
julia> SparseArrays.UMFPACK.umf_ctrl[8] = 0
0
julia> @time t3 = luA\rhs;
7.729600 seconds (609 allocations: 91.562 MB, 0.34% gc time) We might want to expose this through a function. I also tried |
Thanks for the hint. I see a similar speedup now when deactivating iterative refinements. I do see additional room for speedup here (or elsewhere) though if we could have an OpenMP type parallelism for the upper- and lower-triangular solves. UMFpack seems to be using only one thread on my machine on the solve stage. |
We kind of have that option with julia> @time MyUMF.mysolve(luA, rhs);
7.381999 seconds (610 allocations: 23.374 MB, 0.02% gc time)
julia> @time MyUMF.mysolve2(luA, rhs);
2.049506 seconds (1.64 k allocations: 142.510 MB, 3.53% gc time) However, it looks like some of the functions in the |
Thanks for the hint. I looked at threads now. This would be a great addition to many of our codes. However, so far it does not seem to work well. There is a lot of allocation added also for my example (that doesnt involve UMF) See this example: println("Number Of Threads : $(Threads.nthreads())")
function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
nrowB, ncolB = size(B, 1), size(B, 2)
ncol = LinAlg.checksquare(A)
if nrowB != ncol
throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
end
aa = A.nzval
ja = A.rowval
ia = A.colptr
Threads.@threads for k = 1:ncolB
for j = 1:nrowB
i1 = ia[j]
i2 = ia[j + 1] - 1
# loop through the structural zeros
ii = i1
jai = ja[ii]
while ii <= i2 && jai < j
ii += 1
jai = ja[ii]
end
# check for zero pivot and divide with pivot
if jai == j
bj = B[ jai,k]/aa[ii]
B[jai,k] = bj
ii += 1
else
throw(LinAlg.SingularException(j))
end
# update remaining part
for i = ii:i2
B[ ja[i],k] -= bj*aa[i]
end
end
end
B
end
A = tril(sprandn(10000,10000,0.00002) + UniformScaling(100))
rhs = randn(size(A,1),20)
t1 = copy(rhs)
t2 = copy(rhs)
println("no-thread")
@time t1 = A\t1
println("with thread")
@time t2 = fwdTriSolve2!(A,t2)
println("\nerr: $(norm(t1-t2)./norm(t1))") Running this (and repeating a couple of times) I get this:
|
Try function fwdTriSolve2!(A::SparseMatrixCSC, B::AbstractVecOrMat)
# forward substitution for CSC matrices
nrowB, ncolB = size(B, 1), size(B, 2)
ncol = LinAlg.checksquare(A)
if nrowB != ncol
throw(DimensionMismatch("A is $(ncol) columns and B has $(nrowB) rows"))
end
aa = A.nzval
ja = A.rowval
ia = A.colptr
Threads.@threads for k = 1:ncolB
do_stuff(aa, ja, ia, k, nrowB)
end
B
end
function do_stuff(aa, ja, ia, k, nrowB)
for j = 1:nrowB
i1 = ia[j]
i2 = ia[j + 1] - 1
# loop through the structural zeros
ii = i1
jai = ja[ii]
while ii <= i2 && jai < j
ii += 1
jai = ja[ii]
end
# check for zero pivot and divide with pivot
if jai == j
bj = B[ jai,k]/aa[ii]
B[jai,k] = bj
ii += 1
else
throw(LinAlg.SingularException(j))
end
# update remaining part
for i = ii:i2
B[ ja[i],k] -= bj*aa[i]
end
end
end julia> @time t2 = fwdTriSolve2!(A,t2);
0.000126 seconds (17 allocations: 416 bytes) |
Thanks for this suggestion. I tried it, but the result was incorrect. I assume you want to provide
So, it's much better than what I previous had. |
Yeah, I messed up, but the point was to put everything inside the thread macro in its own function. |
Short term it is worthwhile delving into the internals of UMFPack and SuiteSparse in general to see how to speed things up. But long term, for both the license implications and for the sanity of those trying to interface to SuiteSparse, it is better to create an alternative implementation in Julia. |
I've opened #19511 to make it easier to use threads for this. With that change, you can just use function mysolve{T}(F::UmfpackLU{T}, B::Matrix{T})
n = checksquare(F)
n == size(B, 1) || error("")
X = similar(B)
@threads for j in 1:size(B,2)
A_ldiv_B!(view(X,:,j), F, view(B,:,j))
end
return X
end I don't think extracting the factors of the LU will give a speed up. |
Please reopen if something further can be done here. |
The LU factorization (through UMFpack) currently implemented in julia 0.5 is rather slow when using it for multiple right hand sides. Looking at umfpack.jl lines 259 you can see that there is a loop over all right hand sides.
As far as I could see UMFpack does not seem to have a built-in option for multiple rhs. So, i followed the MATLAB code here to build one.
It works quite nicely and I'd be pleased to contribute that to Base. To this end, it would be great to get some help and advice. For example, I'm not sure where to put the code. Also, I'm not sure why it takes so long to get the parts of the lufactorization.
See this example here:
On my machine, Ubuntu 14, julia 0.5.1, I get the following timings
The text was updated successfully, but these errors were encountered: