Skip to content

Conversation

@schmrlng
Copy link
Contributor

Notes:

  • 14 (the threshold for dispatching to Base versions of functions, inherited from lu) may be worth canonizing somewhere instead of appearing repeatedly as a magic number.
  • solve and \ currently only work for square matrices on the LHS, though the same was true of the inv(a) * b stopgap previously in place (i.e., no functionality has been lost). Maybe solve should be folded into the ldiv family of functions in some future update.
  • There's currently no size threshold for falling back to Base in expm. Compile times can get into the minutes beyond 15 x 15; I think this is a consequence of mul! never being dispatched to its Base version.

@tkoolen
Copy link
Contributor

tkoolen commented May 31, 2018

Compile times can get into the minutes beyond 15 x 15

Yeah, I'm generally a little worried about the compile times for the LU code; testing the LU stuff probably takes about 80% of total test time.

@martinholters
Copy link
Collaborator

Yeah, I'm generally a little worried about the compile times for the LU code;

Indeed. It was a nice exercise and it's cool that it's working, but it might not be too practical.

What I think we can do on 0.7, with the better allocation elision, is to convert the input to an MMatrix, have an inplace LU algorithm working on that, then return it converted to an SMatrix, and not suffer from any allocations.

@andyferris
Copy link
Member

Cool! Any analysis of speed and/or accuracy?

Yes, I do agree with the comments above. If it's true that expm is slow for large sizes we should definitely fix that, but

I think this is a consequence of mul! never being dispatched to its Base version.

doesn't mul! call BLAS directly for large matrices?

What I think we can do on 0.7, with the better allocation elision, is to convert the input to an MMatrix, have an inplace LU algorithm working on that, then return it converted to an SMatrix, and not suffer from any allocations.

Yes - I'd really love to see proof of this working. It would be really cool and would be the "correct" way of dealing with larger structures. In fact, we could use BLAS and LAPACK a lot more - we just pass stack pointers to inputs and outputs. In the original code (way back before StaticArrays was even released) I had MMatrix as basically RefValue{SMatrix} which was kind-of neat for this kind of thing. Speculatively, I've thought rather than similar_type and construction, we should use similar, allocation elision and mutation, followed by publish, which is a function that says "I'm done mutating this" and would unwrap the ref and return the result. (Basically, there's no reason mutation of "immutables" should be invalid during the construction phase).

@schmrlng
Copy link
Contributor Author

schmrlng commented Jun 1, 2018

Cool! Any analysis of speed and/or accuracy?

Not a particularly exhaustive study (especially in terms of accuracy), but here's a quick/dirty look:

using StaticArrays, BenchmarkTools, DataFrames, DataStructures

Nmax = 20
unary = (det, inv, expm)
binary = (\,)

data = OrderedDict{Symbol,Any}()
data[:SIZE] = vcat(([i, "", "", ""] for i in 1:Nmax)...)
data[:STAT] = [stat for sz in 1:Nmax for stat in ("compile time (s)", "StaticArrays (μs)", "Base (μs)", "max error")]

for f in unary
    f_data = Float64[]
    for N in 1:Nmax
        print("\r$((f,N))")
        SA = @SMatrix rand(N,N)
        A = Array(SA)
        push!(f_data, @elapsed f(SA))
        push!(f_data, 1e6*@belapsed $f($SA))
        push!(f_data, 1e6*@belapsed $f($A))
        push!(f_data, maximum([begin
                                   SA = @SMatrix rand(N,N)
                                   A = Array(SA)
                                   norm(f(A) - f(SA))
                               end for i in 1:1000]))
    end
    data[Symbol(f)] = f_data
end

for f in binary
    f_data = Float64[]
    for N in 1:Nmax
        print("\r$((f,N))")
        SA = @SMatrix rand(N,N)
        A = Array(SA)
        SB = @SMatrix rand(N,N)
        B = Array(SB)
        push!(f_data, @elapsed f(SA,SB))
        push!(f_data, 1e6*@belapsed $f($SA,$SB))
        push!(f_data, 1e6*@belapsed $f($A,$B))
        push!(f_data, maximum([begin
                                   SA = @SMatrix rand(N,N)
                                   A = Array(SA)
                                   SB = @SMatrix rand(N,N)
                                   B = Array(SB)
                                   norm(f(A,B) - f(SA,SB))
                               end for i in 1:1000]))
    end
    data[Symbol(f)] = f_data
end

df = DataFrame(data...)

Results.

The compile time does continue to increase for expm past 14x14 matrices (my new pet theory: the ldiv methods in triangular.jl never dispatch to Base).

doesn't mul! call BLAS directly for large matrices?

Oh yes, that's correct: https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/matrix_multiply.jl#L126.

@martinholters
Copy link
Collaborator

Little experiment concerning the LU implementation (and sorry for hijacking this PR):

function mylu(A::StaticMatrix{T}, ::Val{Pivot}=Val(true)) where {T,Pivot}
    A = MMatrix(A)
    m, n = size(A)
    ipiv = MVector{m,Int}()

    minmn = min(m,n)

    @inbounds begin # this whole loop copied from stdlib/LinearAlgebra/src/lu.jl
        for k = 1:minmn
            # find index max
            kp = k
            if Pivot
                amax = abs(zero(T))
                for i = k:m
                    absi = abs(A[i,k])
                    if absi > amax
                        kp = i
                        amax = absi
                    end
                end
            end
            ipiv[k] = kp
            if !iszero(A[kp,k])
                if k != kp
                    # Interchange
                    for i = 1:n
                        tmp = A[k,i]
                        A[k,i] = A[kp,i]
                        A[kp,i] = tmp
                    end
                end
                # Scale first column
                Akkinv = inv(A[k,k])
                for i = k+1:m
                    A[i,k] *= Akkinv
                end
            elseif info == 0
                info = k
            end
            # Update the rest
            for j = k+1:n
                for i = k+1:m
                    A[i,j] -= A[i,k]*A[k,j]
                end
            end
        end
    end

    return SMatrix(A), SVector(ipiv)
end

Then I get (on 0.7.0-alpha.0):

julia> @btime mylu($(SMatrix{10,10}(rand(10,10))));
  606.908 ns (0 allocations: 0 bytes)

Compare with (on 0.6.3; cannot meaningfully bechmark on 0.7 due to deprecations):

julia> @btime StaticArrays.__lu($(SMatrix{10,10}(rand(10,10))), Val{true});
  652.778 ns (0 allocations: 0 bytes)

It's not a 100% fair comparison (results in slightly different formats, different Julia versions, runtime depends on random values (different pivoting), ...), but they're definitely close. And approximately ten times faster than lu($(rand(10,10))).

However, on 0.6.3, i.e. without the allocation elimination:

julia> @btime mylu($(SMatrix{10,10}(rand(10,10))));
  138.708 μs (2505 allocations: 57.20 KiB)

Copy link
Member

@andyferris andyferris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like some solid work. Thanks @schmrlng :)

I think we should merge this and play with the idea of mutating algorithms more when v0.7 is released around the time we drop v0.6 support.

I'll leave this open for another day or so for any more feedback before merging.

Finally - @schmrlng feel free to add your benchmarking and/or fuzz testing code the repository, such as in the perf directory - it is very useful, and even if this stuff does tend to go stale after a while it can always be revived.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants