BFGS - GPU #946

vpuri3 · 2021-09-28T21:26:26Z

BLAS-ified update_h! call in BFGS() so it can work on GPU

ChrisRackauckas · 2021-09-29T02:41:34Z

src/multivariate/solvers/first_order/bfgs.jl

-                @inbounds state.invH[i, j] += c1 * state.dx[i] * state.dx[j]' - c2 * (state.u[i] * state.dx[j]' + state.u[j]' * state.dx[i])
-            end
-        end
+        state.invH .+= c1.*(state.dx * state.dx') .-


this introduces quite a bit of intermediate allocations. You probably want to keep a fast ArrayInterface.fast_scalar_indexing(state.u) check and keep the branch

vpuri3 · 2021-09-29T21:34:14Z

@ChrisRackauckas ArrayInterface.can_setindex(state.u) evaluates to true. I also replaced the oneliner with calls to mul!. what are next steps?

ChrisRackauckas · 2021-09-29T21:40:06Z

It's true for arrays

longemen3000 · 2021-09-30T01:03:45Z

the errors are relevant, as the update_h function is tested against matrix and tensor types. maybe adding vec(x) to state.dx and state.u could solve the problem. Finally, there are some new unused dependencies (CUDA, ArrayInterface) in the Project.toml

ChrisRackauckas · 2021-09-30T12:37:30Z

The code is also not correct because it's just overwriting state.invH instead of accumulating.

And it adds a Manifest.toml.

killah-t-cell · 2021-10-08T06:52:38Z

I'm new here, but based on this MWE (which I wrote based on the Rosenbrock OptimTestProblems), the code with mul! seems to do be doing the same thing:

# with loop
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH1 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

for i in 1:2
    @simd for j in 1:2
        @inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
    end
end
@show invH1

# with mul!
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH2 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) 
@show invH2

invH1 - invH2
(invH1 .- invH2) .< 1e-6

Am I missing something?

EDIT: It seems like the new code can't handle tensors or manifolds.

longemen3000 · 2021-10-09T00:43:02Z

EDIT: It seems like the new code can't handle tensors or manifolds.

for the tensor part, i think wrapping the tensor in vec would suffice. i don't know if doing the same with the manifold would help. the thing is that, by iterating over one dimension, the old code assumes implicitly vector form independent of the actual shape, but with an explicit mul this assumption is made explicit and for Arrays that are not in vector form, the new code fails.
Something like this would solve the dimensionality problem:

mul!(invH2,vec(dx),vec(dx)', c1,1)
mul!(invH2,vec(u) ,vec(dx)',-c2,1)
mul!(invH2,vec(dx),vec(u)' ,-c2,1)

ChrisRackauckas · 2021-10-09T07:00:36Z

mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1)

is equivalent to

mul!(invH2,dx,u' ,-c2,1)

because it's just overwriting the same cache. You need to add the previous calculation to the last.

killah-t-cell · 2021-10-10T14:21:22Z

@ChrisRackauckas I don't think so. As per the definition a 5-argument mul!, mul!(C,A,B,α,β) combines inplace matrix-matrix or matrix-vector multiply-add of the form ABα+Cβ and stores it in C. So an addition is already happening here. I wrote this short program to convince myself that I understood this right, and it seems that both approaches are equivalent (though they sometimes vary by 1e-16 which I assume is just a numerical rounding error).

# Declarations
invH1 = [2.0 40; -4.0 7.0]
invH2 = [2.0 40; -4.0 7.0]
invH3 = [2.0 40; -4.0 7.0]
c1 = 41.772696387872294
c2 = 0.0319955499083424
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

# with loop
for i in 1:2
    @simd for j in 1:2
        @show c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]')
        @show -c2*u[j]' * dx[i]
        @inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
    end
end

# with mul!
mul!(invH2,dx,dx', c1,1) 
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) 

# with A*B*α + C -> C
invH3 += dx*dx'*c1
invH3 += u*dx'*-c2
invH3 += dx*u'*-c2

# Test equivalence
@show invH1
@show invH2
@show invH3

invH1 ≈ invH2
invH2 ≈ invH3
invH1 - invH2
invH2 == invH3

# prove inequivalence of https://github.com/JuliaNLSolvers/Optim.jl/pull/946#issuecomment-939242924
invH4 = [2.0 40; -4.0 7.0]
mul!(invH4,dx,u' ,-c2,1) 

invH5 = [2.0 40; -4.0 7.0]
mul!(invH5,dx,dx', c1,1) 
mul!(invH5,u ,dx',-c2,1)
mul!(invH5,dx,u' ,-c2,1) 
invH4 ≈ invH5

Once again LMK if there is something I don't get.

killah-t-cell · 2021-10-10T14:22:46Z

I opened a PR that continues this work here #950. I didn't know how to add to this one. I hope this isn't rude to @vpuri3. This is mostly his work!

ChrisRackauckas · 2021-10-10T15:00:11Z

As per the definition a 5-argument mul!, mul!(C,A,B,α,β) combines inplace matrix-matrix or matrix-vector multiply-add of the form ABα+Cβ and stores it in C. So an addition is already happening here.

Oh I see.

ChrisRackauckas · 2021-10-10T15:00:33Z

You'll still want to keep the loops on CPU to fuse the operations, and use LoopVectorization there.

killah-t-cell · 2021-10-10T15:22:58Z

So you’re saying I should create a mul! Function using LoopVectorization like in the example here? (So I still use a loop, but just with the proper LoopVectorization macros?)

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/#Matrix-Multiplication

function A_mul_B!(C, A, B)
    @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

ChrisRackauckas · 2021-10-10T15:33:46Z

No, that @simd loop fuses the 3 operations, so I would expect that a LoopVectorization form would outperform the BLAS one on Arrays for at least "small enough" equations. It's worth testing and possibly specializing.

vpuri3 · 2021-10-10T19:20:42Z

@ChrisRackauckas, @killah-t-cell thanks for continuing the work! Both the @simd and @turbo implementations fail with CuArrays with scalar indexing errors. Would you recommend moving the arrays to CPU with Array for the update?

ChrisRackauckas · 2021-10-10T19:29:42Z

Would you recommend moving the arrays to CPU with Array for the update?

No, it should branch on the type.

killah-t-cell · 2021-10-11T06:20:57Z

I would probably just remove the new manifest.toml you added and maybe delete the # TODO BLASify this and @show ArrayInterface.can_setindex(state.u) lines.

ChrisRackauckas · 2021-10-11T15:31:55Z

src/multivariate/solvers/first_order/bfgs.jl

+                state.invH[i,j] += c1 * state.dx[i] * state.dx[j]
+                state.invH[i,j] -= c2 * state.u[i]  * state.dx[j]
+                state.invH[i,j] -= c2 * state.dx[i] * state.u[j]
+            end end


why not fused?

using LoopVectorization, BenchmarkTools n = 200 A = rand(n,n) x = rand(n) u = rand(n) c1= 1 c2= 1 function fav1!(A,x,u,c1,c2) @turbo for i in 1:n for j in 1:n A[i,j] += c1 * x[i] * x[j] A[i,j] -= c2 * u[i] * x[j] A[i,j] -= c2 * x[i] * u[j] end end return end function fav2!(A,x,u,c1,c2) @turbo for i in 1:n for j in 1:n A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j]) end end return end function fsmd!(A,x,u,c1,c2) @inbounds @fastmath for i in 1:n for j in 1:n A[i,j] += c1 * x[i] * x[j] A[i,j] -= c2 * u[i] * x[j] A[i,j] -= c2 * x[i] * u[j] end end return end fav1!(A,x,u,c1,c2) fav2!(A,x,u,c1,c2) fsmd!(A,x,u,c1,c2) @btime fav1!(A,x,u,c1,c2) @btime fav2!(A,x,u,c1,c2) @btime fsmd!(A,x,u,c1,c2)

output:

27.323 μs (21 allocations: 656 bytes) 27.619 μs (21 allocations: 656 bytes) 30.255 ms (640401 allocations: 10.39 MiB)

somehow unfused is faster, and with similar # of allocations

*barely faster

Interesting. How does that scale with N? @chriselrod any explanation there?

#----------------------------# n = 32 #----------------------------# n=32 LoopVectorization unfused 463.362 ns (0 allocations: 0 bytes) n=32 LoopVectorization fused 559.114 ns (0 allocations: 0 bytes) n=32 inbounds fastmath 3.136 μs (0 allocations: 0 bytes) n=32 mul! 3.028 μs (0 allocations: 0 bytes) n=32 CuArrays mul! 22.927 μs (102 allocations: 4.88 KiB) #----------------------------# n = 64 #----------------------------# n=64 LoopVectorization unfused 1.701 μs (0 allocations: 0 bytes) n=64 LoopVectorization fused 1.992 μs (0 allocations: 0 bytes) n=64 inbounds fastmath 13.412 μs (0 allocations: 0 bytes) n=64 mul! 8.614 μs (0 allocations: 0 bytes) n=64 CuArrays mul! 22.870 μs (102 allocations: 4.88 KiB) #----------------------------# n = 128 #----------------------------# n=128 LoopVectorization unfused 7.816 μs (0 allocations: 0 bytes) n=128 LoopVectorization fused 8.062 μs (0 allocations: 0 bytes) n=128 inbounds fastmath 71.125 μs (0 allocations: 0 bytes) n=128 mul! 31.908 μs (0 allocations: 0 bytes) n=128 CuArrays mul! 27.459 μs (102 allocations: 4.88 KiB) #----------------------------# n = 256 #----------------------------# n=256 LoopVectorization unfused 39.736 μs (0 allocations: 0 bytes) n=256 LoopVectorization fused 40.701 μs (0 allocations: 0 bytes) n=256 inbounds fastmath 465.373 μs (0 allocations: 0 bytes) n=256 mul! 197.076 μs (0 allocations: 0 bytes) n=256 CuArrays mul! 27.553 μs (102 allocations: 4.88 KiB)

I'll have to take more of a look at this.
What are the two unfused loops vs the fused loops?

I need to update this repo and allow minimization instead of just maximization, but:
https://github.com/chriselrod/QuasiNewtonMethods.jl
I added LoopVectorization based versions there, and got much better benchmark results (>5x faster) than Optim when the function is cheap to evaluate (rosenbrock) for a 60-dimensional example.

@chriselrod

function fav1!(A,x,u,c1,c2) # unfused @turbo for i ∈ axes(A,1), j ∈ axes(A,2) A[i,j] += c1 * x[i] * x[j] A[i,j] -= c2 * u[i] * x[j] A[i,j] -= c2 * x[i] * u[j] end return end function fav2!(A,x,u,c1,c2) # fused @turbo for i ∈ axes(A,1), j ∈ axes(A,2) A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j]) end return end

Thanks. For your earlier examples, you'd have needed to add n as an argument to avoid globals.

FWIW:

julia> @btime fav1!($A,$x,$u,$c1,$c2) 4.348 μs (0 allocations: 0 bytes) julia> @btime fav2!($A,$x,$u,$c1,$c2) 4.276 μs (0 allocations: 0 bytes) julia> @btime fav3!($A,$x,$u,$c1,$c2) 4.390 μs (0 allocations: 0 bytes) julia> @btime fav4!($A,$x,$u,$c1,$c2) 4.370 μs (0 allocations: 0 bytes) julia> @btime fsmd!($A,$x,$u,$c1,$c2) 8.280 μs (0 allocations: 0 bytes)

where:

function fav3!(A,x,u,c1,c2) # fused @turbo for i ∈ axes(A,1), j ∈ axes(A,2) A[i,j] += c1 * x[i]*x[j] - c2*u[i]*x[j] - c2*x[i]*u[j] end return end function fav4!(A,x,u,c1,c2) # fused @turbo for i ∈ axes(A,1), j ∈ axes(A,2) A[i,j] += (c1 * x[j])*x[i] - (c2*x[j])*u[i] - (c2*u[j])*x[i] end return end

So I get the same performance as the unfused with these, but that performance is slightly worse than with the fused.
I think this may be noise. I did confirm that the inner loops are fma-only with the fav1!, 3, and 4 while it has additional multiplications with fav2!.
But I think this code is basically memory bottlenecked so that extra compute doesn't matter much.

Although for n=32, it'll probably start mattering, like it made a big difference in your benchmarks there.

Although this is what I get for n=32:

julia> @btime fav1!($A,$x,$u,$c1,$c2) 107.040 ns (0 allocations: 0 bytes) julia> @btime fav2!($A,$x,$u,$c1,$c2) 100.216 ns (0 allocations: 0 bytes) julia> @btime fav3!($A,$x,$u,$c1,$c2) 106.829 ns (0 allocations: 0 bytes) julia> @btime fav4!($A,$x,$u,$c1,$c2) 108.732 ns (0 allocations: 0 bytes) julia> @btime fsmd!($A,$x,$u,$c1,$c2) 276.584 ns (0 allocations: 0 bytes)

vpuri3 · 2021-10-11T19:01:07Z

@ChrisRackauckas is this PR good to go?

pkofod · 2021-10-14T14:10:43Z

Interesting that CI seemed to pass. @antoine-levitt shouldn't those ' be preserved for complex support?

codecov · 2021-10-14T14:25:07Z

Codecov Report

Merging #946 (26a08e7) into master (a1b957d) will decrease coverage by 0.04%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #946      +/-   ##
==========================================
- Coverage   83.60%   83.55%   -0.05%     
==========================================
  Files          42       42              
  Lines        3019     3022       +3     
==========================================
+ Hits         2524     2525       +1     
- Misses        495      497       +2

Impacted Files	Coverage Δ
src/multivariate/solvers/first_order/bfgs.jl	`91.54% <66.66%> (-2.57%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01b49c7...26a08e7. Read the comment docs.

vpuri3 · 2021-10-14T14:57:38Z

weird...there seems to be an issue with LoopVectorizing not recognizing '. @chriselrod could you help with this?

using LoopVectorization, LinearAlgebra
using BenchmarkTools

function fav1!(A,x,u,c1,c2)
    @turbo for i ∈ axes(A,1), j ∈ axes(A,2)
        A[i,j] += c1 * x[i] * x[j]'
        A[i,j] -= c2 * u[i] * x[j]'
        A[i,j] -= c2 * x[i] * u[j]'
    end
    return
end

julia> include("t.jl")                                                        ERROR: LoadError: Expression not recognized.                                  
(x[j])'                                                                       
Stacktrace:                                                                     [1] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64)                                           
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling/graphs.jl:1131                        
  [2] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr, elementbytes::Int64, position::Int64)                                       
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:68                                                               
  [3] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, elementbytes::Int64, position::Int64, mpref::Nothing)                              
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:353                                                              
  [4] add_compute!                                                            
    @ ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:316 [
inlined]                                                                        [5] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64)                                               @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling
/graphs.jl:1115                                                               
  [6] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr
, elementbytes::Int64, position::Int64)
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:68
  [7] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, eleme
ntbytes::Int64, position::Int64, mpref::LoopVectorization.ArrayReferenceMetaPo
sition)

chriselrod · 2021-10-14T15:56:24Z

(For reference in case anyone reads this, I answered on slack saying LV currently doesn't parse '. I could add parsing for that, but for not explicitly calling adjoint should be fine.)

pkofod · 2021-10-14T19:41:14Z

the nightlies are unrelated I think

BLAS-ified update_h! call in BFGS() so it can work on GPU

08efd0c

ChrisRackauckas reviewed Sep 29, 2021

View reviewed changes

replaced oneliner with mul! calls

fee2654

removed ArrayInterface, CUDA

5ef26ea

ChrisRackauckas mentioned this pull request Oct 7, 2021

System of PDEs with CUDA? SciML/NeuralPDE.jl#410

Closed

killah-t-cell mentioned this pull request Oct 10, 2021

Enable BFGS GPU #950

Closed

vpuri3 added 4 commits October 10, 2021 15:52

branch on type

a8e2da7

ignore vim .swp files

14eeaff

LoopVectorization.@turbo

cec110f

typo

3c3d9a3

clean up

6fc253d

ChrisRackauckas reviewed Oct 11, 2021

View reviewed changes

complex conjugation

367ed5f

call adjoint to do conjugate transpose

26a08e7

pkofod merged commit ac17b7e into JuliaNLSolvers:master Oct 14, 2021

johnmyleswhite mentioned this pull request Aug 24, 2022

Support for GPU codes? #1002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFGS - GPU #946

BFGS - GPU #946

vpuri3 commented Sep 28, 2021

ChrisRackauckas Sep 29, 2021

vpuri3 commented Sep 29, 2021

ChrisRackauckas commented Sep 29, 2021

longemen3000 commented Sep 30, 2021

ChrisRackauckas commented Sep 30, 2021

killah-t-cell commented Oct 8, 2021 •

edited

longemen3000 commented Oct 9, 2021

ChrisRackauckas commented Oct 9, 2021

killah-t-cell commented Oct 10, 2021

killah-t-cell commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

killah-t-cell commented Oct 10, 2021 •

edited

ChrisRackauckas commented Oct 10, 2021

vpuri3 commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

killah-t-cell commented Oct 11, 2021

ChrisRackauckas Oct 11, 2021

vpuri3 Oct 11, 2021

vpuri3 Oct 11, 2021

ChrisRackauckas Oct 11, 2021

vpuri3 Oct 11, 2021

chriselrod Oct 11, 2021 •

edited

vpuri3 Oct 11, 2021

chriselrod Oct 14, 2021

chriselrod Oct 14, 2021 •

edited

chriselrod Oct 14, 2021

vpuri3 commented Oct 11, 2021

pkofod commented Oct 14, 2021

codecov bot commented Oct 14, 2021 •

edited

vpuri3 commented Oct 14, 2021

chriselrod commented Oct 14, 2021

pkofod commented Oct 14, 2021

BFGS - GPU #946

BFGS - GPU #946

Conversation

vpuri3 commented Sep 28, 2021

Choose a reason for hiding this comment

vpuri3 commented Sep 29, 2021

ChrisRackauckas commented Sep 29, 2021

longemen3000 commented Sep 30, 2021

ChrisRackauckas commented Sep 30, 2021

killah-t-cell commented Oct 8, 2021 • edited

longemen3000 commented Oct 9, 2021

ChrisRackauckas commented Oct 9, 2021

killah-t-cell commented Oct 10, 2021

killah-t-cell commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

killah-t-cell commented Oct 10, 2021 • edited

ChrisRackauckas commented Oct 10, 2021

vpuri3 commented Oct 10, 2021

ChrisRackauckas commented Oct 10, 2021

killah-t-cell commented Oct 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Oct 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Oct 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpuri3 commented Oct 11, 2021

pkofod commented Oct 14, 2021

codecov bot commented Oct 14, 2021 • edited

Codecov Report

vpuri3 commented Oct 14, 2021

chriselrod commented Oct 14, 2021

pkofod commented Oct 14, 2021

killah-t-cell commented Oct 8, 2021 •

edited

killah-t-cell commented Oct 10, 2021 •

edited

chriselrod Oct 11, 2021 •

edited

chriselrod Oct 14, 2021 •

edited

codecov bot commented Oct 14, 2021 •

edited