Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BFGS - GPU #946

Merged
merged 10 commits into from
Oct 14, 2021
Merged

BFGS - GPU #946

merged 10 commits into from
Oct 14, 2021

Conversation

vpuri3
Copy link
Contributor

@vpuri3 vpuri3 commented Sep 28, 2021

BLAS-ified update_h! call in BFGS() so it can work on GPU

@inbounds state.invH[i, j] += c1 * state.dx[i] * state.dx[j]' - c2 * (state.u[i] * state.dx[j]' + state.u[j]' * state.dx[i])
end
end
state.invH .+= c1.*(state.dx * state.dx') .-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this introduces quite a bit of intermediate allocations. You probably want to keep a fast ArrayInterface.fast_scalar_indexing(state.u) check and keep the branch

@vpuri3
Copy link
Contributor Author

vpuri3 commented Sep 29, 2021

@ChrisRackauckas ArrayInterface.can_setindex(state.u) evaluates to true. I also replaced the oneliner with calls to mul!. what are next steps?

@ChrisRackauckas
Copy link
Contributor

It's true for arrays

@longemen3000
Copy link
Contributor

the errors are relevant, as the update_h function is tested against matrix and tensor types. maybe adding vec(x) to state.dx and state.u could solve the problem. Finally, there are some new unused dependencies (CUDA, ArrayInterface) in the Project.toml

@ChrisRackauckas
Copy link
Contributor

The code is also not correct because it's just overwriting state.invH instead of accumulating.

And it adds a Manifest.toml.

@killah-t-cell
Copy link

killah-t-cell commented Oct 8, 2021

I'm new here, but based on this MWE (which I wrote based on the Rosenbrock OptimTestProblems), the code with mul! seems to do be doing the same thing:

# with loop
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH1 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

for i in 1:2
    @simd for j in 1:2
        @inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
    end
end
@show invH1

# with mul!
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH2 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) 
@show invH2

invH1 - invH2
(invH1 .- invH2) .< 1e-6

Am I missing something?

EDIT: It seems like the new code can't handle tensors or manifolds.

@longemen3000
Copy link
Contributor

EDIT: It seems like the new code can't handle tensors or manifolds.

for the tensor part, i think wrapping the tensor in vec would suffice. i don't know if doing the same with the manifold would help. the thing is that, by iterating over one dimension, the old code assumes implicitly vector form independent of the actual shape, but with an explicit mul this assumption is made explicit and for Arrays that are not in vector form, the new code fails.
Something like this would solve the dimensionality problem:

mul!(invH2,vec(dx),vec(dx)', c1,1)
mul!(invH2,vec(u) ,vec(dx)',-c2,1)
mul!(invH2,vec(dx),vec(u)' ,-c2,1) 

@ChrisRackauckas
Copy link
Contributor

mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) 

is equivalent to

mul!(invH2,dx,u' ,-c2,1) 

because it's just overwriting the same cache. You need to add the previous calculation to the last.

@killah-t-cell
Copy link

@ChrisRackauckas I don't think so. As per the definition a 5-argument mul!, mul!(C,A,B,α,β) combines inplace matrix-matrix or matrix-vector multiply-add of the form ABα+Cβ and stores it in C. So an addition is already happening here. I wrote this short program to convince myself that I understood this right, and it seems that both approaches are equivalent (though they sometimes vary by 1e-16 which I assume is just a numerical rounding error).

# Declarations
invH1 = [2.0 40; -4.0 7.0]
invH2 = [2.0 40; -4.0 7.0]
invH3 = [2.0 40; -4.0 7.0]
c1 = 41.772696387872294
c2 = 0.0319955499083424
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]

# with loop
for i in 1:2
    @simd for j in 1:2
        @show c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]')
        @show -c2*u[j]' * dx[i]
        @inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
    end
end

# with mul!
mul!(invH2,dx,dx', c1,1) 
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) 

# with A*B*α + C -> C
invH3 += dx*dx'*c1
invH3 += u*dx'*-c2
invH3 += dx*u'*-c2

# Test equivalence
@show invH1
@show invH2
@show invH3

invH1  invH2
invH2  invH3
invH1 - invH2
invH2 == invH3

# prove inequivalence of https://github.com/JuliaNLSolvers/Optim.jl/pull/946#issuecomment-939242924
invH4 = [2.0 40; -4.0 7.0]
mul!(invH4,dx,u' ,-c2,1) 

invH5 = [2.0 40; -4.0 7.0]
mul!(invH5,dx,dx', c1,1) 
mul!(invH5,u ,dx',-c2,1)
mul!(invH5,dx,u' ,-c2,1) 
invH4  invH5

Once again LMK if there is something I don't get.

@killah-t-cell killah-t-cell mentioned this pull request Oct 10, 2021
@killah-t-cell
Copy link

I opened a PR that continues this work here #950. I didn't know how to add to this one. I hope this isn't rude to @vpuri3. This is mostly his work!

@ChrisRackauckas
Copy link
Contributor

As per the definition a 5-argument mul!, mul!(C,A,B,α,β) combines inplace matrix-matrix or matrix-vector multiply-add of the form ABα+Cβ and stores it in C. So an addition is already happening here.

Oh I see.

@ChrisRackauckas
Copy link
Contributor

You'll still want to keep the loops on CPU to fuse the operations, and use LoopVectorization there.

@killah-t-cell
Copy link

killah-t-cell commented Oct 10, 2021

So you’re saying I should create a mul! Function using LoopVectorization like in the example here? (So I still use a loop, but just with the proper LoopVectorization macros?)

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/#Matrix-Multiplication

function A_mul_B!(C, A, B)
    @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
        Cmn = zero(eltype(C))
        for k ∈ indices((A,B), (2,1))
            Cmn += A[m,k] * B[k,n]
        end
        C[m,n] = Cmn
    end
end

@ChrisRackauckas
Copy link
Contributor

No, that @simd loop fuses the 3 operations, so I would expect that a LoopVectorization form would outperform the BLAS one on Arrays for at least "small enough" equations. It's worth testing and possibly specializing.

@vpuri3
Copy link
Contributor Author

vpuri3 commented Oct 10, 2021

@ChrisRackauckas, @killah-t-cell thanks for continuing the work! Both the @simd and @turbo implementations fail with CuArrays with scalar indexing errors. Would you recommend moving the arrays to CPU with Array for the update?

@ChrisRackauckas
Copy link
Contributor

Would you recommend moving the arrays to CPU with Array for the update?

No, it should branch on the type.

@killah-t-cell
Copy link

I would probably just remove the new manifest.toml you added and maybe delete the # TODO BLASify this and @show ArrayInterface.can_setindex(state.u) lines.

Comment on lines 165 to 168
state.invH[i,j] += c1 * state.dx[i] * state.dx[j]
state.invH[i,j] -= c2 * state.u[i] * state.dx[j]
state.invH[i,j] -= c2 * state.dx[i] * state.u[j]
end end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not fused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using LoopVectorization, BenchmarkTools

n = 200
A = rand(n,n)
x = rand(n)
u = rand(n)
c1= 1
c2= 1
    
function fav1!(A,x,u,c1,c2)
    @turbo for i in 1:n for j in 1:n
        A[i,j] += c1 * x[i] * x[j]
        A[i,j] -= c2 * u[i] * x[j]
        A[i,j] -= c2 * x[i] * u[j]
    end end
    return
end
function fav2!(A,x,u,c1,c2)
    @turbo for i in 1:n for j in 1:n
        A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j])
    end end
    return 
end     
function fsmd!(A,x,u,c1,c2)
    @inbounds @fastmath for i in 1:n for j in 1:n
        A[i,j] += c1 * x[i] * x[j]
        A[i,j] -= c2 * u[i] * x[j]
        A[i,j] -= c2 * x[i] * u[j]
    end end
    return
end 

fav1!(A,x,u,c1,c2) 
fav2!(A,x,u,c1,c2) 
fsmd!(A,x,u,c1,c2)
@btime fav1!(A,x,u,c1,c2)
@btime fav2!(A,x,u,c1,c2)
@btime fsmd!(A,x,u,c1,c2)

output:

  27.323 μs (21 allocations: 656 bytes)
  27.619 μs (21 allocations: 656 bytes)
  30.255 ms (640401 allocations: 10.39 MiB)

somehow unfused is faster, and with similar # of allocations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*barely faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. How does that scale with N? @chriselrod any explanation there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#----------------------------#
n = 32
#----------------------------#
n=32 LoopVectorization unfused
  463.362 ns (0 allocations: 0 bytes)
n=32 LoopVectorization fused
  559.114 ns (0 allocations: 0 bytes)
n=32 inbounds fastmath
  3.136 μs (0 allocations: 0 bytes)
n=32 mul!
  3.028 μs (0 allocations: 0 bytes)
n=32 CuArrays mul!
  22.927 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 64
#----------------------------#
n=64 LoopVectorization unfused
  1.701 μs (0 allocations: 0 bytes)
n=64 LoopVectorization fused
  1.992 μs (0 allocations: 0 bytes)
n=64 inbounds fastmath
  13.412 μs (0 allocations: 0 bytes)
n=64 mul!
  8.614 μs (0 allocations: 0 bytes)
n=64 CuArrays mul!
  22.870 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 128
#----------------------------#
n=128 LoopVectorization unfused
  7.816 μs (0 allocations: 0 bytes)
n=128 LoopVectorization fused
  8.062 μs (0 allocations: 0 bytes)
n=128 inbounds fastmath
  71.125 μs (0 allocations: 0 bytes)
n=128 mul!
  31.908 μs (0 allocations: 0 bytes)
n=128 CuArrays mul!
  27.459 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 256
#----------------------------#
n=256 LoopVectorization unfused
  39.736 μs (0 allocations: 0 bytes)
n=256 LoopVectorization fused
  40.701 μs (0 allocations: 0 bytes)
n=256 inbounds fastmath
  465.373 μs (0 allocations: 0 bytes)
n=256 mul!
  197.076 μs (0 allocations: 0 bytes)
n=256 CuArrays mul!
  27.553 μs (102 allocations: 4.88 KiB)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to take more of a look at this.
What are the two unfused loops vs the fused loops?

I need to update this repo and allow minimization instead of just maximization, but:
https://github.com/chriselrod/QuasiNewtonMethods.jl
I added LoopVectorization based versions there, and got much better benchmark results (>5x faster) than Optim when the function is cheap to evaluate (rosenbrock) for a 60-dimensional example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriselrod

function fav1!(A,x,u,c1,c2) # unfused
    @turbo for i  axes(A,1), j  axes(A,2)
        A[i,j] += c1 * x[i] * x[j]
        A[i,j] -= c2 * u[i] * x[j]
        A[i,j] -= c2 * x[i] * u[j]
    end
    return
end
function fav2!(A,x,u,c1,c2) # fused
    @turbo for i  axes(A,1), j  axes(A,2)
        A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j])
    end
    return
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. For your earlier examples, you'd have needed to add n as an argument to avoid globals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW:

julia> @btime fav1!($A,$x,$u,$c1,$c2)
  4.348 μs (0 allocations: 0 bytes)

julia> @btime fav2!($A,$x,$u,$c1,$c2)
  4.276 μs (0 allocations: 0 bytes)

julia> @btime fav3!($A,$x,$u,$c1,$c2)
  4.390 μs (0 allocations: 0 bytes)

julia> @btime fav4!($A,$x,$u,$c1,$c2)
  4.370 μs (0 allocations: 0 bytes)

julia> @btime fsmd!($A,$x,$u,$c1,$c2)
  8.280 μs (0 allocations: 0 bytes)

where:

function fav3!(A,x,u,c1,c2) # fused
  @turbo for i  axes(A,1), j  axes(A,2)
    A[i,j] += c1 * x[i]*x[j] - c2*u[i]*x[j] - c2*x[i]*u[j]
  end
  return
end
function fav4!(A,x,u,c1,c2) # fused
  @turbo for i  axes(A,1), j  axes(A,2)
    A[i,j] += (c1 * x[j])*x[i] - (c2*x[j])*u[i] - (c2*u[j])*x[i]
  end
  return
end

So I get the same performance as the unfused with these, but that performance is slightly worse than with the fused.
I think this may be noise. I did confirm that the inner loops are fma-only with the fav1!, 3, and 4 while it has additional multiplications with fav2!.
But I think this code is basically memory bottlenecked so that extra compute doesn't matter much.

Although for n=32, it'll probably start mattering, like it made a big difference in your benchmarks there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is what I get for n=32:

julia> @btime fav1!($A,$x,$u,$c1,$c2)
  107.040 ns (0 allocations: 0 bytes)

julia> @btime fav2!($A,$x,$u,$c1,$c2)
  100.216 ns (0 allocations: 0 bytes)

julia> @btime fav3!($A,$x,$u,$c1,$c2)
  106.829 ns (0 allocations: 0 bytes)

julia> @btime fav4!($A,$x,$u,$c1,$c2)
  108.732 ns (0 allocations: 0 bytes)

julia> @btime fsmd!($A,$x,$u,$c1,$c2)
  276.584 ns (0 allocations: 0 bytes)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Oct 11, 2021

@ChrisRackauckas is this PR good to go?

@pkofod
Copy link
Member

pkofod commented Oct 14, 2021

Interesting that CI seemed to pass. @antoine-levitt shouldn't those ' be preserved for complex support?

@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #946 (26a08e7) into master (a1b957d) will decrease coverage by 0.04%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #946      +/-   ##
==========================================
- Coverage   83.60%   83.55%   -0.05%     
==========================================
  Files          42       42              
  Lines        3019     3022       +3     
==========================================
+ Hits         2524     2525       +1     
- Misses        495      497       +2     
Impacted Files Coverage Δ
src/multivariate/solvers/first_order/bfgs.jl 91.54% <66.66%> (-2.57%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01b49c7...26a08e7. Read the comment docs.

@vpuri3
Copy link
Contributor Author

vpuri3 commented Oct 14, 2021

weird...there seems to be an issue with LoopVectorizing not recognizing '. @chriselrod could you help with this?

using LoopVectorization, LinearAlgebra
using BenchmarkTools

function fav1!(A,x,u,c1,c2)
    @turbo for i  axes(A,1), j  axes(A,2)
        A[i,j] += c1 * x[i] * x[j]'
        A[i,j] -= c2 * u[i] * x[j]'
        A[i,j] -= c2 * x[i] * u[j]'
    end
    return
end
julia> include("t.jl")                                                        ERROR: LoadError: Expression not recognized.                                  
(x[j])'                                                                       
Stacktrace:                                                                     [1] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64)                                           
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling/graphs.jl:1131                        
  [2] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr, elementbytes::Int64, position::Int64)                                       
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:68                                                               
  [3] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, elementbytes::Int64, position::Int64, mpref::Nothing)                              
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:353                                                              
  [4] add_compute!                                                            
    @ ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:316 [
inlined]                                                                        [5] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64)                                               @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling
/graphs.jl:1115                                                               
  [6] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr
, elementbytes::Int64, position::Int64)
    @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:68
  [7] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, eleme
ntbytes::Int64, position::Int64, mpref::LoopVectorization.ArrayReferenceMetaPo
sition)

@chriselrod
Copy link
Contributor

(For reference in case anyone reads this, I answered on slack saying LV currently doesn't parse '. I could add parsing for that, but for not explicitly calling adjoint should be fine.)

@pkofod
Copy link
Member

pkofod commented Oct 14, 2021

the nightlies are unrelated I think

@pkofod pkofod merged commit ac17b7e into JuliaNLSolvers:master Oct 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants