-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BFGS - GPU #946
BFGS - GPU #946
Conversation
@inbounds state.invH[i, j] += c1 * state.dx[i] * state.dx[j]' - c2 * (state.u[i] * state.dx[j]' + state.u[j]' * state.dx[i]) | ||
end | ||
end | ||
state.invH .+= c1.*(state.dx * state.dx') .- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this introduces quite a bit of intermediate allocations. You probably want to keep a fast ArrayInterface.fast_scalar_indexing(state.u)
check and keep the branch
@ChrisRackauckas |
It's true for arrays |
the errors are relevant, as the |
The code is also not correct because it's just overwriting And it adds a Manifest.toml. |
I'm new here, but based on this MWE (which I wrote based on the Rosenbrock OptimTestProblems), the code with mul! seems to do be doing the same thing: # with loop
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH1 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]
for i in 1:2
@simd for j in 1:2
@inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
end
end
@show invH1
# with mul!
c1 = 41.772696387872294
c2 = 0.0319955499083424
invH2 = [1.0 -0.0; -0.0 1.0]
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]
mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1)
@show invH2
invH1 - invH2
(invH1 .- invH2) .< 1e-6 Am I missing something? EDIT: It seems like the new code can't handle tensors or manifolds. |
for the tensor part, i think wrapping the tensor in
|
mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1) is equivalent to mul!(invH2,dx,u' ,-c2,1) because it's just overwriting the same cache. You need to add the previous calculation to the last. |
@ChrisRackauckas I don't think so. As per the definition a 5-argument mul!, # Declarations
invH1 = [2.0 40; -4.0 7.0]
invH2 = [2.0 40; -4.0 7.0]
invH3 = [2.0 40; -4.0 7.0]
c1 = 41.772696387872294
c2 = 0.0319955499083424
dx = [0.14330445374562797, 0.0584916137737257]
u = [186.9231445284796, 76.37722725998003]
# with loop
for i in 1:2
@simd for j in 1:2
@show c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]')
@show -c2*u[j]' * dx[i]
@inbounds invH1[i, j] += c1 * dx[i] * dx[j]' - c2 * (u[i] * dx[j]' + u[j]' * dx[i])
end
end
# with mul!
mul!(invH2,dx,dx', c1,1)
mul!(invH2,u ,dx',-c2,1)
mul!(invH2,dx,u' ,-c2,1)
# with A*B*α + C -> C
invH3 += dx*dx'*c1
invH3 += u*dx'*-c2
invH3 += dx*u'*-c2
# Test equivalence
@show invH1
@show invH2
@show invH3
invH1 ≈ invH2
invH2 ≈ invH3
invH1 - invH2
invH2 == invH3
# prove inequivalence of https://github.com/JuliaNLSolvers/Optim.jl/pull/946#issuecomment-939242924
invH4 = [2.0 40; -4.0 7.0]
mul!(invH4,dx,u' ,-c2,1)
invH5 = [2.0 40; -4.0 7.0]
mul!(invH5,dx,dx', c1,1)
mul!(invH5,u ,dx',-c2,1)
mul!(invH5,dx,u' ,-c2,1)
invH4 ≈ invH5 Once again LMK if there is something I don't get. |
Oh I see. |
You'll still want to keep the loops on CPU to fuse the operations, and use LoopVectorization there. |
So you’re saying I should create a mul! Function using LoopVectorization like in the example here? (So I still use a loop, but just with the proper LoopVectorization macros?)
|
No, that |
@ChrisRackauckas, @killah-t-cell thanks for continuing the work! Both the |
No, it should branch on the type. |
I would probably just remove the new manifest.toml you added and maybe delete the |
state.invH[i,j] += c1 * state.dx[i] * state.dx[j] | ||
state.invH[i,j] -= c2 * state.u[i] * state.dx[j] | ||
state.invH[i,j] -= c2 * state.dx[i] * state.u[j] | ||
end end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not fused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using LoopVectorization, BenchmarkTools
n = 200
A = rand(n,n)
x = rand(n)
u = rand(n)
c1= 1
c2= 1
function fav1!(A,x,u,c1,c2)
@turbo for i in 1:n for j in 1:n
A[i,j] += c1 * x[i] * x[j]
A[i,j] -= c2 * u[i] * x[j]
A[i,j] -= c2 * x[i] * u[j]
end end
return
end
function fav2!(A,x,u,c1,c2)
@turbo for i in 1:n for j in 1:n
A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j])
end end
return
end
function fsmd!(A,x,u,c1,c2)
@inbounds @fastmath for i in 1:n for j in 1:n
A[i,j] += c1 * x[i] * x[j]
A[i,j] -= c2 * u[i] * x[j]
A[i,j] -= c2 * x[i] * u[j]
end end
return
end
fav1!(A,x,u,c1,c2)
fav2!(A,x,u,c1,c2)
fsmd!(A,x,u,c1,c2)
@btime fav1!(A,x,u,c1,c2)
@btime fav2!(A,x,u,c1,c2)
@btime fsmd!(A,x,u,c1,c2)
output:
27.323 μs (21 allocations: 656 bytes)
27.619 μs (21 allocations: 656 bytes)
30.255 ms (640401 allocations: 10.39 MiB)
somehow unfused is faster, and with similar # of allocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*barely faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. How does that scale with N? @chriselrod any explanation there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#----------------------------#
n = 32
#----------------------------#
n=32 LoopVectorization unfused
463.362 ns (0 allocations: 0 bytes)
n=32 LoopVectorization fused
559.114 ns (0 allocations: 0 bytes)
n=32 inbounds fastmath
3.136 μs (0 allocations: 0 bytes)
n=32 mul!
3.028 μs (0 allocations: 0 bytes)
n=32 CuArrays mul!
22.927 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 64
#----------------------------#
n=64 LoopVectorization unfused
1.701 μs (0 allocations: 0 bytes)
n=64 LoopVectorization fused
1.992 μs (0 allocations: 0 bytes)
n=64 inbounds fastmath
13.412 μs (0 allocations: 0 bytes)
n=64 mul!
8.614 μs (0 allocations: 0 bytes)
n=64 CuArrays mul!
22.870 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 128
#----------------------------#
n=128 LoopVectorization unfused
7.816 μs (0 allocations: 0 bytes)
n=128 LoopVectorization fused
8.062 μs (0 allocations: 0 bytes)
n=128 inbounds fastmath
71.125 μs (0 allocations: 0 bytes)
n=128 mul!
31.908 μs (0 allocations: 0 bytes)
n=128 CuArrays mul!
27.459 μs (102 allocations: 4.88 KiB)
#----------------------------#
n = 256
#----------------------------#
n=256 LoopVectorization unfused
39.736 μs (0 allocations: 0 bytes)
n=256 LoopVectorization fused
40.701 μs (0 allocations: 0 bytes)
n=256 inbounds fastmath
465.373 μs (0 allocations: 0 bytes)
n=256 mul!
197.076 μs (0 allocations: 0 bytes)
n=256 CuArrays mul!
27.553 μs (102 allocations: 4.88 KiB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to take more of a look at this.
What are the two unfused loops vs the fused loops?
I need to update this repo and allow minimization instead of just maximization, but:
https://github.com/chriselrod/QuasiNewtonMethods.jl
I added LoopVectorization based versions there, and got much better benchmark results (>5x faster) than Optim when the function is cheap to evaluate (rosenbrock) for a 60-dimensional example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function fav1!(A,x,u,c1,c2) # unfused
@turbo for i ∈ axes(A,1), j ∈ axes(A,2)
A[i,j] += c1 * x[i] * x[j]
A[i,j] -= c2 * u[i] * x[j]
A[i,j] -= c2 * x[i] * u[j]
end
return
end
function fav2!(A,x,u,c1,c2) # fused
@turbo for i ∈ axes(A,1), j ∈ axes(A,2)
A[i,j] += c1 * x[i]*x[j] - c2*(u[i]*x[j] + x[i]*u[j])
end
return
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. For your earlier examples, you'd have needed to add n
as an argument to avoid globals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW:
julia> @btime fav1!($A,$x,$u,$c1,$c2)
4.348 μs (0 allocations: 0 bytes)
julia> @btime fav2!($A,$x,$u,$c1,$c2)
4.276 μs (0 allocations: 0 bytes)
julia> @btime fav3!($A,$x,$u,$c1,$c2)
4.390 μs (0 allocations: 0 bytes)
julia> @btime fav4!($A,$x,$u,$c1,$c2)
4.370 μs (0 allocations: 0 bytes)
julia> @btime fsmd!($A,$x,$u,$c1,$c2)
8.280 μs (0 allocations: 0 bytes)
where:
function fav3!(A,x,u,c1,c2) # fused
@turbo for i ∈ axes(A,1), j ∈ axes(A,2)
A[i,j] += c1 * x[i]*x[j] - c2*u[i]*x[j] - c2*x[i]*u[j]
end
return
end
function fav4!(A,x,u,c1,c2) # fused
@turbo for i ∈ axes(A,1), j ∈ axes(A,2)
A[i,j] += (c1 * x[j])*x[i] - (c2*x[j])*u[i] - (c2*u[j])*x[i]
end
return
end
So I get the same performance as the unfused with these, but that performance is slightly worse than with the fused.
I think this may be noise. I did confirm that the inner loops are fma-only with the fav1!
, 3, and 4 while it has additional multiplications with fav2!
.
But I think this code is basically memory bottlenecked so that extra compute doesn't matter much.
Although for n=32
, it'll probably start mattering, like it made a big difference in your benchmarks there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is what I get for n=32
:
julia> @btime fav1!($A,$x,$u,$c1,$c2)
107.040 ns (0 allocations: 0 bytes)
julia> @btime fav2!($A,$x,$u,$c1,$c2)
100.216 ns (0 allocations: 0 bytes)
julia> @btime fav3!($A,$x,$u,$c1,$c2)
106.829 ns (0 allocations: 0 bytes)
julia> @btime fav4!($A,$x,$u,$c1,$c2)
108.732 ns (0 allocations: 0 bytes)
julia> @btime fsmd!($A,$x,$u,$c1,$c2)
276.584 ns (0 allocations: 0 bytes)
@ChrisRackauckas is this PR good to go? |
Interesting that CI seemed to pass. @antoine-levitt shouldn't those |
Codecov Report
@@ Coverage Diff @@
## master #946 +/- ##
==========================================
- Coverage 83.60% 83.55% -0.05%
==========================================
Files 42 42
Lines 3019 3022 +3
==========================================
+ Hits 2524 2525 +1
- Misses 495 497 +2
Continue to review full report at Codecov.
|
weird...there seems to be an issue with LoopVectorizing not recognizing using LoopVectorization, LinearAlgebra
using BenchmarkTools
function fav1!(A,x,u,c1,c2)
@turbo for i ∈ axes(A,1), j ∈ axes(A,2)
A[i,j] += c1 * x[i] * x[j]'
A[i,j] -= c2 * u[i] * x[j]'
A[i,j] -= c2 * x[i] * u[j]'
end
return
end julia> include("t.jl") ERROR: LoadError: Expression not recognized.
(x[j])'
Stacktrace: [1] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64)
@ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling/graphs.jl:1131
[2] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr, elementbytes::Int64, position::Int64)
@ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:68
[3] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, elementbytes::Int64, position::Int64, mpref::Nothing)
@ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:353
[4] add_compute!
@ ~/.julia/packages/LoopVectorization/bIqmO/src/parse/add_compute.jl:316 [
inlined] [5] add_operation!(ls::LoopVectorization.LoopSet, LHS::Symbol, RHS::Expr, el
ementbytes::Int64, position::Int64) @ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/modeling
/graphs.jl:1115
[6] add_parent!(vparents::Vector{LoopVectorization.Operation}, deps::Vector{
Symbol}, reduceddeps::Vector{Symbol}, ls::LoopVectorization.LoopSet, var::Expr
, elementbytes::Int64, position::Int64)
@ LoopVectorization ~/.julia/packages/LoopVectorization/bIqmO/src/parse/ad
d_compute.jl:68
[7] add_compute!(ls::LoopVectorization.LoopSet, var::Symbol, ex::Expr, eleme
ntbytes::Int64, position::Int64, mpref::LoopVectorization.ArrayReferenceMetaPo
sition)
|
(For reference in case anyone reads this, I answered on slack saying LV currently doesn't parse |
the nightlies are unrelated I think |
BLAS-ified update_h! call in BFGS() so it can work on GPU