New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Automatic Differentiation #2
Comments
I've set up this example for the lorentz system: using ForwardDiff, BenchmarkTools
function eom!(du, u)
du[1] = 10.0(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end
function jacob!(J, u)
i = one(eltype(J))
J[1,1] = -10i; J[1,2] = 10i; J[1,3] = 0i
J[2,1] = 28i - u[3]; J[2,2] = -i; J[2,3] = -u[1]
J[3,1] = u[2]; J[3,2] = u[1]; J[3,3] = -8i/3
return J
end# should give exponents [0.9056, 0, -14.5723]
jacob_fd! = (J,u) -> ForwardDiff.jacobian!(J, eom!, zeros(u), u)
println("Benchmarking hand-coded vs. ForwardDiff")
function bench(N)
println("for N = $N")
J = rand(3,3); u0 = rand(3)
a = @benchmark for i in 1:$N
jacob!($J, $u0)
end
b = @benchmark for i in 1:$N
jacob_fd!($J, $u0)
end
println("Coded: $(median(a))")
println("ForwardDiff: $(median(b))")
end
bench(100)
bench(10000)
bench(1000000) Which prints:
Now, I am aware that the line: |
If you're trying to get optimal performance, you probably want to preallocate the |
@jrevels I appreciate your joining on the conversation. Replacing the line
makes it go much faster (I wonder why??? I simply used the JacobianConfig that the
They are are still though orders of magnitude larger. For me this seems reasonable (I could not expect automatic derivatives to be as fast as plain old coded ones). But just to be sure, am I doing something wrong? P.s. I have absolutely no clue what the |
OK, I stand corrected. Maybe try ReverseDiff.jl? |
oooouf took me all this time to make it work! Using: #reverse diff:
using ReverseDiff
const f_tape = ReverseDiff.JacobianTape(eom!, rand(3), rand(3))
const compiled_f_tape = ReverseDiff.compile(f_tape)
jacob_rd! = (J, u) -> ReverseDiff.jacobian!(J, compiled_f_tape, u)
println("Benchmarking hand-coded vs. ForwardDiff")
function bench(N)
println("for N = $N")
J = rand(3,3); u0 = rand(3)
a = @benchmark for i in 1:$N
jacob!($J, $u0)
end
b = @benchmark for i in 1:$N
jacob_rd!($J, $u0)
end
println("Coded: $(median(a))")
println("ForwardDiff: $(median(b))")
end
bench(10000) Outputs:
Which is not an improvement! I will catch up to the discussion in about 12 hours from now! |
It's expected that ReverseDiff will be quite a bit worse for Jacobians of this dimensionality - ForwardDiff is algorithmically the better choice here. I rewrote your benchmark script to
Here's the new script: using ForwardDiff, BenchmarkTools, StaticArrays
# Using StaticArrays #
#--------------------#
@inline eom(u) = @SVector [10.0(u[2]-u[1]), u[1]*(28.0-u[3]) - u[2], u[1]*u[2] - (8/3)*u[3]]
@inline function manual_jacobian(u)
i = one(eltype(u))
return @SMatrix [ -10i 10i 0i;
(28i - u[3]) -i -u[1];
u[2] u[1] -8i/3]
end
@inline forward_jacobian(u) = ForwardDiff.jacobian(eom, u)
u = SVector{3}(rand(3))
# sanity check
@assert manual_jacobian(u) == forward_jacobian(u)
println("---- eom(::SVector) benchmark results ----")
display(@benchmark eom($u))
println("---- manual_jacobian(::SVector) benchmark results ----")
display(@benchmark manual_jacobian($u))
println("---- forward_jacobian(::SVector) benchmark results ----")
display(@benchmark forward_jacobian($u))
# Using Base Arrays #
#-------------------#
@inline function eom!(du, u)
du[1] = 10.0(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end
@inline function manual_jacobian!(J, u)
i = one(eltype(J))
J[1,1] = -10i
J[1,2] = 10i
J[1,3] = 0i
J[2,1] = 28i - u[3]
J[2,2] = -i
J[2,3] = -u[1]
J[3,1] = u[2]
J[3,2] = u[1]
J[3,3] = -8i/3
return J
end
@inline forward_jacobian!(J, du, u, cfg) = ForwardDiff.jacobian!(J, eom!, du, u, cfg)
u = rand(3)
du = zeros(3)
J = zeros(3, 3)
cfg = ForwardDiff.JacobianConfig(eom!, du, u)
# sanity check
@assert manual_jacobian!(J, u) == forward_jacobian!(J, du, u, cfg)
println("---- eom!(::Vector, ::Vector) benchmark results ----")
display(@benchmark eom!($du, $u))
println("---- manual_jacobian!(::Matrix, ::Vector) benchmark results ----")
display(@benchmark manual_jacobian!($J, $u))
println("---- forward_jacobian!(::Matrix, ::Vector, ::Vector, ::JacobianConfig) benchmark results ----")
display(@benchmark forward_jacobian!($J, $du, $u, $cfg)) ...and here are the results on my machine (using ForwardDiff v0.5 and Julia v0.6): ---- eom(::SVector) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.943 ns (0.00% GC)
median time: 2.952 ns (0.00% GC)
mean time: 2.955 ns (0.00% GC)
maximum time: 11.160 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
---- manual_jacobian(::SVector) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.904 ns (0.00% GC)
median time: 3.917 ns (0.00% GC)
mean time: 3.926 ns (0.00% GC)
maximum time: 12.365 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
---- forward_jacobian(::SVector) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.551 ns (0.00% GC)
median time: 4.561 ns (0.00% GC)
mean time: 4.572 ns (0.00% GC)
maximum time: 13.199 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
---- eom!(::Vector, ::Vector) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.234 ns (0.00% GC)
median time: 4.242 ns (0.00% GC)
mean time: 4.262 ns (0.00% GC)
maximum time: 15.073 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
---- manual_jacobian!(::Matrix, ::Vector) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.849 ns (0.00% GC)
median time: 5.860 ns (0.00% GC)
mean time: 5.877 ns (0.00% GC)
maximum time: 8.275 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
---- forward_jacobian!(::Matrix, ::Vector, ::Vector, ::JacobianConfig) benchmark results ----
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 90.940 ns (0.00% GC)
median time: 91.572 ns (0.00% GC)
mean time: 91.629 ns (0.00% GC)
maximum time: 254.181 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 956 So we see there's only a small overhead for StaticArrays, and a bit larger overhead for Base Arrays. |
@jrevels Thank you so much for all the help and input! This is beyond amazing.
|
|
But how can it be that
??? An |
Those metrics mean "Julia's GC is reporting 0 heap allocations were performed and 0 bytes of heap memory were used." It's worth doing some Googling to learn about the performance characteristics of stack allocation/access vs. heap allocation/access (in general, not just in Julia). |
The benchmarks @jrevels gives are about the same as what I've seen. In general, hand-coded Jacobians eek out a tadbit more performance, but ForwardDiff on these kinds of systems does really well if the Jacobian is quite dense. The difference grows if you get to something like 15 ODEs with sparse interactions, but even then it's not that big. Allowing the user to give a Jocobian is nice but only actually necessary when ForwardDiff doesn't apply to the equation, which shouldn't happen this case (all of the dynamical systems here are "just math", not using any interop). Symbolic then has about the same speedup in this case as analytical, since you're solving for a symbolic Jacobian once and then re-using it every iteration. So it's really not that big of a deal if you're only getting the Jacobian. However, the reason to use ParameterizedFunctions.jl is because it gives a lot more. The stiff diffeq "Master Equation" is solving the linear system The downside to symbolic is that you need the expression that the user defines the expression with, or some other DSL. That's why ParameterizedFunctions.jl has to be used through the macro |
Thanks @jrevels, very nice results! |
Yes they were quite good! However, I have spend the last 2-3 hours re-implementing these benchmarks in a more "realistic setting" and there are a lot of things changing. Because my benchmark codes are around 300 lines, I will link them instead. I have also changed the fact that half the elements of the Jacobian are simple numbers without any operation happening. Basically what I did was benchmark 5 different ways one could set up the equations of motion of a system:
I used these e.o.m. to benchmark 2 different things: First, evolving the system AND updating the output. This is different than simply calling
This was a crucial thing missing in @jrevels benchmarks. Of course this makes the use of comparison of evolve call:
v2 versus v1
BenchmarkTools.TrialJudgement:
time: -38.01% => improvement (5.00% tolerance)
memory: -57.75% => improvement (1.00% tolerance)
v3 versus v1
BenchmarkTools.TrialJudgement:
time: -88.58% => improvement (5.00% tolerance)
memory: -73.94% => improvement (1.00% tolerance)
v4 versus v1
BenchmarkTools.TrialJudgement:
time: -87.33% => improvement (5.00% tolerance)
memory: -71.83% => improvement (1.00% tolerance)
v5 versus v1
BenchmarkTools.TrialJudgement:
time: -84.89% => improvement (5.00% tolerance)
memory: -71.83% => improvement (1.00% tolerance) As far as the Jacobian calculation is concerned, the results are here. I used the In summary, the method with the StaticVectors is the fastest. However, it has to be said that the method that has the e.o.m. in place is very close, within 20-40% of the speed. It is also worth noting that (for the Jacobian case) the in-place e.o.m. that uses simple Vectors is actually faster than the one that uses MVectors. Results: comparison of in-place jacobian call:
v2 versus v1
BenchmarkTools.TrialJudgement:
time: +6951.54% => regression (5.00% tolerance)
memory: +950.00% => regression (1.00% tolerance)
v3 versus v1
BenchmarkTools.TrialJudgement:
time: +38.53% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
v4 versus v1
BenchmarkTools.TrialJudgement:
time: +35.02% => regression (5.00% tolerance)
memory: -100.00% => improvement (1.00% tolerance)
v5 versus v1
BenchmarkTools.TrialJudgement:
time: +36.49% => regression (5.00% tolerance)
memory: -50.00% => improvement (1.00% tolerance) I think @dpsanders and @jrevels might want to check out the actual benchmarking code. Since as you already know very well, I am not a Benchmarking expert, nor StaticArrays expert, nor Julia expert! :D |
I didn't read too closely, but there seem to be a couple of issues with your code. I'm assuming your examples are representative of the actual dimensionality you care about. For dimensions this small, there's no reason not to use StaticArrays almost entirely, and you should avoid manipulating heap-allocated mutable objects. Instead, you should be operating on immutable values - constructing new objects on the stack (even incrementally) can be often be faster than mutating heap-allocated objects when the dimensions involved are this low. Besides being more performant, it'll also make the code cleaner. If for some reason you had to have a mutable system object (as opposed to an immutable one), and you want a field's load/stores to be performant, you should fully expose that field's type. For example, instead of this: # Operations involving `u` will be slow because Julia can't specialize
# on the element type of `u` when compiling functions for `System`
mutable struct System
u::SVector{3,<:Real}
end ...you should write this: mutable struct System{T<:Real}
u::SVector{3,T}
end |
Something crazy happened for me: Using your suggestion and making the system type a parametric type, changes the benchmarks significantly! Besides everything being much faster, the comparison between different benchmarks also change!
@jrevels The results state that using mutable staticarrays and operating them in-place is the best approach. Combining it with the fact that the Jacobian calculation for mutable statics and staticarrays is not very far of in the sense of performance, I think using mutable static arrays may be the best approach. Would you agree, or you would still suggest to use exclusively static arrays, even with these new benchmarks? There is no reason to specifically have the system's field to be a mutable type, as long as it is efficient to update it (since this will happen all the time). For my package the systems will have small dimensionality, I would guess all usage will be limited in
Well the simple reason to not use StaticArrays is because evolve is simply much faster for Mutable staticarrays. P.S.: EDIT: Initially I could not understand why there should be a difference between the two ways to write the |
Yup, giving the compiler "full" type information for performance hotspots can make a big difference in Julia!
Your benchmarks aren't testing the design I'm suggesting, which is to make the system type itself immutable. For example, using StaticArrays, BenchmarkTools
# I'm adding the `i` field just for the sake of example, to
# motivate even defining a system type at all. Assuming
# your benchmarks are representative of your actual use
# case, defining a new type here would only be useful if
# you have multiple fields to keep track of. Otherwise, it
# would be cleaner to just write all of your operations on
# generic arrays.
struct SystemState{T<:Real}
u::SVector{3,T}
i::Int
end
@inline SystemState(u::SVector) = SystemState(u, 1)
@inline evolve(s::SystemState) = SystemState(eom(s.u), s.i + 1)
@inline function eom(x)
x1, x2, x3 = x[1], x[2], x[3]
return SVector(3.8*x1*(1-x1)-0.05*(x2+0.35)*(1-2*x3),
0.1*((x2+0.35)*(1-2*x3)-1 )*(1-1.9*x1),
3.78*x3*(1-x3)+0.2*x2)
end
@benchmark evolve(s) setup=(s = SystemState(SVector{3}(rand(3)))) On my machine, that last line produces: BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.165 ns (0.00% GC)
median time: 6.174 ns (0.00% GC)
mean time: 6.182 ns (0.00% GC)
maximum time: 14.547 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 Compare that to the output of BenchmarkTools.Trial:
memory estimate: 32 bytes
allocs estimate: 1
--------------
minimum time: 28.358 ns (0.00% GC)
median time: 28.886 ns (0.00% GC)
mean time: 31.701 ns (7.57% GC)
maximum time: 2.038 μs (92.91% GC)
--------------
samples: 10000
evals/sample: 995 |
Hey @jrevels thanks so much for all the input. (Seriously I appreciate it a lot. I would've never been able to explore all these possibilities just by myself). Now understand your approach! However, I need to define my system in a way that contains all the information of the system, in the scientific sense. That is, the current state and the equations of motion. Therefore, my type has to be something like:
or
It is really important to have the equations of motion inside the system type and I am really sorry I was not clear about how I wanted to define the system Having the eom in the type is good not only because this is what makes sense (in the sense of what a dynamical system is) but also because it makes the User Interface beautiful. I consider the last part really important. For example, the user could then do:
and it would work without wonky calls that need to pass 2,3 or even 4 unnecessary arguments to the function. I have written this benchmark file that creates the system as I would need them in the real setting. (I am not pasting it since it is ~150 lines). Version 3 is your suggestion.
(Notice: all of the systems have a field -----Comparison of evolve call:-----
v2 versus v1
BenchmarkTools.TrialJudgement:
time: +279.30% => regression (5.00% tolerance)
memory: +Inf% => regression (1.00% tolerance)
v3 versus v1
BenchmarkTools.TrialJudgement:
time: -55.17% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
v4 versus v1
BenchmarkTools.TrialJudgement:
time: -20.69% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
-----Comparison of in-place jacobian call:-----
v2 versus v1
BenchmarkTools.TrialJudgement:
time: +94.45% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
v3 versus v1
BenchmarkTools.TrialJudgement:
time: -5.55% => improvement (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
v4 versus v1
BenchmarkTools.TrialJudgement:
time: +134.33% => regression (5.00% tolerance)
memory: +0.00% => invariant (1.00% tolerance)
Conclusions:
Minimum time of evolve call for method v3
Minimum time of jacob call for method v3 Sooo I guess you are right? That makes me extremely happy!!!!!!! Because I can finally forget all that stuff since the decision is clear and start coding the actual implementation! |
Oh damn. This is killing my brain... I now think that if I use that approach, there is no reason to have the system state as part of the system... Since it won't be updated in place, what is the reason for that? |
Thanks everybody for the really helpful discussions! Now everything in the package uses automatic derivatives from ForwardDiff if a user one is not supplied, even the continuous systems! I want to especially thank @jrevels for helping me so much with the benchmarks and opening my eyes to how fast StaticArrays truly are! To my astonishment, even the integration of the |
For the moment, I would suggest to concentrate on using Automatic Differentiation, rather than hand-coded Jacobians or symbolic.
Do you have examples where you have checked that hand-coded Jacobians are actually much faster than ForwardDiff.jl?
cc @jrevels
The text was updated successfully, but these errors were encountered: