-
-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the first time to solve from 5 seconds to 1 second for Tsit5 #1465
Conversation
Ya'll think you write good compilers? Well, I'm the compiler now! ```julia using OrdinaryDiffEq, SnoopCompile function lorenz(du,u,p,t) du[1] = 10.0(u[2]-u[1]) du[2] = u[1]*(28.0-u[3]) - u[2] du[3] = u[1]*u[2] - (8/3)*u[3] end u0 = [1.0;0.0;0.0] tspan = (0.0,100.0) prob = ODEProblem(lorenz,u0,tspan) alg = Tsit5() tinf = @snoopi_deep solve(prob,alg) itrigs = inference_triggers(tinf) itrig = itrigs[13] ascend(itrig) @time solve(prob,alg) using ProfileView ProfileView.view(flamegraph(tinf)) v5.60.2 InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children Before InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children Without `@turbo` InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children With `@turbo` InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children With `@inbounds @simd` everywhere InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children ```
@@ -15,6 +15,7 @@ FiniteDiff = "6a86dc24-6348-571c-b903-95158fe2bd41" | |||
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210" | |||
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" | |||
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568" | |||
LoopVectorization = "bdcacae8-1622-11e9-2a5c-532679323890" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compat
Laptop
Before:
InferenceTimingNode: 1.585750/5.363441 on Core.Compiler.Timings.ROOT() with 2 direct children
After:
InferenceTimingNode: 0.885957/1.254411 on Core.Compiler.Timings.ROOT() with 2 direct children |
LOL Vern7
Before:
InferenceTimingNode: 5.962703/13.461966 on Core.Compiler.Timings.ROOT() with 1 direct children
After:
InferenceTimingNode: 2.979609/3.563301 on Core.Compiler.Timings.ROOT() with 2 direct children |
Vern9
Before:
InferenceTimingNode: 17.960255/23.513515 on Core.Compiler.Timings.ROOT() with 2 direct children
After:
InferenceTimingNode: 6.969864/7.531495 on Core.Compiler.Timings.ROOT() with 2 direct children |
* reduce compile times by specializing broadcasts to loops Companion PR to SciML/OrdinaryDiffEq.jl#1465 * Update src/calculate_residuals.jl Co-authored-by: Yingbo Ma <mayingbo5@gmail.com> * Update src/calculate_residuals.jl Co-authored-by: Yingbo Ma <mayingbo5@gmail.com> * Update src/calculate_residuals.jl Co-authored-by: Yingbo Ma <mayingbo5@gmail.com> * simd ivdep * remove reduction compile Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>
Yep, broadcasting is quite expensive for the compiler. Sorry you had to write all this out by hand, but nice outcome! You're getting to be a master of the tools! |
Writing it out by hand is fine. The fact that I cannot do the same for RecursiveFactorization.jl's compile times bug me though... |
In the next month I'm planning to trying to go through some of https://github.com/JuliaLang/julia/issues?q=is%3Aopen+is%3Aissue+label%3Aprecompile. That might make the inference part go away. Any overhead due to codegen/LLVM won't be helped though (yet). |
From these studies and #1467, JuliaLinearAlgebra/RecursiveFactorization.jl#29, and JuliaSIMD/TriangularSolve.jl#8, the biggest thing for us would be to figure out why the RecursiveFactorization/TriangularSolve/LoopVectorization stack won't cache the precompiles. In some sense it should be easy: the lowest level is just functions on |
I suspect a big part of the problem is that LoopVectorization owns the An additional problem is that most of the time is not spent during inference.
However, this is from compiling a large number of methods. The first EDIT: |
AFAICT @ChrisRackauckas, you duplicated #1467 up there, feel free to edit and then I will take a look. |
Edited. Ahh yes, that's the one piece I in the chain I didn't setup to precompile! I'll go add something to DiffEqBase and see if that handles it. |
Ya'll think you write good compilers? Well, I'm the compiler now!