Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the first time to solve from 5 seconds to 1 second for Tsit5 #1465

Merged
merged 9 commits into from
Aug 7, 2021

Conversation

ChrisRackauckas
Copy link
Member

Ya'll think you write good compilers? Well, I'm the compiler now!

using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
 du[1] = 10.0(u[2]-u[1])
 du[2] = u[1]*(28.0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Tsit5()
tinf = @snoopi_deep solve(prob,alg)
itrigs = inference_triggers(tinf)
itrig = itrigs[13]
ascend(itrig)
@time solve(prob,alg)

using ProfileView
ProfileView.view(flamegraph(tinf))

v5.60.2
InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children

Before
InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children

Without `@turbo`
InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd`
InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@turbo`
InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd` everywhere
InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children

Ya'll think you write good compilers? Well, I'm the compiler now!

```julia
using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
 du[1] = 10.0(u[2]-u[1])
 du[2] = u[1]*(28.0-u[3]) - u[2]
 du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Tsit5()
tinf = @snoopi_deep solve(prob,alg)
itrigs = inference_triggers(tinf)
itrig = itrigs[13]
ascend(itrig)
@time solve(prob,alg)

using ProfileView
ProfileView.view(flamegraph(tinf))

v5.60.2
InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children

Before
InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children

Without `@turbo`
InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd`
InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@turbo`
InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd` everywhere
InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children
```
@@ -15,6 +15,7 @@ FiniteDiff = "6a86dc24-6348-571c-b903-95158fe2bd41"
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
Logging = "56ddb016-857b-54e1-b83d-db4d58db5568"
LoopVectorization = "bdcacae8-1622-11e9-2a5c-532679323890"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compat

@ChrisRackauckas
Copy link
Member Author

Laptop

Before:
InferenceTimingNode: 1.585750/5.363441 on Core.Compiler.Timings.ROOT() with 2 direct children

After:
InferenceTimingNode: 0.885957/1.254411 on Core.Compiler.Timings.ROOT() with 2 direct children

@ChrisRackauckas
Copy link
Member Author

LOL

Vern7
Before:
InferenceTimingNode: 5.962703/13.461966 on Core.Compiler.Timings.ROOT() with 1 direct children

After:
InferenceTimingNode: 2.979609/3.563301 on Core.Compiler.Timings.ROOT() with 2 direct children

@ChrisRackauckas
Copy link
Member Author

Vern9
Before:
InferenceTimingNode: 17.960255/23.513515 on Core.Compiler.Timings.ROOT() with 2 direct children

After:
InferenceTimingNode: 6.969864/7.531495 on Core.Compiler.Timings.ROOT() with 2 direct children

ChrisRackauckas added a commit to SciML/DiffEqBase.jl that referenced this pull request Aug 7, 2021
* reduce compile times by specializing broadcasts to loops

Companion PR to SciML/OrdinaryDiffEq.jl#1465

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* simd ivdep

* remove reduction compile

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>
@ChrisRackauckas ChrisRackauckas merged commit 977c5cb into master Aug 7, 2021
@ChrisRackauckas ChrisRackauckas deleted the ftts_by_hand branch August 7, 2021 03:11
@timholy
Copy link
Contributor

timholy commented Aug 7, 2021

Yep, broadcasting is quite expensive for the compiler. Sorry you had to write all this out by hand, but nice outcome! You're getting to be a master of the tools!

@ChrisRackauckas
Copy link
Member Author

Writing it out by hand is fine. The fact that I cannot do the same for RecursiveFactorization.jl's compile times bug me though...

@timholy
Copy link
Contributor

timholy commented Aug 7, 2021

In the next month I'm planning to trying to go through some of https://github.com/JuliaLang/julia/issues?q=is%3Aopen+is%3Aissue+label%3Aprecompile. That might make the inference part go away. Any overhead due to codegen/LLVM won't be helped though (yet).

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Aug 8, 2021

From these studies and #1467, JuliaLinearAlgebra/RecursiveFactorization.jl#29, and JuliaSIMD/TriangularSolve.jl#8, the biggest thing for us would be to figure out why the RecursiveFactorization/TriangularSolve/LoopVectorization stack won't cache the precompiles. In some sense it should be easy: the lowest level is just functions on Vector{Float64}? But I wonder if the caches being an __init__() time static array is causing issues there. This 20 seconds remaining is almost entirely due to this, and it's the reason why https://discourse.julialang.org/t/catalyst-differentiaequations-startup-time/65946/8 can't be solved with a few cheap tricks like this. I was going to ask @chriselrod if there was another way to try and do the caching.

@chriselrod
Copy link
Contributor

chriselrod commented Aug 9, 2021

I suspect a big part of the problem is that LoopVectorization owns the _turbo_! calls / there's nothing to root the inference results into the calling libraries?

An additional problem is that most of the time is not spent during inference.
From one subset of tests, I get:

ROOT                      :  0.03 %   197544027
GC                        :  3.25 %   22497827136
LOWERING                  :  4.44 %   30751764035
PARSING                   :  0.01 %   43577402
INFERENCE                 :  7.60 %   52646925454
CODEGEN                   : 57.02 %   394947402034
METHOD_LOOKUP_SLOW        :  0.00 %   24650402
METHOD_LOOKUP_FAST        :  3.43 %   23736669017
LLVM_OPT                  : 20.35 %   140965350896
LLVM_MODULE_FINISH        :  0.27 %   1865150203
METHOD_MATCH              :  0.52 %   3579983560
TYPE_CACHE_LOOKUP         :  1.76 %   12172038157
TYPE_CACHE_INSERT         :  0.00 %   33134016
STAGED_FUNCTION           :  0.11 %   765643563
MACRO_INVOCATION          :  0.00 %   30346646
AST_COMPRESS              :  0.47 %   3252351088
AST_UNCOMPRESS            :  0.48 %   3334916759
SYSIMG_LOAD               :  0.02 %   148451732
ADD_METHOD                :  0.19 %   1294783747
LOAD_MODULE               :  0.03 %   225725871
INIT_MODULE               :  0.02 %   164596592

However, this is from compiling a large number of methods.

The first @turbo compilation tends to take 8+ seconds, while subsequent compilations are <1s.
It seems like that first compilation -- which presumably is compiling a lot of LoopVectorization's internal methods -- really should be much faster than it is. It also probably has a compilation profile more typical of standard Julia code.

EDIT:
LoopVectorization itself doesn't show up in the TriangularSolve benchmark at all, presumably because the 1x1 lu! doesn't call it?

@timholy
Copy link
Contributor

timholy commented Aug 9, 2021

AFAICT __init__ is never cached: JuliaGraphics/Gtk.jl#466 (comment). Figuring out whether that's an oversight or deliberate is high on the agenda.

@ChrisRackauckas, you duplicated #1467 up there, feel free to edit and then I will take a look.

@ChrisRackauckas
Copy link
Member Author

Edited. Ahh yes, that's the one piece I in the chain I didn't setup to precompile! I'll go add something to DiffEqBase and see if that handles it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants