# Reduce the first time to solve from 5 seconds to 1 second for Tsit5#1465

Merged
merged 9 commits into from
Aug 7, 2021
Merged

# Reduce the first time to solve from 5 seconds to 1 second for Tsit5 #1465

merged 9 commits into from
Aug 7, 2021

## Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

### ChrisRackauckas commented Aug 6, 2021

Ya'll think you write good compilers? Well, I'm the compiler now!

```using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
du[1] = 10.0(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Tsit5()
tinf = @snoopi_deep solve(prob,alg)
itrigs = inference_triggers(tinf)
itrig = itrigs[13]
ascend(itrig)
@time solve(prob,alg)

using ProfileView
ProfileView.view(flamegraph(tinf))

v5.60.2
InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children

Before
InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children

Without `@turbo`
InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd`
InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@turbo`
InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd` everywhere
InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children```

``` Reduce the first time to solve from 5 seconds to 1 second for Tsit5 ```
``` 6fff9f9 ```
```Ya'll think you write good compilers? Well, I'm the compiler now!

```julia
using OrdinaryDiffEq, SnoopCompile

function lorenz(du,u,p,t)
du[1] = 10.0(u[2]-u[1])
du[2] = u[1]*(28.0-u[3]) - u[2]
du[3] = u[1]*u[2] - (8/3)*u[3]
end

u0 = [1.0;0.0;0.0]
tspan = (0.0,100.0)
prob = ODEProblem(lorenz,u0,tspan)
alg = Tsit5()
tinf = @snoopi_deep solve(prob,alg)
itrigs = inference_triggers(tinf)
itrig = itrigs[13]
ascend(itrig)
@time solve(prob,alg)

using ProfileView
ProfileView.view(flamegraph(tinf))

v5.60.2
InferenceTimingNode: 1.249748/4.881587 on Core.Compiler.Timings.ROOT() with 2 direct children

Before
InferenceTimingNode: 1.136504/3.852949 on Core.Compiler.Timings.ROOT() with 2 direct children

Without `@turbo`
InferenceTimingNode: 0.956948/3.460591 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd`
InferenceTimingNode: 0.941427/3.439566 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@turbo`
InferenceTimingNode: 1.174613/11.118534 on Core.Compiler.Timings.ROOT() with 2 direct children

With `@inbounds @simd` everywhere
InferenceTimingNode: 0.760500/1.151602 on Core.Compiler.Timings.ROOT() with 2 direct children
``````
added a commit to SciML/DiffEqBase.jl that referenced this pull request Aug 6, 2021
``` reduce compile times by specializing broadcasts to loops ```
``` bcc5f52 ```
`Companion PR to SciML/OrdinaryDiffEq.jl#1465`
mentioned this pull request Aug 6, 2021
reviewed
 @@ -15,6 +15,7 @@ FiniteDiff = "6a86dc24-6348-571c-b903-95158fe2bd41" ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210" LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" Logging = "56ddb016-857b-54e1-b83d-db4d58db5568" LoopVectorization = "bdcacae8-1622-11e9-2a5c-532679323890"

### YingboMa Aug 6, 2021

Compat

``` simd ivdep ```
``` e643216 ```
``` fix initdt type ```
``` 3b7a757 ```

### ChrisRackauckas commented Aug 6, 2021

 ```Laptop Before: InferenceTimingNode: 1.585750/5.363441 on Core.Compiler.Timings.ROOT() with 2 direct children After: InferenceTimingNode: 0.885957/1.254411 on Core.Compiler.Timings.ROOT() with 2 direct children```

``` fix typo ```
``` e639caa ```
``` add vern7 specializations ```
``` 32cded5 ```

### ChrisRackauckas commented Aug 6, 2021

 LOL ```Vern7 Before: InferenceTimingNode: 5.962703/13.461966 on Core.Compiler.Timings.ROOT() with 1 direct children After: InferenceTimingNode: 2.979609/3.563301 on Core.Compiler.Timings.ROOT() with 2 direct children```

``` Vern9 speedup ```
``` ec0bd9a ```

### ChrisRackauckas commented Aug 6, 2021

 ```Vern9 Before: InferenceTimingNode: 17.960255/23.513515 on Core.Compiler.Timings.ROOT() with 2 direct children After: InferenceTimingNode: 6.969864/7.531495 on Core.Compiler.Timings.ROOT() with 2 direct children```

``` copyto! instead of broadcast ```
``` 3f3c860 ```
``` get rid of last broadcast ```
``` 0a7ea3e ```
``` relax equality test due to broadcast change ```
``` 3d6c158 ```
added a commit to SciML/DiffEqBase.jl that referenced this pull request Aug 7, 2021
``` reduce compile times by specializing broadcasts to loops (#696) ```
``` 72e9cd1 ```
```* reduce compile times by specializing broadcasts to loops

Companion PR to SciML/OrdinaryDiffEq.jl#1465

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* Update src/calculate_residuals.jl

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

* simd ivdep

* remove reduction compile

Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>```
deleted the ftts_by_hand branch August 7, 2021 03:11

 Yep, broadcasting is quite expensive for the compiler. Sorry you had to write all this out by hand, but nice outcome! You're getting to be a master of the tools!

### ChrisRackauckas commented Aug 7, 2021

 Writing it out by hand is fine. The fact that I cannot do the same for RecursiveFactorization.jl's compile times bug me though...

### timholy commented Aug 7, 2021

 In the next month I'm planning to trying to go through some of https://github.com/JuliaLang/julia/issues?q=is%3Aopen+is%3Aissue+label%3Aprecompile. That might make the inference part go away. Any overhead due to codegen/LLVM won't be helped though (yet).

 From these studies and #1467, JuliaLinearAlgebra/RecursiveFactorization.jl#29, and JuliaSIMD/TriangularSolve.jl#8, the biggest thing for us would be to figure out why the RecursiveFactorization/TriangularSolve/LoopVectorization stack won't cache the precompiles. In some sense it should be easy: the lowest level is just functions on `Vector{Float64}`? But I wonder if the caches being an `__init__()` time static array is causing issues there. This 20 seconds remaining is almost entirely due to this, and it's the reason why https://discourse.julialang.org/t/catalyst-differentiaequations-startup-time/65946/8 can't be solved with a few cheap tricks like this. I was going to ask @chriselrod if there was another way to try and do the caching.

 I suspect a big part of the problem is that LoopVectorization owns the `_turbo_!` calls / there's nothing to root the inference results into the calling libraries? An additional problem is that most of the time is not spent during inference. From one subset of tests, I get: ``````ROOT : 0.03 % 197544027 GC : 3.25 % 22497827136 LOWERING : 4.44 % 30751764035 PARSING : 0.01 % 43577402 INFERENCE : 7.60 % 52646925454 CODEGEN : 57.02 % 394947402034 METHOD_LOOKUP_SLOW : 0.00 % 24650402 METHOD_LOOKUP_FAST : 3.43 % 23736669017 LLVM_OPT : 20.35 % 140965350896 LLVM_MODULE_FINISH : 0.27 % 1865150203 METHOD_MATCH : 0.52 % 3579983560 TYPE_CACHE_LOOKUP : 1.76 % 12172038157 TYPE_CACHE_INSERT : 0.00 % 33134016 STAGED_FUNCTION : 0.11 % 765643563 MACRO_INVOCATION : 0.00 % 30346646 AST_COMPRESS : 0.47 % 3252351088 AST_UNCOMPRESS : 0.48 % 3334916759 SYSIMG_LOAD : 0.02 % 148451732 ADD_METHOD : 0.19 % 1294783747 LOAD_MODULE : 0.03 % 225725871 INIT_MODULE : 0.02 % 164596592 `````` However, this is from compiling a large number of methods. The first `@turbo` compilation tends to take 8+ seconds, while subsequent compilations are <1s. It seems like that first compilation -- which presumably is compiling a lot of LoopVectorization's internal methods -- really should be much faster than it is. It also probably has a compilation profile more typical of standard Julia code. EDIT: LoopVectorization itself doesn't show up in the `TriangularSolve` benchmark at all, presumably because the 1x1 `lu!` doesn't call it?

### timholy commented Aug 9, 2021

 AFAICT `__init__` is never cached: JuliaGraphics/Gtk.jl#466 (comment). Figuring out whether that's an oversight or deliberate is high on the agenda. @ChrisRackauckas, you duplicated #1467 up there, feel free to edit and then I will take a look.

### ChrisRackauckas commented Aug 9, 2021

 Edited. Ahh yes, that's the one piece I in the chain I didn't setup to precompile! I'll go add something to DiffEqBase and see if that handles it.

mentioned this pull request Aug 12, 2021
mentioned this pull request Oct 5, 2021
mentioned this pull request Mar 17, 2022