# Multithreaded Runge-Kutta Methods

## The Goal

To achieve its mission as an efficient package for HPC and research purposes, DifferentialEquations.jl aims to provide multithreaded versions of the most common algorithms. In this notebook we will look at some results for the `DP5` solver. As shown in the other benchmarks, DifferentialEquations.jl's `DP5` is already more efficient than the classic Hairer `dopri5`, and vastly outperforms ODE.jl's `ode45`. This multithreading is meant to increase the performance gap even further for larger problems. Here we are testing the `DP5Threaded` method vs `DP5`

## The Problem

For a simple test problem, we will start by taking the linear ODEs matrices:

In [3]:
using DifferentialEquations

# 2D Linear ODE
f = (t,u,du) -> begin
  Threads.@threads for i in 1:length(u)
    du[i] = 1.01*u[i]
  end
end
(::typeof(f))(::Type{Val{:analytic}},t,u0) = u0*exp(1.01*t)
using Plots; gr()
tspan = (0.0,10.0);

setups = [Dict(:alg=>DP5())
          Dict(:alg=>DP5Threaded())]

2-element Array{Dict{Symbol,V},1}:
 Dict(:alg=>OrdinaryDiffEq.DP5())        
 Dict(:alg=>OrdinaryDiffEq.DP5Threaded())

For reference we will start by using 4 threads on a 2x Intel Xeon E5-2667 V3 3.2GHz Eight Core 20MB 135W

In [2]:
Threads.nthreads()

16

We will test against the various Dormand-Prince 4/5 solvers from the wild.

## Effect of Problem Size

The multithreading makes more of a difference at medium problem sizes. This is because for large problems, less of the function time is in the calculation of `f`, and thus the speed of the method's calculations makes more of a difference. But for small problems, the overhead of parallelism doesn't beat out the cost. These results show that multi-threading within the method begin to give reliable gains at problem sizes of about 50x50, doing really well in the 100x100 to 200x200 range, before trailing off. 

These numbers are likely shifted downwards for less threads, but also have less of an effect. The effect size is probably larger for higher order methods since there will be more "method calculations" per step.

### 15x15

In [3]:
prob = ODEProblem(f,rand(15,15),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.653191,0.653191]
[0.000974153,0.00127949]
[1.0,1.31344]


### 20x20

In [4]:
prob = ODEProblem(f,rand(20,20),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.62665,0.62665]
[0.00104161,0.00125653]
[1.0,1.20634]


### 25x25

In [5]:
prob = ODEProblem(f,rand(25,25),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.611453,0.611453]
[0.00112247,0.00141191]
[1.0,1.25786]


### 50x50

In [6]:
prob = ODEProblem(f,rand(50,50),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.604175,0.604175]
[0.0032904,0.00283078]
[1.0,0.860313]


### 75x75

In [7]:
prob = ODEProblem(f,rand(75,75),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.606848,0.606848]
[0.00908926,0.00642197]
[1.0,0.706545]


### 100x100

In [8]:
prob = ODEProblem(f,rand(100,100),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.608219,0.608219]
[0.0136606,0.010079]
[1.0,0.737818]


### 200x200

In [12]:
prob = ODEProblem(f,rand(200,200),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.606576,0.606576]
[0.0949148,0.0760345]
[1.0,0.801081]


### 300x300

In [14]:
prob = ODEProblem(f,rand(300,300),tspan)
shoot = ode_shootout(prob,setups;dt=1/2^(10),numruns=1000)
println(shoot.errors)
println(shoot.times)
println(shoot.effratios[1,:])
plot(shoot)

[0.609555,0.609555]
[0.202752,0.171872]
[1.0,0.847693]


## Conclusion

This is only a very early form of the within-method multithreaded versions, and it already shows promising results for large problems. By around 50x50 matrices we already see a speedup. The speedup then lessens as the problems get bigger since more time is actually spent in the function evaluations. There are still some major problems which Julia's threading which is not letting it get maximum performance. Hopefully these issues will get worked out soon.