# Performance

This notebook measures performance of `Simulate.jl` functionality in order to 

1. improve it,
2. get a ground for decisions about issue ["Getting rid of all functionality involving gobal scope #15"](https://github.com/pbayer/Simulate.jl/issues/15) and
3. compile the [performance section](https://pbayer.github.io/Simulate.jl/dev/performance/) of the documentation.

In [1]:
using Simulate, BenchmarkTools, Random
res = Dict(); # results dictionary

## Event-based simulations

The following is a modification of the [channel example](https://pbayer.github.io/Simulate.jl/dev/approach/#Event-based-modeling-1). We simulate events 

1. taking something from a common channel or waiting if there is nothing, 
2. then taking a delay, doing a calculation and
3. returning three times to the first step.

As calculation we take the following Machin-like sum:

$$4 \sum_{k=1}^{n} \frac{(-1)^{k+1}}{2 k - 1}$$

This gives a slow approximation to $\pi$. 

### Function calls as events

The first implementation is based on events with `SimFunction`s. 

In [2]:
function take(id::Int64, ch::Channel, step::Int64, qpi::Array{Float64,1})
    if isready(ch)
        take!(ch)                                            # take something from common channel
        event!(SF(put, id, ch, step, qpi), after, rand())    # timed event after some time
    else
        event!(SF(take, id, ch, step, qpi), SF(isready, ch)) # conditional event until channel is ready
    end
end

function put(id::Int64, ch::Channel, step::Int64, qpi::Array{Float64,1})
    put!(ch, 1)
    qpi[step] += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, ch, step+1, qpi)
end

function setup(n::Int)                     # a setup he simulation
    reset!(𝐶)
    Random.seed!(123)
    global ch = Channel{Int64}(32)  # create a channel
    global qpi = zeros(4)
    si = shuffle(1:n)
    for i in 1:n
        take(si[i], ch, 1, qpi)
    end
    for i in 1:min(n, 32)
        put!(ch, 1) # put first tokens into channel 1
    end
end

setup (generic function with 1 method)

If we setup 250 summation elements, we get 1000 timed events and over 1438 sample steps with conditional events.

In [3]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", sum(qpi))

  0.962778 seconds (2.30 M allocations: 117.676 MiB, 3.38% gc time)
  1.269727 seconds (4.13 M allocations: 198.800 MiB, 3.40% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.137592669589475


In [4]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=10.0 samples=30)

BenchmarkTools.Trial: 
  memory estimate:  115.75 MiB
  allocs estimate:  2481210
  --------------
  minimum time:     354.273 ms (2.16% GC)
  median time:      369.565 ms (3.23% GC)
  mean time:        369.519 ms (2.91% GC)
  maximum time:     407.303 ms (3.72% GC)
  --------------
  samples:          27
  evals/sample:     1

In [5]:
res["Event based with SimFunctions"] = minimum(t).time * 1e-6 # ms 

354.273085

### Expressions as events

The 2nd implementation does the same but with expressions, which are `eval`uated in global scope during runtime:

In [6]:
function take(id::Int64, ch::Channel, step::Int64, qpi::Array{Float64,1})
    if isready(ch)
        take!(ch)                                            # take something from common channel
        event!(:(put($id, ch, $step, qpi)), after, rand())   # timed event after some time
    else
        event!(:(take($id, ch, $step, qpi)), :(isready(ch))) # conditional event until channel is ready
    end
end

function put(id::Int64, ch::Channel, step::Int64, qpi::Array{Float64,1})
    put!(ch, 1)
    qpi[step] += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, ch, step+1, qpi)
end

put (generic function with 1 method)

In [7]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", sum(qpi))

  0.090309 seconds (158.23 k allocations: 8.476 MiB)
 13.032384 seconds (7.92 M allocations: 487.178 MiB, 0.51% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.137592669589475


In [8]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=10.0 samples=30)

BenchmarkTools.Trial: 
  memory estimate:  486.70 MiB
  allocs estimate:  7905745
  --------------
  minimum time:     12.850 s (0.46% GC)
  median time:      12.850 s (0.46% GC)
  mean time:        12.850 s (0.46% GC)
  maximum time:     12.850 s (0.46% GC)
  --------------
  samples:          1
  evals/sample:     1

In [9]:
res["Event based with Expressions"] = minimum(t).time * 1e-6 #
res

Dict{Any,Any} with 2 entries:
  "Event based with Expressions"  => 12850.1
  "Event based with SimFunctions" => 354.273

In [10]:
res["Event based with Expressions"]/res["Event based with SimFunctions"]

36.2716093095246

This takes much longer and shows that `eval` for Julia expressions, done in global scope is very expensive and should be avoided if performance is any issue.

### Involving a global variable

The third implementation works with `Simfunction`s like the first but involves a global variable `A`:

In [11]:
function take(id::Int64, ch::Channel, step::Int64)
    if isready(ch)
        take!(ch)                                       # take something from common channel
        event!(SF(put, id, ch, step), after, rand())    # timed event after some time
    else
        event!(SF(take, id, ch, step), SF(isready, ch)) # conditional event until channel is ready
    end
end

function put(id::Int64, ch::Channel, step::Int64)
    put!(ch, 1)
    global A += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, ch, step+1)
end

function setup(n::Int)                     # a setup he simulation
    reset!(𝐶)
    Random.seed!(123)
    global ch = Channel{Int64}(32)  # create a channel
    global A = 0
    si = shuffle(1:n)
    for i in 1:n
        take(si[i], ch, 1)
    end
    for i in 1:min(n, 32)
        put!(ch, 1) # put first tokens into channel 1
    end
end

setup (generic function with 1 method)

In [12]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", A)

  0.066916 seconds (132.96 k allocations: 6.673 MiB, 12.95% gc time)
  0.456441 seconds (2.63 M allocations: 123.395 MiB, 2.96% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.1375926695894556


In [13]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=10.0 samples=30)

BenchmarkTools.Trial: 
  memory estimate:  115.75 MiB
  allocs estimate:  2483210
  --------------
  minimum time:     360.148 ms (2.38% GC)
  median time:      371.667 ms (3.52% GC)
  mean time:        375.799 ms (3.14% GC)
  maximum time:     414.637 ms (2.18% GC)
  --------------
  samples:          26
  evals/sample:     1

In [14]:
res["Event based with functions and a global variable"] = minimum(t).time * 1e-6 #
res

Dict{Any,Any} with 3 entries:
  "Event based with Expressions"                     => 12850.1
  "Event based with SimFunctions"                    => 354.273
  "Event based with functions and a global variable" => 360.148

In this case the compiler does well to infer the type of `A` and it takes only marginally longer than the first version.