# Performance

This notebook measures performance of `Simulate.jl` functionality in order to 

1. improve it,
2. get a ground for decisions about issue ["Getting rid of all functionality involving gobal scope #15"](https://github.com/pbayer/Simulate.jl/issues/15) and
3. compile the [performance section](https://pbayer.github.io/Simulate.jl/dev/performance/) of the documentation.

In [1]:
using Simulate, BenchmarkTools, Random
res = Dict(); # results dictionary

## Event-based simulations

The following is a modification of the [channel example](https://pbayer.github.io/Simulate.jl/dev/approach/#Event-based-modeling-1). We simulate events 

1. taking something from a common channel or waiting if there is nothing, 
2. then taking a delay, doing a calculation and
3. returning three times to the first step.

As calculation we take the following Machin-like sum:

$$4 \sum_{k=1}^{n} \frac{(-1)^{k+1}}{2 k - 1}$$

This gives a slow approximation to $\pi$. 

### Function calls as events

The first implementation is based on events with `SimFunction`s. 

In [2]:
function take(id::Int64, qpi::Vector{Float64}, step::Int64)
    if isready(ch)
        take!(ch)                                            # take something from common channel
        event!(SF(put, id, qpi, step), after, rand())    # timed event after some time
    else
        event!(SF(take, id, qpi, step), SF(isready, ch)) # conditional event until channel is ready
    end
end

function put(id::Int64, qpi::Vector{Float64}, step::Int64)
    put!(ch, 1)
    qpi[1] += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, qpi, step+1)
end

function setup(n::Int)                     # a setup he simulation
    reset!(𝐶)
    Random.seed!(123)
    global ch = Channel{Int64}(32)  # create a channel
    global qpi = [0.0]
    si = shuffle(1:n)
    for i in 1:n
        take(si[i], qpi, 1)
    end
    for i in 1:min(n, 32)
        put!(ch, 1) # put first tokens into channel 1
    end
end

setup (generic function with 1 method)

If we setup 250 summation elements, we get 1000 timed events and over 1438 sample steps with conditional events.

In [4]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", qpi[1])

  0.000775 seconds (2.55 k allocations: 149.969 KiB)
  0.271519 seconds (1.76 M allocations: 66.246 MiB, 2.66% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.1375926695894556


In [5]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=15.0 samples=50)

BenchmarkTools.Trial: 
  memory estimate:  66.24 MiB
  allocs estimate:  1759423
  --------------
  minimum time:     267.719 ms (1.52% GC)
  median time:      272.915 ms (2.36% GC)
  mean time:        272.391 ms (2.31% GC)
  maximum time:     278.427 ms (2.86% GC)
  --------------
  samples:          50
  evals/sample:     1

In [6]:
res["Event based with SimFunctions"] = minimum(t).time * 1e-6 # ms 

267.71865299999996

### Expressions as events

The 2nd implementation does the same but with expressions, which are `eval`uated in global scope during runtime:

In [7]:
function take(id::Int64, qpi::Vector{Float64}, step::Int64)
    if isready(ch)
        take!(ch)                                            # take something from common channel
        event!(:(put($id, qpi, $step)), after, rand())   # timed event after some time
    else
        event!(:(take($id, qpi, $step)), :(isready(ch))) # conditional event until channel is ready
    end
end

function put(id::Int64, qpi::Vector{Float64}, step::Int64)
    put!(ch, 1)
    qpi[1] += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, qpi, step+1)
end

put (generic function with 1 method)

In [8]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", sum(qpi))

  0.112309 seconds (241.49 k allocations: 12.564 MiB, 7.51% gc time)
 10.530088 seconds (6.52 M allocations: 395.501 MiB, 0.47% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.1375926695894556


In [9]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=15.0 samples=50)

BenchmarkTools.Trial: 
  memory estimate:  395.27 MiB
  allocs estimate:  6513416
  --------------
  minimum time:     10.469 s (0.42% GC)
  median time:      10.478 s (0.42% GC)
  mean time:        10.478 s (0.42% GC)
  maximum time:     10.486 s (0.42% GC)
  --------------
  samples:          2
  evals/sample:     1

In [10]:
res["Event based with Expressions"] = minimum(t).time * 1e-6 #
res

Dict{Any,Any} with 2 entries:
  "Event based with Expressions"  => 10469.4
  "Event based with SimFunctions" => 267.719

In [11]:
res["Event based with Expressions"]/res["Event based with SimFunctions"]

39.105903916975116

This takes much longer and shows that `eval` for Julia expressions, done in global scope is very expensive and should be avoided if performance is any issue.

### Involving a global variable

The third implementation works with `Simfunction`s like the first but involves a global variable `A`:

In [12]:
function take(id::Int64, step::Int64)
    if isready(ch)
        take!(ch)                                       # take something from common channel
        event!(SF(put, id, step), after, rand())    # timed event after some time
    else
        event!(SF(take, id, step), SF(isready, ch)) # conditional event until channel is ready
    end
end

function put(id::Int64, step::Int64)
    put!(ch, 1)
    global A += (-1)^(id+1)/(2id -1)      # Machin-like series (slow approximation to pi)
    step > 3 || take(id, step+1)
end

function setup(n::Int)                     # a setup he simulation
    reset!(𝐶)
    Random.seed!(123)
    global ch = Channel{Int64}(32)  # create a channel
    global A = 0
    si = shuffle(1:n)
    for i in 1:n
        take(si[i], 1)
    end
    for i in 1:min(n, 32)
        put!(ch, 1) # put first tokens into channel 1
    end
end

setup (generic function with 1 method)

In [13]:
ch = Channel{Int64}(32)
@code_warntype put(1, 1)

Variables
  #self#[36m::Core.Compiler.Const(put, false)[39m
  id[36m::Int64[39m
  step[36m::Int64[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m       Main.put!(Main.ch, 1)
[90m│  [39m       nothing
[90m│  [39m %3  = (id + 1)[36m::Int64[39m
[90m│  [39m %4  = ((-1) ^ %3)[36m::Int64[39m
[90m│  [39m %5  = (2 * id)[36m::Int64[39m
[90m│  [39m %6  = (%5 - 1)[36m::Int64[39m
[90m│  [39m %7  = (%4 / %6)[36m::Float64[39m
[90m│  [39m %8  = (Main.A + %7)[91m[1m::Any[22m[39m
[90m│  [39m       (Main.A = %8)
[90m│  [39m %10 = (step > 3)[36m::Bool[39m
[90m└──[39m       goto #3 if not %10
[90m2 ─[39m       return %10
[90m3 ─[39m %13 = (step + 1)[36m::Int64[39m
[90m│  [39m %14 = Main.take(id, %13)[91m[1m::Any[22m[39m
[90m└──[39m       return %14


In [15]:
@time setup(250)
println(@time run!(𝐶, 500))
println("result=", A)

  0.000773 seconds (2.55 k allocations: 149.875 KiB)
  0.272900 seconds (1.76 M allocations: 66.124 MiB, 1.89% gc time)
run! finished with 1000 clock events, 1438 sample steps, simulation time: 500.0
result=3.1375926695894556


In [16]:
t = run(@benchmarkable run!(𝐶, 500) setup=setup(250) evals=1 seconds=10.0 samples=30)

BenchmarkTools.Trial: 
  memory estimate:  66.11 MiB
  allocs estimate:  1760176
  --------------
  minimum time:     269.096 ms (1.58% GC)
  median time:      275.843 ms (2.33% GC)
  mean time:        275.089 ms (2.37% GC)
  maximum time:     291.051 ms (3.04% GC)
  --------------
  samples:          30
  evals/sample:     1

In [17]:
res["Event based with functions and a global variable"] = minimum(t).time * 1e-6 #
res

Dict{Any,Any} with 3 entries:
  "Event based with Expressions"                     => 10469.4
  "Event based with SimFunctions"                    => 267.719
  "Event based with functions and a global variable" => 269.096

In this case the compiler does well to infer the type of `A` and it takes only marginally longer than the first version.