## Notes about yesterday:
- Simon Danish gave a related talk at Curry On
  - https://nextjournal.com/sdanisch/curry-on-julia
  - https://www.youtube.com/watch?v=DCs0_T9BRp0
  - Also NextJournal is cool
- Zygote paper about differential programming hit the arxiv yesterday
  - https://twitter.com/ChrisRackauckas/status/1151793248626774016
  - https://arxiv.org/abs/1907.07587
  - https://github.com/MikeInnes/zygote-paper


# Multithreading in Julia
For best effect use Julia 1.2 or even better 1.3

MT support is under heavy development right now

In [5]:
NCPU = Sys.CPU_THREADS
using Base.Threads
@show NCPU
@show nthreads();

NCPU = 4
nthreads() = 1


In [36]:
using IJulia
for I in 0:1.0:log(2, NCPU)
    cpus = 2^round(I)
    @show cpus
    installkernel("Julia ($cpus threads)", env=Dict("JULIA_NUM_THREADS"=>"$cpus"))
end
# Now restart Jupyter and switch kernel

I = 0.0
cpus = 1.0
I = 1.0
cpus = 2.0
I = 2.0
cpus = 4.0


┌ Info: Installing Julia (1.0 threads) kernelspec in /home/vchuravy/.local/share/jupyter/kernels/julia-(1.0-threads)-1.1
└ @ IJulia /home/vchuravy/.julia/packages/IJulia/gI2uA/deps/kspec.jl:72
┌ Info: Installing Julia (2.0 threads) kernelspec in /home/vchuravy/.local/share/jupyter/kernels/julia-(2.0-threads)-1.1
└ @ IJulia /home/vchuravy/.julia/packages/IJulia/gI2uA/deps/kspec.jl:72
┌ Info: Installing Julia (4.0 threads) kernelspec in /home/vchuravy/.local/share/jupyter/kernels/julia-(4.0-threads)-1.1
└ @ IJulia /home/vchuravy/.julia/packages/IJulia/gI2uA/deps/kspec.jl:72


## Your hardware (on linux)

In [14]:
;lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              2
Core(s) per socket:              2
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Stepping:                        9
CPU MHz:                         3438.326
CPU max MHz:                     4000.0000
CPU min MHz:                     400.0000
BogoMIPS:                        4993.00
Virtualization:                  VT-x
L1d cache:                       64 KiB
L1i cache:                       64 KiB
L2 cache:                        512 KiB
L3 cache:                       

## Fork-join parallelism
Julia threading model is based on a fork-join approach and is still considered experimental.

Fork-join describes the control flow that a group of threads undergoes. Execution is forked and a anonymous function is then run across all threads. 

All threads have to join together and serial execution continues.

Special care needs to be taken if the loop body access has side-effects or accesses global state. (This includes IO and random numbers)

https://github.com/JuliaLang/julia/pull/32477



In [11]:
a = zeros(Int, nthreads()*10)
@threads for i in 1:length(a)
    a[i] = threadid()
end
a

10-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

In [16]:
using Random
A = zeros(10_000)
@threads for i in 1:length(A)
    A[i] = rand()
end

# This might change soon https://github.com/JuliaLang/julia/pull/32407
const rngs = [MersenneTwister(rand(UInt64)) for i in 1:nthreads()]

@threads for i in 1:length(A)
    A[i] = rand(rngs[threadid()])
end

### Atomics

In [24]:
function ff()
    acc = 0
    @threads for i in 1:10_000
        acc += 1
    end
    return acc
end
# What is the result?

ff (generic function with 1 method)

In [25]:
function f_ref()
    acc = Ref(0)
    @threads for i in 1:10_000
        acc[] += 1
    end
    return acc
end
# What is the result?

f_ref (generic function with 1 method)

In [21]:
function f_acc()
    acc = Atomic{Int64}(0)
    @threads for i in 1:10_000
        atomic_add!(acc, 1)
    end
    acc
end
#What is acc

f_acc (generic function with 1 method)

### Locks

In [3]:
struct Accumulator{T, L}
    x::Base.RefValue{T}
    lock::L
end

Base.lock(a::Accumulator) = lock(a.lock)
Base.unlock(a::Accumulator) = unlock(a.lock)
Base.getindex(a::Accumulator) = a.x[]
Base.setindex!(a::Accumulator, val) = a.x[] = val

In [6]:
function f(accum)
    @threads for i in 1:10_000
        lock(accum)
        accum[] += 1
        unlock(accum)
    end
end

f (generic function with 1 method)

In [7]:
acc_m = Accumulator(Ref(0), Mutex())
f(acc_m)
acc_m[]

10000

In [8]:
acc_sl = Accumulator(Ref(0), SpinLock())
f(acc_sl)
acc_sl[]

10000

In [14]:
acc_rsl = Accumulator(Ref(0), RecursiveSpinLock())
f(acc_rsl)
acc_rsl[]

10000

In [2]:
using BenchmarkTools

┌ Info: Recompiling stale cache file /home/vchuravy/.julia/compiled/v1.1/BenchmarkTools/ZXPQo.ji for BenchmarkTools [6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf]
└ @ Base loading.jl:1184


In [10]:
@btime f($acc_m)

  203.075 μs (1 allocation: 32 bytes)


In [11]:
@btime f($acc_sl)

  100.056 μs (1 allocation: 32 bytes)


In [19]:
@btime f($acc_rsl)

  124.969 μs (1 allocation: 32 bytes)


In [22]:
@btime ff()

  136.899 μs (9491 allocations: 148.31 KiB)


10000

In [26]:
@btime f_acc()

  45.171 μs (2 allocations: 48 bytes)


Atomic{Int64}(10000)

In [28]:
@btime f_ref()

  2.641 μs (2 allocations: 48 bytes)


Base.RefValue{Int64}(10000)

### Useful trick 
Benchmarks for this were run on:
```
> lscpu
...
Thread(s) per core:  2
Core(s) per socket:  2
...
Model name:          Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
...
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
...
Flags: ... avx2 ...
```

and
```
> lscpu
...
Thread(s) per core:  2
Core(s) per socket:  12
...
Model name:          AMD Ryzen Threadripper 1920X 12-Core Processor
...
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
...
Flags: ... avx2 ...
```

```julia
@threads for id in 1:nthreads()
    #each thread does something
end
```

In [29]:
function threaded_sum(arr)
   @assert length(arr) % nthreads() == 0
   results = zeros(eltype(arr), nthreads())
   @threads for tid in 1:nthreads()
       # split work
       acc = zero(eltype(arr))
       len = div(length(arr), nthreads())
       domain = ((tid-1)*len +1):tid*len
       @inbounds for i in domain
           acc += arr[i]    
       end
       results[tid] = acc
   end
   sum(results)
end

threaded_sum (generic function with 1 method)

<div class="sl-block is-focused" data-block-type="table" style="height: auto; min-width: 120px; min-height: 30px; width: 471px; left: 489px; top: 140px;" data-origin-id="41b373377b462cab80c0707d4d5def85"><div class="sl-block-content fragment visible" style="z-index: 16;" data-table-cols="3" data-table-rows="8" data-table-has-header="true" data-fragment-index="3"><table><tbody>
<tr>
<th style="width: 91px;">NT</th>
<th>Skylake</th>
<th>AMD TR</th>
</tr>
<tr>
<td style="width: 91px;">sum</td>
<td>514.476 μs</td>
<td>430.409 μs</td>
</tr>
<tr>
<td style="width: 91px;">1</td>
<td>1.578 ms</td>
<td>1.206 ms</td>
</tr>
<tr>
<td style="width: 91px;">2</td>
<td>831.411 μs</td>
<td>575.872 μs</td>
</tr>
<tr>
<td style="width: 91px;">4</td>
<td>417.656 μs</td>
<td>294.724 μs</td>
</tr>
<tr>
<td style="width: 91px;">6</td>
<td>X</td>
<td>215.986 μs</td>
</tr>
<tr>
<td style="width: 91px;">12</td>
<td>X</td>
<td>109.536 μs</td>
</tr>
<tr>
<td style="width: 91px;">24</td>
<td>X</td>
<td>57.197 μs</td>
</tr>
</tbody></table><div class="editing-ui sl-table-column-resizer" data-column-index="0" style="left: 91px;"></div><div class="editing-ui sl-table-column-resizer" data-column-index="1" style="left: 280px;"></div></div></div>

If your `@threads` performance with one thread is not as fast as a non `@threads` version something is off..., but yeah for linear scaling.

```julia
function threaded_sum(arr)
       ...
       @inbounds for i in domain
           acc += arr[i]    
       end
       ...
end
```

In [51]:
function threaded_sum2(arr)
   @assert length(arr) % nthreads() == 0
   results = zeros(eltype(arr), nthreads())
   @threads for tid in 1:nthreads()
       # split work
       acc = zero(eltype(arr))
       len = div(length(arr), nthreads())
       domain = ((tid-1)*len +1):tid*len
       @inbounds @simd for i in domain
           acc += arr[i]    
       end
       results[tid] = acc
   end
   sum(results)
end

threaded_sum2 (generic function with 1 method)

<div class="sl-block is-focused" data-block-type="table" style="height: auto; min-width: 120px; min-height: 30px; width: 471px; left: 480px; top: 126px;" data-origin-id="6e5b20edb25d4ecc3bac7f5355b40906"><div class="sl-block-content fragment visible" style="z-index: 16;" data-table-cols="3" data-table-rows="8" data-table-has-header="true" data-fragment-index="2"><table><tbody>
<tr>
<th style="width: 91px;">NT</th>
<th>Skylake</th>
<th>AMD TR</th>
</tr>
<tr>
<td style="width: 91px;">sum</td>
<td>514.476 μs</td>
<td>430.409 μs</td>
</tr>
<tr>
<td style="width: 91px;">1</td>
<td>493.384 μs</td>
<td>401.755 μs</td>
</tr>
<tr>
<td style="width: 91px;">2</td>
<td>282.030 μs</td>
<td>73.408 μs</td>
</tr>
<tr>
<td style="width: 91px;">4</td>
<td>230.988 μs</td>
<td>37.541 μs</td>
</tr>
<tr>
<td style="width: 91px;">6</td>
<td>X</td>
<td>29.185 μs</td>
</tr>
<tr>
<td style="width: 91px;">12</td>
<td>X</td>
<td>16.491 μs</td>
</tr>
<tr>
<td style="width: 91px;">24</td>
<td>X</td>
<td>17.693 μs</td>
</tr>
</tbody></table><div class="editing-ui sl-table-column-resizer" data-column-index="0" style="left: 91px;"></div><div class="editing-ui sl-table-column-resizer" data-column-index="1" style="left: 280px;"></div></div></div>

We note:
1. Hyperthreading
2. Superlinear speedup from 1-2 threads on Threadripper
  - Cache effect
  - Data is 12MB, 2xL3 = 16MB

### $3*2^{21} = 48MB$
<div class="sl-block is-focused" data-block-type="table" style="height: auto; width: 452px; left: 28px; top: 235px;" data-origin-id="bca56d5bb6bb71fa96a5eef8fc66f63b"><div class="sl-block-content fragment visible" style="z-index: 12;" data-table-cols="3" data-table-rows="8" data-table-has-header="true" data-fragment-index="0"><table><tbody>
<tr>
<th style="width: 76px;">NT</th>
<th style="width: 183px;">Skylake</th>
<th>AMD TR</th>
</tr>
<tr>
<td style="width: 76px;">sum</td>
<td style="width: 183px;">2.423 ms</td>
<td>1.723 ms</td>
</tr>
<tr>
<td style="width: 76px;">1</td>
<td style="width: 183px;">2.295 ms</td>
<td>1.610 ms</td>
</tr>
<tr>
<td style="width: 76px;">2</td>
<td style="width: 183px;">1.307 ms</td>
<td>1.158 ms</td>
</tr>
<tr>
<td style="width: 76px;">4</td>
<td style="width: 183px;">1.106 ms</td>
<td>582.885 μs</td>
</tr>
<tr>
<td style="width: 76px;">6</td>
<td style="width: 183px;">X</td>
<td>470.023 μs&nbsp;</td>
</tr>
<tr>
<td style="width: 76px;">12</td>
<td style="width: 183px;">X</td>
<td>627.699 μs</td>
</tr>
<tr>
<td style="width: 76px;">24</td>
<td style="width: 183px;">X</td>
<td>595.068 μs</td>
</tr>
</tbody></table><div class="editing-ui sl-table-column-resizer" data-column-index="0" style="left: 76px;"></div><div class="editing-ui sl-table-column-resizer" data-column-index="1" style="left: 259px;"></div></div></div>

### $3*2^{22} = 96MB$

<div class="sl-block is-focused" data-block-type="table" style="height: auto; width: 452px; left: 508px; top: 235px;" data-origin-id="c9b30544172708e13701b826a53c57b4"><div class="sl-block-content fragment visible" style="z-index: 11;" data-table-cols="3" data-table-rows="8" data-table-has-header="true" data-fragment-index="0"><table><tbody>
<tr>
<th style="width: 91px;">NT</th>
<th>Skylake</th>
<th>AMD TR</th>
</tr>
<tr>
<td style="width: 91px;">sum</td>
<td>4.970 ms</td>
<td>3.477 ms</td>
</tr>
<tr>
<td style="width: 91px;">1</td>
<td>4.996 ms</td>
<td>3.249 ms</td>
</tr>
<tr>
<td style="width: 91px;">2</td>
<td>3.867 ms</td>
<td>2.241 ms</td>
</tr>
<tr>
<td style="width: 91px;">4</td>
<td>2.742 ms</td>
<td>1.195 ms</td>
</tr>
<tr>
<td style="width: 91px;">6</td>
<td>X</td>
<td>1.143 ms</td>
</tr>
<tr>
<td style="width: 91px;">12</td>
<td>X</td>
<td>1.225 ms</td>
</tr>
<tr>
<td style="width: 91px;">24</td>
<td>X</td>
<td>1.305 ms</td>
</tr>
</tbody></table><div class="editing-ui sl-table-column-resizer" data-column-index="0" style="left: 91px;"></div><div class="editing-ui sl-table-column-resizer" data-column-index="1" style="left: 271px;"></div></div></div>

## False sharing

https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

In [47]:
function sum_simple()
    acc = zeros(Int64, nthreads())
    @threads for tid in 1:nthreads()
        for i in 1:10_000
             acc[tid] += 1
        end
    end
    sum(acc)
end

function sum_cacheaware()
    CACHE_LINE = 64 #bytes
    elems = div(CACHE_LINE, sizeof(Int64))
    acc = zeros(Int64, nthreads()*elems)
    @threads for tid in 1:nthreads()
        store = (tid-1)*elems+1
        for i in 1:10_000
             acc[store] += 1
        end
    end
    sum(acc)
end

sum_cacheaware (generic function with 1 method)

In [48]:
@benchmark sum_simple()

BenchmarkTools.Trial: 
  memory estimate:  128 bytes
  allocs estimate:  2
  --------------
  minimum time:     12.630 μs (0.00% GC)
  median time:      13.824 μs (0.00% GC)
  mean time:        15.444 μs (0.00% GC)
  maximum time:     86.649 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [50]:
@benchmark sum_cacheaware()

BenchmarkTools.Trial: 
  memory estimate:  192 bytes
  allocs estimate:  2
  --------------
  minimum time:     14.048 μs (0.00% GC)
  median time:      17.409 μs (0.00% GC)
  mean time:        18.417 μs (0.00% GC)
  maximum time:     87.734 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

## A fun user submission
Help me! this code is slow

In [40]:
using BenchmarkTools
using Base.Threads
using LinearAlgebra
using Random
println("Number of threads: ", nthreads())

function myfun(rng::MersenneTwister)
    s = 0.0
    N = 10000
    for i=1:N
        s+=det(randn(rng, 3,3))
    end
    s/N
end

rgi   = [MersenneTwister(rand(UInt)) for _ in 1:nthreads()]
function bench(rgi)
    a  = zeros(1000)
    @threads for i=1:length(a)
        a[i] = myfun(rgi[threadid()])
    end
end

result = @benchmark bench($rgi)
display(result)

BenchmarkTools.Trial: 
  memory estimate:  4.32 GiB
  allocs estimate:  40000002
  --------------
  minimum time:     4.174 s (6.70% GC)
  median time:      4.299 s (6.27% GC)
  mean time:        4.299 s (6.27% GC)
  maximum time:     4.425 s (5.86% GC)
  --------------
  samples:          2
  evals/sample:     1

Number of threads: 1


#### Thread 1

```
BenchmarkTools.Trial: 
  memory estimate:  4.32 GiB
  allocs estimate:  40000002
  --------------
  minimum time:     4.063 s (4.03% GC)
  median time:      4.217 s (3.57% GC)
  mean time:        4.217 s (3.57% GC)
  maximum time:     4.371 s (3.14% GC)
  --------------
  samples:          2
  evals/sample:     1
```

#### Thread 4

```
BenchmarkTools.Trial: 
  memory estimate:  3.52 GiB
  allocs estimate:  34212074
  --------------
  minimum time:     3.346 s (0.00% GC)
  median time:      3.960 s (10.85% GC)
  mean time:        3.960 s (10.85% GC)
  maximum time:     4.574 s (18.78% GC)
  --------------
  samples:          2
  evals/sample:     1
```

#### How did I optimise this code?
1. Memory allocations in hot-loop
2. Eliminate allocs caused by rand
3. Investigate how det is implemented
4. Implement det!
5. Remove overhead to library call
6. Use profiling tools
7. Start using StaticArrays

Full story here: https://hackmd.io/@dLigA9a4SwKmdcaQtloXXw/BkyZ5Mmbb

```julia
@edit det(zeros(3, 3)) -> det(lufact(A))
lufact(A, pivot = true) = lufact!(copy(A), pivot)
```

```julia
det!(A) = det(lufact!(A))
det!(A) = det(LinearAlgebra.generic_lufact!(A))
```

`det!` originally was calling a `lufact!` from LAPACK,
which is overkill for the matrix size. First attempt switch to a pure Julia implementation.

In [46]:
using BenchmarkTools
using StaticArrays
using Base.Threads
println("Number of threads: ", nthreads())

function myfun(rng::MersenneTwister)
    s = 0.0
    N = 10000
    for i=1:N
        s += det(randn(rng, SMatrix{3, 3}))
    end
    s/N
end

rgi   = [MersenneTwister(abs(rand(Int))) for s in 1:nthreads()]

function bench(rgi)
    a  = zeros(1000)
    @threads for i=1:length(a)
        @inbounds a[i] = myfun(rgi[threadid()])
    end
end

result = @benchmark bench($rgi)
display(result)

BenchmarkTools.Trial: 
  memory estimate:  7.98 KiB
  allocs estimate:  2
  --------------
  minimum time:     553.192 ms (0.00% GC)
  median time:      748.534 ms (0.00% GC)
  mean time:        741.813 ms (0.00% GC)
  maximum time:     910.000 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

Number of threads: 1


### The future is near! Partr is coming
- Julia 1.2 and 1.3
- https://github.com/JuliaLang/julia/pull/32600
- https://github.com/JuliaLang/julia/pull/32477
- https://github.com/NHDaly/CspExamples.jl/blob/master/src/CspExamples.jl

In [17]:
macro par(expr)
    thunk = esc(:(()->($expr)))
    quote
        local task = Task($thunk)
        task.sticky = false
        schedule(task)
        task
    end
end

@par (macro with 1 method)

In [56]:
@par println("Hello!")

ErrorException: type Task has no field sticky

In Julia 1.3 task can be executed on multiple worker-threads allowing fine-grained control. This is concurrency ala Go/CSP.

Our handy trick from above can then simply be written as:

```
tasks = Task[]
for tid in 1:nthreads()
    task = @par begin
    ### some work
    end
    push!(tasks, task)
end
for task in tasks
    wait(task)
end
```