# Multithreading benchmarks for haplotyping

Julia offers multithreading capabilities, but matrix-matrix multiplication with BLAS is also inherently multithreaded. If we use max number of threads for each, the computer may suffer oversubscription such as discussed in [this post](https://discourse.julialang.org/t/julia-threads-vs-blas-threads/8914). In addition, there are numerous ways to parallelize the haplotyping code. Thus, below we attempt to find the optimal thread combination for haplotyping project.

Useful code:
+ **Check BLAS threads in Julia:** `ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())`
+ **Set BLAS threads in Julia:** `BLAS.set_num_threads(n)`
+ **Check Julia threads in Julia:** `Threads.nthreads()`
+ **Set Julia threads (in terminal):** `export JULIA_NUM_THREADS=n`

In [1]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 1


In [2]:
using BenchmarkTools
using MendelImpute
using Random
using LinearAlgebra
using Profile

┌ Info: Recompiling stale cache file /Users/biona001/.julia/compiled/v1.0/MendelImpute/DVXpm.ji for MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1190


In [3]:
Random.seed!(123)
n = 5000 # number of individuals
p = 500  # number of SNPs
d = 500  # number of reference haplotypes
H = convert(Matrix{Float32}, rand(0:1, p, d))
X = convert(Matrix{Float32}, rand(0:2, p, n))
M = Transpose(H) * H
for j in 1:d, i in 1:(j - 1) # off-diagonal
    M[i, j] = 2M[i, j] + M[i, i] + M[j, j]
end
for j in 1:d # diagonal
    M[j, j] *= 4
end
N = Transpose(X) * H
for I in eachindex(N)
    N[I] *= 2
end

happair  = zeros(Int, n), zeros(Int, n)
hapscore = zeros(eltype(N), n)
haplopair!(happair, hapscore, M, N)

## Multithreading in finding optimal haplotypes in each window over multiple individuals

This is accomplished by:

```Julia
Threads.@threads for k in 1:d
    @inbounds for j in 1:k
        # loop over individuals
        for i in 1:n
            score = M[j, k] - N[i, j] - N[i, k]
            if score < hapmin[i]
                hapmin[i], happair[1][i], happair[2][i] = score, j, k
            end
        end
    end
end
```

### 1 Julia thread, 8 BLAS thread (default setup)

In [4]:
#BLAS threads
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())

8

In [5]:
#julia threads
Threads.nthreads()

1

In [6]:
@benchmark haplopair!($happair, $hapscore, $M, $N) seconds=30

BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  1
  --------------
  minimum time:     422.589 ms (0.00% GC)
  median time:      430.402 ms (0.00% GC)
  mean time:        432.180 ms (0.00% GC)
  maximum time:     459.712 ms (0.00% GC)
  --------------
  samples:          70
  evals/sample:     1

### 16 Julia thread (max), 1~8 BLAS threads

In [4]:
Threads.nthreads()

16

In [4]:
blas_threads = collect(1:8)
for n in blas_threads
    BLAS.set_num_threads(n)
    println("$n BLAS threads: \n")
    @btime haplopair!($happair, $hapscore, $M, $N) seconds=30
    println("\n")
end

1 BLAS threads: 

  100.901 ms (1 allocation: 64 bytes)


2 BLAS threads: 

  106.044 ms (1 allocation: 64 bytes)


3 BLAS threads: 

  107.748 ms (1 allocation: 64 bytes)


4 BLAS threads: 

  108.500 ms (1 allocation: 64 bytes)


5 BLAS threads: 

  109.335 ms (1 allocation: 64 bytes)


6 BLAS threads: 

  108.418 ms (1 allocation: 64 bytes)


7 BLAS threads: 

  107.891 ms (1 allocation: 64 bytes)


8 BLAS threads: 

  109.018 ms (1 allocation: 64 bytes)




### 8 Julia thread, 1~8 BLAS threads

In [4]:
Threads.nthreads()

8

In [5]:
blas_threads = collect(1:8)
for n in blas_threads
    BLAS.set_num_threads(n)
    println("$n BLAS threads: \n")
    @btime haplopair!($happair, $hapscore, $M, $N) seconds=30
    println("\n")
end

1 BLAS threads: 

  135.831 ms (1 allocation: 64 bytes)


2 BLAS threads: 

  135.022 ms (1 allocation: 64 bytes)


3 BLAS threads: 

  144.304 ms (1 allocation: 64 bytes)


4 BLAS threads: 

  139.879 ms (1 allocation: 64 bytes)


5 BLAS threads: 

  140.054 ms (1 allocation: 64 bytes)


6 BLAS threads: 

  139.810 ms (1 allocation: 64 bytes)


7 BLAS threads: 

  139.915 ms (1 allocation: 64 bytes)


8 BLAS threads: 

  144.358 ms (1 allocation: 64 bytes)




### 4 Julia thread, 1~8 BLAS threads

In [4]:
Threads.nthreads()

4

In [5]:
blas_threads = collect(1:8)
for n in blas_threads
    BLAS.set_num_threads(n)
    println("$n BLAS threads: \n")
    @btime haplopair!($happair, $hapscore, $M, $N) seconds=30
    println("\n")
end

1 BLAS threads: 

  224.559 ms (1 allocation: 64 bytes)


2 BLAS threads: 

  212.941 ms (1 allocation: 64 bytes)


3 BLAS threads: 

  213.031 ms (1 allocation: 64 bytes)


4 BLAS threads: 

  212.990 ms (1 allocation: 64 bytes)


5 BLAS threads: 

  214.731 ms (1 allocation: 64 bytes)


6 BLAS threads: 

  213.046 ms (1 allocation: 64 bytes)


7 BLAS threads: 

  213.081 ms (1 allocation: 64 bytes)


8 BLAS threads: 

  218.415 ms (1 allocation: 64 bytes)




# 7/10/2020 update

On small problems (e.g. compare 1~3), using 8 threads parallelizing across windows achieves 4~5x speedup. However, on large problems (e.g. HRC chr20), we only get 2x speedup with 10 threads. 

Related discussions:
+ [This post](https://discourse.julialang.org/t/multiply-many-matrices-by-many-vectors/18542/17) discuss multiplying a bunch of matrix-vectors. They got linear scaling with # cores if matrices are preallocated. 
+ [This post](https://discourse.julialang.org/t/poor-performance-on-cluster-multithreading/12248/9) suggests allocations is very bad for multithreaded loops because GC is single threaded. 

In [1]:
using LinearAlgebra, BenchmarkTools, Random
BLAS.set_num_threads(1)

In [2]:
function mult_reduce(M::Vector{Matrix{Float64}})
    n = length(M)
    s = 0.0
    for i in 1:n
        tmp = Transpose(M[i]) * M[i] # calculate Mi' Mi
        s  += sum(tmp)               # reduction
    end
    return s
end

function mult_reduce_threaded(M::Vector{Matrix{Float64}})
    n = length(M)
    s = zeros(Threads.nthreads())
    Threads.@threads for i in 1:n
        tmp = Transpose(M[i]) * M[i]      # calculate Mi' Mi        
        s[Threads.threadid()] += sum(tmp) # reduction on each thread         
    end
    return sum(s) # reduce across threads
end

mult_reduce_threaded (generic function with 1 method)

In [3]:
# generate (small) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
    k = rand(100:200) # random matrix dimension
    M[i] = rand(k, k)
end

In [4]:
@btime mult_reduce(M) # 1 thread

  18.225 ms (200 allocations: 18.71 MiB)


1.0128044609863697e8

In [4]:
@btime mult_reduce_threaded(M) # 8 thread

  4.451 ms (259 allocations: 18.72 MiB)


1.0128044609863696e8

In [5]:
# generate (large) test data
Random.seed!(2020)
n = 100
M = Vector{Matrix{Float64}}(undef, n)
for i in 1:n
    k = rand(1000:3000) # random matrix dimension
    M[i] = rand(k, k)
end

In [6]:
@btime mult_reduce(M) # 1 thread

  21.042 s (200 allocations: 3.12 GiB)


2.368913048551288e11

In [6]:
@btime mult_reduce_threaded(M) # 8 thread

  9.937 s (275 allocations: 3.12 GiB)


2.3689130485512885e11