# Linear algebra of SnpArray

SnpArrays.jl supports three modes of matrix-vector multiplications.

1. Direct operations on a plink-formatted `SnpArray`: `SnpLinAlg`
2. Operations on transformed `BitMatrix`es: `SnpBitMatrix`
3. Direct operations on a plink-formatted data on an Nvidia GPU: `CuSnpArray`.

`SnpLinAlg` and `SnpBitMatrix` use Chris Elrod's [LoopVectorization.jl](https://github.com/chriselrod/LoopVectorization.jl) internally. It is much faster on machines with AVX support. `CuSnpArray` uses [CUDA.jl](https://juliagpu.gitlab.io/CUDA.jl/) internally.
On this page, we compare these three.

In [1]:
versioninfo()

Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)


In [2]:
using SnpArrays

In [3]:
const EUR = SnpArray(SnpArrays.datadir("EUR_subset.bed"))

379×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮

Let's try with EUR data repeated 100 times: 37900 by 54051.

In [4]:
EUR_10 = [EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR; EUR]
EUR_100 = [EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10; EUR_10]

37900×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
   

We create instnaces of SnpLinAlg, SnpBitmatrix and CuSnpArray:

In [5]:
EUR_100_bm = SnpBitMatrix{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false)
EUR_100_sla = SnpLinAlg{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);

In [6]:
ENV["JULIA_CUDA_USE_BINARYBUILDER"] = "false"
using CUDA
EUR_100_cu = CuSnpArray{Float64}(EUR_100; model=ADDITIVE_MODEL, center=false, scale=false);

│   caller = llvm_compat(::VersionNumber) at compatibility.jl:176
└ @ CUDA /home/kose/.julia/packages/CUDA/5t6R9/deps/compatibility.jl:176


## $y = Ax$

In [7]:
using LinearAlgebra
using BenchmarkTools

In [8]:
v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2));

Direct linear algebra on a SnpArray: 

In [9]:
@benchmark LinearAlgebra.mul!($v1, $EUR_100_sla, $v2)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.455 s (0.00% GC)
  median time:      2.460 s (0.00% GC)
  mean time:        2.462 s (0.00% GC)
  maximum time:     2.470 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1

The below is the benchmark for SnpBitMatrix:

In [10]:
@benchmark (LinearAlgebra.mul!($v1, $EUR_100_bm, $v2))

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.301 s (0.00% GC)
  median time:      1.306 s (0.00% GC)
  mean time:        1.306 s (0.00% GC)
  maximum time:     1.311 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

SnpBitMatrix is about twice as fast as SnpLinAlg. However, note that it allocates additional BitMatrixes of the same size as the plink file itself.

Now let's try CUDA. The device is Nvidia Titan V.

In [11]:
using Adapt

Moving data to GPU: 

In [12]:
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

In [13]:
using BenchmarkTools
@benchmark LinearAlgebra.mul!($v1_d, $EUR_100_cu, $v2_d)

│   caller = ip:0x0
└ @ Core :-1


BenchmarkTools.Trial: 
  memory estimate:  1.44 KiB
  allocs estimate:  54
  --------------
  minimum time:     22.021 ms (0.00% GC)
  median time:      22.162 ms (0.00% GC)
  mean time:        22.666 ms (0.00% GC)
  maximum time:     44.924 ms (0.00% GC)
  --------------
  samples:          221
  evals/sample:     1

The speedup is obvious. Let's check correctness:

In [14]:
isapprox(collect(v1_d), v1)

true

## $A^T x$

In [15]:
v1 = randn(size(EUR_100, 1))
v2 = randn(size(EUR_100, 2))
v1_d = adapt(CuArray{Float64}, v1)
v2_d = adapt(CuArray{Float64}, v2);

In [16]:
@benchmark LinearAlgebra.mul!($v2, transpose($EUR_100_sla), $v1)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     1.601 s (0.00% GC)
  median time:      1.605 s (0.00% GC)
  mean time:        1.607 s (0.00% GC)
  maximum time:     1.616 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

In [17]:
@benchmark (LinearAlgebra.mul!($v2, transpose($EUR_100_bm), $v1))

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     601.766 ms (0.00% GC)
  median time:      609.034 ms (0.00% GC)
  mean time:        617.225 ms (0.00% GC)
  maximum time:     657.481 ms (0.00% GC)
  --------------
  samples:          9
  evals/sample:     1

In [18]:
@benchmark LinearAlgebra.mul!($v2_d, transpose($EUR_100_cu), $v1_d)

BenchmarkTools.Trial: 
  memory estimate:  1.42 KiB
  allocs estimate:  53
  --------------
  minimum time:     27.659 ms (0.00% GC)
  median time:      27.871 ms (0.00% GC)
  mean time:        28.316 ms (0.00% GC)
  maximum time:     33.411 ms (0.00% GC)
  --------------
  samples:          177
  evals/sample:     1

In [19]:
isapprox(collect(v2_d), v2)

true