# Example of Random M Variance Components Simulation

Authors: Sarah Ji, Hua Zhou, Janet Sinsheimer, Kenneth Lange

In this notebook I demo how to simulte from the LMM/VCM framework with the following parameters. We then benchmark against the same simulation process using the MatrixNormal() function in the Distributions.jl package.

In [1]:
@show n = 1000   # no. observations
@show d = 2      # dimension of responses
@show m = 10      # no. variance components
@show p = 2;      # no. covariates

n = 1000 = 1000
d = 2 = 2
m = 10 = 10
p = 2 = 2


2

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [2]:
versioninfo()

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [3]:
using DataFrames, Random, LinearAlgebra, TraitSimulation, Distributions, BenchmarkTools
Random.seed!(1234);

┌ Info: Recompiling stale cache file /Users/sarahji/.julia/compiled/v1.2/TraitSimulation/VikWX.ji for TraitSimulation [dec3038e-29bc-11e9-2207-9f3d5855a202]
└ @ Base loading.jl:1240
│ - If you have TraitSimulation checked out for development and have
│   added SnpArrays as a dependency but haven't updated your primary
│   environment's manifest file, try `Pkg.resolve()`.
│ - Otherwise you may need to report an issue with TraitSimulation


# Generating Random Design Matrix, Coefficient Vector and Variance Component Matrices

Here for m = 10 random Variance Components, we generate m random covariance matrices, a random design matrix and p regression coefficients to illustrate the simulation of a d-dimensional response matrix for a sample of n = 1000 people.


In [4]:
# n-by-p design matrix
X = randn(n, p)

# p-by-d mean component regression coefficient
B = ones(p, d)  

# a tuple of m covariance matrices
V = ntuple(x -> zeros(n, n), m) 
for i = 1:m-1
  Vi = [j ≥ i ? i * (n - j + 1) : j * (n - i + 1) for i in 1:n, j in 1:n]
  copy!(V[i], Vi * Vi')
end
copy!(V[m], Diagonal(ones(n))) # last covarianec matrix is idendity

# a tuple of m d-by-d variance component parameters
Σ = ntuple(x -> zeros(d, d), m) 
for i in 1:m
  Σi = [j ≥ i ? i * (d - j + 1) : j * (d - i + 1) for i in 1:d, j in 1:d]
  copy!(Σ[i], Σi' * Σi)
end

Random_VCM_Trait = DataFrame(VCM_simulation(X, B, Σ, V), [:SimTrait1, :SimTrait2])

Unnamed: 0_level_0,SimTrait1,SimTrait2
Unnamed: 0_level_1,Float64,Float64
1,-0.141849,-0.141849
2,-1.37707,-1.37707
3,-1.25452,-1.25452
4,-1.46763,-1.46763
5,1.04345,1.04345
6,1.16475,1.16475
7,1.37619,1.37619
8,-0.900015,-0.900015
9,-1.26027,-1.26027
10,0.813623,0.813623


# Comparing Benchmarking with the Distributions.jl package MatrixNormal distribution. 

In our VarianceComponent type, we store the cholesky decomposition of each $\Sigma_i$ and $V_i$, computed outside of simulation within the vc_vector of VarianceComponent types. This is important since the more often than not, users have to run the simulation many times for their desired goal. From our benchmarking below, we show that when we use the simulation package to simulate traits n_reps times, using the VariaceComponent type is much faster and memory efficient than calling the available julia MatrixNormal distribution m times for the following parameters:

In [6]:
@show n # sample size
@show m # number of random variance componenents 
@show d # number of traits
@show p; # number of fixed effects

n = 1000
m = 10
d = 2
p = 2


## Compare for m = 1 variance component

For only one variance component we are roughly four 4x more memory efficient and 6x faster at simulating this bivariate trait

In [7]:
LMMtraitobj = LMMTrait(X*B, VarianceComponent(Σ[1], V[1]))
@benchmark simulate(LMMtraitobj)

BenchmarkTools.Trial: 
  memory estimate:  7.66 MiB
  allocs estimate:  4
  --------------
  minimum time:     2.837 ms (0.00% GC)
  median time:      3.214 ms (0.00% GC)
  mean time:        4.038 ms (19.96% GC)
  maximum time:     12.038 ms (37.68% GC)
  --------------
  samples:          1235
  evals/sample:     1

In [8]:
function MN_J(X, B, V, Σ; n_reps = 1)
    n, p = size(X*B)
    sim = [zeros(n, p) for i in 1:n]
    for i in 1:n_reps
        sim[i] = rand(MatrixNormal(X*B, V, Σ))
    end
    return(sim)
end

@benchmark MN_J($X, $B, $V[1], $Σ[1])

BenchmarkTools.Trial: 
  memory estimate:  30.76 MiB
  allocs estimate:  1024
  --------------
  minimum time:     13.612 ms (16.46% GC)
  median time:      15.681 ms (21.03% GC)
  mean time:        16.957 ms (21.45% GC)
  maximum time:     74.202 ms (84.04% GC)
  --------------
  samples:          295
  evals/sample:     1

## Compare simulation for m = 10 variance components

still about 2x memory efficient but now 3.8x faster compared to the Distributions package

In [9]:
vc_vector = [VarianceComponent(Σ[i], V[i]) for i in eachindex(V)]
@benchmark VCM_simulation($X, $B, $vc_vector)

BenchmarkTools.Trial: 
  memory estimate:  76.51 MiB
  allocs estimate:  35
  --------------
  minimum time:     29.629 ms (12.73% GC)
  median time:      34.381 ms (12.79% GC)
  mean time:        37.353 ms (12.70% GC)
  maximum time:     68.097 ms (11.66% GC)
  --------------
  samples:          134
  evals/sample:     1

In [10]:
function MN_Jm(X, B, V, Σ; n_reps = 1)
    n, p = size(X*B)
    m = length(V)
    sim = [zeros(n, p) for i in 1:n]
    for i in 1:n_reps
        for j in 1:m
            dist = MatrixNormal(X*B, V[j], Σ[j])
            sim[i] += rand(dist)
        end
    end
    return(sim)
end

@benchmark vecs = MN_Jm($X, $B, $V, $Σ)

BenchmarkTools.Trial: 
  memory estimate:  169.07 MiB
  allocs estimate:  1232
  --------------
  minimum time:     103.212 ms (9.49% GC)
  median time:      126.816 ms (8.67% GC)
  mean time:        137.321 ms (24.76% GC)
  maximum time:     199.573 ms (44.95% GC)
  --------------
  samples:          37
  evals/sample:     1

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).
