# Multiple Traits, Multiple Variance Components? Easy.

This example extends the standard genetic variance component model to efficiently account for any number of other random effects, in addition to the additive genetic and environmental variance components (more than 2 variance components). Say we have $m \geq 2$ variance components for $d$ correlated traits of $n$ related people under the VCM, users  specify their VCM as follows:

$$Y_{nd \times 1} \sim \text{MatrixNormal}(\mathbf{M}_{nd \times 1} = XB, \Omega_{nd \times nd} = \sum_{k=1}^m \Sigma_k \otimes V_k)$$

Allows the model **data** can be inputed under the standard [VarianceComponentModels.jl](https://github.com/OpenMendel/VarianceComponentModels.jl/) framework as follows:

* `Y`: `nd x 1` response (phenotype) 
* `X`: `nd x p` covariate matrix 
* `V = (V1, ..., Vm)`: a tuple of `m` `n x n` covariance matrices

and **parameters** are

* `B`: `pd x 1` mean parameter
* `Σ = (Σ1, ..., Σm)`: a tuple of `m` `d x d` variance components. 

In this example we show alternative ways to specify the simulation parameters for the VCM and benchmark it against the available method using the MatrixNormal distribution in Julia [Distributions.jl](https://juliastats.org/Distributions.jl/latest/matrix/#Distributions.MatrixNormal) package.


Users who want a reference on genetic modeling, we recommend [Mathematical And Statistical Methods For Genetic Analysis](http://www.biometrica.tomsk.ru/lib/lange_1.pdf) by Dr. Kenneth Lange. In chapter 8 of this book, the user can find an introduction to Variance Component Models in Genetic Setting. For a more in depth review of variance component modeling in the genetic setting, we include a reference at the end of the notebook [4].

In [1]:
using LinearAlgebra, Random, TraitSimulation, DataFrames, Distributions, BenchmarkTools

Here for m = 10 random Variance Components, we generate m random covariance matrices, a random design matrix and p regression coefficients to illustrate the simulation of a d = 2 dimensional response matrix of traits for a sample of n = 1000 people.

In [2]:
n = 1000   # no. observations
d = 2      # dimension of responses
m = 10      # no. variance components
p = 2;      # no. covariates
Random.seed!(1234);

The following functions will be used to generate the random data to benchmark our model. We want to simulate a Design matrix, the matrix of regression coefficients, and a tuple for each covariance matrix of the variance components.

In [3]:
function generateSPDmatrix(n)
    A = rand(n)
    m = 0.5 * (A * A')
    PDmat = m + (n * Diagonal(ones(n)))
end

function generateRandomVCM(n::Int64, p::Int64, d::Int64, m::Int64)
    # n-by-p design matrix
    X = randn(n, p)

    # p-by-d mean component regression coefficient for each trait
    B = hcat(ones(p, 1), rand(p))  

    V = ntuple(x -> zeros(n, n), m) 
    for i = 1:m-1
      copy!(V[i], generateSPDmatrix(n))
    end
    copy!(V[end], Diagonal(ones(n))) # last covarianec matrix is identity

    # a tuple of m d-by-d variance component parameters
    Σ = ntuple(x -> zeros(d, d), m) 
    for i in 1:m
      copy!(Σ[i], generateSPDmatrix(d))
    end

    return(X, B, Σ, V)
    end;


In [5]:
X_sim, B_sim, Σ_sim, V_sim = generateRandomVCM(n, p, d, m);
VCM_model = VCMTrait(X_sim, B_sim, [Σ_sim...], [V_sim...])

Variance Component Model
  * number of traits: 2
  * number of variance components: 10
  * sample size: 1000

In [6]:
Random_VCM_Trait = DataFrame(simulate(VCM_model))
rename!(Random_VCM_Trait, [Symbol("Trait$i") for i in 1:d])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Float64,Float64
1,-118.132,100.018
2,-40.241,-41.3468
3,91.5106,-191.511
4,70.3645,-31.6985
5,64.6592,-186.749
6,-5.65423,-243.272
7,128.86,244.639
8,-68.5528,-39.9173
9,203.579,100.126
10,27.9118,227.065


In our VarianceComponent type, we store the cholesky decomposition of each $\Sigma_i$ and $V_i$, computed outside of simulation within the vc_vector of VarianceComponent types. This is important since the more often than not, users have to run the simulation many times for their desired goal.

# Compare simulation for m = 1 variance component
For only one variance component we are roughly four 2x more memory efficient and 3.7x faster at simulating this bivariate trait

In [10]:
vcomp = @vc Σ_sim[1] ⊗ V_sim[1]
VCM_model = VCMTrait(X_sim, B_sim, vcomp)
@benchmark simulate($VCM_model)

BenchmarkTools.Trial: 
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     436.371 μs (0.00% GC)
  median time:      497.925 μs (0.00% GC)
  mean time:        505.794 μs (0.35% GC)
  maximum time:     18.283 ms (96.78% GC)
  --------------
  samples:          9873
  evals/sample:     1

In [11]:
function MN_J(X, B, Σ, V; n_reps = 1)
    n, p = size(X*B)
    sim = [zeros(n, p) for i in 1:n_reps]
    for i in 1:n_reps
        sim[i] = rand(MatrixNormal(X*B, V, Σ))
    end
    return(sim)
end

@benchmark MN_J($X_sim, $B_sim, $Σ_sim[1], $V_sim[1])

BenchmarkTools.Trial: 
  memory estimate:  15.38 MiB
  allocs estimate:  18
  --------------
  minimum time:     6.182 ms (0.00% GC)
  median time:      7.322 ms (0.00% GC)
  mean time:        8.836 ms (14.03% GC)
  maximum time:     23.197 ms (51.52% GC)
  --------------
  samples:          566
  evals/sample:     1

# Compare simulation for m = 10 variance components
still about 2x memory efficient but now 3.2x faster compared to the Distributions package

In [15]:
vc_vector = [VarianceComponent(Σ_sim[i], V_sim[i]) for i in eachindex(V_sim)]
VCM_model_m = VCMTrait(X_sim, B_sim, vc_vector)
@benchmark simulate($VCM_model_m)

BenchmarkTools.Trial: 
  memory estimate:  15.75 KiB
  allocs estimate:  1
  --------------
  minimum time:     5.841 ms (0.00% GC)
  median time:      6.196 ms (0.00% GC)
  mean time:        6.217 ms (0.00% GC)
  maximum time:     7.371 ms (0.00% GC)
  --------------
  samples:          805
  evals/sample:     1

In [16]:
function MN_Jm(X, B, Σ, V; n_reps = 1)
    n, p = size(X*B)
    m = length(V)
    sim = [zeros(n, p) for i in 1:n_reps]
    for i in 1:n_reps
        for j in 1:m
            dist = MatrixNormal(X*B, V[j], Σ[j])
            sim[i] += rand(dist)
        end
    end
    return(sim)
end

@benchmark vecs = MN_Jm($X_sim, $B_sim, $Σ_sim, $V_sim)

BenchmarkTools.Trial: 
  memory estimate:  153.70 MiB
  allocs estimate:  163
  --------------
  minimum time:     83.526 ms (0.00% GC)
  median time:      110.818 ms (18.10% GC)
  mean time:        110.729 ms (13.94% GC)
  maximum time:     124.819 ms (17.24% GC)
  --------------
  samples:          46
  evals/sample:     1

From our benchmarking below, we show that when we use the simulation package to simulate traits n_reps times, using the VariaceComponent type is much faster and memory efficient than calling the available julia MatrixNormal distribution m times. This is largely due to the fact that we can compute the Cholesky decomposition of the covariance matrices only once for simulation (which we may want to do many times). 

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`

[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).

[3] Ji, SS, Lange, K, Sinsheimer, JS, Zhou, JJ, Zhou, H, Sobel, E. Modern Simulation Utilities for Genetic Analysis. BMC Bioinformatics. 2020; BINF-D-20-00690

[4] Lange K, Boehnke M (1983) Extensions to pedigree analysis. IV. Covariance component models for multivariate traits. Amer J Med Genet 14:513:524