# Multiple Traits, Multiple Variance Components? Easy.

Say that you have the the classical setting in genetics, two variance components, one for the additive genetic variance and one for the environmental variance. This example extends the standard genetic variance component model to efficiently account for any number of other random effects, in addition to the additive genetic and environmental variance components (more than 2 variance components). 

In this example we show alternative ways to specify the simulation parameters for the VCM and benchmark it against the available method using the MatrixNormal distribution in Distributions.jl package.

We then benchmark against the same simulation process using the MatrixNormal() function in the Distributions.jl package.


$$Y_{n \times p} \sim \text{MatrixNormal}(\mathbf{M}_{n \times p} = XB, \Omega_{np \times np} = \sum_{k=1}^m \Sigma_k \otimes V_k)$$


Users can specify their covariance structures as follows. We note that this form can also accompany more than 2 variance components.

In [1]:
using LinearAlgebra, Random, TraitSimulation, DataFrames, Distributions, BenchmarkTools

Here for m = 10 random Variance Components, we generate m random covariance matrices, a random design matrix and p regression coefficients to illustrate the simulation of a d = 2 dimensional response matrix of traits for a sample of n = 1000 people.

In [2]:
n = 1000   # no. observations
d = 2      # dimension of responses
m = 10      # no. variance components
p = 2;      # no. covariates
Random.seed!(1234);

The following functions will be used to generate the random data to benchmark our model. We want to simulate a Design matrix, the matrix of regression coefficients, and a tuple for each covariance matrix of the variance components.

In [3]:
function generateSPDmatrix(n)
    A = rand(n)
    m = 0.5 * (A * A')
    PDmat = m + (n * Diagonal(ones(n)))
end


function generateRandomVCM(n::Int64, p::Int64, d::Int64, m::Int64)
    # n-by-p design matrix
    X = randn(n, p)

    # p-by-d mean component regression coefficient for each trait
    B = hcat(ones(p, 1), rand(p))  

    V = ntuple(x -> zeros(n, n), m) 
    for i = 1:m-1
      copy!(V[i], generateSPDmatrix(n))
    end
    copy!(V[end], Diagonal(ones(n))) # last covarianec matrix is identity

    # a tuple of m d-by-d variance component parameters
    Σ = ntuple(x -> zeros(d, d), m) 
    for i in 1:m
      copy!(Σ[i], generateSPDmatrix(d))
    end

    return(X, B, Σ, V)
    end;


In [4]:
X_sim, B_sim, Σ_sim, V_sim = generateRandomVCM(n, p, d, m);
VCM_model = VCMTrait(X_sim, B_sim, Σ_sim, V_sim)

Variance Component Model
  * number of traits: 2
  * number of variance components: 10
  * sample size: 1000

In [5]:
Random_VCM_Trait = DataFrame(simulate(VCM_model))
rename!(Random_VCM_Trait, [Symbol("Trait$i") for i in 1:d])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Float64,Float64
1,-69.5177,-125.128
2,-184.975,179.507
3,62.2697,-29.6645
4,-77.7197,-143.363
5,215.216,-109.543
6,45.6826,42.9817
7,-47.6316,-128.685
8,-43.8541,-191.055
9,95.7323,-119.535
10,89.6551,-251.84


In our VarianceComponent type, we store the cholesky decomposition of each $\Sigma_i$ and $V_i$, computed outside of simulation within the vc_vector of VarianceComponent types. This is important since the more often than not, users have to run the simulation many times for their desired goal.

# Compare simulation for m = 1 variance component
For only one variance component we are roughly four 2x more memory efficient and 3.7x faster at simulating this bivariate trait

In [6]:
VCM_model = VCMTrait(X_sim*B_sim, VarianceComponent(Σ_sim[1], V_sim[1]))
@benchmark simulate(VCM_model)

BenchmarkTools.Trial: 
  memory estimate:  7.68 MiB
  allocs estimate:  5
  --------------
  minimum time:     2.064 ms (0.00% GC)
  median time:      3.126 ms (0.00% GC)
  mean time:        3.355 ms (11.71% GC)
  maximum time:     10.899 ms (29.62% GC)
  --------------
  samples:          1484
  evals/sample:     1

In [7]:
function MN_J(X, B, Σ, V; n_reps = 1)
    n, p = size(X*B)
    sim = [zeros(n, p) for i in 1:n_reps]
    for i in 1:n_reps
        sim[i] = rand(MatrixNormal(X*B, V, Σ))
    end
    return(sim)
end

@benchmark MN_J($X_sim, $B_sim, $Σ_sim[1], $V_sim[1])

BenchmarkTools.Trial: 
  memory estimate:  15.38 MiB
  allocs estimate:  25
  --------------
  minimum time:     9.058 ms (0.00% GC)
  median time:      11.277 ms (10.12% GC)
  mean time:        11.992 ms (8.03% GC)
  maximum time:     23.930 ms (8.26% GC)
  --------------
  samples:          417
  evals/sample:     1

# Compare simulation for m = 10 variance components
still about 2x memory efficient but now 3.2x faster compared to the Distributions package

In [8]:
vc_vector = [VarianceComponent(Σ_sim[i], V_sim[i]) for i in eachindex(V_sim)]
VCM_model_m = VCMTrait(X_sim*B_sim, vc_vector);
@benchmark simulate(VCM_model_m)

BenchmarkTools.Trial: 
  memory estimate:  76.36 MiB
  allocs estimate:  24
  --------------
  minimum time:     32.783 ms (15.10% GC)
  median time:      35.511 ms (14.73% GC)
  mean time:        35.561 ms (16.06% GC)
  maximum time:     42.054 ms (13.81% GC)
  --------------
  samples:          141
  evals/sample:     1

In [9]:
function MN_Jm(X, B, Σ, V; n_reps = 1)
    n, p = size(X*B)
    m = length(V)
    sim = [zeros(n, p) for i in 1:n_reps]
    for i in 1:n_reps
        for j in 1:m
            dist = MatrixNormal(X*B, V[j], Σ[j])
            sim[i] += rand(dist)
        end
    end
    return(sim)
end

@benchmark vecs = MN_Jm($X_sim, $B_sim, $Σ_sim, $V_sim)

BenchmarkTools.Trial: 
  memory estimate:  153.70 MiB
  allocs estimate:  233
  --------------
  minimum time:     105.317 ms (9.59% GC)
  median time:      113.232 ms (10.42% GC)
  mean time:        113.533 ms (10.05% GC)
  maximum time:     126.924 ms (9.60% GC)
  --------------
  samples:          45
  evals/sample:     1

From our benchmarking below, we show that when we use the simulation package to simulate traits n_reps times, using the VariaceComponent type is much faster and memory efficient than calling the available julia MatrixNormal distribution m times.