# TraitSimulation: Simulation Utilities for Family Data

Authors: Sarah Ji, Janet Sinsheimer, Hua Zhou, Ken Lange, Eric Sobel

In this notebook, we show how to simulate trait data so that the related individuals have correlated phenotype values even after we account for the effect of a snp, a combination of snps or other fixed effects. We simulate data under a linear mixed model so that we can model residual dependency among individuals. 

We use a subset of the UKBiobank data to demonstrate how to simulate phenotypic traits after controlling for family structure under different scenarios. Users who are interested in the simulation of traits for independent samples under the Generalized Linear Model (GLM) framework can refer to the [example notebook: prototyping MendelIHT](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/TraitSimulation-TestingMendelIHT.ipynb). 


In addition to VCM's we also show how to simulate traits from the exponential family of distributions, after controlling for family structure under the Generalized Linear Mixed Model (GLMM) framework. To illustrate an example code pipeline for downstream analysis,  we include Jupyter notebooks passing the simulation results to analysis packages, [ordinal model](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/Example_Ordinal_Multinomial_Power.ipynb) and [variance component model](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/Example_VCM_Power.ipynb) to calculate study power.


We let the user specify the effect sizes for non-genetic covariates, and we provide the user with an option to simulate their own effect sizes, given a known distribution. In both models $\mu$ is the mean, $x_i$ is the allele count for snp $i$, GRM is kinship matrix as derived from the genetic relationship matrix (GRM) across only the common snps with minor allele frequency $\ge 0.05$, and I is the identity matrix. 

At the end of each example, we demonstrate how to write the results of each simulation to a file on the users own machine. The notebook is organized as follows: <br>


**Example 1: User specified mixed effects model with Interaction in the Fixed Effects**


**Example 2: Rare Variant Model with effect sizes generated off minor allele frequency**


In this example we first simulate rare SNP's uniformly from a vector of specified minor allele frequencies. We choose the potential minor allele frequencies to be greater than 0.005 but less than 0.03, then we simulate traits with the 20 rare SNP's as fixed effects  for both univariate and bivariate models.

$$
Y \sim \text{Normal}(\mathbf{\mu}_{n \times 1} = X\beta, \Sigma_{n \times n} = \sigma_A \times 2\hat{\Phi}_{GRM} + \sigma_E \times I_n)
$$

**Example 3: GLMM Trait simulation - Simulating count data in families**

$$
Y \sim \text{Poisson}(\mathbf{\mu}_{n \times 1} = g^{-1}(X\beta), \Sigma_{n \times n} = \sigma_A \times 2\hat{\Phi}_{GRM} + \sigma_E \times I_n)
$$

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [259]:
using Random, SnpArrays, TraitSimulation, VarianceComponentModels, StatsBase, DataFrames
using LinearAlgebra, Distributions, CSV, Plots, StatsFuns, GLM
Random.seed!(1234);
versioninfo()

Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i9-9920X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [192]:
filename = "hypertension_L4_full"
sample_snp_data = SnpData(filename);

Information on the individual can be found in the .fam file. We retrieve the `familyid` (Pedigree ID) along with the `personid` for each the 3754 individuals in the sample. 

In [193]:
full_snps = SnpArray("hypertension_L4_full.bed", 3754)

3754×470228 SnpArray:
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x03  0x02  0x03  0x03  0x02     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x02  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x02  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x03  0x02  0x03  0x03  0x02  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x03  0x02  0x03  0x03  0x02     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x03  0x02  0x03  0x03  0x02     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x03  0x03  0x03  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x02  0x03  0x02  0x03  0x03  0x02     0x00  0x00  0x00  0x00  0x00  0x00
   

In [14]:
# rows are people; columns are SNPs
people, snps = size(full_snps)

(3754, 470228)

We subset the SNP names into a vector called `snpid`

In [7]:
bimfile = sample_snp_data.snp_info # store the snp_info with the snp names
snpid  = bimfile[!, :snpid] # store the snp names in the snpid vector;

# Example 1: User Specified Model

Say, for instance, we know which SNP's we want to have an effect on the Trait. 

To save computing time and memory, we can simply call them by their names in `snpid`, and subset only those specified SNP's for the analysis.

In this example, we choose three snp's on three different chromosomes to prevent Linkage disequilibrium (LD). Say I know the associated SNP's are located on the tenth, eleventh and twenty first respective chromosomes to have an effect on the trait. We include the three SNP's and the interaction of the SNP on chromosome 11 and 21.

We first get the indices of the specified snps to subset the dataset. The snps of interest in this example are in the 10th, 5348671th and 8348112th respective columns of the dataset.


Note: While prior knowledge of which SNP's have an effect on the trait can be helpful, it is important to consider the proper assumptions of the SNP's before performing any analyses. In particular, we will later check to make sure that all the minor allele frequencies are large enough so that a fixed effect model makes sense, and the SNP's are not monomorphic. 

In [42]:
specified_indices = findall(x -> (x == "rs11240779" || x == "rs13257831"  || x == "rs10058946"), snpid)
# Use can change this SNP if you would like to assess another's snps effect on the trait, e.g.:
#specified_indices = findall(x -> x == "rs11240779", snpid); # find the index of the snp of interest by snpid

In [43]:
loci = convert(Matrix{Float64}, @view(full_snps[:, specified_indices]), impute = true)
X_gen = DataFrame(loci)
rename!(X_gen, Symbol.(snpid[specified_indices]))

Unnamed: 0_level_0,rs11240779,rs10058946,rs13257831
Unnamed: 0_level_1,Float64,Float64,Float64
1,1.0,1.0,1.0
2,1.0,2.0,2.0
3,2.0,0.0,2.0
4,1.0,1.0,0.0
5,1.0,2.0,2.0
6,1.0,1.0,2.0
7,2.0,0.0,2.0
8,1.0,1.0,1.0
9,2.0,1.0,2.0
10,2.0,2.0,1.0


# Polymorphic Loci 

We check that the minor allele frequencies for our three specified SNP's are greater than 0.05 so we will proceed with just these SNP's in our fixed effect model. 

In [46]:
maf_cs = maf(@view full_snps[:, specified_indices])[:] 

3-element Array{Float64,1}:
 0.22996515679442509
 0.3670141673349372 
 0.31128300880234727

For the three specified SNP's, we simulate effect sizes based off of their minor allele frequencies. These effect sizes will be used throughout example 1. 

## Construct Design Matrix

Here we construc the desired design matrix for simulation. We specify sex from the person_info field of the SnpData, and map it from "M/F" indicated male/female to 1/0. We then simulate age from a Normal distribution with mean age 45 years old, variance of 3 years. 

In [208]:
n = people
famfile = sample_snp_data.person_info
IndividualID = famfile[:, 1:2];

# map sex from M/F to 1/0
sex = map(x -> strip(x) == "F" ? 0.0 : 1.0, famfile[!, :sex]);

pdf_age = Normal(45, 3)
age = rand(pdf_age, n)

intercept = ones(n)
X_non_gen = DataFrame(intercept = intercept, age = age, sex = sex)
sim_effectsize = round.(simulate_effect_size(maf_cs)

β_cov = [1.0, 0.0002, 0.2]
β = vcat(β_cov, sim_effectsize)
X_design = [X_non_gen X_gen]

Unnamed: 0_level_0,intercept,age,sex,rs11240779,rs10058946,rs13257831
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.961317,1.0,1.0,1.0,1.0
2,1.0,0.430665,1.0,1.0,2.0,2.0
3,1.0,-0.540137,1.0,2.0,0.0,2.0
4,1.0,-2.13068,1.0,1.0,1.0,0.0
5,1.0,0.69471,1.0,1.0,2.0,2.0
6,1.0,-0.0775978,1.0,1.0,1.0,2.0
7,1.0,1.35295,1.0,2.0,0.0,2.0
8,1.0,1.38327,1.0,1.0,1.0,1.0
9,1.0,0.573823,1.0,2.0,1.0,2.0
10,1.0,-0.839263,1.0,2.0,2.0,1.0


# Genetic Relationship Matrix (GRM)

We estimate the kinship using the grm function in `SnpArrays` from only the common snps with minor allele frequency(maf) > 0.05.

We will use the same values of GRM for both univaraite and bivariate examples. 

This example simulates an infinitesmal effect model in which each SNP contributes a small amount to the variance of the trait. In this case, methods to identify SNPs that contribute to the trait value will be unsuccessful and any SNPs identified are false positives.  As such, this simulation can be when the null hypothesis is desired. 

Another issue of importance $E(GRM) = \Phi$ and $V_p = 2V_a \Phi + V_e I$ , so we are in fact simulating a trait that has $V_p = 1.0, h^2 = 0.30$.

We take a look at the Genetic Relationship Matrix computed through `SnpArrays`. 
As a rule of thumb, we should see values close to a half on the diagonal of the GRM.

In [49]:
# Compute GRM using the grm function in SnpArrays
GRM = grm(full_snps)

3754×3754 Array{Float64,2}:
  0.507378      0.000982835   0.00147616   …  -0.00075625    0.00119529 
  0.000982835   0.492554     -0.000152061     -0.00112687   -0.00288623 
  0.00147616   -0.000152061   0.497708         0.00162231   -0.000139996
 -0.00145189   -0.000773482  -0.00177809       0.000876674   0.00170538 
  0.00225233   -0.00107256   -0.00320005       0.00212845   -0.000825247
 -0.00358522    0.002827     -0.000474857  …  -0.00107532    0.000566833
 -0.00417571   -0.00433671    0.000894409      0.00194644    0.00231383 
  0.000524487  -0.000306755  -0.000585709     -0.00142927   -0.00260279 
  0.000801737  -0.00245975   -0.00203422      -0.000757197   0.000850964
  0.00141141    0.00381874   -0.00333167       0.000482886  -0.00265196 
  0.00186258    0.00319489    0.000228094  …   0.00298501   -0.0005898  
  0.000965864  -0.000342783   0.00218133      -0.00153142   -0.00199691 
  0.00328322   -0.00300997    0.000346194      0.00014947    0.00142216 
  ⋮                    

In [54]:
I_n = Matrix{Float64}(I, size(GRM));
totalvc = @vc [0.1][:, :] ⊗ GRM + [0.9][:, :] ⊗ I_n
# # # Create the simulation model 
vcm_model1 = VCMTrait(Matrix(X_design), β, totalvc)

Variance Component Model
  * number of traits: 1
  * number of variance components: 2
  * sample size: 3754

### Alternative Model Specification

```julia
mean_formula = ["1 + 0.0002age + 0.2sex + 0.238rs11240779 + 0.207rs10058946 - 0.216rs13257831"]
vcm_model1 = VCMTrait(mean_formula, X_design, totalvc)
```

For users who wish to specify the fixed effects as a formula, we provide alternative ways to specify the model parameters for simulation. 

In [55]:
# simulate the trait
y_1 = DataFrame(Trait = simulate(vcm_model1)[:])

Unnamed: 0_level_0,Trait
Unnamed: 0_level_1,Float64
1,0.526989
2,0.79911
3,0.525143
4,0.139311
5,-0.742257
6,0.334136
7,0.287026
8,2.14406
9,2.11084
10,1.95348


## Saving Simulation Results to Local Machine

Next we output the SNPs and the coefficients used to simulate this trait along with the simulated trait values and corresponding design matrix for each of the 849 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait. Note that we can impute missing genotypes by turning the argument `impute = true`.


In [56]:
Coefficients = DataFrame(Coefficients = β)
Covariates = DataFrame(covariates = names(X_design))
Trait1_SNPs = hcat(Coefficients, Covariates)

Unnamed: 0_level_0,Coefficients,covariates
Unnamed: 0_level_1,Float64,Symbol
1,1.0,intercept
2,0.0002,age
3,0.2,sex
4,0.237637,rs11240779
5,-0.207473,rs10058946
6,-0.215974,rs13257831


In [57]:
Trait1_data = [IndividualID X_design]

Unnamed: 0_level_0,fid,iid,intercept,age,sex,rs11240779,rs10058946,rs13257831
Unnamed: 0_level_1,Abstract…,Abstract…,Float64,Float64,Float64,Float64,Float64,Float64
1,1002101,1002101,1.0,2.22475,1.0,1.0,1.0,1.0
2,1004064,1004064,1.0,-1.26116,1.0,1.0,2.0,2.0
3,1004240,1004240,1.0,-0.599745,1.0,2.0,0.0,2.0
4,1004303,1004303,1.0,0.922595,1.0,1.0,1.0,0.0
5,1007498,1007498,1.0,-2.29475,1.0,1.0,2.0,2.0
6,1008785,1008785,1.0,-0.896205,1.0,1.0,1.0,2.0
7,1011423,1011423,1.0,0.207726,1.0,2.0,0.0,2.0
8,1012197,1012197,1.0,0.109201,1.0,1.0,1.0,1.0
9,1013656,1013656,1.0,0.110908,1.0,2.0,1.0,2.0
10,1013701,1013701,1.0,0.44119,1.0,2.0,2.0,1.0


In [58]:
# CSV.write("Trait1_data.csv", Trait1_data)
# CSV.write("Trait1_SNPs.csv", Trait1_SNPs)

## Example 2: Rare Variant VCM Related Individuals

$$
Y \sim \text{Normal}(\mathbf{\mu}_{n \times 1} = X\beta, \Sigma_{n \times n} = \sigma_A \times 2\hat{\Phi}_{GRM} + \sigma_E \times I_n)
$$

This example is meant to simulate data in a scenario in which a number of rare mutations in a single gene can change a trait value. We model the residual variation among relatives with the additive genetic variance component and we include 20 rare variants in the mean portion of the model, defined as loci with minor allele frequencies greater than 0.002 but less than 0.02.

Specifically we are generating a single normal trait controlling for family structure with residual heritabiity of 67%, and effect sizes for the variants generated as a function of the minor allele frequencies. The rarer the variant the greater its effect size.

In practice rare variants have smaller minor allele frequencies, but we are limited in this tutorial by the relatively small size of the data set. Note also that our modeling these effects as part of the mean is not meant to imply that the best way to detect them would be a standard association analysis. Instead we recommend a burden or SKAT test.

In [241]:
n = 3754
n_snps = 20
rare_maf = rand([0.02, 0.003, 0.004, 0.02, 0.003, 0.01, 0.02], n_snps)
G = snparray_simulation(rare_maf, n);

In [122]:
X_rare_snps = DataFrame(convert(Matrix{Float64}, G, model=ADDITIVE_MODEL, center=true, scale=true))
rename!(X_rare_snps, [Symbol("Rare SNP $i") for i in 1:length(maf_rare)])

Unnamed: 0_level_0,Rare SNP 1,Rare SNP 2,Rare SNP 3,Rare SNP 4,Rare SNP 5,Rare SNP 6,Rare SNP 7
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
2,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
3,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
4,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
5,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
6,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
7,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
8,-0.0865253,11.9535,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
9,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522
10,-0.0865253,-0.0833667,-0.0849605,-0.201233,-0.0800855,-0.0673705,-0.149522


## Generating Effect Sizes 

Below we demonstrate how to simulate effect sizes for each SNP, conditional on its minor allele frequency and a known distribution.

We include these distributions to model realistic scenarios where the rarest snps have the largest effect size.  


In [243]:
maf_rare = round.(maf(G), digits = 3);

## Chisquared(df = 1)

We want to use allele frequency as x and find f(x) where f is the pdf for the chisquare (df=1) density, so that the rarest snps have the biggest effect sizes.

In [94]:
# Generating Effect Sizes from Chisquared(df = 1) density
chisq_coeff = zeros(n_snps)
for i in 1:20
    chisq_coeff[i] = chisqpdf(1, maf_rare[i])
end

Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice how the more rare SNP's have the largest effect sizes.

## Exponential

For demonstration purposes, we use simulated_effectsizes3 = 3*exp.(-200*maf_rare\[1:20\]), rounded to the second digit, throughout this example. However, named distribution can also be used to simulate effect sizes. 

In [244]:
simulated_effectsizes_exp = round.(3*exp.(-700 * maf_rare), digits = 2)

DataFrame(chisq_coeff = simulated_effectsizes_chisq, exp_coeff = simulated_effectsizes_exp, maf = maf_rare)

Unnamed: 0_level_0,chisq_coeff,exp_coeff,maf
Unnamed: 0_level_1,Float64,Float64,Float64
1,6.3,0.37,0.003
2,7.27,0.01,0.009
3,6.3,0.01,0.009
4,2.79,0.0,0.021
5,7.27,0.37,0.003
6,8.91,0.37,0.003
7,3.78,0.0,0.011
8,2.72,0.0,0.021
9,3.03,0.18,0.004
10,2.87,0.0,0.021


In [245]:
xbm = SnpBitMatrix{Float64}(G, model=ADDITIVE_MODEL, center=true, scale=true); 

### Simulate Trait

Now for the univariate rare variant model we have constructed, we simulate y_2. 

We write our results for the marginal trait simulation with the bivariate simulation results at the end of this example.

In [240]:
vcm_model2 = VCMTrait3(Matrix(X_non_gen), β_cov, xbm, simulated_effectsizes_exp, totalvc)

Variance Component Model
  * number of traits: 1
  * number of variance components: 2
  * sample size: 3754

In [262]:
# Generate the simulations
y_2a = DataFrame(Marginal_Trait1 = simulate(vcm_model2)[:])

Unnamed: 0_level_0,Marginal_Trait1
Unnamed: 0_level_1,Float64
1,-0.604751
2,1.45546
3,0.819765
4,-1.4136
5,1.07872
6,1.64403
7,2.16217
8,-1.09265
9,0.435962
10,-0.450501


### Alternative VCM Parameter Specification:


We can extend the mixed model for a single trait in the previous example to demo how to efficiently simulate multiple traits, while accounting for any number of other random effects in addition to the additive genetic and environmental variance components. In particular, we note the alternative ways users can specify simulation parameters under the VCM. 

Say we have $m \geq 2$ variance components for $d$ correlated traits of $n$ related people under the VCM.


$Y_{n \times d} \sim \text{MatrixNormal}(\mathbf{M}_{n \times d} = XB, \Omega_{nd \times nd} = \Sigma_1 \otimes V_1 + \cdots + \Sigma_m \otimes V_m)$

Users can also specify the model under the standard [VarianceComponentModels.jl](https://github.com/OpenMendel/VarianceComponentModels.jl/) framework as follows:

* `Y`: `n x d` response matrix 
* `X`: `n x p` covariate matrix 
* `V=(V1,...,Vm)`: a tuple of `m` `n x n` covariance matrices

and **parameters** are

* `B`: `p x d` mean parameter matrix
* `Σ=(Σ1,...,Σm)`: a tuple of `m` `d x d` variance components. 

Those who wish to specify a large number of group effects or clusters, we provide an alternative method to specify the variance components and benchmark its performance in [this Example](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/benchmarking_VCM.ipynb).

Note both univariate and multivariate simulation models allow for alternative model specifications. For the bivariate model the following two are equivalent:

**1.**
```julia
Σ_2 = [Σ_A, Σ_E]
V_2 = [GRM, I_n]
# # Create the simulation model 
vcm_model3 =  VCMTrait(Matrix(X_non_gen), β_2, xbm, γ_s, Σ_2, V_2)
```

**2.**
```julia
variance_formula = @vc Σ_A ⊗ GRM  + Σ_E ⊗ I_n
vcm_model3 =  VCMTrait(Matrix(X_non_gen), β_2, xbm, γ_s, variance_formula)
```

In [316]:
β_2 = [β_cov β_cov]
γ_s = [simulated_effectsizes_exp simulated_effectsizes_chisq]

Σ_A = [4 1; 1 4]
Σ_E = [2.0 0.0; 0.0 2.0]
n_traits = size(Σ_A, 1)

variance_formula = @vc Σ_A ⊗ GRM  + Σ_E ⊗ I_n

# # Create the simulation model 
vcm_model3 =  VCMTrait(Matrix(X_non_gen), β_2, xbm, γ_s, variance_formula)

Variance Component Model
  * number of traits: 2
  * number of variance components: 2
  * sample size: 3754

### Simulate Trait

Now for the bivariate rare variant model we have constructed, we simulate y_2b. Notice how the correlation structure between the two traits has an effect in this simulation compared to the marginal simulation of y_2a above.

In [315]:
# Generate the simulations
y_2b = DataFrame(simulate(vcm_model3))
rename!(y_2b, [Symbol("Trait$i") for i in 1:n_traits])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Float64,Float64
1,-1.47905,1.9046
2,-2.10728,-2.39905
3,0.0438116,2.73362
4,-2.9106,-1.30386
5,-1.09351,-0.230581
6,1.39904,1.50179
7,3.61088,-1.25602
8,-1.1113,-0.305428
9,0.427108,1.75923
10,-1.888,1.97498


## Saving Simulation Results to Local Machine

In [264]:
Trait2_data = [IndividualID hcat(y_2a, y_2b, X_rare_snps)]

Unnamed: 0_level_0,fid,iid,Marginal_Trait1,Trait1,Trait2,Rare SNP 1,Rare SNP 2
Unnamed: 0_level_1,Abstract…,Abstract…,Float64,Float64,Float64,Float64,Float64
1,1002101,1002101,-0.604751,1.9902,-4.72049,-0.0865253,-0.0833667
2,1004064,1004064,1.45546,1.53203,-0.175097,-0.0865253,-0.0833667
3,1004240,1004240,0.819765,2.85028,3.94665,-0.0865253,-0.0833667
4,1004303,1004303,-1.4136,-0.298231,4.31397,-0.0865253,-0.0833667
5,1007498,1007498,1.07872,2.06732,3.29499,-0.0865253,-0.0833667
6,1008785,1008785,1.64403,1.94517,1.50403,-0.0865253,-0.0833667
7,1011423,1011423,2.16217,-0.37764,-4.46406,-0.0865253,-0.0833667
8,1012197,1012197,-1.09265,-2.00309,0.602762,-0.0865253,11.9535
9,1013656,1013656,0.435962,0.467142,3.36846,-0.0865253,-0.0833667
10,1013701,1013701,-0.450501,-0.554165,0.768331,-0.0865253,-0.0833667


In [265]:
Trait2_SNPs = DataFrame(Coefficients_Trait1 = simulated_effectsizes_exp,
                       Coefficients_Trait2 = simulated_effectsizes_chisq, maf = maf_rare)

Unnamed: 0_level_0,Coefficients_Trait1,Coefficients_Trait2,maf
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.37,6.3,0.003
2,0.01,7.27,0.009
3,0.01,6.3,0.009
4,0.0,2.79,0.021
5,0.37,7.27,0.003
6,0.37,8.91,0.003
7,0.0,3.78,0.011
8,0.0,2.72,0.021
9,0.18,3.03,0.004
10,0.0,2.87,0.021


In [188]:
# CSV.write("Trait2_data.csv", Trait2_data)
# CSV.write("Trait2_SNPs.csv", Trait2_SNPs)

# Example 3: GLMM Trait Simulation

Next, we demonstrate how to simulate a Poisson Trait, after controlling for family structure.  

In [320]:
X_3 = Matrix(hcat(X_non_gen, X_rare_snps))
β_3 = [β_cov β_cov; simulated_effectsizes_exp simulated_effectsizes_chisq];

In [318]:
Σ_A = [4 -1; -1 4]
Σ_E = [2.0 0.0; 0.0 2.0];

dist = Poisson()
link = LogLink()

variance_formula = @vc Σ_A ⊗ GRM + Σ_E ⊗ I_n
GLMMmodel = GLMMTrait(X_3, β_3, variance_formula, dist, link)

Generalized Linear Mixed Model
  * response distribution: Poisson
  * link function: LogLink
  * number of variance components: 2
  * sample size: 3754

In [319]:
y_3 = DataFrame(stimulate(GLMMmodel))
rename!(y_3, [Symbol("Trait$i") for i in 1:n_traits])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Int64,Int64
1,2,3
2,2,3
3,2,3
4,2,3
5,2,3
6,2,3
7,2,3
8,2,3
9,2,3
10,2,3


In [323]:
Trait3_data = [IndividualID hcat(y_3, X_non_gen, X_rare_snps)]

Unnamed: 0_level_0,fid,iid,Trait1,Trait2,intercept,age,sex,Rare SNP 1
Unnamed: 0_level_1,Abstract…,Abstract…,Int64,Int64,Float64,Float64,Float64,Float64
1,1002101,1002101,2,3,1.0,0.961317,1.0,-0.0865253
2,1004064,1004064,2,3,1.0,0.430665,1.0,-0.0865253
3,1004240,1004240,2,3,1.0,-0.540137,1.0,-0.0865253
4,1004303,1004303,2,3,1.0,-2.13068,1.0,-0.0865253
5,1007498,1007498,2,3,1.0,0.69471,1.0,-0.0865253
6,1008785,1008785,2,3,1.0,-0.0775978,1.0,-0.0865253
7,1011423,1011423,2,3,1.0,1.35295,1.0,-0.0865253
8,1012197,1012197,2,3,1.0,1.38327,1.0,-0.0865253
9,1013656,1013656,2,3,1.0,0.573823,1.0,-0.0865253
10,1013701,1013701,2,3,1.0,-0.839263,1.0,-0.0865253


In [333]:
Trait3_SNPs = DataFrame(Coefficients_Trait1 = β_3[:, 1],
                        Coefficients_Trait2 = β_3[:, 2], sim_info = vcat(String.(names(X_non_gen)), maf_rare))

Unnamed: 0_level_0,Coefficients_Trait1,Coefficients_Trait2,sim_info
Unnamed: 0_level_1,Float64,Float64,Any
1,1.0,1.0,intercept
2,0.0002,0.0002,age
3,0.2,0.2,sex
4,0.37,6.3,0.003
5,0.01,7.27,0.009
6,0.01,6.3,0.009
7,0.0,2.79,0.021
8,0.37,7.27,0.003
9,0.37,8.91,0.003
10,0.0,3.78,0.011


In [334]:
# CSV.write("Trait3_data.csv", Trait3_data)
# CSV.write("Trait3_SNPs.csv", Trait3_SNPs)

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).