# Trait Simulation Tutorial

Authors: Sarah Ji, Chris German, Kenneth Lange, Janet Sinsheimer, Hua Zhou, Jin Zhou, Eric Sobel

In this notebook we show how to use the `TraitSimulation.jl` package we illustrate how TraitSimulation.jl can easily simulate traits from genotype data, all within the OpenMendel universe. Operating within this universe brings potential advantages over the available software(s) when needed for downstream analysis or study design. To illustrate a downstream application of this software, we conduct power analysis in [this example](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/example_power_analysis.ipynb) on simulated data using the ordered multinomial model. [3]

## Background

There is a lack of software available to geneticists who wish to calculate power and sample sizes in designing a study on genetics data. Typically, the study power depends on assumptions about the underlying disease model.  Many power calculating software tools operate as a black box and do not allow for customization.  To develop custom tests, researchers can develop their own simulation procedures to carry out power calculations.  One limitation with many existing methods for simulating traits conditional on genotypes is that these methods are limited to normally distributed traits and to fixed effects. 

This software package, TraitSimuliation.jl addresses the need for simulated trait data in genetic analyses.  This package generates data sets that will allow researchers to accurately check the validity of programs and to calculate power for their proposed studies. This package gives users the ability to easily simulate phenotypic traits under generalized linear models (GLMs) or variance component models (VCMs) conditional on PLINK formatted genotype data. In addition, we include customized simulation utilities that accompany specific genetic analysis options in Open-Mendel; for example, ordered, multinomial traits. We demonstrate these simulation utilities on the example dataset described below.

## Demonstration

##### Example Data

We use the OpenMendel package [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) to both read in and write out PLINK formatted files. Available in the data directory under the [Example_Data](https://openmendel.github.io/SnpArrays.jl/latest/#Example-data-1) section of this package, we use the file `"EUR_SUBSET"` for the demonstration how to simulate phenotypic traits on PLINK formatted data. 
For convenience we use the common assumption that the residual covariance among two relatives can be captured by the additive genetic variance times twice the kinship coefficient.

In each example the user can specify the simulation model parameters, along with the number of repitions for each simulation model as desired. By default, the simulation will return the result of a single simulation.

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)


In [2]:
using Random, Plots, DataFrames, LinearAlgebra, DelimitedFiles
using SnpArrays, TraitSimulation, GLM, StatsBase, OrdinalMultinomialModels
Random.seed!(1234);

# Reading genotype data using SnpArrays

First use `SnpArrays.jl` to read in the genotype data. We use PLINK formatted data with the same prefixes for the .bim, .fam, .bed files.

SnpArrays is a very useful utility and can do a lot more than just read in the data. More information about all the functionality of SnpArrays can be found at:
https://openmendel.github.io/SnpArrays.jl/latest/

As missing genotypes are often due to problems making the calls, the called genotypes at a marker with too much missing genotypes are potentially unreliable. By default, SnpArrays filters to keep only the genotypes with success rates greater than 0.98 and the minimum minor allele frequency to be 0.01. If the user wishes to change the stringency, change the number given in filter according to [SnpArrays](https://openmendel.github.io/SnpArrays.jl/latest/#Fitering-1).

In [3]:
filename = "EUR_subset"
EUR = SnpArray(SnpArrays.datadir(filename * ".bed"));

In [4]:
rowmask, colmask =  SnpArrays.filter(EUR)
minor_allele_frequency = maf(EUR);
people, snps = size(EUR)

(379, 54051)

In [5]:
EUR_data = SnpData(SnpArrays.datadir(filename));

Here we will use identify by name, which locus to include, first subset the names of all the loci into a vector called `snpid`  and then call the following command to store our design matrix for the model that includes sex and locus of choice.

In [6]:
bimfile = EUR_data.snp_info # store the snp_info with the snp names

snpid  = bimfile[!, :snpid] # store the snp names in the snpid vector

causal_snp_index = findfirst(x -> x == "rs150018646", snpid); # find the index of the snp of interest by snpid

Additionally, we will control for sex, with females as the baseline group, `sex = 0.0`. We want to find the index of this causal locus in the snp_definition (.bim) file and then subset that locus from the genetic marker data above. Make note of julia's ternary operator '?' which allows us to make this conversion efficiently!

Using SnpArrays.jl we can then use the `convert` and `@view` commands to get the appropriate conversion from SnpArray to a computable vector of Float64. 

In [7]:
locus = convert(Vector{Float64}, @view(EUR[:, causal_snp_index]))
famfile = EUR_data.person_info
sex = map(x -> strip(x) == "F" ? 0.0 : 1.0, famfile[!, :sex])
intercept = ones(length(sex))

X_covar = [intercept sex]

X = [intercept sex locus]

379×3 Array{Float64,2}:
 1.0  1.0  2.0
 1.0  1.0  1.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 ⋮         
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  1.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0
 1.0  1.0  2.0


# Example 1: GLM Trait

In this example we first demonstrate how to use the GLM.jl package to simulate a trait from unrelated individuals. A more thorough application of this GLM TraitSimulation can be found [here](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/TraitSimulation-TestingMendelIHT.ipynb). 

$Y \sim Poisson(\mu = g^{-1}(XB))$

Here we specify the fixed effects and the phenotype distribution, and output for ten simulations per person.

## GLM Traits from Unrelated Individuals
$$
    Y_{n \times 1} \sim Poisson(\mu_{n \times 1} = X\beta)
$$ 

Here we specify the fixed effects and the phenotype distribution, and output for ten simulations per person. 

In [8]:
β = [1; 0.2; 0.5]
dist = Poisson()
link = LogLink()
GLMmodel = GLMTrait(X, β, dist, link)

Generalized Linear Model
  * response distribution: Poisson
  * link function: LogLink
  * sample size: 379  * fixed effects: 3

In [9]:
nsim = 10
Simulated_GLM_Traits = DataFrame(simulate(GLMmodel, nsim))
rename!(Simulated_GLM_Traits, [Symbol("Trait$i") for i in 1:nsim])

Unnamed: 0_level_0,Trait1,Trait2,Trait3,Trait4,Trait5,Trait6,Trait7,Trait8,Trait9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,10.0,11.0,5.0,8.0,8.0,13.0,8.0,11.0,9.0
2,7.0,5.0,8.0,5.0,2.0,6.0,5.0,3.0,8.0
3,9.0,8.0,11.0,6.0,2.0,8.0,10.0,8.0,11.0
4,9.0,13.0,10.0,7.0,7.0,8.0,13.0,3.0,11.0
5,11.0,6.0,12.0,13.0,13.0,4.0,12.0,8.0,11.0
6,12.0,7.0,8.0,9.0,11.0,9.0,8.0,14.0,8.0
7,6.0,13.0,12.0,6.0,9.0,10.0,9.0,7.0,5.0
8,7.0,12.0,7.0,10.0,5.0,5.0,3.0,7.0,12.0
9,7.0,12.0,5.0,7.0,7.0,9.0,5.0,9.0,13.0
10,9.0,5.0,6.0,8.0,11.0,12.0,6.0,10.0,8.0


# Example 2: Rare Variant VCM Related Individuals

In this example we show how to generate data so that the related individuals have correlated trait values even after we account for the effect of a snp, a combination of snps or other fixed effects. We simulate data under a linear mixed model so that we can model residual dependency among individuals. 

$$
Y \sim \text{Normal}(\mathbf{\mu}_{n \times 1} = X\beta, \Sigma_{n \times n} = \sigma_A \times 2\hat{\Phi}_{GRM} + \sigma_E \times I_n)
$$

This example is meant to simulate data in a scenario in which a number of rare mutations in a single gene can change a trait value. We model the residual variation among relatives with the additive genetic variance component and we include 20 rare variants in the mean portion of the model, defined as loci with minor allele frequencies greater than 0.002 but less than 0.02.

Specifically we are generating a single normal trait controlling for family structure with residual heritabiity of 67%, and effect sizes for the variants generated as a function of the minor allele frequencies. The rarer the variant the greater its effect size.

To demonstrate how to specify the genetic and non-genetic covariates separately, we use randomly generated effect size to demo. In practice rare variants have smaller minor allele frequencies, but we are limited in this tutorial by the relatively small size of the data set. Note also that our modeling these effects as part of the mean is not meant to imply that the best way to detect them would be a standard association analysis. Instead we recommend a burden or SKAT test.

Users who want a reference on genetic modeling, we recommend [Mathematical And Statistical Methods For Genetic Analysis](http://www.biometrica.tomsk.ru/lib/lange_1.pdf) by Dr. Kenneth Lange. In chapter 8 of this book, the user can find an introduction to Variance Component Models in Genetic Setting. For a more in depth review of variance component modeling in the genetic setting, we include a reference at the end of the notebook [4].

In [10]:
GRM = grm(EUR, minmaf = 0.05);

We simulate traits on 20 rare SNP's for demonstration. Change the parameters and the number of SNPs for simulation to model different regions of the genome. The number 20 is arbitrary and you can use more or less than 20 if you desire by changing the final number.

In [11]:
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
filtsnpdata = SnpArrays.filter(EUR_data, rowmask, rare_index, des = "rare_filtered_28data");

In [12]:
β_rare = rand([0.002, 0.004, 0.008, 0.01, 0.012, 0.015, 0.02], 20)
rare_snps = SnpArray("rare_filtered_28data.bed", 379, 20)

379×20 SnpArray:
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
 0x00  0x00  0x00  0x00  0x00  0x00     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮   

In [13]:
β_covar = [1.0; 0.6]
I_n = Matrix{Float64}(I, size(GRM))
vc = @vc [0.01][:, :] ⊗ (GRM + I_n) + [0.9][:, :] ⊗ I_n
sigma, v = vcobjtuple(vc)
rare_20_snp_model = VCMTrait(X_covar, β_covar[:, :], rare_snps, β_rare[:, :], [sigma...], [v...])

Variance Component Model
  * number of traits: 1
  * number of variance components: 2
  * sample size: 379

In [14]:
Rare_SNP_Trait = DataFrame(simulate(rare_20_snp_model))
rename!(Rare_SNP_Trait, [Symbol("Trait$i") for i in 1:size(Rare_SNP_Trait, 2)])

Unnamed: 0_level_0,Trait1
Unnamed: 0_level_1,Float64
1,2.09497
2,1.77248
3,1.61694
4,1.94672
5,2.64797
6,1.2259
7,2.90142
8,2.75068
9,1.42602
10,0.892808


In [15]:
rm("rare_filtered_28data.bed")
rm("rare_filtered_28data.bim")
rm("rare_filtered_28data.fam")

### Multiple Traits, Multiple Variance Components? Easy.

This example extends the variance component model in the previous example to demo how to efficiently account for any number of other random effects, in addition to the additive genetic and environmental variance components. 

Y ~ MatrixNormal($M = XB$, $Omega = \sum_{k=1}^m \Sigma_{k}$ $\otimes V_k$)

We note that this form can also accompany more than 2 variance components.

I encourage for those interested, to look at [this example](https://github.com/OpenMendel/TraitSimulation.jl/blob/master/docs/benchmarking_VCM.ipynb) where we demonstrate the simlation of $d = 2$ traits with $m = 10$ variance components, and benchmark it against the available method using the MatrixNormal distribution in Julia package, [Distributions.jl](https://juliastats.org/Distributions.jl/latest/matrix/#Distributions.MatrixNormal).

# Example 3: Ordered Multinomial Trait

For the last example, we show how to simulate from customized simulation models that accompany specific genetic analysis options in OpenMendel; for example, ordered, multinomial traits. 

We demonstrate on the `OrderedMultinomialTrait` model object in TraitSimulation.jl.


### Ordered Multinomial Trait

Recall that this phenotype is special, in that the [OrdinalMultinomialModels](https://openmendel.github.io/OrdinalMultinomialModels.jl/stable/#Syntax-1) package provides Julia utilities to fit ordered multinomial models, including [proportional odds model](https://en.wikipedia.org/wiki/Ordered_logit) and [ordered Probit model](https://en.wikipedia.org/wiki/Ordered_probit) as special cases. 

In [16]:
θ = [1.0, 1.2, 1.4]
Ordinal_Model = OrderedMultinomialTrait(X, β, θ, LogitLink())

Ordinal Multinomial Model
  * number of fixed effects: 3
  * number of ordinal multinomial outcome categories: 4
  * link function: LogitLink
  * sample size: 379

In [17]:
nsim = 10 
Ordinal_Trait = DataFrame(simulate(Ordinal_Model, nsim))
rename!(Ordinal_Trait, [Symbol("Trait$i") for i in 1:nsim])

Unnamed: 0_level_0,Trait1,Trait2,Trait3,Trait4,Trait5,Trait6,Trait7,Trait8,Trait9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
2,4.0,1.0,4.0,1.0,4.0,4.0,4.0,1.0,4.0
3,1.0,4.0,4.0,4.0,4.0,1.0,1.0,4.0,4.0
4,4.0,4.0,3.0,4.0,4.0,4.0,1.0,1.0,4.0
5,4.0,4.0,4.0,4.0,4.0,4.0,4.0,1.0,1.0
6,1.0,4.0,2.0,4.0,4.0,1.0,4.0,1.0,4.0
7,1.0,4.0,4.0,1.0,1.0,4.0,4.0,2.0,4.0
8,4.0,1.0,4.0,1.0,1.0,4.0,3.0,4.0,4.0
9,4.0,4.0,4.0,3.0,1.0,1.0,4.0,1.0,4.0
10,1.0,1.0,1.0,4.0,4.0,1.0,4.0,4.0,1.0


### Simulate Ordered Multinomial Logistic

Specific to the Ordered Multinomial Logistic model is the option to transform the multinomial outcome (i.e 1, 2, 3, 4) into a binary outcome for logistic regression. 

Although by default is the multinomial simulation above, the user can simulate from the transformed logistic outcome for example by specifying arguments: `Logistic = true` and `threshold = 2` the value to use as a cutoff for identifying cases and controls. **(i.e if y > 2 => y_logit == 1).** We note if you specify `Logistic = true` and do not provide a threshold value, the function will throw an error to remind you to specify one.

In [18]:
Logistic_Trait = DataFrame(simulate(Ordinal_Model, nsim, Logistic = true, threshold = 2))
rename!(Logistic_Trait, [Symbol("Trait$i") for i in 1:nsim])

Unnamed: 0_level_0,Trait1,Trait2,Trait3,Trait4,Trait5,Trait6,Trait7,Trait8,Trait9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0
2,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0
4,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
5,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
6,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0
7,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0
8,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
10,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0


# Example 4: GLMM Trait Simulation

Next, we demonstrate how to simulate a Poisson Trait, after controlling for family structure.  

In [19]:
dist = Poisson()
link = LogLink()

vc = @vc [0.01][:, :] ⊗ (GRM + I_n) + [0.9][:, :] ⊗ I_n
GLMMmodel = GLMMTrait(X, β[:, :], vc, dist, link)

Generalized Linear Mixed Model
  * response distribution: Poisson
  * link function: LogLink
  * number of variance components: 2
  * sample size: 379

In [20]:
Y = DataFrame(simulate(GLMMmodel))
rename!(Y, [Symbol("Trait$i") for i in 1:size(Y, 2)])

Unnamed: 0_level_0,Trait1
Unnamed: 0_level_1,Int64
1,29
2,1
3,46
4,11
5,5
6,12
7,16
8,23
9,2
10,49


### Example Downstream Application: Power Analysis

Our group recently demonstrated customized simulation utilities that accompany specific genetic analysis options in Open-Mendel; for example Variance Component Models (VCM) and ordered multinomial traits on a subset of UK Biobank data [5]. The [simulation results](https://openmendel.github.io/TraitSimulation.jl/dev/examples/ukbiobank_vcm_power/) are now available to view on Github. 

In addition, we use the ordinal model to demo how we estimate the statistical power to detect the effect of a single associated SNP on simulated data (accessible to all users). In this example we use simulated data to demo the full power pipeline contatined within the OpenMendel environment. Operating within the OpenMendel universe brings potential advantages over the available software(s) when needed for downstream analysis or study design. The figure below diagrams how TraitSimulation can deliver an understandable and developer-friendly pipeline in concert with other OpenMendel modules.

![png](diagram.png)

We illustrate this example in three digestable steps as shown in the figure for the Ordered Multinomial Model: 
   * The first by simulating genotypes and covariate values representative for our study population.
   * Carry over the simulated design matrix from (1) to create the OrderedMultinomialTrait model object.
   * Simulate off the OrderedMultinomialTrait model object created in (2) and run the power analyses for the desired significance level.

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).

[3] German, CA, Sinsheimer, JS, Klimentidis, YC, Zhou, H, Zhou, JJ. Ordered multinomial regression for genetic association analysis of ordinal phenotypes at Biobank scale. Genetic Epidemiology. 2019; 1– 13. https://doi.org/10.1002/gepi.22276

[4] Lange K, Boehnke M (1983) Extensions to pedigree analysis. IV. Covariance component models for multivariate traits. Amer J Med Genet 14:513:524

[5] Ji, SS, Lange, K, Sinsheimer, JS, Zhou, JJ, Zhou, H, Sobel, E. Modern Simulation Utilities for Genetic Analysis. BMC Bioinformatics. 2020; BINF-D-20-00690