# Trait Simulation Tutorial


Authors: Sarah Ji, Janet Sinsheimer, Kenneth Lange

In this notebook we show how to use the `TraitSimulation.jl` package to simulate traits from genotype data from unrelateds or families with user-specified Generalized Linear Models (GLMs) or Linear Mixed Models (LMMs), respectively. For simulating under either GLM or LMMs, the user can specify the number of repitions for each simulation model. By default, the simulation will return the result of a single simulation. 

We use the OpenMendel package [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) to both read in and write out PLINK formatted files. The notebook is organized as follows:

$\textbf{Example 1: Mendel28e Linear Mixed Model Example}$

In this example we show how to generate data so that the related individuals have correlated trait values even after we account for the effect of a snp, a combination of snps or other fixed effects. We simulate data under a linear mixed model so that we can model residual dependency among individuals. We use the same parameters as were used in Mendel Option 28e with the simulation parameters for Trait1 and Trait2 as shown below. 

For convenience we use the common assumption that the residual covariance among two relatives can be captured by the additive genetic variance times twice the kinship coefficient. However, if you like you can specify your own variance components and their design matrices as long as they are positive semi definite using the `@vc` macro demonstrated in this example. We run this simulation 1000 times, and store the simulation results in a vector of DataFrames.

$\textbf{Example 2: Rare Variant Linear Mixed Model}$

This example is meant to simulate data in a scenario in which a number of rare mutations in a single gene can change a trait value.  In this example we model the residual variation among relatives with the additive genetic variance component and we include 20 rare variants in the mean portion of the model, defined as loci with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare variants have smaller minor allele frequencies, but we are limited in this tutorial by the relatively small size of the data set. Note also that our modeling these effects as part of the mean is not meant to imply that the best way to detect them would be a standard association analysis. Instead we recommend a burden or SKAT test.

Specifically we are generating a single normal trait controlling for family structure with residual heritabiity of 67%, and effect sizes for the variants generated as a function of the minor allele frequencies. The rarer the variant the greater its effect size.

In both examples, you can specify your own arbitrary fixed effect sizes, variance components and simulation parameters as desired. You can also specify the number of replicates for each Trait simulation in the `simulate` function.

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [2]:
using DataFrames, SnpArrays, Random, LinearAlgebra, CSV, TraitSimulation

In [3]:
Random.seed!(1234);

# Reading the Mendel 28a data using SnpArrays

First use `SnpArrays.jl` to read in the genotype data. We use PLINK formatted data with the same prefixes for the .bim, .fam, .bed files.

The data we will be using are from the Mendel version 16[1] sample files. The data are described in examples under Option 28e in the Mendel Version 16 Manual [Section 28.1,  page 279](http://software.genetics.ucla.edu/download?file=202). It consists of simulated data where the two traits of interest have one contributing SNP and a sex effect.

SnpArrays is a very useful utility and can do a lot more than just read in the data. More information about all the functionality of SnpArrays can be found at:
https://openmendel.github.io/SnpArrays.jl/latest/

In [4]:
filepath = "traitsim28e"
snpdata = SnpArray(filepath * ".bed")
rowmask, colmask =  SnpArrays.filter(snpdata);
famfile28e = CSV.read("traitsim28e.fam"; header = [:fid, :iid, :father, :mother, :sex, :phenotype1, :phenotype2])

Unnamed: 0_level_0,fid,iid,father,mother,sex,phenotype1,phenotype2
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,String,Float64,Float64
1,1,9218,17008,16,M,38.9635,18.9857
2,1,3226,9218,8228,F,33.7391,21.1041
3,2,29,0,0,F,34.8884,19.0114
4,2,2294,0,0,M,37.7011,19.1656
5,2,3416,0,0,M,45.1317,19.8409
6,2,17893,2294,29,F,35.156,14.1423
7,2,6952,3416,17893,M,42.4514,19.9271
8,2,14695,2294,29,F,35.6426,17.4191
9,2,6790,2294,29,M,40.6344,23.6845
10,2,3916,2294,29,F,34.8618,16.8684


Transform sex variable from M/F to 1/-1 as is done in the older version of Mendel.  If you prefer you can use the more common convention of making one of the sexes the reference sex (coding it as zero) and make the other sex have the value 1 but then you will have to work a little harder to compare the results to the older version of Mendel. 

In [5]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile28e[!, :sex]) # note julia's ternary operator '?'

209-element Array{Float64,1}:
  1.0
 -1.0
 -1.0
  1.0
  1.0
 -1.0
  1.0
 -1.0
  1.0
 -1.0
 -1.0
 -1.0
  1.0
  ⋮  
  1.0
  1.0
  1.0
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0

We will use snp rs10412915 as a covariate in our model.  We want to find the index of this causal locus in the snp_definition file and then subset that locus from the genetic marker data above. 
We first subset the names of all the loci into a vector called `snpid` and store our design matrix for the model that includes sex and locus rs10412915.

In [6]:
bimfile = SnpData(filepath).snp_info
snpid  = bimfile[!, :snpid]
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]
locus = convert(Vector{Float64}, @view(snpdata[:, ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

Unnamed: 0_level_0,sex,locus
Unnamed: 0_level_1,Float64,Float64
1,1.0,2.0
2,-1.0,0.0
3,-1.0,2.0
4,1.0,2.0
5,1.0,1.0
6,-1.0,1.0
7,1.0,1.0
8,-1.0,2.0
9,1.0,1.0
10,-1.0,1.0


# Example: Multiple Correlated Traits (Mendel Example 28e Simulation)

We simulate two correlated Normal Traits controlling for family structure, location = $μ$ and scale = $\mathbf\Sigma$. 
The corresponding bivariate variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here.

$$
Y ∼ N(μ, \mathbf\Sigma)
$$ 

$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

$$
\mathbf\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


&nbsp; $FYI$: To create a trait with different variance components change the elements of $\mathbf\Sigma$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b. While this tutorial only uses 2 variance components, we make note that the `@vc` macro is designed to handle as many variance components as needed. 

As long as each Variance Component is specified correctly, we can create a `VarianceComponent` Julia object for Trait Simulation:

&nbsp; 
Example) Specifying more than 2 variance components (let V_H indicate an additional Household Variance component and V_D indicate a dominance genetic effect) 

```{julia}
    multiple_variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n + V_D ⊗ Δ + V_H ⊗ H;
```

## The Variance Covariance Matrix

Recall : $E(\mathbf{GRM}) = \Phi$
<br>
We use the [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl) package to find an estimate of the Kinship ($\Phi$), the Genetic Relationship Matrix (GRM). 

We will use the same values of $\textbf{GRM, V_a, and V_e}$ in the bivariate covariance matrix for both the mixed effect example and for the rare variant example.

Note that the residual covariance among two relatives is the additive genetic variance, $\textbf{V_a}$, times twice the kinship coefficient, $\Phi$. The kinship matrix is derived from the genetic relationship matrix $\textbf{GRM}$ across the common SNPs with minor allele frequency at least 0.05.

In [7]:
GRM = grm(snpdata, minmaf = 0.05)

209×209 Array{Float64,2}:
  0.497782     0.00779799    0.016657    …   0.0140643    -0.00555751
  0.00779799   0.498684     -0.0213316       0.0198823    -0.00755835
  0.016657    -0.0213316     0.497585        0.00955793    0.0336777 
  0.25366     -0.00194255    0.281407        0.00753494    0.0204589 
  0.126378     0.254046      0.129138        0.00573495   -0.00581729
 -0.0144856   -0.00275722   -0.00250825  …  -0.0116607     0.00598452
 -0.0226168    0.00895183   -0.0110798       0.00297856   -0.0140158 
 -0.0161397   -0.00731042   -0.015204       -0.00219203    0.00893666
 -0.0170008    0.00269143   -0.0131553      -0.0169046    -0.0058373 
 -0.0177143   -0.0104742    -0.0160042      -0.00568833   -0.00707821
 -0.0162183    0.00914329    0.0059497   …  -0.0160541    -0.00293139
 -0.0164437    0.0034591    -0.00952172     -0.00945007   -0.00813788
 -0.0311472    0.000528572  -0.0109591       0.007882     -0.00431412
  ⋮                                      ⋱                      

These are the formulas for the mean and variance, as specified by Mendel Option 28e.

In [8]:
mean_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

In [9]:
I_n = Matrix{Float64}(I, size(GRM))
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0];

# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n;

In [10]:
Multiple_LMM_traits_model = LMMTrait(mean_formulas, X, variance_formula)
Simulated_LMM_Traits = simulate(Multiple_LMM_traits_model)
Simulated_LMM_Traits = DataFrame(Trait1 = Simulated_LMM_Traits[:, 1][:], Trait2 = Simulated_LMM_Traits[:, 2][:])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Float64,Float64
1,40.0,19.0
2,37.0,18.0
3,34.0,15.0
4,40.0,19.0
5,41.5,20.5
6,35.5,16.5
7,41.5,20.5
8,34.0,15.0
9,41.5,20.5
10,35.5,16.5


### Summary Statistics of Our Simulated Traits vs. Mendel 28e Simulated Traits

In [11]:
describe(Simulated_LMM_Traits)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,Trait1,38.1411,34.0,40.0,43.0,,,Float64
2,Trait2,18.0502,15.0,19.0,22.0,,,Float64


In [12]:
describe(famfile28e[!, [:phenotype1, :phenotype2]])

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,phenotype1,37.9152,29.2403,37.7165,47.8619,,,Float64
2,phenotype2,18.5265,9.88673,18.654,27.5554,,,Float64


# Example 2: Rare Variant Linear Mixed Model


$$
Y ∼ N(\mu, 4* 2GRM + 2I)
$$

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.002 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. For this demo, we subset the fist k = 20 rare snps. Change the parameters and the number of SNPs for simulation to model different regions of the genome. The number 20 is arbitrary and you can use more or less than 20 if you desire by changing the final number.

In [13]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
filtsnpdata = SnpArrays.filter(filepath, rowmask, rare_index, des = "rare_filtered_28data")

209×79507 SnpArray:
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x02  0x02  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮

### Chisquared Distribution (df = 1)

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

```julia

# Generating Effect Sizes from Chisquared(df = 1) density

n = length(rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = rand([-1, 1]) .* (0.1 / sqrt.(maf_rare_snps[i] .* (1 - maf_rare_snps[i])))
end
```

In [14]:
meanformula_rare, df_rare = Generate_Random_Model_Chisq("rare_filtered_28data", 20)
rare_20_snp_model = LMMTrait([meanformula_rare], df_rare, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = DataFrame(SimTrait = simulate(rare_20_snp_model)[:])

Unnamed: 0_level_0,SimTrait
Unnamed: 0_level_1,Float64
1,0.438912
2,4.05968
3,1.06029
4,-1.70496
5,1.47966
6,5.69039
7,0.886001
8,0.0463611
9,3.94482
10,0.98625


Some summary statistics

In [15]:
describe(trait_rare_20_snps)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,SimTrait,2.1325,-5.61635,2.13658,9.29495,,,Float64


## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).
