# Trait Simulation Tutorial


Authors: Sarah Ji, Janet Sinsheimer, Kenneth Lange

In this notebook we show how to use the `TraitSimulation.jl` package to simulate traits from genotype data from unrelateds or families with user-specified Generalized Linear Models (GLMs) or Linear Mixed Models (LMMs), respectively. For simulating under either GLM or LMMs, the user can specify the number of repitions for each simulation model. By default, the simulation will return the result of a single simulation. 

We use the OpenMendel package [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) to read in the PLINK formatted SNP data. The notebook is organized as follows:

$\textbf{Example 1: Mendel28e Example}$

In example 1 we simulate data under a linear mixed model so that we can model residual dependency among individuals. We use the same parameters as were used in Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below. 

$\textbf{Example 2: Rare Variant Example}$

In example 2 we simulate a trait from the first 20 rare snps, with effect sizes simulated from the minor allele frequencies.

In both examples, you can specify your own arbitrary fixed effect sizes, variance components and simulation parameters as desired. You can also specify the number of replicates for each Trait simulation in the `simulate` function.

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


In [2]:
using DataFrames, SnpArrays, Random, LinearAlgebra, TraitSimulation

In [3]:
Random.seed!(1234);

# Reading the Mendel 28a data using SnpArrays

First use `SnpArrays.jl` to read in the genotype data. We use PLINK formatted data with the same prefixes for the .bim, .fam, .bed files.

The data we will be using are from the Mendel version 16[1] sample files. The data are described in examples under Option 28e in the Mendel Version 16 Manual [Section 28.1,  page 279](http://software.genetics.ucla.edu/download?file=202). It consists of simulated data where the two traits of interest have one contributing SNP and a sex effect.

SnpArrays is a very useful utility and can do a lot more than just read in the data. More information about all the functionality of SnpArrays can be found at:
https://openmendel.github.io/SnpArrays.jl/latest/

In [22]:
filepath = "traitsim28e"
snpdata = SnpArray(filepath * ".bed")
rowmask, colmask =  SnpArrays.filter(snpdata);
famfile = SnpData(filepath).person_info

Unnamed: 0_level_0,fid,iid,father,mother,sex,phenotype,x7
Unnamed: 0_level_1,Abstract…,Abstract…,Abstract…,Abstract…,Abstract…,Abstract…,Abstract…
1,1,16,0,0,F,30.20564,9.2421
2,1,8228,0,0,F,35.82143,15.27458
3,1,17008,0,0,M,36.05298,19.50496
4,1,9218,17008,16,M,38.96351,18.98575
5,1,3226,9218,8228,F,33.73911,21.10412
6,2,29,0,0,F,34.88835,19.01142
7,2,2294,0,0,M,37.70105,19.16556
8,2,3416,0,0,M,45.13171,19.84088
9,2,17893,2294,29,F,35.15599,14.14228
10,2,6952,3416,17893,M,42.45136,19.92713


Transform sex variable from M/F to 1/-1 as is done in the older version of Mendel.  If you prefer you can use the more common convention of making one of the sexes the reference sex (coding it as zero) and make the other sex have the value 1 but then you will have to work a little harder to compare the results to the older version of Mendel. 

In [5]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[!, :sex]) # note julia's ternary operator '?'

212-element Array{Float64,1}:
 -1.0
 -1.0
  1.0
  1.0
 -1.0
 -1.0
  1.0
  1.0
 -1.0
  1.0
 -1.0
  1.0
 -1.0
  ⋮  
  1.0
  1.0
  1.0
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0

We will use snp rs10412915 as a covariate in our model.  We want to find the index of this causal locus in the snp_definition file and then subset that locus from the genetic marker data above. 
We first subset the names of all the loci into a vector called `snpid` and store our design matrix for the model that includes sex and locus rs10412915.

In [6]:
bimfile = SnpData(filepath).snp_info
snpid  = bimfile[!, :snpid]
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]
locus = convert(Vector{Float64}, @view(snpdata[:, ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

Unnamed: 0_level_0,sex,locus
Unnamed: 0_level_1,Float64,Float64
1,-1.0,2.0
2,-1.0,0.0
3,1.0,2.0
4,1.0,2.0
5,-1.0,1.0
6,-1.0,1.0
7,1.0,1.0
8,1.0,2.0
9,-1.0,1.0
10,1.0,1.0


# Example: Multiple Correlated Traits (Mendel Example 28e Simulation)

We simulate two correlated Normal Traits controlling for family structure, location = $μ$ and scale = $\mathbf\Sigma$. 
The corresponding bivariate variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here.

$$
Y ∼ N(μ, \mathbf\Sigma)
$$ 

$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

$$
\mathbf\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


&nbsp; $FYI$: To create a trait with different variance components change the elements of $\mathbf\Sigma$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b. While this tutorial only uses 2 variance components, we make note that the `@vc` macro is designed to handle as many variance components as needed. 

As long as each Variance Component is specified correctly, we can create a `VarianceComponent` Julia object for Trait Simulation:

&nbsp; 
Example) Specifying more than 2 variance components (let V_H indicate an additional Household Variance component and V_D indicate a dominance genetic effect) 

```{julia}
    multiple_variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n + V_D ⊗ Δ + V_H ⊗ H;
```

## The Variance Covariance Matrix

Recall : $E(\mathbf{GRM}) = \Phi$
<br>
We use the [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl) package to find an estimate of the Kinship ($\Phi$), the Genetic Relationship Matrix (GRM). 

We will use the same values of $\textbf{GRM, V_a, and V_e}$ in the bivariate covariance matrix for both the mixed effect example and for the rare variant example.

Note that the residual covariance among two relatives is the additive genetic variance, $\textbf{V_a}$, times twice the kinship coefficient, $\Phi$. The kinship matrix is derived from the genetic relationship matrix $\textbf{GRM}$ across the common SNPs with minor allele frequency at least 0.05.

In [7]:
GRM = grm(snpdata, minmaf = 0.05)

212×212 Array{Float64,2}:
  0.498264     0.0080878    0.0164327   …   0.0246825    0.00181856
  0.0080878    0.498054    -0.0212599      -0.0285927   -0.0226525 
  0.0164327   -0.0212599    0.499442       -0.0219661   -0.00748536
  0.253627    -0.00160532   0.282542        0.00612693  -0.00339125
  0.126098     0.253365     0.128931       -0.0158446   -0.00633959
 -0.014971    -0.00266073  -0.00243384  …   0.00384757   0.0145936 
 -0.0221357    0.0100492   -0.0107012      -0.0148443   -0.00127783
 -0.01629     -0.00749253  -0.015372       -0.0163305   -0.00258392
 -0.016679     0.00353587  -0.0128844      -0.0332489   -0.00707839
 -0.0176101   -0.00996912  -0.0158473      -0.00675875  -0.0122339 
 -0.0162558    0.00938592   0.0064231   …  -0.00510882   0.0168778 
 -0.0167487    0.00414544  -0.00936538     -0.0134863    0.0020952 
 -0.031148     0.00112387  -0.010794        0.00383105   0.0198635 
  ⋮                                     ⋱   ⋮                      
 -0.00865735  -0.00335

These are the formulas for the mean and variance, as specified by Mendel Option 28e.

In [8]:
mean_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

In [9]:
I_n = Matrix{Float64}(I, size(GRM))
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0];

# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n;

In [10]:
Multiple_LMM_traits_model = LMMTrait(mean_formulas, X, variance_formula)
Simulated_LMM_Traits = simulate(Multiple_LMM_traits_model)
Simulated_LMM_Traits = DataFrame(Trait1 = Simulated_LMM_Traits[:, 1][:], Trait2 = Simulated_LMM_Traits[:, 2][:])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Float64,Float64
1,34.0,15.0
2,37.0,18.0
3,40.0,19.0
4,40.0,19.0
5,35.5,16.5
6,35.5,16.5
7,41.5,20.5
8,40.0,19.0
9,35.5,16.5
10,41.5,20.5


### Summary Statistics of Our Simulated Traits

In [18]:
describe(Simulated_LMM_Traits)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,Trait1,38.1321,34.0,40.0,43.0,,,Float64
2,Trait2,18.0472,15.0,19.0,22.0,,,Float64


# Example 2: Rare Variant Linear Mixed Model


$$
Y ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.002 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. For this demo, we subset the fist k = 20 rare snps. Change the parameters and the number of SNPs for simulation to model different regions of the genome. The number 20 is arbitrary and you can use more or less than 20 if you desire by changing the final number.

In [13]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
filtsnpdata = SnpArrays.filter(filepath, rowmask, rare_index, des = "rare_filtered_28data")

212×80493 SnpArray:
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x02  0x02  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
    ⋮

### Chisquared Distribution (df = 1)

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

```julia

# Generating Effect Sizes from Chisquared(df = 1) density

n = length(maf_20_rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = rand([-1, 1]) .* (0.1 / sqrt.(maf_20_rare_snps[i] .* (1 - maf_20_rare_snps[i])))
end
```

In [16]:
meanformula_rare, df_rare = Generate_Random_Model_Chisq("rare_filtered_28data", 20)
rare_20_snp_model = LMMTrait([meanformula_rare], df_rare, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = DataFrame(SimTrait = simulate(rare_20_snp_model)[:])

Unnamed: 0_level_0,SimTrait
Unnamed: 0_level_1,Float64
1,-5.63749
2,-11.1311
3,-10.7046
4,-9.1107
5,-4.73371
6,-10.8011
7,-12.6473
8,-12.8405
9,-11.6011
10,-10.359


Some summary statistics of just the first of the 1000 simulation results.

In [17]:
describe(trait_rare_20_snps)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,SimTrait,-10.0043,-18.0882,-9.94102,-3.60614,,,Float64


## Saving Simulation Results to Local Machine

Here we output the simulated trait values and corresponding genotypes for each of the 212 individuals, labeled by their pedigree ID and person ID for the first iteration of the 1000 simulations. 

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).
