# Trait Simulation Tutorial


Authors: Sarah Ji, Janet Sinsheimer, Kenneth Lange

In this notebook we show how to use the `TraitSimulation.jl` package to simulate traits from genotype data from unrelateds or families with user-specified Generalized Linear Models (GLMs) or Linear Mixed Models (LMMs), respectively. For simulating under either GLM or LMMs, the user can specify the number of repitions for each simulation model. By default, the simulation will return the result of a single simulation. 

The data we will be using are from the Mendel version 16[1] sample files. The data are described in examples under Option 28e in the Mendel Version 16 Manual [Section 28.1,  page 279](http://software.genetics.ucla.edu/download?file=202). It consists of simulated data where the two traits of interest have one contributing SNP and a sex effect.

### Double check that you are using Julia version 1.0 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.2.0
Commit c6da87ff4b (2019-08-20 00:03 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.6.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)


# Add any missing packages needed for this tutorial:

Note: For demonstration purposes, the generation of this Jupyter Notebook requires the use of the following registered packages: `DataFrames.jl`, `SnpArrays.jl`, `StatsModels.jl`, `Random.jl`, `DelimitedFiles.jl`, `StatsBase.jl`, and `StatsFuns.jl`. 

If it is your first time using these registered packages, you will first have to add the registered packages: DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, Random, StatsBase by running the following code chunk in Julia's package manager:

```{julia}
pkg> add DataFrames
pkg> add SnpArrays
...
pkg> add StatsFuns
```
You can also use the package manager to add the `TraitSimulation.jl` package by running the following link: </br>

```{julia}
pkg> add "https://github.com/sarah-ji/TraitSimulation.jl"
```

Only after all of the necessary packages have been added, load them into your working environment with the `using` command:

In [51]:
using DataFrames, SnpArrays, Random, LinearAlgebra, TraitSimulation, Glob

# Reproducibility

For reproducibility, we set a random seed using the `Random.jl` package for each simulation using `Random.seed!(1234)`.  If you wish to end up with different data, you will need to comment out these commands or use another value in Random.seed!().

In [53]:
Random.seed!(1234);

# The notebook is organized as follows:

We use the OpenMendel package [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) to read in the PLINK formatted SNP data. In example 1, we simulate generalized linear models assuming that everyone is unrelated. So in example 1,  the only data used from option 28e is the genotype for a specific locus in the snp file and the sex of the individual. The pedigree structure and relationship matrix are irrelevant.   In example 2 we simulate data under a linear mixed model so that we can model residual dependency among individuals.  In example 2b, we use the same parameters as were used in Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below.

In both examples, you can specify your own arbitrary fixed effect sizes, variance components and simulation parameters as desired. You can also specify the number of replicates for each Trait simulation in the `simulate` function.

In Example 3, we demo how to simulate from the rare variant model. In addition, we show how the user can generate effect sizes that depend on the minor allele frequencies from the chisquare distribution. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components for simulation.

$\textbf{At the end of Examples 1 and 3}$, we demo how to $\textbf{write the results}$ of the simulation to a file on your own machine.

# Reading the Mendel 28a data using SnpArrays.jl

First use `SnpArrays.jl` to read in the genotype data. There are 212 individuals and 253,141 snps in the genotype data set. 


In [54]:
filepath = SnpArrays.datadir("EUR_subset")
snpdata = SnpArray(filepath * ".bed")

379×54051 SnpArray:
 0x03  0x03  0x03  0x02  0x02  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x02  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x02
 0x03  0x03  0x03  0x00  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x03  0x03     0x02  0x02  0x02  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x02
 0x02  0x03  0x03  0x02  0x02  0x03     0x03  0x03  0x02  0x02  0x03  0x03
 0x02  0x03  0x03  0x03  0x02  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x00  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x02  0x03  0x02     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03  …  0x03  0x03  0x02  0x02  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x02
 0x03  0x02  0x03  0x02  0x02  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮

The binary codes correspond to genotypes, A1,A1=0x00, missing=0x01, A1,A2=0x02 and A2,A2=0x03

SnpArrays is a very useful utility and can do a lot more than just read in the data. More information about all the functionality of SnpArrays can be found at:
https://openmendel.github.io/SnpArrays.jl/latest/

In [55]:
EUR_famdata = SnpData(filepath)

SnpData(people: 379, snps: 54051,
snp_info: 
│ Row │ chromosome │ snpid       │ genetic_distance │ position │ allele1      │ allele2      │
│     │ String     │ String      │ Float64          │ Int64    │ Categorical… │ Categorical… │
├─────┼────────────┼─────────────┼──────────────────┼──────────┼──────────────┼──────────────┤
│ 1   │ 17         │ rs34151105  │ 0.0              │ 1665     │ T            │ C            │
│ 2   │ 17         │ rs143500173 │ 0.0              │ 2748     │ T            │ A            │
│ 3   │ 17         │ rs113560219 │ 0.0              │ 4702     │ T            │ C            │
│ 4   │ 17         │ rs1882989   │ 5.6e-5           │ 15222    │ G            │ A            │
│ 5   │ 17         │ rs8069133   │ 0.000499         │ 32311    │ G            │ A            │
│ 6   │ 17         │ rs112221137 │ 0.000605         │ 36405    │ G            │ T            │
…,
person_info: 
│ Row │ fid       │ iid       │ father    │ mother    │ sex       │ phenotype │
│  

### Names of Variants:

We will use snp rs62057050 as a covariate in our model.  We want to find the index of this causal locus in the snp_definition file and then subset that locus from the genetic marker data above. 
We first subset the names of all the loci into a vector called `snpid`

In [56]:
snpid = EUR_famdata.snp_info[!, :snpid]

54051-element CSV.Column{String,String}:
 "rs34151105" 
 "rs143500173"
 "rs113560219"
 "rs1882989"  
 "rs8069133"  
 "rs112221137"
 "rs34889101" 
 "rs35840960" 
 "rs144918387"
 "rs62057022" 
 "rs4890182"  
 "rs1882990"  
 "rs62057050" 
 ⋮            
 "rs5770999"  
 "rs6010070"  
 "rs6010072"  
 "rs6009960"  
 "rs56807126" 
 "rs6010073"  
 "rs184517959"
 "rs113391741"
 "rs151247655"
 "rs187225588"
 "rs9616967"  
 "rs148755559"

We next need to find the position of the snp rs7212950.  If you wish to use another snp as the causal locus just change the rs number to another one that is found in the available genotype data, for example rs148755559.

In [60]:
index_rs7212950 = findall(x -> x == "rs7212950", snpid)[1]

2362

We see that rs7212950, is the 2362th locus in the dataset.

Let's create a design matrix for the model that includes the locus rs7212950.

In [61]:
X = DataFrame(locus = convert(Vector{Float64}, @view(EUR_snpdata[:, index_rs7212950])))

Unnamed: 0_level_0,locus
Unnamed: 0_level_1,Float64
1,2.0
2,2.0
3,1.0
4,2.0
5,2.0
6,2.0
7,2.0
8,2.0
9,1.0
10,2.0


# Example 1 Generalized Linear Model:

This example simulates a case where three snps have fixed effects on the trait. Any apparent genetic correlation between relatives for the trait is due to the effect of these snps, so once these effects of these snps are modelled there should be no residual correlation among relatives. Note that by default, individuals with missing genotype values will have missing phenotype values, unless the user specifies the argument `impute = true` in the convert function above.
Be sure to change Random.seed!(1234) to something else (or comment out) if you want to generate a new data set. 


### Example 1a: Single Trait
$$Y ∼ N(\mu, \sigma^{2})$$

In example (1a) we simulate a $\textbf{SINGLE INDEPENDENT NORMAL TRAIT}$, with simulation parameters: $\mu = 20 - 1.5*locus$, $\sigma^{2} = 2$

In [62]:
mean_formula = "20 - 1.5(locus)"
GLM_trait_model = GLMTrait(mean_formula, X, NormalResponse(2), IdentityLink())
Simulated_GLM_trait = DataFrame(Simulated_Trait = simulate(GLM_trait_model))

│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52


Unnamed: 0_level_0,Simulated_Trait
Unnamed: 0_level_1,Float64
1,18.7347
2,15.1965
3,17.511
4,15.1942
5,18.7288
6,21.4238
7,18.0656
8,16.4565
9,19.5047
10,15.966


In [63]:
describe(Simulated_GLM_trait, :mean, :std, :min, :median, :max)

Unnamed: 0_level_0,variable,mean,std,min,median,max
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Float64
1,Simulated_Trait,17.211,2.1495,11.8256,17.1321,26.1927


# Example 2: Linear Mixed Model (with additive genetic variance component).
Examples 2a simulates a single trait, while Example 2b simulates two correlated traits.

Note you can scale the function to simulate the trait multiple times by specifying the argument, `n_reps`. 
Also, you can extend the model in Example 2b to include more than 2 variance components using the `@vc` macro.


## The Variance Covariance Matrix

Recall : $E(\mathbf{GRM}) = \Phi$
<br>
We use the [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl) package to find an estimate of the Kinship ($\Phi$), the Genetic Relationship Matrix (GRM). 

We will use the same values of $\textbf{GRM, V_a, and V_e}$ in the bivariate covariance matrix for both the mixed effect example and for the rare variant example.

Note that the residual covariance among two relatives is the additive genetic variance, $\textbf{V_a}$, times twice the kinship coefficient, $\Phi$. The kinship matrix is derived from the genetic relationship matrix $\textbf{GRM}$ across the common SNPs with minor allele frequency at least 0.05.

In [64]:
GRM = grm(EUR_snpdata, minmaf = 0.05)

379×379 Array{Float64,2}:
  0.515862     -0.0171514     0.000830645  …   0.01089       0.0133328  
 -0.0171514     0.504733     -0.000395141      0.000707769  -0.00699462 
  0.000830645  -0.000395141   0.503371        -0.00896103   -0.00369584 
 -0.00482143    0.00671988    0.00610991      -0.000679375   0.000816469
 -0.00652937    0.00396417    0.0130454       -0.011225     -0.00324037 
  0.00217151   -0.00044179    0.0132678    …  -0.0133392    -0.00687835 
 -0.0136113     0.00497049   -0.00402601      -0.0014478     0.000640062
 -0.00802599    0.00345512    0.00503707      -0.0115018    -0.0058708  
  0.00277393   -0.0047003    -0.000865875      0.000308418   0.0107124  
 -0.0144153     0.00509626    0.00395056      -0.00405372   -0.00677055 
 -0.0011438    -0.00350985    0.0111732    …   0.000910357   0.0101875  
  0.0138263    -0.00419484    1.09592e-5      -0.00980202   -0.00291516 
 -0.00418412    0.0100604    -0.00555352      -0.00600662   -0.00960445 
  ⋮                      

### Example 2a: Single Trait 
$$
Y ∼ N(μ, 4* 2GRM + 2I)$$

We simulate a Normal Trait controlling for family structure, location = $\mu = 40 + 3(sex) - 1.5(locus)$ and scale =  $\mathbf{V} = 2*V_a \Phi + V_e I = 4* 2GRM + 2I$. 


In [65]:
mean_formula = ["40 - 1.5(locus)"]
I_n = Matrix{Float64}(I, size(GRM));
LMM_trait_model = LMMTrait(mean_formula, X, 4*(2*GRM) + 2*(I_n))
Simulated_LMM_Trait = DataFrame(Simulated_Trait = simulate(LMM_trait_model, 1000)[:, :, 1][:])

│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52


Unnamed: 0_level_0,Simulated_Trait
Unnamed: 0_level_1,Float64
1,36.8906
2,36.6386
3,39.2387
4,36.156
5,37.5698
6,33.5482
7,36.3204
8,36.222
9,37.8047
10,34.578


Let's look at summary statistics of just the first of the 1000 simulation results.

In [66]:
describe(Simulated_LMM_Trait, :mean, :std, :min, :median, :max)

Unnamed: 0_level_0,variable,mean,std,min,median,max
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Float64
1,Simulated_Trait,37.1641,2.50047,29.7848,37.0535,43.6048


###  Example 2b: Multiple Correlated Traits (Mendel Example 28e Simulation)

We simulate two correlated Normal Traits controlling for family structure, location = μ and scale = $\mathbf\Sigma$. 
The corresponding bivariate variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here.

$$
Y ∼ N(μ, \mathbf\Sigma)
$$ 

$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

$$
\mathbf\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


&nbsp; $FYI$: To create a trait with different variance components change the elements of $\mathbf\Sigma$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b. While this tutorial only uses 2 variance components, we make note that the `@vc` macro is designed to handle as many variance components as needed. 

As long as each Variance Component is specified correctly, we can create a `VarianceComponent` Julia object for Trait Simulation:

&nbsp; 
Example) Specifying more than 2 variance components (let V_H indicate an additional Household Variance component and V_D indicate a dominance genetic effect) 

```{julia}
    multiple_variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n + V_D ⊗ Δ + V_H ⊗ H;
```

V_E is multiplies a 212 by 212 identity matrix, which we creat along with the V_E and V_A matrices. 

In [68]:
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0];
# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n;

These are the formulas for the fixed effects, as specified by Mendel Option 28e.

In [69]:
mean_formulas = ["40 - 1.5(locus)", "20 - 1.5(locus)"]
Multiple_LMM_traits_model = LMMTrait(mean_formulas, X, variance_formula)
Simulated_LMM_Traits = DataFrame(simulate(Multiple_LMM_traits_model))

│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52
│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52


Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,37.0,17.0
2,37.0,17.0
3,38.5,18.5
4,37.0,17.0
5,37.0,17.0
6,37.0,17.0
7,37.0,17.0
8,37.0,17.0
9,38.5,18.5
10,37.0,17.0


### Summary Statistics of Our Simulated Traits

In [71]:
describe(Simulated_LMM_Traits, :mean, :std, :min, :max)

Unnamed: 0_level_0,variable,mean,std,min,max
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64
1,x1,37.2335,0.586633,37.0,40.0
2,x2,17.2335,0.586633,17.0,20.0


# Example 3: Rare Variant Linear Mixed Model


$$
Y ∼ N(\mu, 4* 2GRM + 2I)
$$

In this example we simulate a trait from a prespecified k, number of snps, with simulated effect sizes based off of minor allele frequencies and a Chi-squared distribution. 

## Generating Effect Sizes (Based on MAF)

In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu$ on the following Dataframe: <br> 

In [83]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
data_rare = @view(snpdata[:, rare_index]);

In [87]:

model_k = GLMTrait(meanformula_k, genotype_df, NormalResponse(1), IdentityLink())
simulated_trait = simulate(model_k)

│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52
│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52
│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52
│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52
│   caller = top-level scope at kernel.jl:52
└ @ Core /Users/sarahji/.julia/packages/IJulia/F1GUo/src/kernel.jl:52


379-element Array{Float64,1}:
  1.5661027315584106 
 -2.428580940323021  
 -1.897170220570642  
 -1.8148788490514518 
 -3.295922821333367  
 -1.5335608813595354 
  0.3038800577950106 
 -0.8168966275116707 
 -2.3732135814713318 
 -2.084770406004215  
 -1.6952389469633933 
 -1.0108267421628827 
 -1.1130663200983464 
  ⋮                  
 -0.736593478287938  
 -4.3141578250966415 
 -2.943297217377027  
 -2.034231422322049  
 -0.16117207974432068
 -1.6909873192064548 
 -1.1612333913411892 
 -3.5264825656258125 
 -2.7851622478430267 
 -0.7162636204504689 
 -0.9463627805091666 
 -2.7294006981079697 

Some summary statistics of just the first of the 1000 simulation results.

In [29]:
describe(simulated_trait[:])

UndefVarError: UndefVarError: trait_rare_20_snps not defined

## Saving Simulation Results to Local Machine

Here we output the simulated trait values and corresponding genotypes for each of the 212 individuals, labeled by their pedigree ID and person ID for the first iteration of the 1000 simulations. 

In [30]:
Trait3_rare = hcat(Fam_Person_id, trait_rare_20_snps[:], geno_rare20_converted)

UndefVarError: UndefVarError: trait_rare_20_snps not defined

In addition, we output the simulation parameters (generated effect sizes and SNP names) used to simulate this trait.

In [31]:
Coefficients = DataFrame(Coefficients = simulated_effectsizes_chisq)
SNPs_rare = DataFrame(SNPs = rare_snps_for_simulation)
Trait3_rare_sim = hcat(Coefficients, SNPs_rare)

UndefVarError: UndefVarError: simulated_effectsizes_chisq not defined

In [32]:
#cd("/Users") #change to home directory
CSV.write("Trait3_rare.csv", Trait3_rare)
CSV.write("Trait3_rare_sim.csv", Trait3_rare_sim);

UndefVarError: UndefVarError: Trait3_rare not defined

## Citations: 

[1] Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM (2013) Mendel: The Swiss army knife of genetic analysis programs. Bioinformatics 29:1568-1570.`


[2] OPENMENDEL: a cooperative programming project for statistical genetics.
[Hum Genet. 2019 Mar 26. doi: 10.1007/s00439-019-02001-z](https://www.ncbi.nlm.nih.gov/pubmed/?term=OPENMENDEL).
