# Trait Simulation Tutorial

Authors: Sarah Ji, Janet Sinsheimer, Kenneth Lange

In this notebook we demonstrate how to simulate phenotypic traits. We use the Classic Mendel Option 28e data with known parameter estimates. In example 2b, we follow Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below.

You can specify arbitrary fixed effect sizes in both examples. 

Additionally, in the Generating Effect Sizes Section of Example 2c) we show how the user can generate effect sizes that depend on the minor allele frequencies from a function such as an exponential or chisquare. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components. 

At the end of example 2, we demo how to write the results of each simulation to a file on the users own machine.

## Mendel Option 28e Data: 
Mean effect:
$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

Covariance Matrix of Both Traits simulated Simultaneously through Linear Mixed Model (LMM):

$$
\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

Where we have the additive and environmental variances:

$$
V_a = 
\begin{vmatrix}
4 & 1\\
1 & 4\\
\end{vmatrix}
$$

$$
V_e = 
\begin{vmatrix}
2 & 0\\
0 & 2\\
\end{vmatrix}
$$

The kinship matrix is derived from the genetic relationship matrix (GRM) across the common SNPs with minor allele frequency at least 0.05. $I_{n}$ is the n dimensional identity matrix. The locus in this case is snp rs10412915. this snp is "causal" in the sense that its genotype contributes to the trait value.

### Double check that you are using Julia version 1.0.3 or higher by checking the machine information

In [1]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)


# Add any missing packages needed for this tutorial:

Note that the creation of this Jupyter notebook requires the use of the following registered packages: `DataFrames.jl`, `SnpArrays.jl`, `StatsModels.jl`, `Random.jl`, `DelimitedFiles.jl`, `StatsBase.jl`, and `StatsFuns.jl`. 

If it is your first time using these registered packages, you will first have to add the registered packages: DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, Random, StatsBase by running the following code chunk in Julia's package manager:

```{julia}
pkg> add DataFrames
pkg> add SnpArrays
...
pkg> add StatsFuns
```
You can also use the package manager to add the `TraitSimulation.jl` package by running the following link: </br>

```{julia}
pkg> add "https://github.com/sarah-ji/TraitSimulation.jl"
```

Only after all of the necessary packages have been added, we can load them into our working environment with the `using` command:

In [2]:
using DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, StatsBase, TraitSimulation, StatsFuns

# Reproducibility

For reproducibility, we set a random seed using the `Random.jl` package for each simulation using `Random.seed!(1234)`.  If you wish to end up with different data, you will need to comment out these commands or use another value in Random.seed!().

In [3]:
Random.seed!(1234);

# The notebook is organized as follows:
## Example 1: Generalized Linear Fixed Effects Model (No Residual Familial Correlation)

### Multiple Independent Traits: User specified distributions
In example (1 b) we simulate two independent Traits one with a normal distribution and one with a Poisson distribution simultaneously.<br>
$$ Y = 
\begin{vmatrix}
Y_{1}\\
Y_{2}\\
\end{vmatrix}
$$

$$ 
Y_{1} ∼ N(\mu_{1}, 2), \mu_{1} = 40 + 3(sex) - 1.5(locus)\\
Y_{2} ∼ Poisson(\mu_{1}), \mu_{1} = 2 + 2(sex) - 1.5(locus)
$$

## Example 2: Linear Mixed Model (With Additive Genetic Variance Component).
In this example we show to generatee data with a residual correlation among relatives. 
For convenience we use the common assumption that the residual covariance among two relatives can be captured by the additive genetic variance times twice the kinship coefficient. However, if you like you can specify your own variance components and their design matrices as long as they are positive semi definite using the `@vc` macro demonstrated in this example.

### (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $\mu_{1} and scale = V_{{a}_{1,1}}* 2GRM + V_{{e}_{1,1}}I$. 
$$
Y_{2a} ∼ N(\mu_{1}, 4* 2GRM + 2I)$$


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $\mu$ and scale = $\Sigma$. 
$$
Y_{2b} ∼ N(\mu, \Sigma) , \Sigma  = V_{a} \otimes (2GRM) + V_{e} \otimes I_{n}
$$

### (c) Rare Variant Linear Mixed Model 

The example also assumes an additive genetic variance component in the model which includes 20 rare SNPs, defined as snps with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare SNPs have smaller minor allele frequencies, but we are limited in this tutorial by the number of individuals in the data set. <br>

We simulate a Single normal Trait controlling for family structure, with effect sizes generated as a function of the minor allele frequencies.
$$
Y_{2c} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

# Reading the Mendel 28a data using SnpArrays

First use `SnpArrays.jl` to read in the SNP data


In [4]:
snpdata = SnpArray("traitsim28e.bed", 212)

212×253141 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x02  0x02  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03  …  0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
    

Store the FamID and PersonID of Individuals in Mendel 28e data

In [5]:
famfile = readdlm("traitsim28e.fam", ',')
Fam_Person_id = DataFrame(FamID = famfile[:, 1], PID = famfile[:, 2])

Unnamed: 0_level_0,FamID,PID
Unnamed: 0_level_1,Any,Any
1,1,16
2,1,8228
3,1,17008
4,1,9218
5,1,3226
6,2,29
7,2,2294
8,2,3416
9,2,17893
10,2,6952


Note: Because later we will want to compare our results to the original results in the file,  we subset `traits_original` 

In [6]:
traits_original = DataFrame(Trait1 = famfile[:, 7], Trait2 = famfile[:, 8])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Any,Any
1,30.2056,9.2421
2,35.8214,15.2746
3,36.053,19.505
4,38.9635,18.9857
5,33.7391,21.1041
6,34.8884,19.0114
7,37.7011,19.1656
8,45.1317,19.8409
9,35.156,14.1423
10,42.4514,19.9271


Transform sex variable from M/F to 1/-1 as does Mendel 28e data.  If you prefer you can use the more common convention of making one of the sexes the reference sex (coding it as zero) and make the other sex have the value 1.

In [7]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[:, 5]) # note julia's ternary operator '?'

212-element Array{Float64,1}:
 -1.0
 -1.0
  1.0
  1.0
 -1.0
 -1.0
  1.0
  1.0
 -1.0
  1.0
 -1.0
  1.0
 -1.0
  ⋮  
  1.0
  1.0
  1.0
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0

### Names of Variants:

We want to find the index of the causal snp, rs10412915, in the snp_definition file and then subset that snp from the genetic marker data above. 
We first subset the SNP names into a vector called `snpid`

In [8]:
snpdef28_1 = readdlm("traitsim28e.bim", Any; header = false)
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1]) # strip mining in the data 

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

We first need to find the position of the snp rs10412915.  If you wish to use another snp just change the rs number to another one that is found in the available genotype data, for example rs186913222.

In [9]:
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]

236074

We see that the causal snp, rs10412915, is the 236074th variant in the snp dataset.

Let's create a design matrix for the alternative model that includes sex and locus rs10412915.

In [10]:
locus = convert(Vector{Float64}, @view(snpdata[:, ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

Unnamed: 0_level_0,sex,locus
Unnamed: 0_level_1,Float64,Float64
1,-1.0,2.0
2,-1.0,0.0
3,1.0,2.0
4,1.0,2.0
5,-1.0,1.0
6,-1.0,1.0
7,1.0,1.0
8,1.0,2.0
9,-1.0,1.0
10,1.0,1.0


# Example 1) Multiple Independent Traits: User specified distributions

Here I simulate two independent traits simultaneously, one from a Normal distribution and the other from a Poisson Distribution. 
We create the following 3 vectors to specify the simulation parameters of the two independent traits: 

&nbsp; &nbsp; `dist_type_vector` &nbsp; &nbsp; `link_type_vector` &nbsp; &nbsp; `mean_formulas`

$$
Y_{1b_{1}} ∼ N(\mu_{1b}, 2),  \mu_{1b} = 40 + 3(sex) - 1.5(locus)\\
Y_{1b_{2}} ∼ Poisson(\mu_{2b}),  \mu_{2b} = 2 + 2(sex) - 1.5(locus)\\
$$

In [11]:
#for multiple glm traits from different distributions
dist_type_vector = [NormalResponse(4), PoissonResponse()]
link_type_vector = [IdentityLink(), LogLink()]

mean_formulas = ["40 + 3(sex) - 1.5(locus)", "2 + 2(sex) - 1.5(locus)"]

Multiple_GLM_traits_model_NOTIID = Multiple_GLMTraits(mean_formulas, X, dist_type_vector, link_type_vector)
Simulated_GLM_trait_NOTIID = simulate(Multiple_GLM_traits_model_NOTIID)

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Int64
1,37.4694,1
2,33.393,2
3,38.0221,1
4,36.3883,1
5,38.9576,0
6,44.3475,1
7,43.6313,16
8,38.9131,2
9,37.5093,0
10,39.4321,13


In [12]:
describe(Simulated_GLM_trait_NOTIID, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Real,Float64,Float64,Float64,Real,DataType
1,trait1,38.119,4.99078,21.1546,34.6485,38.5702,41.7241,50.9751,Float64
2,trait2,6.32075,12.1924,0.0,0.0,1.0,8.0,66.0,Int64


See the instructions at the end of this jupyter notebook if you want to save your results to a file.

# Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient. Examples 2a and 2c simulate single traits, while Example 2b simulates two correlated traits.


We make note that the user can extend the model in Example 2b to include more than 2 variance components using the `@vc` macro.


## The Variance Covariance Matrix
### Single Trait 
Recall : $E(\mathbf{GRM}) = \Phi$ and $\mathbf{V} = 2\mathbf{V_a} \mathbf{\Phi} + \mathbf{V_e} \mathbf{I}$
<br>
We will use the same values of $\mathbf{GRM}$, $V_a$, and $V_e$ for the mixed effect example (2) and for the rare variant example (3).

We use the SnpArrays.jl package to compute the Genetic Relationship Matrix (GRM).

In [13]:
GRM = grm(snpdata, method = :GRM)
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0]
I_n = Matrix{Float64}(I, size(GRM));

We simulate a Normal Trait controlling for family structure, location = $μ_{1a} and scale = 4* 2GRM + 2I$. 
$$
Y_{2a} ∼ N(μ_{1}, 4* 2GRM + 2I)$$

In [14]:
mean_formula = ["40 + 3(sex) - 1.5(locus)"]

1-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"

In [15]:
Ex2a_model = LMMTrait(mean_formula, X, 4*(2*GRM) + 2*(I_n))
trait_2a = simulate(Ex2a_model)

Unnamed: 0_level_0,trait1
Unnamed: 0_level_1,Float64
1,35.5325
2,34.2255
3,39.4903
4,36.9414
5,37.5817
6,34.4615
7,46.4516
8,42.7973
9,38.1481
10,42.2538


In [16]:
describe(trait_2a, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Float64,Float64,Float64,DataType
1,trait1,38.25,3.95697,27.8593,35.1312,38.5002,41.1097,48.297,Float64


## Example 2b) Simulating Two Correlated Traits with Mendel Option 28e paramaters


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $μ_{1a}$ and scale = $4* 2GRM + 2I$. 
$$
Y_{2b} ∼ N(μ, \Sigma) ,  \Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


### Multiple Correlated Traits Variance Components

The corresponding variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here: To create a trait different variance components change $\Sigma  = V_a \otimes (2GRM) + V_e \otimes I$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b.

&nbsp; $FYI$: While this tutorial only uses 2 variance components, we make note that the `@vc` macro is designed to handle as many variance components as needed. As long as each Variance Component is specified correctly, we can create a `VarianceComponent` Julia object for Trait Simulation:

&nbsp; 
ex)
```{julia}
multiple_variance_formula = @vc V_A ⊗ GRM + V_E ⊗ I_n + V_B ⊗ I_n + V_C ⊗ I_n;
```


In [17]:
# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ GRM + V_E ⊗ I_n;

These are the formulas for the fixed effects, as specified by Mendel Option 28e.

In [18]:
mean_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

In [19]:
Ex2b_model = LMMTrait(mean_formulas, X, variance_formula)
trait_2b = simulate(Ex2b_model)

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Float64
1,37.3789,15.123
2,39.3743,16.8079
3,36.4352,14.5049
4,38.0227,17.2941
5,39.2585,20.0591
6,34.7246,9.5668
7,40.4575,12.5429
8,43.034,17.9209
9,34.3649,8.61429
10,41.9882,18.1775


### Summary Statistics of Our Simulated Traits

In [20]:
describe(trait_2b, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,trait1,38.1871,4.51019,28.8591,52.6095,Float64
2,trait2,18.1381,5.0916,5.97027,33.7165,Float64


### Summary Statistics of the Original Mendel 28e dataset Traits:

Note we want to see similar values from our simulated traits!

In [21]:
describe(traits_original, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,Trait1,37.8602,4.04887,29.2403,47.8619,Any
2,Trait2,18.472,3.37633,9.2421,27.5554,Any


## Example 2c: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.002 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. Here are the 20 SNP's that will be used for trait simulation in this example.  

For this demo, the indexing `snpid[rare_index][1:2:40]` allows us to subset every other rare snp in the first 40 SNPs, to get our list of 20 rare SNPs. Change the range and number of SNPs to simulate with more or less SNPs and from different regions of the genome. The number 20 is arbitrary and you can use more or less than 20 if you desire by changing the final number. You can change the spacing of the snps by changing the second number. 
For example, `snpid[rare_index][1:5:500]` would give you 100 snps.


In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu_{rare20}$ <br>

### (c) Single Trait: 
$$
Y_{2c} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

In [22]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
data_rare = snpdata[:, rare_index]

212×80493 Array{UInt8,2}:
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x02  0x02  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00

In [23]:
maf_20_rare_snps = minor_allele_frequency[rare_index][1:2:40]

20-element Array{Float64,1}:
 0.01650943396226412  
 0.014150943396226415 
 0.009433962264150941 
 0.018867924528301886 
 0.009433962264150941 
 0.004716981132075526 
 0.007075471698113208 
 0.009433962264150941 
 0.007075471698113178 
 0.002358490566037763 
 0.014150943396226415 
 0.0047169811320754715
 0.002358490566037763 
 0.004716981132075526 
 0.018867924528301886 
 0.002358490566037763 
 0.002358490566037763 
 0.0023584905660377358
 0.018867924528301886 
 0.004716981132075526 

In [24]:
rare_snps_for_simulation = snpid[rare_index][1:2:40]

20-element Array{SubString{String},1}:
 "rs3020701"  
 "rs181646587"
 "rs182902214"
 "rs184527030"
 "rs10409990" 
 "rs185166611"
 "rs181637538"
 "rs186213888"
 "rs184010370"
 "rs11667161" 
 "rs188819713"
 "rs182378235"
 "rs146361744"
 "rs190575937"
 "rs149949827"
 "rs117671630"
 "rs149171388"
 "rs188520640"
 "rs142722885"
 "rs146938393"

In [25]:
geno_rare20_converted = convert(DataFrame, convert(Matrix{Float64}, @view(data_rare[:, 1:2:40])))
names!(geno_rare20_converted, Symbol.(rare_snps_for_simulation))

Unnamed: 0_level_0,rs3020701,rs181646587,rs182902214,rs184527030,rs10409990,rs185166611,rs181637538,rs186213888,rs184010370,rs11667161,rs188819713,rs182378235,rs146361744,rs190575937,rs149949827,rs117671630,rs149171388,rs188520640,rs142722885,rs146938393
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
2,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
3,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
4,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
5,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
6,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
7,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
8,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
9,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
10,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0


## Generating Effect Sizes Based on MAF

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

In addition to the Chi-Squared distribution, we also demo how to simulate from the Exponential distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Exponential density. 

## Chisquared(df = 1)

In [26]:
# Generating Effect Sizes from Chisquared(df = 1) density
n = length(maf_20_rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = sign(rand() - .5) * chisqpdf(1, maf_20_rare_snps[i])/5.0
end

Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice the rarer SNPs have the largest absolute values for their effect sizes.

In [27]:
Ex2c_rare = round.([chisq_coeff maf_20_rare_snps], digits = 3)
Ex2c_rare = DataFrame(Chisq_Coefficient = Ex2c_rare[:, 1] , MAF_rare = Ex2c_rare[:, 2] )

Unnamed: 0_level_0,Chisq_Coefficient,MAF_rare
Unnamed: 0_level_1,Float64,Float64
1,-0.616,0.017
2,-0.666,0.014
3,-0.818,0.009
4,-0.575,0.019
5,0.818,0.009
6,1.159,0.005
7,-0.945,0.007
8,-0.818,0.009
9,0.945,0.007
10,-1.641,0.002


In [28]:
simulated_effectsizes_chisq = Ex2c_rare[:, 1]

20-element Array{Float64,1}:
 -0.616
 -0.666
 -0.818
 -0.575
  0.818
  1.159
 -0.945
 -0.818
  0.945
 -1.641
 -0.666
  1.159
  1.641
 -1.159
 -0.575
  1.641
  1.641
 -1.641
  0.575
 -1.159

### Exponential Distribution
Here we show how to generate effect sizes for the 20 rare snp's from the Exponential Distribution, where we use the maf as x and find f(x) where f is the pdf for the Exponential density

In [29]:
simulated_effectsizes_exp = round.(6*exp.(-200*maf_20_rare_snps), digits = 3)

20-element Array{Float64,1}:
 0.221
 0.354
 0.909
 0.138
 0.909
 2.336
 1.457
 0.909
 1.457
 3.744
 0.354
 2.336
 3.744
 2.336
 0.138
 3.744
 3.744
 3.744
 0.138
 2.336

## Function for Mean Model Expression

In some cases a large number of variants may be used for simulation. Thus, in this example we create a function where the user inputs a vector of coefficients and a vector of variants for simulation, then the function outputs the mean model expression. 

The function `FixedEffectTerms`, creates the proper evaluated expression for the simulation process, using the specified vectors of coefficients and snp names. The function outputs `evaluated_fixed_expression` which will be used to estimate the mean effect, `μ` in our mixed effects model. We make use of this function in this example, instead of having to write out all 20 of the coefficients and variant locus names.

In [30]:
rare_snps_for_simulation

20-element Array{SubString{String},1}:
 "rs3020701"  
 "rs181646587"
 "rs182902214"
 "rs184527030"
 "rs10409990" 
 "rs185166611"
 "rs181637538"
 "rs186213888"
 "rs184010370"
 "rs11667161" 
 "rs188819713"
 "rs182378235"
 "rs146361744"
 "rs190575937"
 "rs149949827"
 "rs117671630"
 "rs149171388"
 "rs188520640"
 "rs142722885"
 "rs146938393"

In [31]:
function FixedEffectTerms(effectsizes::AbstractVecOrMat, snps::AbstractVecOrMat)
 # implementation
    fixed_terms = ""
for i in 1:length(simulated_effectsizes_chisq) - 1
expression = " + " * string(simulated_effectsizes_chisq[i]) * "(" * rare_snps_for_simulation[i] * ")"
    fixed_terms = fixed_terms * expression
end
    return String(fixed_terms)
end


FixedEffectTerms (generic function with 1 method)

In [32]:
mean_formula_rare = FixedEffectTerms(simulated_effectsizes_chisq, rare_snps_for_simulation)

" + -0.616(rs3020701) + -0.666(rs181646587) + -0.818(rs182902214) + -0.575(rs184527030) + 0.818(rs10409990) + 1.159(rs185166611) + -0.945(rs181637538) + -0.818(rs186213888) + 0.945(rs184010370) + -1.641(rs11667161) + -0.666(rs188819713) + 1.159(rs182378235) + 1.641(rs146361744) + -1.159(rs190575937) + -0.575(rs149949827) + 1.641(rs117671630) + 1.641(rs149171388) + -1.641(rs188520640) + 0.575(rs142722885)"

## Example 2c) Mixed effects model Single Trait and rare variants:
$$
Y_{2c} ∼ N(μ_{20raresnps}, 4* 2GRM + 2I)$$


In [33]:
rare_20_snp_model = LMMTrait([mean_formula_rare], geno_rare20_converted, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = simulate(rare_20_snp_model)

Unnamed: 0_level_0,trait1
Unnamed: 0_level_1,Float64
1,7.84557
2,8.32246
3,8.98565
4,10.4579
5,10.078
6,8.87202
7,7.18824
8,12.2827
9,7.04519
10,12.8496


In [34]:
describe(trait_rare_20_snps, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,trait1,8.12279,2.55208,1.11429,14.219,Float64


## Saving Simulation Results to Local Machine

Write the newly simulated trait into a comma separated (csv) file for later use. Note that the user can specify the separator to '\t' for tab separated, or another separator of choice. 

Here we output the simulated trait values for each of the 212 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait.

In [35]:
Trait2_mixed = hcat(Fam_Person_id, trait_rare_20_snps, geno_rare20_converted)

Unnamed: 0_level_0,FamID,PID,trait1,rs3020701,rs181646587,rs182902214,rs184527030,rs10409990,rs185166611,rs181637538,rs186213888,rs184010370,rs11667161,rs188819713,rs182378235,rs146361744,rs190575937,rs149949827,rs117671630,rs149171388,rs188520640,rs142722885,rs146938393
Unnamed: 0_level_1,Any,Any,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1,16,7.84557,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
2,1,8228,8.32246,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
3,1,17008,8.98565,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
4,1,9218,10.4579,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
5,1,3226,10.078,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
6,2,29,8.87202,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
7,2,2294,7.18824,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
8,2,3416,12.2827,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
9,2,17893,7.04519,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
10,2,6952,12.8496,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0


In [36]:
Coefficients = DataFrame(Coefficients = simulated_effectsizes_chisq)
SNPs_rare = DataFrame(SNPs = rare_snps_for_simulation)
Trait2_mixed_sim = hcat(Coefficients, SNPs_rare)


Unnamed: 0_level_0,Coefficients,SNPs
Unnamed: 0_level_1,Float64,SubStrin…
1,-0.616,rs3020701
2,-0.666,rs181646587
3,-0.818,rs182902214
4,-0.575,rs184527030
5,0.818,rs10409990
6,1.159,rs185166611
7,-0.945,rs181637538
8,-0.818,rs186213888
9,0.945,rs184010370
10,-1.641,rs11667161


In [37]:
#cd("/Users") #change to home directory
using CSV
CSV.write("Trait2c_mixed.csv", Trait2_mixed)
CSV.write("Trait2c_mixed_sim.csv", Trait2_mixed_sim);