# Trait Simulation Demonstration

In this notebook we demonstrate how to simulate phenotypic traits. We use the Mendel Option 28e data with known parameter estimates to validate whether the simulation is sensible. In example 2b, we follow Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below.

The user specifies arbitrary fixed effect sizes in examples 1 and 2. 

In the Generating Effect Sizes Section of Example 3 we show how the user can generate effect sizes that depend on the minor allele frequencies from a function such as an exponential or chisquare. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components. At the end of example 3, we demo how to write the results of each simulation to a file on the users own machine.

## Mendel Option 28e Data: 
Mean effect:
$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

Covariance Matrix of Both Traits simulated Simultaneously through Linear Mixed Model (LMM):

$$
\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

Where we have the additive and environmental variances:

$$
V_a = 
\begin{vmatrix}
4 & 1\\
1 & 4\\
\end{vmatrix}
$$

$$
V_e = 
\begin{vmatrix}
2 & 0\\
0 & 2\\
\end{vmatrix}
$$

The kinship matrix is derived from the genetic relationship matrix (GRM) across the common SNPs with minor allele frequency at least 0.05. I_n is the n dimensional identity matrix.

# Reproducibility

For reproducibility, we set a random seed using the `Random.jl` package for each simulation using `Random.seed!(1234)`.  If the user wishes to end up with different data, they will need to comment out these commands or use another value in Random.seed!().

In [1]:
using Random
Random.seed!(1234);

Machine information:

In [40]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)


# The notebook is organized as follows:
## Example 1: Generalized Linear Fixed Effects Model (no residual familial correlation)

### a) Single IID Non-Normal Trait:<br>
We simulate an iid Poisson Trait, location = 5.
$$
Y_{1a} ∼ Poisson(5)
$$

### b) Multiple Independent Traits: User specified distributions
We simulate two independent Traits from example 1a 1b, simultaneously.<br>
$$
Y_{1b_{1}} ∼ N(\mu_{1b}, 2), \mu_{1b} = 40 + 3(sex) - 1.5(locus)\\
Y_{1b_{2}} ∼ Poisson(\mu_{2b}), \mu_{2b} = 2 + 2(sex) - 1.5(locus)
$$

## Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  

### (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $\mu_{1} and scale = V_{{a}_{1,1}}* 2GRM + V_{{e}_{1,1}}I$. 
$$
Y_{2a} ∼ N(\mu_{1}, 4* 2GRM + 2I)$$


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $\mu$ and scale = $\Sigma$. 
$$
Y_{2b} ∼ N(\mu, \Sigma) , \Sigma  = V_{a} \otimes (2GRM) + V_{e} \otimes I_{n}
$$

## Example 3: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

The example also assumes an additive genetic variance component in the model which includes 20 rare SNPs, defined as snps with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu_{rare20}$ <br>

### (a) Single Trait: 
$$
Y_{3a} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

# Reading the Mendel 28a data using SnpArrays

First use `SnpArrays.jl` to read in the SNP data

In [2]:
using DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, StatsBase, TraitSimulation
snpdata = SnpArray("traitsim28e.bed", 212)

212×253141 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x02  0x02  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03  …  0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
    

Store the FamID and PersonID of Individuals in Mendel 28e data

In [3]:
famfile = readdlm("traitsim28e.fam", ',')
Fam_Person_id = DataFrame(FamID = famfile[:, 1], PID = famfile[:, 2])

Unnamed: 0_level_0,FamID,PID
Unnamed: 0_level_1,Any,Any
1,1,16
2,1,8228
3,1,17008
4,1,9218
5,1,3226
6,2,29
7,2,2294
8,2,3416
9,2,17893
10,2,6952


Note: We subset `traits_original` to compare in Example 2b our simulated traits to these two simulated traits from Mendel Option 28e.

In [4]:
traits_original = DataFrame(Trait1 = famfile[:, 7], Trait2 = famfile[:, 8])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Any,Any
1,30.2056,9.2421
2,35.8214,15.2746
3,36.053,19.505
4,38.9635,18.9857
5,33.7391,21.1041
6,34.8884,19.0114
7,37.7011,19.1656
8,45.1317,19.8409
9,35.156,14.1423
10,42.4514,19.9271


Transform sex variable from M/F to 1/-1 as does Mendel 28e data

In [5]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[:, 5])

212-element Array{Float64,1}:
 -1.0
 -1.0
  1.0
  1.0
 -1.0
 -1.0
  1.0
  1.0
 -1.0
  1.0
 -1.0
  1.0
 -1.0
  ⋮  
  1.0
  1.0
  1.0
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0

### Names of Variants:

We want to find the index of the causal snp, rs10412915, in the snp_definition file and then subset that snp from the genetic marker data above. 
We subset the SNP names into a vector called `snpid`

In [6]:
snpdef28_1 = readdlm("traitsim28e.bim", Any; header = false)
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1])

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

We see that the causal snp, rs10412915, is the 236074th variant in the snp dataset.

In [7]:
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]

236074

Let's create a design matrix for the alternative model that includes sex and locus rs10412915.

In [8]:
locus = convert(Vector{Float64}, @view(snpdata[:, ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

Unnamed: 0_level_0,sex,locus
Unnamed: 0_level_1,Float64,Float64
1,-1.0,2.0
2,-1.0,0.0
3,1.0,2.0
4,1.0,2.0
5,-1.0,1.0
6,-1.0,1.0
7,1.0,1.0
8,1.0,2.0
9,-1.0,1.0
10,1.0,1.0


## The Variance Covariance Matrix
### Single Trait 
Recall : $E(\mathbf{GRM}) = \Phi$ and $\mathbf{V} = 2\mathbf{V_a} \mathbf{\Phi} + \mathbf{V_e} \mathbf{I}$
<br>
We will use the same values of $\mathbf{GRM}$, $V_a$, and $V_e$ for the mixed effect example (2) and for the rare variant example (3).

We use the SnpArrays.jl package to compute the Genetic Relationship Matrix (GRM).

In [9]:
GRM = grm(snpdata, method = :GRM)
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0]
I_n = Matrix{Float64}(I, size(GRM));

### Multiple Correlated Traits

The corresponding variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here: To create a trait different variance components change $\Sigma  = V_a \otimes (2GRM) + V_e \otimes I$. We create the variance component object `variancecomp` below, to simulate our traits in example 2b.

In [10]:
variancecomp = @vc V_A ⊗ GRM + V_E ⊗ I_n;

# Example 1 Generalized Linear Model:

This example simulates a case where three snps have fixed effects on the trait. Any apparent genetic correlation between relatives for the trait is due to the effect of these snps, so once these effects of these snps are modelled there should be no residual correlation among relatives. Note that by default, individuals with missing genotype values will have missing phenotype values, unless the user specifies the argument `impute = true` in the convert function above.
Be sure to change Random.seed!(1234) to something else (or comment out) if you want to generate a new data set. 


### a) Single IID Non-Normal Trait: <br>
We simulate an iid Poisson Trait, location = 5. We use the Identity Link to simulate this Poisson trait.
$$
Y_{1a} ∼ Poisson(5)
$$

# A ) 𝑌_1a ∼𝑃𝑜𝑖𝑠𝑠𝑜𝑛(5)


In [11]:
GLM_trait_model_Poisson5 = GLMTrait(5, X, PoissonResponse(), IdentityLink())
Simulated_GLM_trait = simulate(GLM_trait_model_Poisson5)

Unnamed: 0_level_0,trait1
Unnamed: 0_level_1,Int64
1,7
2,3
3,3
4,7
5,2
6,4
7,1
8,3
9,4
10,3


Descriptive Statistics of Poisson(5) Trait

In [12]:
describe(Simulated_GLM_trait[:, 1])

Summary Stats:
Length:         212
Missing Count:  0
Mean:           5.080189
Minimum:        1.000000
1st Quartile:   4.000000
Median:         5.000000
3rd Quartile:   6.000000
Maximum:        11.000000
Type:           Int64


# Example 1b) Multiple Independent Traits: User specified distributions

Here I simulate two independent traits simultaneously, one from a Normal distribution and the other from a Poisson Distribution. Notice a difference from Example 1a, we use the LogLink to simulate the Poisson Trait this time.

$$
Y_{1b_{1}} ∼ N(\mu_{1b}, 2), where \mu_{1b} = 40 + 3(sex) - 1.5(locus)\\
Y_{1b_{2}} ∼ Poisson(\mu_{2b}), where \mu_{2b} = 2 + 2(sex) - 1.5(locus)\\
$$

In [13]:
#for multiple glm traits from different distributions
dist_type_vector = [NormalResponse(4), PoissonResponse()]
link_type_vector = [IdentityLink(), LogLink()]

Ex1b_formulas = ["40 + 3(sex) - 1.5(locus)", "2 + 2(sex) - 1.5(locus)"]

Multiple_GLM_traits_model_NOTIID = Multiple_GLMTraits(Ex1b_formulas, X, dist_type_vector, link_type_vector)
Simulated_GLM_trait_NOTIID = simulate(Multiple_GLM_traits_model_NOTIID)

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Int64
1,33.2534,0
2,35.606,1
3,32.8041,1
4,38.9971,2
5,40.3932,1
6,34.869,0
7,39.7272,6
8,32.5276,2
9,37.8167,0
10,41.8518,15


In [39]:
describe(Simulated_GLM_trait_NOTIID, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Real,Float64,Float64,Float64,Real,DataType
1,trait1,38.1454,5.20123,24.1389,34.7277,38.5765,41.7131,50.6483,Float64
2,trait2,6.1934,11.9836,0.0,0.0,1.0,9.0,63.0,Int64


# Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  


## (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $μ_{1a} and scale = 4* 2GRM + 2I$. 
$$
Y_{2a} ∼ N(μ_{1}, 4* 2GRM + 2I)$$

In [14]:
Ex2a_formula = ["40 + 3(sex) - 1.5(locus)"]

1-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"

In [15]:
Ex2a_model = LMMTrait(Ex2a_formula, X, 4*(2*GRM) + 2*(I_n))
trait_2a = simulate(Ex2a_model)

Unnamed: 0_level_0,trait1
Unnamed: 0_level_1,Float64
1,31.4138
2,35.8122
3,36.6426
4,39.8293
5,36.6758
6,33.7105
7,39.885
8,40.3451
9,35.4798
10,37.1198


In [16]:
describe(trait_2a, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,q25,median,q75,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Float64,Float64,Float64,DataType
1,trait1,38.0685,4.12272,27.4444,34.7404,37.9142,41.1204,47.197,Float64


## Example 2b) Simulating Two Correlated Traits with Mendel Option 28e paramaters


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $μ_{1a}$ and scale = $4* 2GRM + 2I$. 
$$
Y_{2b} ∼ N(μ, \Sigma) , where \Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


These are the formulas for the fixed effects, as specified by Mendel Option 28e.

In [17]:
Example2b_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

In [18]:
Ex2b_model = LMMTrait(Example2b_formulas, X, variancecomp)
trait_2b = simulate(Ex2b_model)

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Float64
1,33.0168,15.7839
2,41.9126,23.6211
3,38.8357,16.794
4,39.2944,17.6285
5,41.0893,20.9907
6,35.9302,19.2215
7,45.7619,24.5611
8,39.3657,21.864
9,38.4263,16.3262
10,44.8287,20.6457


### Summary Statistics of Our Simulated Traits

In [19]:
describe(trait_2b, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,trait1,38.315,4.48644,26.0719,47.6865,Float64
2,trait2,18.2868,4.214,5.85156,29.2223,Float64


### Summary Statistics of the original Mendel 28e dataset Traits:

Note we want to see similar values from our simulated traits!

In [20]:
describe(traits_original, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,Trait1,37.8602,4.04887,29.2403,47.8619,Any
2,Trait2,18.472,3.37633,9.2421,27.5554,Any


# Example 3: 20 Rare SNPs for Simulation


## Example 3: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.002 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. Here are the 20 SNP's that will be used for trait simulation in this example. 

For this demo, the indexing `snpid[rare_index][1:2:40]` allows us to subset every other rare snp in the first 40 SNPs, to get our list of 20 rare SNPs. Change the range and number of SNPs to simulate with more or less SNPs and from different regions of the genome. 

In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu_{rare20}$ <br>

### (a) Single Trait: 
$$
Y_{3a} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

In [21]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
data_rare = snpdata[:, rare_index]

212×80493 Array{UInt8,2}:
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x02  0x02  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00
 0x03  0x00  0x00  0x00  0x03  0x03     0x00  0x00  0x00  0x00  0x00  0x00

In [22]:
maf_20_rare_snps = minor_allele_frequency[rare_index][1:2:40]

20-element Array{Float64,1}:
 0.01650943396226412  
 0.014150943396226415 
 0.009433962264150941 
 0.018867924528301886 
 0.009433962264150941 
 0.004716981132075526 
 0.007075471698113208 
 0.009433962264150941 
 0.007075471698113178 
 0.002358490566037763 
 0.014150943396226415 
 0.0047169811320754715
 0.002358490566037763 
 0.004716981132075526 
 0.018867924528301886 
 0.002358490566037763 
 0.002358490566037763 
 0.0023584905660377358
 0.018867924528301886 
 0.004716981132075526 

In [23]:
#rare_snps = minor_allele_frequency(0.002 .< maf .≤ 0.02)
rare_snps_for_simulation = snpid[rare_index][1:2:40]

20-element Array{SubString{String},1}:
 "rs3020701"  
 "rs181646587"
 "rs182902214"
 "rs184527030"
 "rs10409990" 
 "rs185166611"
 "rs181637538"
 "rs186213888"
 "rs184010370"
 "rs11667161" 
 "rs188819713"
 "rs182378235"
 "rs146361744"
 "rs190575937"
 "rs149949827"
 "rs117671630"
 "rs149171388"
 "rs188520640"
 "rs142722885"
 "rs146938393"

In [24]:
geno_rare20_converted = convert(DataFrame, convert(Matrix{Float64}, @view(data_rare[:, 1:2:40])))
names!(geno_rare20_converted, Symbol.(rare_snps_for_simulation))

Unnamed: 0_level_0,rs3020701,rs181646587,rs182902214,rs184527030,rs10409990,rs185166611,rs181637538,rs186213888,rs184010370,rs11667161,rs188819713,rs182378235,rs146361744,rs190575937,rs149949827,rs117671630,rs149171388,rs188520640,rs142722885,rs146938393
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
2,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
3,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
4,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
5,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
6,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
7,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
8,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
9,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
10,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0


## Generating Effect Sizes Based on MAF

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

In addition to the Chi-Squared distribution, we also demo how to simulate from the Exponential distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Exponential density. 

## Chisquared(df = 1)

In [25]:
# Generating Effect Sizes from Chisquared(df = 1) density
using StatsFuns
n = length(maf_20_rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = sign(rand() - .5) * chisqpdf(1, maf_20_rare_snps[i])/5.0
end

Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice how the more rare SNP's have the largest effect sizes.

In [26]:
Ex3_rare = round.([chisq_coeff maf_20_rare_snps], digits = 3)
Ex3_rare = DataFrame(Chisq_Coefficient = Ex3_rare[:, 1] , MAF_rare = Ex3_rare[:, 2] )

Unnamed: 0_level_0,Chisq_Coefficient,MAF_rare
Unnamed: 0_level_1,Float64,Float64
1,0.616,0.017
2,-0.666,0.014
3,0.818,0.009
4,-0.575,0.019
5,0.818,0.009
6,1.159,0.005
7,-0.945,0.007
8,-0.818,0.009
9,0.945,0.007
10,-1.641,0.002


In [27]:
simulated_effectsizes_chisq = Ex3_rare[:, 1]

20-element Array{Float64,1}:
  0.616
 -0.666
  0.818
 -0.575
  0.818
  1.159
 -0.945
 -0.818
  0.945
 -1.641
  0.666
 -1.159
  1.641
 -1.159
 -0.575
  1.641
 -1.641
  1.641
  0.575
 -1.159

### Simulating effect sizes from the Exponential distribution, where we use the maf as x and find f(x) where f is the pdf for the Exponential density

In [28]:
simulated_effectsizes_exp = round.(6*exp.(-200*maf_20_rare_snps), digits = 3)

20-element Array{Float64,1}:
 0.221
 0.354
 0.909
 0.138
 0.909
 2.336
 1.457
 0.909
 1.457
 3.744
 0.354
 2.336
 3.744
 2.336
 0.138
 3.744
 3.744
 3.744
 0.138
 2.336

## Function for Mean Model Expression

In some cases a large number of variants may be used for simulation. Thus, in this example we create a function where the user inputs a vector of coefficients and a vector of variants for simulation, then the function outputs the mean model expression. 

The function `FixedEffectTerms`, creates the proper evaluated expression for the simulation process, using the specified vectors of coefficients and snp names. The function outputs `evaluated_fixed_expression` which will be used to estimate the mean effect, `μ` in our mixed effects model. We make use of this function in this example, instead of having to write out all 20 of the coefficients and variant locus names.

In [29]:
rare_snps_for_simulation

20-element Array{SubString{String},1}:
 "rs3020701"  
 "rs181646587"
 "rs182902214"
 "rs184527030"
 "rs10409990" 
 "rs185166611"
 "rs181637538"
 "rs186213888"
 "rs184010370"
 "rs11667161" 
 "rs188819713"
 "rs182378235"
 "rs146361744"
 "rs190575937"
 "rs149949827"
 "rs117671630"
 "rs149171388"
 "rs188520640"
 "rs142722885"
 "rs146938393"

In [30]:
function FixedEffectTerms(effectsizes::AbstractVecOrMat, snps::AbstractVecOrMat)
 # implementation
    fixed_terms = ""
for i in 1:length(simulated_effectsizes_chisq) - 1
expression = " + " * string(simulated_effectsizes_chisq[i]) * "(" * rare_snps_for_simulation[i] * ")"
    fixed_terms = fixed_terms * expression
end
    return String(fixed_terms)
end


FixedEffectTerms (generic function with 1 method)

In [31]:
formula_20_rare_snps = FixedEffectTerms(simulated_effectsizes_chisq, rare_snps_for_simulation)

" + 0.616(rs3020701) + -0.666(rs181646587) + 0.818(rs182902214) + -0.575(rs184527030) + 0.818(rs10409990) + 1.159(rs185166611) + -0.945(rs181637538) + -0.818(rs186213888) + 0.945(rs184010370) + -1.641(rs11667161) + 0.666(rs188819713) + -1.159(rs182378235) + 1.641(rs146361744) + -1.159(rs190575937) + -0.575(rs149949827) + 1.641(rs117671630) + -1.641(rs149171388) + 1.641(rs188520640) + 0.575(rs142722885)"

## Example 3(a) Mixed effects model Single Trait:
$$
Y_{3a} ∼ N(μ_{20raresnps}, 4* 2GRM + 2I)$$


This intermediate step uses the `mean_formula` function to evaluate the `formula_20_rare_snps` above on the 20 rare snp data, `geno_rare20_converted`,  to get the fixed effects mean vector (rounded to the third digit).

In [32]:
μ_20_rare_snps = round.(mean_formula(formula_20_rare_snps, geno_rare20_converted), digits = 3)

212-element Array{Float64,1}:
  7.137
  7.137
  7.137
  7.137
  7.137
  7.137
  7.137
  4.819
  7.137
  4.819
  7.137
  7.137
  7.137
  ⋮    
  7.137
  7.137
  7.137
  8.469
 10.419
  7.137
  7.137
  7.137
  7.137
  7.137
  7.137
  7.137

In [33]:
rare_20_snp_model = LMMTrait([formula_20_rare_snps], geno_rare20_converted, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = simulate(rare_20_snp_model)

Unnamed: 0_level_0,trait1
Unnamed: 0_level_1,Float64
1,13.1579
2,8.96176
3,4.97421
4,11.8191
5,9.14336
6,4.33189
7,6.03755
8,2.22563
9,7.45517
10,5.35407


In [34]:
describe(trait_rare_20_snps, stats = [:mean, :std, :min, :max, :eltype])

Unnamed: 0_level_0,variable,mean,std,min,max,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,DataType
1,trait1,7.01025,2.53983,0.548453,14.0576,Float64


## Saving Simulation Results to Local Machine

Write the newly simulated trait into a comma separated (csv) file for later use. Note that the user can specify the separator to '\t' for tab separated, or another separator of choice. 

Here we output the simulated trait values for each of the 849 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait.

In [35]:
Trait3_mixed = hcat(Fam_Person_id, trait_rare_20_snps, geno_rare20_converted)

Unnamed: 0_level_0,FamID,PID,trait1,rs3020701,rs181646587,rs182902214,rs184527030,rs10409990,rs185166611,rs181637538,rs186213888,rs184010370,rs11667161,rs188819713,rs182378235,rs146361744,rs190575937,rs149949827,rs117671630,rs149171388,rs188520640,rs142722885,rs146938393
Unnamed: 0_level_1,Any,Any,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1,16,13.1579,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
2,1,8228,8.96176,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
3,1,17008,4.97421,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
4,1,9218,11.8191,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
5,1,3226,9.14336,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
6,2,29,4.33189,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
7,2,2294,6.03755,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
8,2,3416,2.22563,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
9,2,17893,7.45517,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0
10,2,6952,5.35407,3.0,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.0,3.0,0.0,2.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,3.0


In [36]:
Coefficients = DataFrame(Coefficients = simulated_effectsizes_chisq)
SNPs_rare = DataFrame(SNPs = rare_snps_for_simulation)
Trait3_mixed_sim = hcat(Coefficients, SNPs_rare)

Unnamed: 0_level_0,Coefficients,SNPs
Unnamed: 0_level_1,Float64,SubStrin…
1,0.616,rs3020701
2,-0.666,rs181646587
3,0.818,rs182902214
4,-0.575,rs184527030
5,0.818,rs10409990
6,1.159,rs185166611
7,-0.945,rs181637538
8,-0.818,rs186213888
9,0.945,rs184010370
10,-1.641,rs11667161


In [37]:
#cd("/Users") #change to home directory
using CSV
CSV.write("Trait3_mixed.csv", Trait3_mixed)
CSV.write("Trait3_mixed_sim.csv", Trait3_mixed_sim);