# Trait Simulation Demonstration

In this notebook we demonstrate how to simulate phenotypic traits. We use the Mendel Option 28e data with known parameter estimates to validate whether the simulation is sensible. 

In this notebook, we use the GAW19 data and provide 3 examples of how to simulate phenotypic traits. In all the examples, we follow Mendel Option 28e with the following simulation parameters for Trait1 and Trait2 in Ped28e.out: <br>

Mean effect:
$$
\mathbf{\mu} = 
\begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
\\
=
\begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

Covariance Matrix of Both Traits simulated Simultaneously through Linear Mixed Model (LMM):

$$
\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

Where we have the additive and environmental variances:

$$ V_a = 
\begin{vmatrix}
4 & 1 \\
1 & 4\\
\end{vmatrix}
$$

$$
V_e = 
\begin{vmatrix}
2 & 0 \\
0 & 2\
\end{vmatrix}
$$

The kinship matrix is derived from the genetic relationship matrix (GRM) across the common SNPs with minor allele frequency at least 0.05. I is the identity matrix. In these examples, we only use SNPs with genotype success rate > 0.98 but this criterion can be modified by the user. 

The user specifies arbitrary fixed effect sizes in examples 1 and 2. In the Generating Effect Sizes Section of Example 3 we show how the user can generate effect sizes that depend on the minor allele frequencies from a function such as an exponential or chisquare. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components. For reproducibility, we set a random seed for each simulation using `srand`.  If the user wishes to end up with different data, they will need to comment out these commands or use another value in srand.  At the end of each example, we write the results of each simulation to a file on the users own machine.

The notebook is organized as follows: <br>

Example 1: Generalized Linear Fixed Effects Model (no residual familial correlation)

a) Fixed effects model Normal Trait: User specified the SNPs to have fixed effects <br>
$$
Y_{1a} ∼ N(μ_{1}, 2), where μ_{1} = 40 + 3(sex) - 1.5(locus)
$$

b) Fixed effects model Non-Normal Trait:
$$
Y_{1a} ∼ N(μ_{1}, 2), where μ_{1} = 40 + 3(sex) - 1.5(locus)
$$

Example 2: Linear Mixed Model (models include an additive genetic variance component). Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  

(a) Mixed effects model Single Trait:
$$
Y_{2a} ∼ N(μ_{1}, 4* 2GRM + 2I)$$


(b) Mixed effects model Multiple Traits:
$$
Y_{2b} ∼ N(μ, \Sigma)
$$

Example 3: Rare Variant Model with effect sizes as a function of the allele frequencies. This model also assumes an additive genetic variance component.

The example includes 21 rare SNPs, with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set.<br>

a) Mixed effects model: Y ∼ N( μ, $\Sigma$) , with $\mu$ the combination of the effects of 21 variants. 

In [1]:
using TraitSimulation

┌ Info: Recompiling stale cache file /Users/sarahji/.julia/compiled/v1.0/TraitSimulation/VikWX.ji for TraitSimulation [dec3038e-29bc-11e9-2207-9f3d5855a202]
└ @ Base loading.jl:1190


In [2]:
using DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles
snpdata = SnpArray("traitsim28e.bed", 212)

┌ Info: Recompiling stale cache file /Users/sarahji/.julia/compiled/v1.0/SnpArrays/iEYce.ji for SnpArrays [4e780e97-f5bf-4111-9dc4-b70aaf691b06]
└ @ Base loading.jl:1190


212×253141 SnpArray:
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x02  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x02  0x02  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x03  0x02  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x03  0x00  0x00  0x03  0x00
 0x03  0x03  0x00  0x03  0x03  0x03  …  0x00  0x00  0x00  0x00  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x02  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x03  0x03  0x00  0x03  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03  …  0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x03  0x00  0x03  0x03  0x03     0x00  0x02  0x00  0x02  0x00  0x03
 0x03  0x02  0x00  0x03  0x03  0x03     0x02  0x02  0x00  0x02  0x00  0x03
    

In [3]:
famfile = readdlm("traitsim28e.fam", ',')

212×9 Array{Any,2}:
     1     16       "          "  …  "          "  30.2056   9.2421  ""
     1   8228       "          "     "          "  35.8214  15.2746  ""
     1  17008       "          "     "          "  36.053   19.505   ""
     1   9218  17008                 "          "  38.9635  18.9857  ""
     1   3226   9218                 "          "  33.7391  21.1041  ""
     2     29       "          "  …  "          "  34.8884  19.0114  ""
     2   2294       "          "     "          "  37.7011  19.1656  ""
     2   3416       "          "     "          "  45.1317  19.8409  ""
     2  17893   2294                 "          "  35.156   14.1423  ""
     2   6952   3416                 "          "  42.4514  19.9271  ""
     2  14695   2294              …  "          "  35.6426  17.4191  ""
     2   6790   2294                 "          "  40.6344  23.6845  ""
     2   3916   2294                 "          "  34.8618  16.8684  ""
     ⋮                            ⋱  ⋮      

In [4]:
traits_original = DataFrame(Trait1 = famfile[:, 7], Trait2 = famfile[:, 8])

Unnamed: 0_level_0,Trait1,Trait2
Unnamed: 0_level_1,Any,Any
1,30.2056,9.2421
2,35.8214,15.2746
3,36.053,19.505
4,38.9635,18.9857
5,33.7391,21.1041
6,34.8884,19.0114
7,37.7011,19.1656
8,45.1317,19.8409
9,35.156,14.1423
10,42.4514,19.9271


# Summary Statistics of the original Mendel 28e dataset:

Note we want to see similar values from our simulated traits!

In [5]:
describe(traits_original)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Nothing,Float64,Int64,Int64,DataType
1,Trait1,37.8602,29.2403,,47.8619,212,0,Any
2,Trait2,18.472,9.2421,,27.5554,212,0,Any


Transform sex variable from M/F to 1/-1 as does Mendel 28e data

In [6]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[:, 5])

212-element Array{Float64,1}:
 -1.0
 -1.0
  1.0
  1.0
 -1.0
 -1.0
  1.0
  1.0
 -1.0
  1.0
 -1.0
  1.0
 -1.0
  ⋮  
  1.0
  1.0
  1.0
 -1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0
  1.0

### Names of Variants:

We want to find the index of the causal snp, rs10412915, in the snp_definition file and then subset that snp from the genetic marker data above. 
We subset the SNP names into a vector called `snpid`

In [7]:
snpdef28_1 = readdlm("traitsim28e.bim", Any; header = false)
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1])

253141-element Array{SubString{String},1}:
 "rs3020701"  
 "rs56343121" 
 "rs143501051"
 "rs56182540" 
 "rs7260412"  
 "rs11669393" 
 "rs181646587"
 "rs8106297"  
 "rs8106302"  
 "rs183568620"
 "rs186451972"
 "rs189699222"
 "rs182902214"
 ⋮            
 "rs188169422"
 "rs144587467"
 "rs139879509"
 "rs143250448"
 "rs145384750"
 "rs149215836"
 "rs139221927"
 "rs181848453"
 "rs138318162"
 "rs186913222"
 "rs141816674"
 "rs150801216"

# Example 1: User Specified Model

In this example we suppose that the user knows which causal snp they want to have an effect on the Trait. To save computing time and memory, we can simply call it by their names in `snpid`, and subset only those specified SNP's for the analysis.

We see that the causal snp, rs10412915, is the 236074th variant in the snp dataset.

In [8]:
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]

236074

Let's create a design matrix for the alternative model that includes sex and locus rs10412915.

In [9]:
locus = convert(Vector{Float64}, @view(snpdata[:,ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

Unnamed: 0_level_0,sex,locus
Unnamed: 0_level_1,Float64,Float64
1,-1.0,2.0
2,-1.0,0.0
3,1.0,2.0
4,1.0,2.0
5,-1.0,1.0
6,-1.0,1.0
7,1.0,1.0
8,1.0,2.0
9,-1.0,1.0
10,1.0,1.0


### The Variance Covariance Matrix $\mathbf{\Sigma}$
Recall : $E(\mathbf{GRM}) = \Phi$ and $\mathbf{V} = 2\mathbf{V_a} \mathbf{\Phi} + \mathbf{V_e} \mathbf{I}$
<br>
We will use the same values of $\mathbf{GRM}$, $V_a$, and $V_e$ for the random effect example (2A), for the mixed effect example (2B) and for the rare variant example (3).

We use the SnpArrays.jl package to compute the Genetic Relationship Matrix (GRM).

In [10]:
GRM = grm(snpdata, method= :GRM)
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0]
I_n = Matrix{Float64}(I, size(GRM));

The corresponding variance covariance matrix, $\mathbf{Σ}$, is generated here: To create a trait different variance components change $\Sigma  = V_a \otimes (2GRM) + V_e \otimes I$. We create the variance component object `variancecomp` below, to simulate our traits.

In [11]:
variancecomp = @vc V_A ⊗ GRM + V_E ⊗ I_n;

# A) Generalized Linear Model:


# Y ∼ N( μ, 1.0)

# μ = ? 

This example simulates a case where three snps have fixed effects on the trait. Any apparent genetic correlation between relatives for the trait is due to the effect of these snps, so once these effects of these snps are modelled there should be no residual correlation among relatives. Note that by default, individuals with missing genotype values will have missing phenotype values, unless the user specifies the argument `impute = true` in the convert function above.
Be sure to change srand(1111) to something else (or comment out) if you want to generate a new data set. 


# Simulating Traits under Null Model

We simulate the two traits under the null model, with intercepts only. 

In [12]:
#trait_null = simulate(LMMTrait(["40", "20"], X, variancecomp))
trait_null_weird = simulate(LMMTrait(["40 + 0(sex) + 0(locus)", "20 + 0(sex) + 0(locus)"], X, variancecomp))

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Float64
1,41.7539,22.2263
2,39.2975,21.1787
3,39.8776,17.6752
4,38.5322,20.4645
5,40.3502,17.7098
6,40.351,15.354
7,41.5968,21.0145
8,37.9784,20.7133
9,36.1813,15.8798
10,38.7386,14.3234


In [13]:
user_formula_string = ["40", "20"]
users_formula_expression = Meta.parse.(user_formula_string)
typeof(users_formula_expression)
simulate(LMMTrait(["40", "20"], X, variancecomp))

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Float64
1,43.4836,23.9638
2,49.3585,28.2032
3,40.9944,18.4056
4,42.0447,23.4854
5,44.9094,34.8414
6,42.431,26.8722
7,42.3989,24.1662
8,40.4534,24.0244
9,38.2909,21.1426
10,39.8644,24.3895


# These are the formulas for the alternative model simulation

In [14]:
formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

# Trait Simulated from Alternative Model 

In [15]:
alternative_model = LMMTrait(formulas, X, variancecomp)
trait_alternative = simulate(LMMTrait(formulas, X, variancecomp))

Unnamed: 0_level_0,trait1,trait2
Unnamed: 0_level_1,Float64,Float64
1,32.3437,19.852
2,36.5528,16.4761
3,40.3608,18.6442
4,41.0844,20.4292
5,35.6206,13.8892
6,37.3675,23.0473
7,44.6417,20.2759
8,42.1219,24.3055
9,33.4115,20.1576
10,42.7116,25.8549


In [16]:
describe(trait_alternative)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Nothing,Nothing,DataType
1,trait1,38.306,26.5297,38.4138,49.3515,,,Float64
2,trait2,18.0051,7.86279,17.6972,30.3197,,,Float64


## 21 Rare SNPs for Simulation

Here are the 21 SNP's that will be used for trait simulation in this example. 

For this demo, the indexing `snpid[rare_index][1:2:41]` allows us to subset every other rare snp in the first 41 SNPs, to get our list of 21 rare SNPs. Change the range and number of SNPs to simulate with more or less SNPs and from different regions of the genome. 

In [17]:
minor_allele_frequency = maf(snpdata)

253141-element Array{Float64,1}:
 0.01650943396226412 
 0.08254716981132071 
 0.009433962264150943
 0.08726415094339623 
 0.08490566037735847 
 0.02594339622641506 
 0.014150943396226415
 0.05896226415094341 
 0.05896226415094341 
 0.018867924528301886
 0.08490566037735847 
 0.025943396226415096
 0.009433962264150941
 ⋮                   
 0.08254716981132071 
 0.01179245283018868 
 0.3136792452830188  
 0.2806603773584906  
 0.28773584905660377 
 0.37028301886792453 
 0.30660377358490565 
 0.07075471698113212 
 0.04481132075471698 
 0.2169811320754717  
 0.2806603773584906  
 0.2783018867924528  

In [18]:
#rare_snps = minor_allele_frequency(0.002 .< maf .≤ 0.02)
rare_snps_for_simulation = snpid[(0.002 .< minor_allele_frequency .≤ 0.02)][1:2:41]

21-element Array{SubString{String},1}:
 "rs3020701"  
 "rs181646587"
 "rs182902214"
 "rs184527030"
 "rs10409990" 
 "rs185166611"
 "rs181637538"
 "rs186213888"
 "rs184010370"
 "rs11667161" 
 "rs188819713"
 "rs182378235"
 "rs146361744"
 "rs190575937"
 "rs149949827"
 "rs117671630"
 "rs149171388"
 "rs188520640"
 "rs142722885"
 "rs146938393"
 "rs184561383"

In [19]:
geno_rare21_converted = convert(DataFrame, convert(Matrix{Float64}, data_rare[:, 1:2:41]))
names!(geno_rare21_converted, Symbol.(rare_snps_for_simulation))

UndefVarError: UndefVarError: data_rare not defined

## Generating Effect Sizes 

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

In addition to the Chi-Squared distribution, we also demo how to simulate from the Exponential distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Exponential density. 

## Chisquared(df = 1)

In [20]:
# Generating Effect Sizes from Chisquared(df = 1) density
using StatsFuns
n = length(maf_rare)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = sign(rand()-.5)*chisqpdf(1, maf_rare[i])/5.0
end

ArgumentError: ArgumentError: Package StatsFuns not found in current path:
- Run `import Pkg; Pkg.add("StatsFuns")` to install the StatsFuns package.


Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice how the more rare SNP's have the largest effect sizes.

In [21]:
[chisq_coeff maf_rare]

UndefVarError: UndefVarError: chisq_coeff not defined

In [22]:
simulated_effectsizes_chisq = round.(chisq_coeff, 2)

UndefVarError: UndefVarError: chisq_coeff not defined

In [23]:
simulated_effectsizes_exp = round.(6*exp.(-200*maf_rare), 2)

UndefVarError: UndefVarError: maf_rare not defined

## Function for Mean Model Expression

In some cases a large number of variants may be used for simulation. Thus, in this example we create a function where the user inputs a vector of coefficients and a vector of variants for simulation, then the function outputs the mean model expression. 

The function `FixedEffectTerms`, creates the proper evaluated expression for the simulation process, using the specified vectors of coefficients and snp names. The function outputs `evaluated_fixed_expression` which will be used to estimate the mean effect, `μ` in our mixed effects model. We make use of this function in this example, instead of having to write out all 21 of the coefficients and variant locus names.

In [24]:
function FixedEffectTerms(effectsizes::AbstractVecOrMat, snps::AbstractVecOrMat)
 # implementation
    fixed_terms = ""
for i = 1:length(rare_snps_for_simulation)-1
    expression = " + " * string(simulated_effectsizes_chisq[i]) * ".*" * "geno_rare21_converted[:" *
    rare_snps_for_simulation[i] * "]"
    fixed_terms = fixed_terms * expression
end
 # Output
    fixed_expression = parse(fixed_terms)
    evaluated_fixed_expression = eval(fixed_expression)
 return evaluated_fixed_expression
end

FixedEffectTerms (generic function with 1 method)

## Saving Simulation Results to Local Machine

Write the newly simulated trait into a comma separated (csv) file for later use. Note that the user can specify the separator to '\t' for tab separated, or another separator of choice. 

Here we output the simulated trait values for each of the 849 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait. Note that we can impute missing genotypes by turning the argument:<br> `impute = true"`.