# Trait Simulation Demonstration

In this notebook we demonstrate how to simulate phenotypic traits. We use the Mendel Option 28e data with known parameter estimates to validate whether the simulation is sensible. In all the examples, we follow Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below.

## Mendel Option 28e Data: 
Mean effect:
$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

Covariance Matrix of Both Traits simulated Simultaneously through Linear Mixed Model (LMM):

$$
\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

Where we have the additive and environmental variances:

$$
V_a = 
\begin{vmatrix}
4 & 1\\
1 & 4\\
\end{vmatrix}
$$

$$
V_e = 
\begin{vmatrix}
2 & 0\\
0 & 2\\
\end{vmatrix}
$$

The kinship matrix is derived from the genetic relationship matrix (GRM) across the common SNPs with minor allele frequency at least 0.05. I_n is the n dimensional identity matrix.

# The notebook is organized as follows: <br>

The user specifies arbitrary fixed effect sizes in examples 1 and 2. 

In the Generating Effect Sizes Section of Example 3 we show how the user can generate effect sizes that depend on the minor allele frequencies from a function such as an exponential or chisquare. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components. For reproducibility, we set a random seed for each simulation using `srand`.  If the user wishes to end up with different data, they will need to comment out these commands or use another value in srand.  At the end of each example, we write the results of each simulation to a file on the users own machine.

## Example 1: Generalized Linear Fixed Effects Model (no residual familial correlation)

### a) Single IID Normal Trait: User specified the SNPs to have fixed effects <br>
We simulate an iid Normal Trait with parameters, location = $μ_{1a}$ and scale = 2.
$$
Y_{1a} ∼ N(μ_{1a}, 2), where μ_{1a} = 40 + 3(sex) - 1.5(locus)
$$

### b) Single IID Non-Normal Trait:<br>
We simulate an iid Poisson Trait, location = 5.
$$
Y_{1b} ∼ Poisson(5)
$$

### c) Multiple Independent Traits: User specified distributions
We simulate two independent Traits from example 1a 1b, simultaneously.
$$
Y_{1c_{1}} ∼ N(\mu_{1c}, 2), where \mu_{1c} = 40 + 3(sex) - 1.5(locus)\\
Y_{1c_{2}} ∼ Poisson(\mu_{2c}), where \mu_{2c} = 2 - 1.5(locus)
$$

## Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  

### (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $μ_{1a} and scale = 4* 2GRM + 2I$. 
$$
Y_{2a} ∼ N(μ_{1}, 4* 2GRM + 2I)$$


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $μ_{1a}$ and scale = $4* 2GRM + 2I$. 
$$
Y_{2b} ∼ N(μ, \Sigma) , where \Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

## Example 3: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

The example also assumes an additive genetic variance component in the model which includes 20 rare SNPs, defined as snps with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $mu_{rare20}$ <br>

### (a) Single Trait: 
$$
Y_{3a} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

In [None]:
using DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, TraitSimulation
snpdata = SnpArray("traitsim28e.bed", 212)

┌ Info: Recompiling stale cache file /Users/sarahji/.julia/compiled/v1.0/TraitSimulation/VikWX.ji for TraitSimulation [dec3038e-29bc-11e9-2207-9f3d5855a202]
└ @ Base loading.jl:1190


In [None]:
famfile = readdlm("traitsim28e.fam", ',')

In [None]:
traits_original = DataFrame(Trait1 = famfile[:, 7], Trait2 = famfile[:, 8])

# Summary Statistics of the original Mendel 28e dataset:

Note we want to see similar values from our simulated traits!

In [None]:
describe(traits_original)

Transform sex variable from M/F to 1/-1 as does Mendel 28e data

In [None]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[:, 5])

### Names of Variants:

We want to find the index of the causal snp, rs10412915, in the snp_definition file and then subset that snp from the genetic marker data above. 
We subset the SNP names into a vector called `snpid`

In [None]:
snpdef28_1 = readdlm("traitsim28e.bim", Any; header = false)
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1])

# Example 1: User Specified Model

In this example we suppose that the user knows which causal snp they want to have an effect on the Trait. To save computing time and memory, we can simply call it by their names in `snpid`, and subset only those specified SNP's for the analysis.

We see that the causal snp, rs10412915, is the 236074th variant in the snp dataset.

In [None]:
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]

Let's create a design matrix for the alternative model that includes sex and locus rs10412915.

In [None]:
locus = convert(Vector{Float64}, @view(snpdata[:,ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

### The Variance Covariance Matrix $\mathbf{\Sigma}$
Recall : $E(\mathbf{GRM}) = \Phi$ and $\mathbf{V} = 2\mathbf{V_a} \mathbf{\Phi} + \mathbf{V_e} \mathbf{I}$
<br>
We will use the same values of $\mathbf{GRM}$, $V_a$, and $V_e$ for the random effect example (2A), for the mixed effect example (2B) and for the rare variant example (3).

We use the SnpArrays.jl package to compute the Genetic Relationship Matrix (GRM).

In [None]:
GRM = grm(snpdata, method= :GRM)
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0]
I_n = Matrix{Float64}(I, size(GRM));

The corresponding variance covariance matrix, $\mathbf{Σ}$, is generated here: To create a trait different variance components change $\Sigma  = V_a \otimes (2GRM) + V_e \otimes I$. We create the variance component object `variancecomp` below, to simulate our traits.

In [None]:
variancecomp = @vc V_A ⊗ GRM + V_E ⊗ I_n;

# A) Generalized Linear Model:


# Y ∼ N( μ, 1.0)

# μ = ? 

This example simulates a case where three snps have fixed effects on the trait. Any apparent genetic correlation between relatives for the trait is due to the effect of these snps, so once these effects of these snps are modelled there should be no residual correlation among relatives. Note that by default, individuals with missing genotype values will have missing phenotype values, unless the user specifies the argument `impute = true` in the convert function above.
Be sure to change srand(1111) to something else (or comment out) if you want to generate a new data set. 


# b) 𝑌_1b ∼𝑃𝑜𝑖𝑠𝑠𝑜𝑛(5)


In [None]:
GLM_trait_model_Poisson5 = GLMTrait(5, X, PoissonResponse(), IdentityLink())
Simulated_GLM_trait = simulate(GLM_trait_model_Poisson5)

In [None]:
describe(Simulated_GLM_trait[:,1])

# Example 1c) Simulate Multiple GLM Traits from Different Distributions

Here I simulate two independent traits simultaneously, one from a Normal distribution and the other from a Poisson Distribution.

 trait 1) 
 trait 2)

In [None]:
#for multiple glm traits from different distributions
dist_type_vector = [NormalResponse(4), PoissonResponse()]
link_type_vector = [IdentityLink(), LogLink()]

Ex1a_formulas = ["40 + 3(sex) - 1.5(locus)", "2 + 2(sex) - 1.5(locus)"]

Multiple_GLM_traits_model_NOTIID = Multiple_GLMTraits(Ex1a_formulas, X, dist_type_vector, link_type_vector)
Simulated_GLM_trait_NOTIID = simulate(Multiple_GLM_traits_model_NOTIID)

# Simulating Traits under Null Model

We simulate the two traits under the null model, with intercepts only. 

In [None]:
null_model = LMMTrait(["40", "20"], X, variancecomp)
trait_null = simulate(null_model)

# These are the formulas for the alternative model simulation

In [None]:
alternative_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

# Trait Simulated from Alternative Model 

In [None]:
alternative_model = LMMTrait(alternative_formulas, X, variancecomp)
trait_alternative = simulate(alternative_model)

In [None]:
describe(trait_alternative)

## 20 Rare SNPs for Simulation

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.001 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. Here are the 20 SNP's that will be used for trait simulation in this example. 

For this demo, the indexing `snpid[rare_index][1:2:40]` allows us to subset every other rare snp in the first 40 SNPs, to get our list of 20 rare SNPs. Change the range and number of SNPs to simulate with more or less SNPs and from different regions of the genome. 

In [None]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
data_rare = snpdata[:, rare_index]

In [None]:
maf_20_rare_snps = minor_allele_frequency[rare_index][1:2:40]

In [None]:
#rare_snps = minor_allele_frequency(0.002 .< maf .≤ 0.02)
rare_snps_for_simulation = snpid[rare_index][1:2:40]

In [None]:
geno_rare20_converted = convert(DataFrame, convert(Matrix{Float64}, data_rare[:, 1:2:40]))
names!(geno_rare20_converted, Symbol.(rare_snps_for_simulation))

## Generating Effect Sizes 

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

In addition to the Chi-Squared distribution, we also demo how to simulate from the Exponential distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Exponential density. 

## Chisquared(df = 1)

In [None]:
# Generating Effect Sizes from Chisquared(df = 1) density
using StatsFuns
n = length(maf_20_rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = sign(rand()- .5) * chisqpdf(1, maf_20_rare_snps[i])/5.0
end

Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice how the more rare SNP's have the largest effect sizes.

In [None]:
Ex3_rare = round.([chisq_coeff maf_20_rare_snps], digits = 3)
Ex3_rare = DataFrame(Chisq_Coefficient = Ex3_rare[:, 1] , MAF_rare = Ex3_rare[:, 2] )

In [None]:
simulated_effectsizes_chisq = Ex3_rare[:, 1]

In [None]:
simulated_effectsizes_exp = round.(6*exp.(-200*maf_20_rare_snps), digits = 3)

## Function for Mean Model Expression

In some cases a large number of variants may be used for simulation. Thus, in this example we create a function where the user inputs a vector of coefficients and a vector of variants for simulation, then the function outputs the mean model expression. 

The function `FixedEffectTerms`, creates the proper evaluated expression for the simulation process, using the specified vectors of coefficients and snp names. The function outputs `evaluated_fixed_expression` which will be used to estimate the mean effect, `μ` in our mixed effects model. We make use of this function in this example, instead of having to write out all 20 of the coefficients and variant locus names.

In [None]:
rare_snps_for_simulation

In [None]:
function FixedEffectTerms(effectsizes::AbstractVecOrMat, snps::AbstractVecOrMat)
 # implementation
    fixed_terms = ""
for i in 1:length(simulated_effectsizes_chisq) - 1
expression = " + " * string(simulated_effectsizes_chisq[i]) * "(" * rare_snps_for_simulation[i] * ")"
    fixed_terms = fixed_terms * expression
end
    return String(fixed_terms)
end


In [None]:
formula_20_rare_snps = FixedEffectTerms(simulated_effectsizes_chisq, rare_snps_for_simulation)

In [None]:
fixed_effect_20_rare_snps = round.(mean_formula(formula_20_rare_snps, geno_rare20_converted), digits = 3)

In [None]:
geno_rare20_converted

(a) Mixed effects model Single Trait:
$$
Y_{3a} ∼ N(μ_{20raresnps}, 4* 2GRM + 2I)$$


In [None]:
A = 4*(2*GRM) + 2*(I_n)

In [None]:
rare_20_snp_model = LMMTrait([formula_20_rare_snps], geno_rare20_converted, A)
trait_rare_20_snps = simulate(rare_20_snp_model)

In [None]:
describe(trait_rare_20_snps)

## Saving Simulation Results to Local Machine

Write the newly simulated trait into a comma separated (csv) file for later use. Note that the user can specify the separator to '\t' for tab separated, or another separator of choice. 

Here we output the simulated trait values for each of the 849 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait. Note that we can impute missing genotypes by turning the argument:<br> `impute = true"`.