# Trait Simulation Demonstration

Authors: Sarah Ji, Janet Sinsheimer, Ken Lange

In this notebook we demonstrate how to simulate phenotypic traits. We use the Classic Mendel Option 28e data with known parameter estimates to validate whether the simulation works. In example 2b, we follow Mendel Option 28e with the simulation parameters for Trait1 and Trait2 in Ped28e.out as shown below.

The user specifies arbitrary fixed effect sizes in examples 1 and 2. 

In the Generating Effect Sizes Section of Example 3 we show how the user can generate effect sizes that depend on the minor allele frequencies from a function such as an exponential or chisquare. To aid the user when they wish to include a large number of loci in the model, we created a function that automatically writes out the mean components. At the end of example 3, we demo how to write the results of each simulation to a file on the users own machine.

## Mendel Option 28e Data: 
Mean effect:
$$
\mathbf{\mu} = \begin{vmatrix}
\mu_1 \\
\mu_2 \\
\end{vmatrix}
= \begin{vmatrix}
40 + 3(sex) - 1.5(locus)\\
20 + 2(sex) - 1.5(locus)\\
\end{vmatrix}
\\
$$

Covariance Matrix of Both Traits simulated Simultaneously through Linear Mixed Model (LMM):

$$
\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

Where we have the additive and environmental variances:

$$
V_a = 
\begin{vmatrix}
4 & 1\\
1 & 4\\
\end{vmatrix}
$$

$$
V_e = 
\begin{vmatrix}
2 & 0\\
0 & 2\\
\end{vmatrix}
$$

The kinship matrix is derived from the genetic relationship matrix (GRM) across the common SNPs with minor allele frequency at least 0.05. $I_{n}$ is the n dimensional identity matrix.

# Reproducibility

For reproducibility, we set a random seed using the `Random.jl` package for each simulation using `Random.seed!(1234)`.  If the user wishes to end up with different data, they will need to comment out these commands or use another value in Random.seed!().

In [1]:
using Random
Random.seed!(1234);

Machine information:

In [2]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)


# The notebook is organized as follows:
## Example 1: Generalized Linear Fixed Effects Model (no residual familial correlation)

### a) Single IID Non-Normal Trait:<br>
We simulate an iid Poisson Trait, location = 5.
$$
Y_{1a} ∼ Poisson(5)
$$

### b) Multiple Independent Traits: User specified distributions
We simulate two independent Traits from example 1a 1b, simultaneously.<br>
$$
Y_{1b_{1}} ∼ N(\mu_{1b}, 2), \mu_{1b} = 40 + 3(sex) - 1.5(locus)\\
Y_{1b_{2}} ∼ Poisson(\mu_{2b}), \mu_{2b} = 2 + 2(sex) - 1.5(locus)
$$

## Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  

### (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $\mu_{1} and scale = V_{{a}_{1,1}}* 2GRM + V_{{e}_{1,1}}I$. 
$$
Y_{2a} ∼ N(\mu_{1}, 4* 2GRM + 2I)$$


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $\mu$ and scale = $\Sigma$. 
$$
Y_{2b} ∼ N(\mu, \Sigma) , \Sigma  = V_{a} \otimes (2GRM) + V_{e} \otimes I_{n}
$$

## Example 3: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

The example also assumes an additive genetic variance component in the model which includes 20 rare SNPs, defined as snps with minor allele frequencies greater than 0.002 but less than 0.02.  In practice rare SNPs have smaller minor allele frequencies, but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu_{rare20}$ <br>

### (a) Single Trait: 
$$
Y_{3a} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

# Reading the Mendel 28a data using SnpArrays

First use `SnpArrays.jl` to read in the SNP data

In [3]:
using DataFrames, SnpArrays, StatsModels, Random, LinearAlgebra, DelimitedFiles, StatsBase, TraitSimulation
snpdata = SnpArray("traitsim28e.bed", 212)

SystemError: SystemError: opening file traitsim28e.bed: No such file or directory

Store the FamID and PersonID of Individuals in Mendel 28e data

In [4]:
famfile = readdlm("traitsim28e.fam", ',')
Fam_Person_id = DataFrame(FamID = famfile[:, 1], PID = famfile[:, 2])

ArgumentError: ArgumentError: Cannot open 'traitsim28e.fam': not a file

Note: We subset `traits_original` to compare in Example 2b our simulated traits to these two simulated traits from Mendel Option 28e.

In [5]:
traits_original = DataFrame(Trait1 = famfile[:, 7], Trait2 = famfile[:, 8])

UndefVarError: UndefVarError: famfile not defined

Transform sex variable from M/F to 1/-1 as does Mendel 28e data

In [6]:
sex = map(x -> strip(x) == "F" ? -1.0 : 1.0, famfile[:, 5]) # note julia's ternary operator '?'

UndefVarError: UndefVarError: famfile not defined

### Names of Variants:

We want to find the index of the causal snp, rs10412915, in the snp_definition file and then subset that snp from the genetic marker data above. 
We subset the SNP names into a vector called `snpid`

In [40]:
snpdef28_1 = readdlm("traitsim28e.bim", Any; header = false)
snpid = map(x -> strip(string(x)), snpdef28_1[:, 1]) # strip mining in the data 

ArgumentError: ArgumentError: Cannot open 'traitsim28e.bim': not a file

We see that the causal snp, rs10412915, is the 236074th variant in the snp dataset.

In [8]:
ind_rs10412915 = findall(x -> x == "rs10412915", snpid)[1]

UndefVarError: UndefVarError: snpid not defined

Let's create a design matrix for the alternative model that includes sex and locus rs10412915.

In [9]:
locus = convert(Vector{Float64}, @view(snpdata[:, ind_rs10412915]))
X = DataFrame(sex = sex, locus = locus)

UndefVarError: UndefVarError: snpdata not defined

## The Variance Covariance Matrix
### Single Trait 
Recall : $E(\mathbf{GRM}) = \Phi$ and $\mathbf{V} = 2\mathbf{V_a} \mathbf{\Phi} + \mathbf{V_e} \mathbf{I}$
<br>
We will use the same values of $\mathbf{GRM}$, $V_a$, and $V_e$ for the mixed effect example (2) and for the rare variant example (3).

We use the SnpArrays.jl package to compute the Genetic Relationship Matrix (GRM).

In [10]:
GRM = grm(snpdata, method = :GRM)
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0]
I_n = Matrix{Float64}(I, size(GRM));

UndefVarError: UndefVarError: snpdata not defined

### Multiple Correlated Traits

The corresponding variance covariance matrix as specified Mendel Option 28e, $\mathbf{Σ}$, is generated here: To create a trait different variance components change $\Sigma  = V_a \otimes (2GRM) + V_e \otimes I$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b.

In [11]:
# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ GRM + V_E ⊗ I_n;

UndefVarError: UndefVarError: V_A not defined

# Example 1 Generalized Linear Model:

This example simulates a case where the snp has a fixed effect on the trait. Any apparent genetic correlation between relatives for the trait is due to the effect of the snp, so once the effect of the snp is modelled there should be no residual correlation among relatives. Note that by default, individuals with missing genotype values will have missing phenotype values, unless the user specifies the argument `impute = true` in the convert function above.
Be sure to change Random.seed!(1234) to something else (or comment out) if you want to generate a new data set. 


### a) Single IID Non-Normal Trait: <br>
We simulate an iid Poisson Trait, location = 5. We use the Identity Link to simulate this Poisson trait.
$$
Y_{1a} ∼ Poisson(5)
$$

# A ) 𝑌_1a ∼𝑃𝑜𝑖𝑠𝑠𝑜𝑛(5)


In [12]:
GLM_trait_model_Poisson5 = GLMTrait(5, X, PoissonResponse(), IdentityLink())
Simulated_GLM_trait = simulate(GLM_trait_model_Poisson5)

UndefVarError: UndefVarError: X not defined

Descriptive Statistics of Poisson(5) Trait

In [13]:
describe(Simulated_GLM_trait[:, 1])

UndefVarError: UndefVarError: Simulated_GLM_trait not defined

# Example 1b) Multiple Independent Traits: User specified distributions

Here I simulate two independent traits simultaneously, one from a Normal distribution and the other from a Poisson Distribution. Notice a difference from Example 1a, we use the LogLink to simulate the Poisson Trait this time.

$$
Y_{1b_{1}} ∼ N(\mu_{1b}, 2), where \mu_{1b} = 40 + 3(sex) - 1.5(locus)\\
Y_{1b_{2}} ∼ Poisson(\mu_{2b}), where \mu_{2b} = 2 + 2(sex) - 1.5(locus)\\
$$

In [14]:
#for multiple glm traits from different distributions
dist_type_vector = [NormalResponse(4), PoissonResponse()]
link_type_vector = [IdentityLink(), LogLink()]

mean_formulas = ["40 + 3(sex) - 1.5(locus)", "2 + 2(sex) - 1.5(locus)"]

Multiple_GLM_traits_model_NOTIID = Multiple_GLMTraits(mean_formulas, X, dist_type_vector, link_type_vector)
Simulated_GLM_trait_NOTIID = simulate(Multiple_GLM_traits_model_NOTIID)

UndefVarError: UndefVarError: X not defined

In [15]:
describe(Simulated_GLM_trait_NOTIID, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

UndefVarError: UndefVarError: Simulated_GLM_trait_NOTIID not defined

# Example 2: Linear Mixed Model (with additive genetic variance component).
Note that the residual covariance among two relatives is the additive genetic variance times twice the kinship coefficient.  


## (a) Single Trait:
We simulate a Normal Trait controlling for family structure, location = $μ_{1a} and scale = 4* 2GRM + 2I$. 
$$
Y_{2a} ∼ N(μ_{1}, 4* 2GRM + 2I)$$

In [16]:
mean_formula = ["40 + 3(sex) - 1.5(locus)"]

1-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"

In [17]:
Ex2a_model = LMMTrait(mean_formula, X, 4*(2*GRM) + 2*(I_n))
trait_2a = simulate(Ex2a_model)

UndefVarError: UndefVarError: GRM not defined

In [18]:
describe(trait_2a, stats = [:mean, :std, :min, :q25, :median, :q75, :max, :eltype])

UndefVarError: UndefVarError: trait_2a not defined

## Example 2b) Simulating Two Correlated Traits with Mendel Option 28e paramaters


### (b) Multiple Correlated Traits: (Mendel Example 28e Simulation)
We simulate two correlated Normal Traits controlling for family structure, location = $μ_{1a}$ and scale = $4* 2GRM + 2I$. 
$$
Y_{2b} ∼ N(μ, \Sigma) , where \Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$


These are the formulas for the fixed effects, as specified by Mendel Option 28e.

In [19]:
mean_formulas = ["40 + 3(sex) - 1.5(locus)", "20 + 2(sex) - 1.5(locus)"]

2-element Array{String,1}:
 "40 + 3(sex) - 1.5(locus)"
 "20 + 2(sex) - 1.5(locus)"

In [20]:
Ex2b_model = LMMTrait(mean_formulas, X, variance_formula)
trait_2b = simulate(Ex2b_model)

UndefVarError: UndefVarError: X not defined

### Summary Statistics of Our Simulated Traits

In [21]:
describe(trait_2b, stats = [:mean, :std, :min, :max, :eltype])

UndefVarError: UndefVarError: trait_2b not defined

### Summary Statistics of the original Mendel 28e dataset Traits:

Note we want to see similar values from our simulated traits!

In [22]:
describe(traits_original, stats = [:mean, :std, :min, :max, :eltype])

UndefVarError: UndefVarError: traits_original not defined

# Example 3: 20 Rare SNPs for Simulation


## Example 3: Rare Variant Linear Mixed Model with effect sizes as a function of the allele frequencies. 

In this example we first subset only the rare SNP's with minor allele frequency greater than 0.002 but less than 0.02, then we simulate traits on 20 of the rare SNP's as fixed effects. Here are the 20 SNP's that will be used for trait simulation in this example. 

For this demo, the indexing `snpid[rare_index][1:2:40]` allows us to subset every other rare snp in the first 40 SNPs, to get our list of 20 rare SNPs. Change the range and number of SNPs to simulate with more or less SNPs and from different regions of the genome. 

In practice rare SNPs have smaller minor allele frequencies but we are limited in this tutorial by the number of individuals in the data set. We use generated effect sizes to evaluate $\mu_{rare20}$ <br>

### (a) Single Trait: 
$$
Y_{3a} ∼ N(\mu_{rare20}, 4* 2GRM + 2I)
$$

In [23]:
# filter out rare SNPS, get a subset of uncommon SNPs with 0.002 < MAF ≤ 0.02
minor_allele_frequency = maf(snpdata)
rare_index = (0.002 .< minor_allele_frequency .≤ 0.02)
data_rare = snpdata[:, rare_index]

UndefVarError: UndefVarError: snpdata not defined

In [24]:
maf_20_rare_snps = minor_allele_frequency[rare_index][1:2:40]

UndefVarError: UndefVarError: minor_allele_frequency not defined

In [25]:
#rare_snps = minor_allele_frequency(0.002 .< maf .≤ 0.02)
rare_snps_for_simulation = snpid[rare_index][1:2:40]

UndefVarError: UndefVarError: snpid not defined

In [26]:
geno_rare20_converted = convert(DataFrame, convert(Matrix{Float64}, @view(data_rare[:, 1:2:40])))
names!(geno_rare20_converted, Symbol.(rare_snps_for_simulation))

UndefVarError: UndefVarError: data_rare not defined

## Generating Effect Sizes Based on MAF

For demonstration purposes, we simulate effect sizes from the Chi-squared(df = 1) distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Chi-squared (df = 1) density, so that the rarest SNP's have the biggest effect sizes. The effect sizes are rounded to the second digit, throughout this example. Notice there is a random +1 or -1, so that there are effects that both increase and decrease the simulated trait value.

In addition to the Chi-Squared distribution, we also demo how to simulate from the Exponential distribution, where we use the minor allele frequency (maf) as x and find f(x) where f is the pdf for the Exponential density. 

## Chisquared(df = 1)

In [27]:
# Generating Effect Sizes from Chisquared(df = 1) density
using StatsFuns
n = length(maf_20_rare_snps)
chisq_coeff = zeros(n)

for i in 1:n
    chisq_coeff[i] = sign(rand() - .5) * chisqpdf(1, maf_20_rare_snps[i])/5.0
end

UndefVarError: UndefVarError: maf_20_rare_snps not defined

Take a look at the simulated coefficients on the left, next to the corresponding minor allele frequency. Notice the rarer SNPs have the largest effect sizes.

In [28]:
Ex3_rare = round.([chisq_coeff maf_20_rare_snps], digits = 3)
Ex3_rare = DataFrame(Chisq_Coefficient = Ex3_rare[:, 1] , MAF_rare = Ex3_rare[:, 2] )

UndefVarError: UndefVarError: chisq_coeff not defined

In [29]:
simulated_effectsizes_chisq = Ex3_rare[:, 1]

UndefVarError: UndefVarError: Ex3_rare not defined

### Simulating effect sizes from the Exponential distribution, where we use the maf as x and find f(x) where f is the pdf for the Exponential density

In [30]:
simulated_effectsizes_exp = round.(6*exp.(-200*maf_20_rare_snps), digits = 3)

UndefVarError: UndefVarError: maf_20_rare_snps not defined

## Function for Mean Model Expression

In some cases a large number of variants may be used for simulation. Thus, in this example we create a function where the user inputs a vector of coefficients and a vector of variants for simulation, then the function outputs the mean model expression. 

The function `FixedEffectTerms`, creates the proper evaluated expression for the simulation process, using the specified vectors of coefficients and snp names. The function outputs `evaluated_fixed_expression` which will be used to estimate the mean effect, `μ` in our mixed effects model. We make use of this function in this example, instead of having to write out all 20 of the coefficients and variant locus names.

In [31]:
rare_snps_for_simulation

UndefVarError: UndefVarError: rare_snps_for_simulation not defined

In [32]:
function FixedEffectTerms(effectsizes::AbstractVecOrMat, snps::AbstractVecOrMat)
 # implementation
    fixed_terms = ""
for i in 1:length(simulated_effectsizes_chisq) - 1
expression = " + " * string(simulated_effectsizes_chisq[i]) * "(" * rare_snps_for_simulation[i] * ")"
    fixed_terms = fixed_terms * expression
end
    return String(fixed_terms)
end


FixedEffectTerms (generic function with 1 method)

In [33]:
formula_20_rare_snps = FixedEffectTerms(simulated_effectsizes_chisq, rare_snps_for_simulation)

UndefVarError: UndefVarError: simulated_effectsizes_chisq not defined

## Example 3(a) Mixed effects model Single Trait:
$$
Y_{3a} ∼ N(μ_{20raresnps}, 4* 2GRM + 2I)$$


This intermediate step uses the `mean_formula` function to evaluate the `formula_20_rare_snps` above on the 20 rare snp data, `geno_rare20_converted`,  to get the fixed effects mean vector (rounded to the third digit).

In [34]:
μ_20_rare_snps = round.(mean_formula(formula_20_rare_snps, geno_rare20_converted), digits = 3)

UndefVarError: UndefVarError: formula_20_rare_snps not defined

In [35]:
rare_20_snp_model = LMMTrait([formula_20_rare_snps], geno_rare20_converted, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = simulate(rare_20_snp_model)

UndefVarError: UndefVarError: formula_20_rare_snps not defined

In [36]:
describe(trait_rare_20_snps, stats = [:mean, :std, :min, :max, :eltype])

UndefVarError: UndefVarError: trait_rare_20_snps not defined

## Saving Simulation Results to Local Machine

Write the newly simulated trait into a comma separated (csv) file for later use. Note that the user can specify the separator to '\t' for tab separated, or another separator of choice. 

Here we output the simulated trait values for each of the 212 individuals, labeled by their pedigree ID and person ID.

In addition, we output the genotypes for the variants used to simulate this trait.

In [37]:
Trait3_mixed = hcat(Fam_Person_id, trait_rare_20_snps, geno_rare20_converted)

UndefVarError: UndefVarError: Fam_Person_id not defined

In [38]:
Coefficients = DataFrame(Coefficients = simulated_effectsizes_chisq)
SNPs_rare = DataFrame(SNPs = rare_snps_for_simulation)
Trait3_mixed_sim = hcat(Coefficients, SNPs_rare)

UndefVarError: UndefVarError: simulated_effectsizes_chisq not defined

In [39]:
#cd("/Users") #change to home directory
using CSV
CSV.write("Trait3_mixed.csv", Trait3_mixed)
CSV.write("Trait3_mixed_sim.csv", Trait3_mixed_sim);

UndefVarError: UndefVarError: Trait3_mixed not defined