# Random k Rare Snps Example

In the three examples we simulate traits from a prespecified k number of snps, with simulated effect sizes based off of minor allele frequencies and a Chi-squared distribution. 

Using SnpArrays.jl we read in the set of PLINK files for analysis, filter the set of PLINK files from our desired parameters, and output the stored files to our own machine. 


In [1]:
using SnpArrays, TraitSimulation, LinearAlgebra, DataFrames

┌ Info: Recompiling stale cache file /Users/sarahji/.julia/compiled/v1.2/TraitSimulation/VikWX.ji for TraitSimulation [dec3038e-29bc-11e9-2207-9f3d5855a202]
└ @ Base loading.jl:1240


In [2]:
# User specifies the number of snps to use for simulation
k = 5

5

# Data

Our Data comes from the data directory of the SnpArrays.jl package. We can easily filter the set of PLINK using SnpArrays.jl, and save to our own machine the filtered PLINK files.


## Input PLINK files

In [3]:
filepath = SnpArrays.datadir("EUR_subset")

"/Users/sarahji/.julia/packages/SnpArrays/d0iJw/src/../data/EUR_subset"

In [4]:
EUR_snpdata = SnpArray(filepath * ".bed")
rowmask, colmask =  SnpArrays.filter(EUR_snpdata);

In [5]:
SnpData(SnpArrays.datadir(filepath));

# Output: Save Filtered PLINK files to local machine

In [6]:
minor_allele_frequency = maf(EUR_snpdata)
new_column_indices = (0.002 .< minor_allele_frequency .≤ 0.02)
sum(new_column_indices) # we should have 7171 filtered rare snps

7171

In [7]:
filtsnpdata = SnpArrays.filter(SnpArrays.datadir(filepath), rowmask, new_column_indices, des = "tmp_filtered_EURdata")

379×7171 SnpArray:
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x02  0x03  0x02  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03     0x03  0x02  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x03  0x03  0x03  …  0x03  0x03  0x03  0x03  0x03  0x03
 0x03  0x03  0x03  0x02  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
 0x02  0x03  0x03  0x03  0x03  0x03     0x03  0x03  0x03  0x03  0x03  0x03
    ⋮ 

# Example 1: Rare Variant GLM model

$$
Y ∼ N(\mu, 4* 2GRM + 2I)
$$


In [8]:
meanformula_rare, df_rare = Generate_Random_Model_Chisq("tmp_filtered_EURdata", k)
df_rare

Unnamed: 0_level_0,rs112552500,rs144235675,rs8095054,rs139491109,rs141954310
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,2.0,2.0,2.0,1.0,2.0
2,2.0,2.0,2.0,2.0,2.0
3,2.0,2.0,2.0,2.0,2.0
4,2.0,2.0,2.0,2.0,2.0
5,2.0,2.0,2.0,2.0,2.0
6,2.0,2.0,2.0,2.0,2.0
7,2.0,2.0,2.0,2.0,2.0
8,2.0,2.0,2.0,2.0,2.0
9,2.0,2.0,2.0,2.0,2.0
10,2.0,2.0,2.0,2.0,2.0


In [9]:
model_rare = GLMTrait(meanformula_rare, df_rare, NormalResponse(1), IdentityLink())
simulated_trait = simulate(model_rare)

379-element Array{Float64,1}:
 0.4652368823595425 
 0.7655860681351965 
 0.5599508456845363 
 0.4006010832726554 
 0.6480322706284275 
 2.2646979013610196 
 3.964307948566201  
 2.678959607200487  
 2.63145858273236   
 0.2740303353636089 
 1.4781426230247598 
 1.895650281427516  
 1.5301222104122967 
 ⋮                  
 2.4649905339074683 
 1.6654697128669342 
 1.7193265929303339 
 0.8470649373524373 
 0.837327899642764  
 2.083307930746231  
 4.124986276999239  
 0.20177212842668946
 1.2949833993082618 
 2.1253813863640185 
 1.2145390902632818 
 2.280332672267521  

# Example 2: Single Rare Variant LMM model

## The Variance Covariance Matrix

Recall : $E(\mathbf{GRM}) = \Phi$
<br>
We use the [SnpArrays.jl](https://github.com/OpenMendel/SnpArrays.jl) package to find an estimate of the Kinship ($\Phi$), the Genetic Relationship Matrix (GRM). 

We will use the same values of $\textbf{GRM, V_a, and V_e}$ in the bivariate covariance matrix for both the mixed effect example and for the rare variant example.

Note that the residual covariance among two relatives is the additive genetic variance, $\textbf{V_a}$, times twice the kinship coefficient, $\Phi$. The kinship matrix is derived from the genetic relationship matrix $\textbf{GRM}$ across the common SNPs with minor allele frequency at least 0.05.

In [10]:
GRM = grm(EUR_snpdata, minmaf = 0.05)
I_n = Matrix{Float64}(I, size(GRM));

In [11]:
rare_20_snp_model = LMMTrait([meanformula_rare], df_rare, 4*(2*GRM) + 2*(I_n))
trait_rare_20_snps = simulate(rare_20_snp_model, 1000)[:, :, 1]

379×1 Array{Float64,2}:
  2.6431041595815357 
  3.3626846214216117 
 -0.9706484488139182 
  2.1046879663559768 
  0.3869794804911375 
  0.47957779083280483
  6.032767688181237  
  0.7273717767383031 
  1.590854134991661  
 -2.8672496061842097 
  0.31440239788745633
  4.370953345193305  
  1.4903742349387976 
  ⋮                  
 -2.6562881484312797 
 -2.3388976486883966 
  3.523714373774819  
  2.9103143013125425 
  3.2784804825762928 
  1.4714345933897637 
  0.3330203049484175 
 -0.4135344211943712 
  5.313884144225529  
  4.645058572811591  
 -0.7250904858749383 
  3.341893171825407  

# Example 3: Multiple Rare Variant LMM

We simulate two correlated Normal Traits controlling for family structure, location = μ and scale = $\mathbf\Sigma$. 

$$
Y ∼ N(μ, \mathbf\Sigma)
$$ 


$$
\mathbf\Sigma  = V_a \otimes (2GRM) + V_e \otimes I_n
$$

&nbsp; $FYI$: To create a trait with different variance components change the elements of $\mathbf\Sigma$. We create the variance component object `variance_formula` below, to simulate our traits in example 2b. While this tutorial only uses 2 variance components, we make note that the `@vc` macro is designed to handle as many variance components as needed. 

As long as each Variance Component is specified correctly, we can create a `VarianceComponent` Julia object for Trait Simulation:

&nbsp; 
Example) Specifying more than 2 variance components (let V_H indicate an additional Household Variance component and V_D indicate a dominance genetic effect) 

```{julia}
    multiple_variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n + V_D ⊗ Δ + V_H ⊗ H;
```

In [12]:
V_A = [4 1; 1 4]
V_E = [2.0 0.0; 0.0 2.0];
# @vc is a macro that creates a 'VarianceComponent' Type for simulation
variance_formula = @vc V_A ⊗ 2GRM + V_E ⊗ I_n;

In [13]:
mean_formulas2 = [meanformula_rare, meanformula_rare]
Multiple_LMM_traits_model = LMMTrait(mean_formulas2, df_rare, variance_formula)
Simulated_LMM_Traits = DataFrame(simulate(Multiple_LMM_traits_model))

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.718008,0.718008
2,1.43602,1.43602
3,1.43602,1.43602
4,1.43602,1.43602
5,1.43602,1.43602
6,1.43602,1.43602
7,1.43602,1.43602
8,1.43602,1.43602
9,1.43602,1.43602
10,1.43602,1.43602
