# Examples

In this section, we learn how to setup a basic analysis pipeline to analyze your GWAS data with MendelIHT. 

## Installation

Press `]` to enter package manager mode and type the following (after `pkg>`):
```
(v1.0) pkg> add https://github.com/OpenMendel/SnpArrays.jl
(v1.0) pkg> add https://github.com/OpenMendel/MendelSearch.jl
(v1.0) pkg> add https://github.com/OpenMendel/MendelBase.jl
(v1.0) pkg> add https://github.com/biona001/MendelIHT.jl
```
The order of installation is important!

## Workflow

For a typical user, your analysis pipeline should look like the following:

### Step 1: Import data
We use [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/) as backend to process genotype files. Internally, the genotype file is a memory mapped [SnpArray type](https://openmendel.github.io/SnpArrays.jl/stable/#SnpArray-1), which is used to compute summary statistics and does not consume RAM. In step 4 (particularly with function `L0_reg`, the genotype file must be converted to [SnpBitMatrix type](https://openmendel.github.io/SnpArrays.jl/stable/#SnpBitMatrix-1) to run linear algebra routines in MendelIHT, and this consumes $n \times p \times 2$ bits of RAM. 

Non-genetic predictors should be read into Julia in the standard way, and should be stored as a **matrix** of type Float64 (i.e. `Array{Float64, 2}`. One should include the intercept as the first column, but an intercept is not required to run IHT. 

#### Example 1.1: Reading Genotype data

In [16]:
using SnpArrays, MendelIHT
x = SnpArray("../data/test1.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true);

!!! note

    (1) MendelIHT.jl assumes there are **NO missing genotypes**, and (2) the trios (`.bim`, `.bed`, `.fam`) must all be present in the same directory. 
    
#### Example 1.2: Reading non-genetic covariate data

In [17]:
using DelimitedFiles, Statistics
z = readdlm("../data/test1_covariates.txt") # 1st column intercept, 2nd column sex

# standardize all covariates (other than intercept) to mean 0 variance 1
for i in 2:size(z, 2)
    col_mean = mean(z[:, i])
    col_std  = std(z[:, i])
    z[:, i] .= (z[:, i] .- col_mean) ./ col_std
end

z #first column intercept, and 2nd column is the standardized sex covariate. 

1000×2 Array{Float64,2}:
 1.0   1.01969 
 1.0  -0.979706
 1.0   1.01969 
 1.0  -0.979706
 1.0  -0.979706
 1.0  -0.979706
 1.0   1.01969 
 1.0  -0.979706
 1.0  -0.979706
 1.0  -0.979706
 1.0  -0.979706
 1.0   1.01969 
 1.0  -0.979706
 ⋮             
 1.0  -0.979706
 1.0  -0.979706
 1.0  -0.979706
 1.0   1.01969 
 1.0   1.01969 
 1.0   1.01969 
 1.0  -0.979706
 1.0   1.01969 
 1.0  -0.979706
 1.0  -0.979706
 1.0   1.01969 
 1.0  -0.979706

!!! note

    Except for the intercept, one should standardize all (including binary and categorical) covarariates. This ensures equal penalization for all.

### Step 2: Decide different model sizes you wish to test
This means deciding how many SNPs *may* be associated with the trait. You store all the model sizes you want to test in a vector typically called `path`. Then you decide how many fold of cross validation you need (more = higher accuracy, but longer compute time. Typically we do 3~5 fold). 

#### Example 2.1
A complex trait such as obesity may have 100 associative SNPs. Thus we test models $k = \{1, 2, ..., 100\}$ as possible models. Then we run 5 fold [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) to minimize overfitting. Of course, one is free to choose any number of folds and any model sizes, but the computational complexity of running $a$ models across $b$ folds scales as $\mathcal{O}(ab)$. 

In [18]:
path = collect(1:100)
num_folds = 5
folds = rand(1:num_folds, size(x, 1)) #vector partitioning x into 5 disjoint subsets

1000-element Array{Int64,1}:
 1
 2
 5
 4
 4
 1
 2
 2
 1
 4
 1
 3
 1
 ⋮
 3
 1
 5
 3
 3
 3
 5
 3
 4
 5
 1
 4

In [None]:

### Step 3: Call cv_iht to run cross validation across the different models

This picks the best model size you specified through cross validation
### Step 4: Run `L0_reg` on the best model size

In [9]:
using SnpArrays, MendelIHT
x = SnpArray("../data/test1.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true);

In [15]:
using DelimitedFiles, Statistics
z = readdlm("../data/test1_covariates.txt") # 1st column intercept, 2nd column sex

# standardize all covariates (other than intercept) to mean 0 variance 1
for i in 2:size(z, 2)
    col_mean = mean(z[:, i])
    col_std  = std(z[:, i])
    z[:, i] .= (z[:, i] .- col_mean) ./ col_std
end
z

0.9999999999999997

In [None]:
```Julia

```
