# OpenMendel Tutorial on Iterative Hard Thresholding

### Last update: 4/24/2019

### Julia version

`MendelIHT.jl` currently supports Julia 1.0 and 1.1, but it currently an unregistered package. To install, press `]` to invoke the package manager mode and install these packages by typing:

```
add https://github.com/OpenMendel/SnpArrays.jl
add https://github.com/OpenMendel/MendelSearch.jl
add https://github.com/OpenMendel/MendelBase.jl
add https://github.com/biona001/MendelIHT.jl
```

For this tutorial you will also need a few registered packages. Add them by typing:

```
add DataFrames, Distributions, BenchmarkTools, Random, LinearAlgebra, GLM
```

For reproducibility, the computer spec and Julia version is listed below.

In [28]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)


### When to use Iterative Hard Thresholding

Continuous model selection is advantageous in situations where the multivariate nature of the regressors plays a significant role together. As an alternative to traditional SNP-by-SNP association testing, iterative hard-thresholing (IHT) performs continuous model selection on a GWAS dataset $\mathbf{X} \in \{0, 1, 2\}^{n \times p}$ and continuous phenotype vector $\mathbf{y}$ by maximizing the loglikelihood $L(\beta)$ subject to the constraint that $\beta$ is $k-$sparse. This method has the edge over LASSO because IHT does not shrink estimated effect sizes. Parallel computing is offered through `q-`fold cross validation.

### Appropriate Datasets and Example Inputs 

All genotype data **must** be stored in the [PLINK binary genotype format](https://www.cog-genomics.org/plink2/formats#bed) where at least the triplets `.bim`, `.bed` and `.fam` must all be present. Additional non-genetic covariates should be imported separately by the user. In the examples below, we first simulate phenotypes from the Normal, Bernoulli, Poisson, and Negative Binomial family, and then attempt to fit the corresponding model using our IHT implementation. We can examine reconstruction behavior as well as the ability for cross validation to find the true sparsity parameter.


### Missing Data

`MendelIHT` assumes there are no missing genotypes, since it uses linear algebra functions defined in [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/man/snparray/#linear-algebra-with-snparray). Therefore, you must first impute missing genotypes before you use MendelIHT. SnpArrays.jl offer basic quality control routines such as filtering, but otherwise, our own software [option 23 of Mendel](http://software.genetics.ucla.edu/download?package=1) is a reasonable choice. Open Mendel will soon provide a separate package `MendelImpute.jl` containing new imputation strategies such as alternating least squares.  

### Cross Validation and Regularization paths

We usually have very little information on how many SNPs are affecting the phenotype. In a typical GWAS study, anywhere between 1 to thousands of SNPs could play a role. Thus ideally, we can test many different models to find the best one. MendelIHT provides 2 ways for one to perform this automatically: user specified regulartization paths, and $q-$fold [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics). Users should know that, in the first method, increasing the number of predictors will almost always decrease the error, but as a result introduce overfitting. Therefore, in most practical situations, it is highly recommended to combine this method with cross validation. In $q-$fold cross validation, samples are divided into $q$ disjoint subsets, and IHT fits a model on $q-1$ of those sets data, then computes the [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) tested on the $qth$ samples. Each $q$ subsets are served as the test set exactly once. This functionality of `MendelIHT.jl` natively supports parallel computing. 

# Example 1: Quantitative Traits

Quantitative traits are continuous phenotypes that can essentially take on any real number. In this example, we first simulate $y_i \sim x_i^T\beta + \epsilon_i$ where $\epsilon_i \sim N(0, 1)$ and $\beta_i \sim N(0, 1)$. Then using just the genotype matrix $X$ and phenotype vector $y$, we use IHT to recover the simulated $\beta$. 

In [1]:
#first add workers needed for parallel computing. Add only as many CPU cores you have 
using Distributed
addprocs(4)

4-element Array{Int64,1}:
 2
 3
 4
 5

In [2]:
#load necessary packages
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using GLM

### Step 1: Simulat data with k true predictors, from distribution d and with link l.

In [3]:
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

IdentityLink()

### Step 2: Construct snpmatrix, covariate files, and true model b

The SnpBitMatrix type (`xbm` below) is necessary for performing linear algebra directly on raw genotype files without expanding the matrix to numeric floating points. Here the SnpArray (`x` below) is memory-mapped to a file called `tmp.bed` stored on your disk, and hence, does not require RAM to store. 

In [4]:
Random.seed!(1111) #set random seed
x, = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # only nongenetic covariate is the intercept

1000×1 Array{Float64,2}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

### Step 3: Simulate response y, true model b, and the correct non-0 positions of b

In [5]:
y, true_b, correct_position = simulate_random_response(x, xbm, k, d, l)

([-2.02168, -2.56256, 1.2439, 0.304348, 1.70433, -2.7755, -0.948664, 0.166054, 1.58801, 1.03323  …  -2.1665, 7.97552, 0.324306, 1.60573, 1.59093, -2.50396, -3.46523, -0.346403, 1.07067, 0.292686], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [2384, 3352, 4093, 5413, 5455, 6729, 7403, 8753, 9089, 9132])

### Step 4: Run IHT

In [6]:
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, use_maf=false)

IHT results:

Compute time (sec):     0.8107540607452393
Final loglikelihood:    -1407.2533232402275
Iterations:             12
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero SNP predictors and 0 non-genetic predictors.

Selected genetic predictors:
10×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 2384     │ -1.26014    │
│ 2   │ 3352     │ -0.26742    │
│ 3   │ 3353     │ 0.141208    │
│ 4   │ 4093     │ 0.289956    │
│ 5   │ 5413     │ 0.366689    │
│ 6   │ 5609     │ -0.137181   │
│ 7   │ 7403     │ -0.308255   │
│ 8   │ 8753     │ 0.332881    │
│ 9   │ 9089     │ 0.964598    │
│ 10  │ 9132     │ -0.509461   │

Selected nongenetic predictors:
0×2 DataFrame


### Step 5: Check results

IHT found 8/10 predictors in this example. The 2 that was not found had a relatively small effect size, and as far as IHT can tell, they are indistinguishable from noise. 

In [7]:
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ -1.19376  │ -1.26014    │
│ 2   │ -0.230351 │ -0.26742    │
│ 3   │ 0.257181  │ 0.289956    │
│ 4   │ 0.344827  │ 0.366689    │
│ 5   │ 0.155484  │ 0.0         │
│ 6   │ -0.126114 │ 0.0         │
│ 7   │ -0.286079 │ -0.308255   │
│ 8   │ 0.327039  │ 0.332881    │
│ 9   │ 0.931375  │ 0.964598    │
│ 10  │ -0.496683 │ -0.509461   │
Total iteration number was 12
Total time was 0.8107540607452393
Total found predictors = 8


# Example 2: Case-control study controlling for sex

Case control studies are used when the phenotype in a binary count data. In this example, we simulate a case-control study, while controling for sex as a non-genetic covariate. 

The exact simulation code to generate the phenotype $y$ can be found at: https://github.com/biona001/MendelIHT.jl/blob/master/src/simulate_utilities.jl#L107

### Step 1: Simulat data with k true predictors, from distribution d and with link l.

In [8]:
n = 1000
p = 10000
k = 10
d = Bernoulli
l = canonicallink(d())

LogitLink()

### Step 2: construct snpmatrix, covariate files, and true model b

In [9]:
Random.seed!(1111) #set random seed 
x, = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 2) # first column is the intercept, second column the sex. 0 = male 1 = female
z[:, 2] .= rand(0:1, n)

1000-element view(::Array{Float64,2}, :, 2) with eltype Float64:
 1.0
 0.0
 0.0
 1.0
 0.0
 0.0
 1.0
 0.0
 1.0
 1.0
 0.0
 0.0
 1.0
 ⋮  
 0.0
 1.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 1.0
 0.0

### Step 3: simulate true models 

Here we used $k=8$ genetic predictors and 2 non-genetic predictors (intercept and sex). The simulation code in our package does not yet handle simulations with non-genetic predictors, so we must simulate these phenotypes manually. 

In [10]:
true_b = zeros(p) #genetic predictors
true_b[1:k-2] = randn(k-2)
shuffle!(true_b)
correct_position = findall(!iszero, true_b)
true_c = [1.0; 1.5] #non-genetic predictors: intercept & sex

2-element Array{Float64,1}:
 1.0
 1.5

### Step 4: simulate phenotype using genetic and nongenetic predictors

In [11]:
prob = linkinv.(l, xbm * true_b .+ z * true_c)
y = [rand(d(i)) for i in prob]
y = Float64.(y) #convert y to floating point numbers

1000-element Array{Float64,1}:
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 1.0
 0.0
 ⋮  
 0.0
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 0.0

### Step 5: run IHT

In [12]:
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, use_maf=false)

IHT results:

Compute time (sec):     2.4853720664978027
Final loglikelihood:    -286.3534665608417
Iterations:             48
Max number of groups:   1
Max predictors/group:   10
IHT estimated 8 nonzero SNP predictors and 2 non-genetic predictors.

Selected genetic predictors:
8×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 98       │ 0.36567     │
│ 2   │ 983      │ 0.424646    │
│ 3   │ 2960     │ -2.26025    │
│ 4   │ 4461     │ 0.427904    │
│ 5   │ 4588     │ -0.67782    │
│ 6   │ 6086     │ 0.777424    │
│ 7   │ 6130     │ -0.948359   │
│ 8   │ 9283     │ -0.753285   │

Selected nongenetic predictors:
2×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 1        │ 0.980127    │
│ 2   │ 2        │ 1.71288     │

### Step 6: check result

As we can see below, IHT finds 5/8 true genetic predictors and 2/2 true non-genetic predictors. Note that:

+ The coefficient estimates for found predictors are unbiased.
+ Larger effect sizes are easier to find.

In [13]:
compare_model_genetic = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])

compare_model_nongenetic = DataFrame(
    true_c      = true_c[1:2], 
    estimated_c = result.c[1:2])

@show compare_model_genetic
println("\n")
@show compare_model_nongenetic

compare_model_genetic = 8×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ -2.22637  │ -2.26025    │
│ 2   │ 0.0646127 │ 0.0         │
│ 3   │ -0.63696  │ -0.67782    │
│ 4   │ 1.08631   │ 0.777424    │
│ 5   │ -0.930103 │ -0.948359   │
│ 6   │ -0.283783 │ 0.0         │
│ 7   │ -0.206074 │ 0.0         │
│ 8   │ -0.553461 │ -0.753285   │


compare_model_nongenetic = 2×2 DataFrame
│ Row │ true_c  │ estimated_c │
│     │ Float64 │ Float64     │
├─────┼─────────┼─────────────┤
│ 1   │ 1.0     │ 0.980127    │
│ 2   │ 1.5     │ 1.71288     │


Unnamed: 0_level_0,true_c,estimated_c
Unnamed: 0_level_1,Float64,Float64
1,1.0,0.980127
2,1.5,1.71288


# Example 3: Cross Validation with Poisson using debiasing

In this example, we investiate IHT's cross validation routines using as many CPU cores as possible. We use Poisson regression as an example. The current machine (4 cores avaialble) info is listed in the beginning of this tutorial. We also turned on debiasing just to show that this functionality work.

### Step 1: Verify we can multiple workers involved. 

Workers were added in the first example with the Distributed.jl package. If `nprocs()` return 1, restart the notebook and add workers before loading packages. 

In [14]:
nprocs()

5

### Step 2: simulat data with k true predictors, from distribution d and with link l.

Here we chose a larger sample size to have better accuracy.

In [21]:
n = 5000
p = 30000
k = 10
d = Poisson
l = canonicallink(d())

LogLink()

### Step 3: construct snpmatrix, covariate files, and true model b

Note using `undef` as the third argument will instead create non-memory mapped SnpArrays, which must be stored in the RAM. While this has extra memory overhead, it also facilitates quicker data access. Therefore it is on the user to decide when it is appropriate to create memory mapped files and when it is not. If one is not very computer savvy, we recommend always doing memory mapping. 

In [22]:
Random.seed!(1111) #set random seed
x, = simulate_random_snparray(n, p, undef)
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept

5000×1 Array{Float64,2}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

### Step 4: simulate response, true model b, and the correct non-0 positions of b

In [23]:
y, true_b, correct_position = simulate_random_response(x, xbm, k, d, l)

([0.0, 6.0, 0.0, 0.0, 6.0, 1.0, 1.0, 0.0, 0.0, 7.0  …  1.0, 2.0, 0.0, 0.0, 0.0, 0.0, 12.0, 1.0, 2.0, 9.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1744, 6495, 10765, 12333, 16133, 17026, 17885, 21068, 22330, 29911])

### Step 5: specify path and folds

Here `path` are all the model sizes you wish to test and `folds` indicates how to partition the samples into disjoint groups. It is important we partition the training/testing data randomly as opposed to chunck by chunck to avoid nasty things like sampling biases. Below we tested $k = 1, 2, ..., 20$ across 3 fold. This is equivalent to running IHT across 60 different models, and hence, is ideal for parallel computing. 

In [24]:
path = collect(1:20)
num_folds = 3
folds = rand(1:num_folds, size(x, 1))

5000-element Array{Int64,1}:
 2
 3
 2
 3
 2
 3
 1
 2
 2
 3
 2
 1
 1
 ⋮
 2
 3
 3
 2
 3
 3
 1
 3
 2
 2
 3
 2

### Step 6: Run IHT's cross validation routine

This returns a vector of deviance residuals, which is a generalization of the mean squared error. 

**Warning:** This step will generate intermediate files with similar titles as `train.tmp` and `test.tmp`. These are necessary auxiliary files that will be automatically removed when cross validation completes. **Removing these files before the algorithm terminate will lead of bad errors.**

In [25]:
drs = cv_iht(d(), l, x, z, y, 1, path, folds, num_folds, debias=true, parallel=true)



Crossvalidation Results:
	k	MSE
	1	5581.143071419869
	2	4223.403648644891
	3	3239.104084235584
	4	2879.8855975535353
	5	2575.1195191721963
	6	2315.6174299543
	7	2230.341919415393
	8	1948.4954290886512
	9	1830.8021581831804
	10	1678.88047182858
	11	1695.310144524478
	12	1706.3305709652873
	13	1710.8323819723923
	14	1719.5785652367854
	15	1722.8525298462273
	16	1730.1535563414525
	17	1728.6745696861462
	18	1734.1884767964175
	19	1738.055904325496
	20	1741.9157245269835

The lowest MSE is achieved at k = 10 



20-element Array{Float64,1}:
 5581.143071419869 
 4223.403648644891 
 3239.104084235584 
 2879.8855975535353
 2575.1195191721963
 2315.6174299543   
 2230.341919415393 
 1948.4954290886512
 1830.8021581831804
 1678.88047182858  
 1695.310144524478 
 1706.3305709652873
 1710.8323819723923
 1719.5785652367854
 1722.8525298462273
 1730.1535563414525
 1728.6745696861462
 1734.1884767964175
 1738.055904325496 
 1741.9157245269835

### Step 7: Run full model on the best estimated model size 

According to our cross validation result, the best model size that minimizes deviance residuals (i.e. MSE on the q-th subset of samples) is attained at $k = 10$. That is, cross validation detected that we need 10 SNPs to achieve the best model size. Using this information, one can re-run the IHT code to obtain the estimated model.

In [26]:
k_est = argmin(drs)
result = L0_reg(x, xbm, z, y, 1, k_est, d(), l, debias=true)

IHT results:

Compute time (sec):     24.928004026412964
Final loglikelihood:    -6720.822219489936
Iterations:             30
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero SNP predictors and 0 non-genetic predictors.

Selected genetic predictors:
10×2 DataFrame
│ Row │ Position │ Estimated_β │
│     │ [90mInt64[39m    │ [90mFloat64[39m     │
├─────┼──────────┼─────────────┤
│ 1   │ 1744     │ 0.842532    │
│ 2   │ 6495     │ 0.179672    │
│ 3   │ 10765    │ 0.183316    │
│ 4   │ 12333    │ -0.236543   │
│ 5   │ 16133    │ -0.254481   │
│ 6   │ 17026    │ 0.462852    │
│ 7   │ 17885    │ -0.299888   │
│ 8   │ 21068    │ -0.222807   │
│ 9   │ 22330    │ 0.202813    │
│ 10  │ 29911    │ -0.561777   │

Selected nongenetic predictors:
0×2 DataFrame


### Step 8: Check final model against simulation

In [27]:
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ 0.842304  │ 0.842532    │
│ 2   │ 0.162081  │ 0.179672    │
│ 3   │ 0.165261  │ 0.183316    │
│ 4   │ -0.23034  │ -0.236543   │
│ 5   │ -0.253494 │ -0.254481   │
│ 6   │ 0.472102  │ 0.462852    │
│ 7   │ -0.315671 │ -0.299888   │
│ 8   │ -0.216082 │ -0.222807   │
│ 9   │ 0.199723  │ 0.202813    │
│ 10  │ -0.547869 │ -0.561777   │
Total iteration number was 30
Total time was 24.928004026412964
Total found predictors = 10


# Conclusion

This notebook demonstrated some of the basic features of IHT. It is important to note that in the real world, the effect sizes of genetic predictors are expected to be small. Thus to detecting them would require a reasonably large sample size (say $n$ in the thousands). Fortunately, this is common place nowadays. 


# Extra features 

Due to limited space, we obmited illustrating some functionalities that have already been implemented, listed below:

+ Negative binomial, gamma, inverse gaussian, and binomial regressions
+ Use of non-canonical link functions 
+ Initializing IHT at a good starting point (setting init=true)
+ Doubly sparse projection (requires group information)
+ Weighted projections (requires weight information)

Interested users can visit [our code to reproduce certain figures of our paper](https://github.com/biona001/MendelIHT.jl/tree/master/figures) on our github. 