# OpenMendel: Iterative Hard Thresholding Tutorial

### Last update: 10/22/2018

### Julia version

For reproducibility, the computer spec and Julia version is listed below. Current code supports Julia version 0.6.4. It will not work in v0.7 or v1.0, but upgrade to v1.0 is very high on our TODO list. 

The IHT.jl module must be installed on your computer before running this tutorial.  Instructions on cloning and the latest IHT code can be found here: https://github.com/biona001/IHT.jl

In [1]:
versioninfo()

Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)


### When to use Iterative Hard Thresholding

*Continuous* model selection is advantageous in situations where the multivariate nature of the regressors plays a significant role *together*. Iterative hard-thresholing (IHT) performs continuous model selection on a GWAS dataset $\mathbf{X} \in \{0, 1, 2\}^{n \times p}$ and continuous phenotype vector $\mathbf{y}$ by minimizing the residual sum of squares $f(\beta) = \frac{1}{2}||\mathbf{y} - \mathbf{X}\beta||^2$ subject to the constraint that $\beta$ is $k-$sparse. Parallel computing is offered through `q-`fold cross validation, and in the near future, dense (genotype matrix)-(dense vector) multiplication. 

### Appropriate Datasets and Example Inputs 

All genotype data **must** be stored in the [PLINK binary genotype format](https://www.cog-genomics.org/plink2/formats#bed), where the triplets `.bim`, `.bed` and `.fam` must all be present. Additional non-genetic covariates should be stored in a separate file (e.g. comma separated file). In this tutorial, we use "gwas 1 data" (github repo: [here](https://github.com/OpenMendel/MendelGWAS.jl/tree/master/docs)) to illustrate the functionalities of MendelIHT. This dataset has 2200 people and a modest 10000 simulated SNPs, with 2 SNPs (`rs1935681` and `rs2256412`) contributing to the response. One can obtain this dataset from the first example input of [MendelGWAS.jl](https://openmendel.github.io/MendelGWAS.jl/), or via option 24a of the free application [Mendel version 16](http://www.genetics.ucla.edu/software/mendel). 


### Missing Data

`MendelIHT` assumes there are no missing genotypes, since it uses linear algebra functions defined in [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/man/snparray/#linear-algebra-with-snparray). Therefore, you must first impute missing genotypes before you use MendelIHT. SnpArrays.jl offer a naive imputation strategy, but otherwise, our own software [option 23 of Mendel](http://www.genetics.ucla.edu/software/mendel) is a reasonable choice. Open Mendel will soon provide a separate package `MendelImpute` containing new imputation strategies such as alternating least squares.  

### Cross Validation and Regularization paths

We usually have very little information on how many SNPs are affecting the phenotype. In a typical GWAS study, anywhere between 1 to thousands of SNPs could play a role. Thus ideally, we can test many different models to find the best one. MendelIHT provides 2 ways for one to perform this automatically: user specified regulartization paths, and $q-$fold [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics). Users should know that, in the first method, increasing the number of predictors will almost always decrease the error, but as a result introduce overfitting. Therefore, in most practical situations, it is highly recommended to combine this method with cross validation. In $q-$fold cross validation, samples are divided into $q$ disjoint subsets, and IHT fits a model on $q-1$ of those sets data, then computes the [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) tested on the $qth$ samples. Each $q$ subsets are served as the test set exactly once. With noisy responses, IHT cross validation typically underestimates the true model size, as shown in an example below. 

### Analysis keywords available to users 

| Keyword | Default Value | Allowed value | Description |
| --- | --- | --- | --- |
|`predictors` | 0 | Positive integer | Max number of non-zero entries of $\beta$ |
|`non_genetic_covariates` | "" | File name on disk | Delimited file containing the non-genetic covariates for each sample |
|`run_cross_validation` | false | boolean | Whether the user wants to run cross-validation |
|`model_sizes` | "" | Integers stored in string separated by ',' | Different model sizes users wish to run IHT |
|`cv_fold` | 0 | Positive integer | Number of disjoint subsets the samples should be divided into |
|`max_groups` (\*) | 1 | Integer | Total number of groups |
|`group_membership` (\*) | "" | File name on disk | File indicating group membership |
|`prior_weights` (\*) | "" | maf | How to scale predictors based on different weights |

+ (\*) Indicates experimental features. We currently have no theoretical guarantees on their performance, therefore illustrations of these functionalities are omitted from this tutorial. Users should tread carefully with these features. 
+ A list of OpenMendel keywords common to most analysis package can be found [here](https://openmendel.github.io/MendelBase.jl/#keywords-table)

# Example 1: Run IHT with Only Genotype Data

### Step 1: Preparing Input files

In Open Mendel, all analysis parameters are specified via the [Control file](https://openmendel.github.io/MendelBase.jl/#control-file). Genotype data must be inputted via the PLINK binary format. The most basic control file to run IHT looks like the following:

In [2]:
;cat "gwas 1 Control basic.txt"

#
# Input and Output files.
#
plink_input_basename = gwas 1 data

#
# Analysis parameters for IHT option.
#
predictors = 2

### Step 2: Run MendelIHT

To run `MendelIHT`, execute the following in the Julia REPL or in this notebook:

In [3]:
using IHT
MendelIHT("gwas 1 Control basic.txt") # change directory as necessary

 
 
     Welcome to OpenMendel's
      IHT analysis option
        version 0.4.0
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/v0.6/IHT/docs/MendelIHT_tutorial".

Keywords modified by the user:

  affected_designator = 2
  control_file = gwas 1 Control basic.txt
  pedigree_file = gwas 1 data.fam
  plink_input_basename = gwas 1 data
  predictors = 2
  snpdata_file = gwas 1 data.bed
  snpdefinition_file = gwas 1 data.bim
 


[1m[36mINFO: [39m[22m[36mReading in data
[39m[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m[1m[36mINFO: [39m[22m[36mAnalyzing the data for model size k = 2
[39m

IHT results:

Compute time (sec):     2.152526323
Final loss:             1161.951128229144
Iterations:             12
Max number of groups:   1
Max predictors/group:   2
IHT estimated 2 nonzero coefficients.
2×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 3981      │ 0.149048    │
│ 2   │ 1     │ 7023      │ 0.272921    │

Intercept of model = 0.1405364595125695


### Step 3: Interpreting the results

Here the estimated model is the 3981th and 7023th predictor, corresponding to rs1935681 and rs2256412 in the `gwas 1 data.bim` file, which are the correct SNPs. The intercept of the model is given at the bottom of the table. Here the compute time is the time associated with computing the optimal $\beta$ only, so it does not include other necessary processes, such as importing the data. 

**Note:** the `affected_designator = 2` simply indicates that the pedigree is a Plink .fam file, which must always be the case for `MendelIHT` because this analysis option only accepts PLINK binary format as inputs. 

# Example 2: Including Non-Genetic Covariates

Non-genetic covariates must be stored in a comma demited file, with the same number of rows as the number of samples. The intercept term (i.e. grand mean) must also be included in the file. If the user does not specify a non-genetic covariate file, `MendelIHT` will by default include an intercept in the estimated model. 

### Step 1: Prepare Non-Genetic Covariate File

In this example, we generated one non-genetic covariate from a $N(0, 1)$ distribution. After including the grand mean, we saved the file in `gwas 1 noncov.txt` where the entries are separated by a tab. The first few lines of this file looks like the following:

In [4]:
;head -10 "gwas 1 noncov.txt"

1	-0.088704513339476
1	-0.9575873240069772
1	-0.9713258274139007
1	-0.9847900613424241
1	-0.5954781589540936
1	0.2124813875751884
1	2.28150775802523
1	1.7643235366779797
1	-0.3933262467789896
1	-0.1348394065324508


### Step 2: Prepare Corresponding Control File 

We need to tell MendelIHT that the covariates are separated by tabs. This can be specified via the [MendelBase](https://openmendel.github.io/MendelBase.jl/) keyword `field_separator` in the control file as follows:

In [5]:
;cat "gwas 1 Control nongen.txt"

#
# Input and Output files.
#
plink_input_basename = gwas 1 data
non_genetic_covariates = gwas 1 noncov.txt
field_separator = '	'
#
# Analysis parameters for IHT option.
#
predictors = 2

### Step 3: Run IHT 

In [6]:
using IHT
MendelIHT("gwas 1 Control nongen.txt") # change directory as necessary

 
 
     Welcome to OpenMendel's
      IHT analysis option
        version 0.4.0
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/v0.6/IHT/docs/MendelIHT_tutorial".

Keywords modified by the user:

  affected_designator = 2
  control_file = gwas 1 Control nongen.txt
  field_separator = 	
  non_genetic_covariates = gwas 1 noncov.txt
  pedigree_file = gwas 1 data.fam
  plink_input_basename = gwas 1 data
  predictors = 2
  snpdata_file = gwas 1 data.bed
  snpdefinition_file = gwas 1 data.bim
 


[1m[36mINFO: [39m[22m[36mReading in data
[39m[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m[1m[36mINFO: [39m[22m[36mAnalyzing the data for model size k = 2
[39m

IHT results:

Compute time (sec):     1.528576778
Final loss:             1161.9471299575046
Iterations:             12
Max number of groups:   1
Max predictors/group:   2
IHT estimated 2 nonzero coefficients.
2×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 3981      │ 0.149032    │
│ 2   │ 1     │ 7023      │ 0.272937    │

Intercept of model = 0.14055574614433


**Remark:** Observe that the resulting error and model did not change by much, because the covariates we added is white noise. 

# Example 3: Cross Validation

In this example, users can run IHT on any number of model in at most 8 parallel threads. Empirically, running on $n$ threads achieves roughly $n/2$ fold speedup. Note that running $q$ fold cross validation on $r$ different models entails running IHT $q \times r$ times. 

### Step 0: IMPORTANT in order to take advantage of multithreads, discontinue this notebook and julia. Have a terminal open.

### Step 1: IMPORTANT Execute following line in the terminal BEFORE starting notebook (or Julia REPL) 
export JULIA_NUM_THREADS=8

### Step 2: Start notebook in the same terminal window and verify that notebook is indeed running with 8 threads:  
Note that if you computer's capacity is less than 8, it will default to the largest number it can run. 

In [7]:
Threads.nthreads()

8

### Step 3: Specify the model sizes

The paths should be inside quotes and separated by comma, specified via the keyword `model_sizes`. Each entry must be an integer. In this example, we tried to run IHT for model sizes $k = 1, 2, ..., 10$ and 5 different folds. This is equivalent to running IHT 50 different times, and hence, ideal for parallel computing. 

In [8]:
;cat "gwas 1 Control cv.txt"

#
# Input and Output files.
#
plink_input_basename = gwas 1 data
#
# Cross Validation parameters
#
run_cross_validation = true
cv_folds = 5
model_sizes = "1,2,3,4,5,6,7,8,9,10"


### Step 4: Run Cross Validation to find best model size

In [9]:
using IHT
MendelIHT("gwas 1 Control cv.txt") # change directory as necessary

 
 
     Welcome to OpenMendel's
      IHT analysis option
        version 0.4.0
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/v0.6/IHT/docs/MendelIHT_tutorial".

Keywords modified by the user:

  affected_designator = 2
  control_file = gwas 1 Control cv.txt
  cv_folds = 5
  model_sizes = 1,2,3,4,5,6,7,8,9,10
  pedigree_file = gwas 1 data.fam
  plink_input_basename = gwas 1 data
  run_cross_validation = true
  snpdata_file = gwas 1 data.bed
  snpdefinition_file = gwas 1 data.bim
 


[1m[36mINFO: [39m[22m[36mReading in data
[39m[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m



Crossvalidation Results:
k	MSE
1	0.5866668270376938
2	0.66585824439109
3	0.712315413745251
4	0.7283336243892302
5	0.6555881430294945
6	0.6297867677979846
7	0.6173827609661935
8	0.6405844374720657
9	0.6627558645805942
10	0.6660236298886482

The lowest MSE is achieved at k = 1


[1m[36mINFO: [39m[22m[36mRunning 5-fold cross validation on the following model sizes:
1,2,3,4,5,6,7,8,9,10.
Ignoring keyword predictors.
[39m

### Step 5: Re-run ordinary IHT on the best model size

According to our cross validation result, the best model size that minimizes out-of-sample errors (i.e. MSE on the q-th subset of samples) is attained at $k = 1$. Using this information, one can re-run the IHT code to obtain the estimated model. Note importantly, that cross-validation in this case **did not** return the true model size $k = 2$. This reflects the general tendency for IHT to *under* estimate the true model size, as highlighted in our [previous tutorial](https://github.com/klkeys/IHT.jl/blob/master/docs/IHT_tutorial_full.ipynb). Currently we are working hard to simulate varying datasets and model sizes to examine the behavior of using IHT. 

In [10]:
;cat "gwas 1 Control basic2.txt"

#
# Input and Output files.
#
plink_input_basename = gwas 1 data

#
# Analysis parameters for IHT option.
#
predictors = 1

In [11]:
MendelIHT("gwas 1 Control basic2.txt") # change directory as necessary

 
 
     Welcome to OpenMendel's
      IHT analysis option
        version 0.4.0
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/v0.6/IHT/docs/MendelIHT_tutorial".

Keywords modified by the user:

  affected_designator = 2
  control_file = gwas 1 Control basic2.txt
  pedigree_file = gwas 1 data.fam
  plink_input_basename = gwas 1 data
  predictors = 1
  snpdata_file = gwas 1 data.bed
  snpdefinition_file = gwas 1 data.bim
 


[1m[36mINFO: [39m[22m[36mReading in data
[39m[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m[1m[36mINFO: [39m[22m[36mAnalyzing the data for model size k = 1
[39m

IHT results:

Compute time (sec):     1.53561531
Final loss:             1186.372885450256
Iterations:             12
Max number of groups:   1
Max predictors/group:   1
IHT estimated 1 nonzero coefficients.
1×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 7023      │ 0.276207    │

Intercept of model = 0.14054243272581285


### Step 6 (optional): Running different path sizes but don't run cross validation

Instead of cutting your sample size into $q$ disjoint sets, IHT optionally allows users to run various models on all the samples. This happens if the user does not set `run_cross_validation` to true. An example is as follows:

In [12]:
;cat "gwas 1 Control userpath.txt"

#
# Input and Output files.
#
plink_input_basename = gwas 1 data
#
# Analysis parameters for IHT option.
#
predictors = 2
model_sizes = "1,2,3,4,5,6,7,8,9,10"


In [13]:
using IHT
models, model_errors = MendelIHT("gwas 1 Control userpath.txt")

 
 
     Welcome to OpenMendel's
      IHT analysis option
        version 0.4.0
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/v0.6/IHT/docs/MendelIHT_tutorial".

Keywords modified by the user:

  affected_designator = 2
  control_file = gwas 1 Control userpath.txt
  model_sizes = 1,2,3,4,5,6,7,8,9,10
  pedigree_file = gwas 1 data.fam
  plink_input_basename = gwas 1 data
  predictors = 2
  snpdata_file = gwas 1 data.bed
  snpdefinition_file = gwas 1 data.bim
 


[1m[36mINFO: [39m[22m[36mReading in data
[39m[1m[36mINFO: [39m[22m[36mv1.0 BED file detected
[39m[1m[36mINFO: [39m[22m[36mRunning the following model sizes: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[39m

([0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], [0.140542 0.140536 … 0.140545 0.140564])

**Remark:** We can check that IHT is stable given different sparsity constraints, in the sense that parameters selected with a sparser constraint will still be selected if given a weaker constraint. Below we list the 10 models in separate columns.  

In [14]:
idx = find(models[:, end])
full(models[idx, :])

10×10 Array{Float64,2}:
 0.0       0.0        0.0        …  -0.0766181  -0.0782086  -0.0796667
 0.0       0.0        0.0           -0.077302   -0.0787067  -0.0821465
 0.0       0.0        0.0            0.0         0.0        -0.08236  
 0.0       0.149048   0.151119       0.144214    0.143389    0.144931 
 0.0       0.0        0.0            0.080625    0.0810577   0.0783766
 0.0       0.0       -0.0856503  …  -0.0869424  -0.0890821  -0.0907393
 0.0       0.0        0.0            0.0857555   0.0875377   0.086168 
 0.0       0.0        0.0           -0.0737263  -0.0751238  -0.0757259
 0.276207  0.272921   0.272561       0.264556    0.2656      0.266956 
 0.0       0.0        0.0            0.0        -0.0832995  -0.0861502

# Conclusion

This notebook demonstrated some of the basic features of IHT. It will be informative to simulate known values of $\beta$, because then we can compare reconstruction results directly. But in this notebook we instead chose to use a dataset that all other OpenMendel notebooks were using, with the values of $\beta_{true}$ unknown. Interested readers can visit our older notebooks [here](https://github.com/klkeys/IHT.jl/blob/master/docs/IHT_tutorial_full.ipynb) and [here](https://github.com/klkeys/IHT.jl/blob/master/docs/IHT_tutorial_brief.ipynb), which additionally runs IHT on numeric data. In the near future, we will release a more detailed tutorial that includes tutorials on all the experiemental features of IHT on GWAS data, as well as direct comparison results on simulated data. 