# IHT.jl tutorial (brief)

In this tutorial we explore some of the functionality of the IHT.jl package, which implements iterative hard threhsolding on floating point arrays and binary PLINK data. 

The first part of the tutorial demonstrates how to handle floating point data. Later we will show how to use [PLINK binary genotype files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). IHT.jl offers two parallel computational frameworks for PLINK data: multicore CPUs or massively parallel GPUs. Although this tutorial features the CPU version, instructions for GPU use are included.

This tutorial targets novice users. Advanced users are encouraged to read the full tutorial located at `~/.julia/v0.4/IHT/IHT_tutorial_full.ipynb`.

We begin by adding processors and loading libraries:

In [1]:
addprocs(5)         # for later parallel XV
using IHT, PLINK    # PLINK.jl handles PLINK files

For this tutorial, we will use a script to simulate all data and variables. The default location is given in `simpath`:

In [2]:
simpath = expanduser("~/.julia/v0.4/IHT/sim/tutorial_simulation.jl")
include(simpath)

Invoking transpose with outfile /Users/kkeys/Desktop/tbed_1.bed


Simulation complete.


The script `tutorial_simulation.jl` generates floating point data `x` with a response vector `y`.

Using `x`, we now run iterative hard thresholding to capture the best model of size `k = 10`. The function call is `L0_reg`, which returns an object `output` with the following fields:

- `output.time`, the compute time in seconds
- `output.iter`, the number of iterations taken until convergence
- `output.loss`, the residual sum of squares at convergence
- `output.beta`, the $\boldsymbol{\beta}$ vector at convergence

In [3]:
output = L0_reg(x,y,k)  # run IHT with data x, response y, and desired model size k


Compute time:   1.08264232
Final loss:     25.12432779234657
Iterations::    13
IHT estimated a vector of type Array{Float64,1} with 10 nonzeroes.
10×2 DataFrames.DataFrame
│ Row │ Predictor │ β         │
├─────┼───────────┼───────────┤
│ 1   │ 3310      │ -0.346533 │
│ 2   │ 4460      │ 0.612489  │
│ 3   │ 5861      │ 0.175425  │
│ 4   │ 9731      │ -0.494717 │
│ 5   │ 11294     │ 1.36302   │
│ 6   │ 11378     │ -0.588894 │
│ 7   │ 14217     │ 0.154568  │
│ 8   │ 15118     │ 0.0318823 │
│ 9   │ 19815     │ 0.374349  │
│ 10  │ 23295     │ -1.38392  │

Let us compare the results against the true model `b`:

In [4]:
bk = copy(output.beta)  # copy the estimated β for later use
[β[bidx] bk[bidx]]      # compare true and estimated coefficients; b, bidx contain true model

10x2 Array{Float64,2}:
 -0.341588  -0.346533 
  0.612107   0.612489 
  0.17465    0.175425 
 -0.494423  -0.494717 
  1.36306    1.36302  
 -0.58849   -0.588894 
  0.155924   0.154568 
  0.031168   0.0318823
  0.374056   0.374349 
 -1.38168   -1.38392  

Observe that IHT returns all the correct nonzero coefficients. The coefficient values themselves are fairly close to their originals. We expect this since `s = 0.1` does not yield a very noisy `y`. Observe what happens when we use a noisier response `y2` simulated from the same data `x`:

In [5]:
output2 = L0_reg(x, y2, k)      # run IHT with noisier response y2
bk2     = copy(output2.beta)    # copy results for later use
[β[bidx] bk[bidx] bk2[bidx]]    # compare analysis of y, y2 against truth

10x3 Array{Float64,2}:
 -0.341588  -0.346533   -0.326644
  0.612107   0.612489    0.604358
  0.17465    0.175425    0.0     
 -0.494423  -0.494717   -0.612899
  1.36306    1.36302     1.39415 
 -0.58849   -0.588894   -0.613471
  0.155924   0.154568    0.0     
  0.031168   0.0318823   0.0     
  0.374056   0.374349    0.0     
 -1.38168   -1.38392    -1.33059 

Recall that `b` is the original model, `bk` is the estimated model with low noise (`s = 0.1`), and `bk2` is the estimated model with high noise (`s = 5`). The coefficient values are less accurate, and several nonzero values are not recovered. Note that a failure to recover the correct model does not indicate that IHT model selection performance is necessarily bad; the high noise level makes accurate estimation of the model quite difficult.

## IHT for GWAS

IHT.jl interfaces with [PLINK.jl](https://github.com/klkeys/PLINK.jl) for GWAS analysis. PLINK.jl handles PLINK binary genotype files (BED files). In this section of the tutorial we will demonstrate GWAS analysis with IHT.jl on simulated binary genotype data. Luckily, PLINK.jl ships with some simulated binary data.

The call to IHT is the same as with the floating point `x`. Here we use simulated binary genotype data housed in `xbed` and a corresponding simulated response `ybed`:

In [6]:
output = L0_reg(xbed, ybed, k) # run IHT with BED files


Compute time:   7.346981444
Final loss:     25.07111852912545
Iterations::    13
IHT estimated a vector of type Array{Float64,1} with 10 nonzeroes.
10×2 DataFrames.DataFrame
│ Row │ Predictor │ β         │
├─────┼───────────┼───────────┤
│ 1   │ 650       │ -0.179789 │
│ 2   │ 3288      │ 1.80377   │
│ 3   │ 6035      │ 0.137861  │
│ 4   │ 6931      │ -0.854269 │
│ 5   │ 7949      │ 1.02445   │
│ 6   │ 8886      │ 0.0930143 │
│ 7   │ 14799     │ -0.439589 │
│ 8   │ 14984     │ -1.46272  │
│ 9   │ 19620     │ 0.705692  │
│ 10  │ 19872     │ 0.0107966 │

As before, let us compare the estimated model to the simulated one:

In [7]:
bk = copy(output.beta)      # copy the beta for later use
[bbed[bidxbed] bk[bidxbed]] # did we get the correct model and coefficient values?

10x2 Array{Float64,2}:
 -0.176622   -0.179789 
  1.80336     1.80377  
  0.135377    0.137861 
 -0.85557    -0.854269 
  1.022       1.02445  
  0.0903269   0.0930143
 -0.442234   -0.439589 
 -1.46398    -1.46272  
  0.705111    0.705692 
  0.0109637   0.0107966

## Crossvalidation

These exploratory efforts are admittedly not illuminating. In a realistic setting, we wouldn't know the correct model size `k = 10`. IHT handles this by crossvalidating the model size with `cv_iht`. For example, to perform 5-fold over a range of models given by pathidx, we would use

In [8]:
cv_output = cv_iht(x, y2, pathidx, nfolds, folds=folds, pids=pids) # nfolds = 5, pathidx = 1:20

An IHTCrossvalidationResults object with the following results:
Minimum MSE 16.161813213014682 occurs at k = 6.
Best model β has the following nonzero coefficients:
6×2 DataFrames.DataFrame
│ Row │ Predictor │ β         │
├─────┼───────────┼───────────┤
│ 1   │ 3310      │ -0.33289  │
│ 2   │ 4460      │ 0.615219  │
│ 3   │ 9731      │ -0.60089  │
│ 4   │ 11294     │ 1.40044   │
│ 5   │ 11378     │ -0.613623 │
│ 6   │ 23295     │ -1.35266  │


IHT.jl can use PLINK files directly in parallel crossvalidation. However, the interface of `cv_iht` changes in two crucial ways. Firstly, instead of feeding the data `x` and `y` directly, we must point `cv_iht` to the locations of all data files stored on disk. Secondly, all data except the covariates must be stored in *binary* format.

Storing data in binary format is actually quite simple. The call in Julia is `write`. For example, the script `tutorial_simulation.jl` saved the response variable `ybed` to the desktop in binary format by calling

    ypath     = expanduser("~/Desktop/y.bin") # path to save response ybed to desktop
    write(open(ypath, "w"), ybed)             # "w"rite ybed to file
 
The script also generated the correct filepaths to simulated genotype data stored the PLINK.jl module. Now let us run the crossvalidation routine with binary data:

In [9]:
srand(2016) # reset seed before crossvalidation to make reproducible results
cv_bed  = cv_iht(xpath, covpath, ypath, pathidx, folds, nfolds, pids=pids)

Invoking transpose with outfile /Users/kkeys/Desktop/tbed_4.bed
Invoking transpose with outfile /Users/kkeys/Desktop/tbed_2.bed
Invoking transpose with outfile /Users/kkeys/Desktop/tbed_5.bed
Invoking transpose with outfile /Users/kkeys/Desktop/tbed_6.bed
Invoking transpose with outfile /Users/kkeys/Desktop/tbed_3.bed
Invoking transpose with outfile /Users/kkeys/Desktop/tbed_1.bed


An IHTCrossvalidationResults object with the following results:
Minimum MSE 0.006067673766979373 occurs at k = 10.
Best model β has the following nonzero coefficients:
10×2 DataFrames.DataFrame
│ Row │ Predictor │ β         │
├─────┼───────────┼───────────┤
│ 1   │ 650       │ -0.179729 │
│ 2   │ 3288      │ 1.804     │
│ 3   │ 6035      │ 0.137926  │
│ 4   │ 6931      │ -0.854346 │
│ 5   │ 7949      │ 1.02459   │
│ 6   │ 8886      │ 0.0930185 │
│ 7   │ 14799     │ -0.439675 │
│ 8   │ 14984     │ -1.46289  │
│ 9   │ 19620     │ 0.705787  │
│ 10  │ 19872     │ 0.0109802 │


Here we see that IHT finds the correct model size. We can verify that IHT also captures the correct predictors and provides reasonably accurate estimates of the coefficients:

In [10]:
display([bidxbed cv_bed.bidx])    # compare the indices
display([bbed[bidxbed] cv_bed.b]) # compare the coefficients

10x2 Array{Int64,2}:
   650    650
  3288   3288
  6035   6035
  6931   6931
  7949   7949
  8886   8886
 14799  14799
 14984  14984
 19620  19620
 19872  19872

10x2 Array{Float64,2}:
 -0.176622   -0.179729 
  1.80336     1.804    
  0.135377    0.137926 
 -0.85557    -0.854346 
  1.022       1.02459  
  0.0903269   0.0930185
 -0.442234   -0.439675 
 -1.46398    -1.46289  
  0.705111    0.705787 
  0.0109637   0.0109802

## GPUs

As mentioned before, IHT.jl can use GPUs to accelerate computations. This is an advanced topic, so we will only outline its use. We will not execute any code here since we cannot assume that the user has a dedicated GPU on their machine.

Using GPUs with IHT.jl require both a `BEDFile` object and a filepath to the GPU kernels. The GPU kernels ship with the PLINK.jl package in the package subdirectory `/src/kernels`. Assuming that a suitable GPU device is available on the machine, one can call `L0_reg` with GPU facilities using

    kernfile = open(readall, expanduser("~/.julia/v0.4/PLINK/src/kernels/iht_kernels64.cl")) # read Float64 kernels
    output   = L0_reg(xbed, ybed, k, kernfile) # use GPU for L0_reg

The previous code creates a long string object called `kernfile` that contains all of the OpenCL kernel code for the GPU.
It then calls `L0_reg` with a GPU.

Using GPUs in crossvalidation merely entails adding a `kernfile` argument to `cv_iht` between the `path` and `folds` argument. More explicitly, the GPU call is

    cv_bed = cv_iht(xpath, covpath, ypath, pathidx, kernfile, folds, pids=pids, nfolds)

Before we close, we must remove the GWAS response located at `ypath = ~/Desktop/y.bin`:

In [11]:
rm(ypath)

We hope that this tutorial is useful. Interested readers can consult the following references for further study:

- Kevin L. Keys, Gary K. Chen, Kenneth Lange. (2016) *Iterative Hard Thresholding for Model Selection in Genome-Wide Association Studies*. [(arXiv)](http://arxiv.org/abs/1608.01398)
- Thomas Blumensath and Mike E. Davies. (2010) "Normalized Iterative Hard Thresholding: Guaranteed Stability and Performance". *IEEE Journal of Selected Topics in Signal Processing* **4**:2, 298-309. [(pdf)](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=5419091) [(preprint)](http://www.personal.soton.ac.uk/tb1m08/papers/BD_NIHT09.pdf)
- Thomas Blumensath and Mike E. Davies. (2009) "Iterative Hard Thresholding for Compressed Sensing". *Applied and Computational Harmonic Analysis* **27**:3, 265-274. [(pdf)](http://www.sciencedirect.com/science/article/pii/S1063520309000384) [(arXiv)](http://arxiv.org/abs/0805.0510)