# IHT.jl tutorial

In this tutorial we explore some of the functionality of the IHT.jl package, which implements iterative hard threhsolding on floating point arrays and binary PLINK data. 

The first part of the tutorial demonstrates how to handle floating point data. Later we will show how to use [PLINK binary genotype files](http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml). IHT.jl offers two parallel compute frameworks for PLINK data : multicore CPUs or massively parallel GPUs. This tutorial will only demonstrate the CPU version, but instructions for GPU use are included.

IHT minimizes the residual sum of squares $\frac{1}{2} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|$ for the data matrix $\boldsymbol{X}$, response vector $\boldsymbol{y}$, and coefficient vector $\boldsymbol{\beta}$. If $\boldsymbol{\beta}$ is $k$-sparse, then we have a sparse regression problem.

Let us start by defining several simulation parameters:

In [1]:
addprocs(5) # for later parallel XV
using IHT, PLINK # PLINK.jl handles PLINK files
n = 5000   # number of cases
p = 23999 # number of predictors
k = 10  # number of true predictors
s = 0.1 # standard deviation of noise
srand(2016)
x_temp = randn(n,p) # data

5000x23999 Array{Float64,2}:
 -1.21187    -0.0372144   0.90941     …  -1.16745    -0.106107   0.492012
  0.429356    2.44578    -0.638042        1.64271    -0.438327  -1.05184 
  1.47077     0.404689   -0.597546       -0.0787781   0.622279  -0.931525
 -0.139799   -0.103057    0.163683       -2.75938     0.204186  -0.497543
 -0.494523   -0.28781    -0.00782332      1.03826    -0.85644    0.496935
 -1.05075     0.523897   -1.27745     …  -0.849481   -0.281006  -0.236835
 -0.974419   -1.27537     0.558627       -0.238681   -0.635668  -1.05491 
 -1.10666    -0.98015    -0.784369        0.0204989   1.22107    1.21211 
 -2.09573    -2.49759    -0.945998        0.565168    1.21262   -1.16107 
 -0.938486    0.439453    0.765375       -0.911794    0.642921   1.28155 
 -0.190322    0.388568   -1.27155     …  -0.49365     1.50609   -0.710576
  1.17238    -0.891597   -0.669788        0.507479   -1.41907   -0.141352
 -1.91608     1.44175     0.0441058      -3.92381    -0.242271   0.36265 
  ⋮      

Since we want our simulation to be reproducible, configure $\boldsymbol{\beta}$ with a fixed random seed:

In [2]:
b = zeros(p)
b[1:k] = randn(k)
shuffle!(b)
bidx = find(b)

10-element Array{Int64,1}:
  3310
  4460
  5861
  9731
 11294
 11378
 14217
 15118
 19815
 23295

Now we make a noisy response $\boldsymbol{y}$:

In [3]:
y = x_temp*b + s*randn(n)

5000-element Array{Float64,1}:
  1.3482   
 -1.72093  
 -1.15338  
 -3.49947  
  1.86919  
 -0.808959 
 -4.29945  
 -0.525047 
 -1.43292  
  0.56259  
  2.81115  
 -1.39126  
 -0.0480848
  ⋮        
  0.288666 
  0.361842 
  4.86104  
 -0.745529 
  1.27571  
 -1.86058  
 -2.45003  
 -1.07489  
  2.40557  
  0.227523 
  0.891287 
  0.64908  

Next we configure a regression problem. In this case, we need a data matrix with a grand mean included:

In [4]:
x = zeros(n,p+1)
setindex!(x, x_temp, :, 1:p)
x[:,end] = 1.0
x

5000x24000 Array{Float64,2}:
 -1.21187    -0.0372144   0.90941     …  -0.106107   0.492012  1.0
  0.429356    2.44578    -0.638042       -0.438327  -1.05184   1.0
  1.47077     0.404689   -0.597546        0.622279  -0.931525  1.0
 -0.139799   -0.103057    0.163683        0.204186  -0.497543  1.0
 -0.494523   -0.28781    -0.00782332     -0.85644    0.496935  1.0
 -1.05075     0.523897   -1.27745     …  -0.281006  -0.236835  1.0
 -0.974419   -1.27537     0.558627       -0.635668  -1.05491   1.0
 -1.10666    -0.98015    -0.784369        1.22107    1.21211   1.0
 -2.09573    -2.49759    -0.945998        1.21262   -1.16107   1.0
 -0.938486    0.439453    0.765375        0.642921   1.28155   1.0
 -0.190322    0.388568   -1.27155     …   1.50609   -0.710576  1.0
  1.17238    -0.891597   -0.669788       -1.41907   -0.141352  1.0
 -1.91608     1.44175     0.0441058      -0.242271   0.36265   1.0
  ⋮                                   ⋱                           
  1.12193    -1.51104     1.18153

Using `x`, run iterative hard thresholding to pull the best model of size `k`. The function call is `L0_reg`, which returns an `IHTReults` object `output` with the following fields:

    - output.time => compute time in seconds
    - output.iter => iterations taken until convergence
    - output.loss => residual sum of squares at convergence
    - output.beta => beta vector at convergence

In [5]:
output = L0_reg(x,y,k);
bk = copy(output.beta) # copy the beta for later use
[b[bidx] bk[bidx]]     # did we get the correct model and coefficient values?

10x2 Array{Float64,2}:
 -0.341588  -0.346533 
  0.612107   0.612489 
  0.17465    0.175425 
 -0.494423  -0.494717 
  1.36306    1.36302  
 -0.58849   -0.588894 
  0.155924   0.154568 
  0.031168   0.0318823
  0.374056   0.374349 
 -1.38168   -1.38392  

Observe that IHT returns all the correct nonzero coefficients. The coefficient values themselves are fairly close to their originals. We expect this since `s = 0.1` does not yield a very noisy `y`. Observe what happens when we increase the noise:

In [6]:
s = 5 # very noisy!
y2 = x_temp*b + s*randn(n)
output2 = L0_reg(x, y2, k);
bk2 = copy(output2.beta)
[b[bidx] bk[bidx] bk2[bidx]] # more noisy response = less accurate estimation

10x3 Array{Float64,2}:
 -0.341588  -0.346533   -0.326644
  0.612107   0.612489    0.604358
  0.17465    0.175425    0.0     
 -0.494423  -0.494717   -0.612899
  1.36306    1.36302     1.39415 
 -0.58849   -0.588894   -0.613471
  0.155924   0.154568    0.0     
  0.031168   0.0318823   0.0     
  0.374056   0.374349    0.0     
 -1.38168   -1.38392    -1.33059 

Recall that `b` is the original model, `bk` is the estimated model with small noise (`s = 0.1`), and `bk2` is the estimated model with high noise (`s = 5`). The coefficient values are less accurate, and several nonzero values are not recovered. A failure to recover the correct model does not indicate that IHT model selection performance is necessarily bad. To fully analyze model selection performance, we still need some sort of benchmark. A later section on regularization paths and LASSO will address this issue.

## IHT for GWAS

IHT.jl ships with facilities for using [PLINK.jl](https://github.com/klkeys/PLINK.jl) for GWAS analysis. PLINK.jl interfaces with PLINK binary genotype files (BED files). In this section of the tutorial we will demonstrate IHT.jl GWAS facilities on simulated binary genotype data. Luckily, PLINK.jl ships with some simulated binary data.

Note that PLINK.jl requires both a compressed BED file *and* its transpose in order to ensure fast linear algebra operations. Users who wish to use their own BED-BIM-FAM files should generate a transposed BED file before using PLINK.jl.

The aforementioned simulated genotype data contains 5000 cases and 24,000 SNPs:

In [7]:
fpath = expanduser("~/.julia/v0.4/PLINK/data/")
xpath = fpath * "x_test.bed"
xtpath = fpath * "xt_test.bed"
xbed = BEDFile(xpath, xtpath);
n,p = size(xbed)

(5000,24001)

Observe that there are actually 24,001 predictors. The previous constructor for `x` adds a nongenetic covariate of zeros by default. In our case, we want to use the grand mean (vector of ones). Neither IHT.jl nor PLINK.jl knows this, so we add the ones manually.

In [8]:
fill!(xbed.covar.x, 1.0)
fill!(xbed.covar.xt, 1.0)

1x5000 SharedArray{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0

We also need to calculate column means and precisions. Note that the `mean!` and `prec!` functions do not know any information about the predictors themselves. In the case of the grand mean, the functions return a mean of 1 and a precision of infinity. We must manually correct this in order to not penalize the grand mean.

In [9]:
mean!(xbed) # compute means in-place
prec!(xbed) # compute precisions in-place
xbed.means[end] = 0.0 # index "end" substitutes for position of grand mean in x.means! 
xbed.precs[end] = 1.0 # same as above

1.0

It is generally good practice to compute means and precisions once and then save them to file. Similarly, most applications will read nongenetic covariates from file. In fact, IHT.jl will **require** means, precisions, and covariates from file for crossvalidation functions, a topic that we will discuss later. In the meantime, observe that PLINK.jl contains constructor for `BEDFile` object when all data are stored on the disk:

    covpath = fpath * "covfile.txt"; # this is a delimited TXT file
    mpath = fpath * "means.bin"; # note: this is a BINARY file, not a TEXT file; we will explain why later
    ppath = fpath * "precs.bin"; # same as means
    x = BEDFile(xpath, xtpath, covpath, mpath, ppath); # this loads a BEDFile entirely from hard disk
    
`BEDFile` objects are designed to facilitate linear algebra operations with binary PLINK data. Currently PLINK.jl supports matrix-vector multiplcations $\boldsymbol{X}^T \boldsymbol{z}$ with **dense** $\boldsymbol{z}$ and $\boldsymbol{X} \boldsymbol{\beta}$ with **sparse** $\boldsymbol{\beta}$. This is sufficient to run IHT with `BEDFile`s. Let us make a new simulated model with our `BEDFile` object:

In [10]:
bbed = SharedArray(Float64, p, pids=procs(xbed)) # a model b to use with the BEDFile
bbed[1:k] = randn(k) # random coefficients
shuffle!(bbed) # random model
bidxbed = find(bbed) # store locations of nonzero coefficients
idx = bbed .!= 0.0 # need BitArray indices of nonzeroes in b for A_mul_B
xb = A_mul_B(xbed, bbed, idx, k, pids=procs(xbed)) # compute x*b
ybed2 = xb + 0.1*randn(n) # unfortunately this yields a Vector, which we must then convert to SharedVector
ybed = convert(SharedVector{Float64}, ybed2) # our response variable with the BEDFile

5000-element SharedArray{Float64,1}:
 -0.24977 
 -0.026113
 -1.85642 
  0.381802
  4.02698 
 -4.80148 
 -5.17884 
  0.115311
  3.6929  
  0.777447
 -0.59551 
 -1.23767 
 -5.22835 
  ⋮       
 -3.24521 
 -0.962122
  3.64983 
 -3.0133  
 -2.31627 
  2.52531 
 -2.11933 
 -5.07293 
 -3.3227  
  1.25493 
 -2.34029 
 -0.60029 

The call to IHT is the same as with the floating point `x`:

In [11]:
output = L0_reg(xbed, ybed, k)
bk = copy(output.beta) # copy the beta for later use
[bbed[bidxbed] bk[bidxbed]]     # did we get the correct model and coefficient values?

10x2 Array{Float64,2}:
 -0.176622   -0.179789 
  1.80336     1.80377  
  0.135377    0.137861 
 -0.85557    -0.854269 
  1.022       1.02445  
  0.0903269   0.0930143
 -0.442234   -0.439589 
 -1.46398    -1.46272  
  0.705111    0.705692 
  0.0109637   0.0107966

Let us close this section with a brief discussion about GPU use. Using GPUs with IHT.jl require both a `BEDFile` object and a filepath to the GPU kernels. The GPU kernels ship with the PLINK.jl package in the package subdirectory `/src/kernels`. Assuming that a suitable GPU device is available on the machine, one can call `L0_reg` with GPU facilities using

    kernfile = open(readall, expanduser("~/.julia/v0.4/PLINK/src/kernels/iht_kernels64.cl")) # read Float64 kernels
    output = L0_reg(xbed, ybed, k, kernfile) # use GPU for L0_reg

The previous code creates a long string object called `kernfile` that contains all of the OpenCL kernel code for the GPU.
We will not execute this code here since we cannot assume that the user has a dedicated GPU on their machine.
Note that this example assumes that the PLINK module is installed in the default library directory (`~/.julia/v0.4/`). Nonstandard library installations should point to the correct location of the kernels.

## Regularization Paths and LASSO

Until this point we have merely shown how to use IHT to compute the best model of size $k$.
Often this information alone is not terribly useful.
Ideally we would like to test several models and find the best one.
Many tools exist to conduct this model selection procedure, but the [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) is by far the most popular.
In this section we will use the LASSO to benchmark the model selection capabilities of IHT.
[GLMNet.jl](https://github.com/simonster/GLMNet.jl) is one implementation of the LASSO in Julia:

In [12]:
using GLMNet
srand(2016)
dfmax = 11
nlambda = 30
lambda = [1.40759, 1.34361, 1.28254, 0.66872, 0.638325, 0.515862, 0.3, 0.27, 0.265, 0.263, 0.26] # lambda values
penalty_factor = ones(size(x,2)); penalty_factor[end] = 0.0 # do not penalize grand mean
lassopath = glmnet(x, y2, lambda=lambda, penalty_factor=penalty_factor) # vector lambda ensures entry with df = 10

Least Squares GLMNet Solution Path (11 solutions for 24000 predictors in 43 passes):
11×3 DataFrames.DataFrame
│ Row │ df │ pct_dev    │ λ        │
├─────┼────┼────────────┼──────────┤
│ 1   │ 0  │ 0.0        │ 1.40759  │
│ 2   │ 1  │ 0.00350737 │ 1.34361  │
│ 3   │ 2  │ 0.0125526  │ 1.28254  │
│ 4   │ 2  │ 0.0923299  │ 0.66872  │
│ 5   │ 2  │ 0.094976   │ 0.638325 │
│ 6   │ 5  │ 0.115147   │ 0.515862 │
│ 7   │ 6  │ 0.145076   │ 0.3      │
│ 8   │ 6  │ 0.148561   │ 0.27     │
│ 9   │ 9  │ 0.149201   │ 0.265    │
│ 10  │ 10 │ 0.149532   │ 0.263    │
│ 11  │ 12 │ 0.150085   │ 0.26     │

We want the estimates of $\boldsymbol{\beta}$ with `k = 10` nonzeroes. In our case, row 10 contains the model size that we want.

In [13]:
betas = convert(Matrix{Float64}, lassopath.betas) # must convert the `betas` object to a useable form
[b[bidx] bk2[bidx] betas[bidx,10]] # b -> true model, bk2 -> previous estimate from IHT

10x3 Array{Float64,2}:
 -0.341588  -0.326644  -0.0570387
  0.612107   0.604358   0.351269 
  0.17465    0.0        0.0      
 -0.494423  -0.612899  -0.33027  
  1.36306    1.39415    1.12938  
 -0.58849   -0.613471  -0.361091 
  0.155924   0.0        0.0      
  0.031168   0.0        0.0      
  0.374056   0.0        0.0      
 -1.38168   -1.33059   -1.07278  

LASSO selects the same nonzeroes as IHT and maintains the correct sign of the coefficients. However, the coefficients are "deflated" towards zero because LASSO is a shrinkage operator. Users should consider this fact when choosing a feature selection tool. In some scenarios, such as genome-wide association studies, the coefficient values are often quite small, and shrinkage can drive these estimated coefficients to zero. In doing so, selection with the LASSO complicates accurate estimation of the statistical model.

Users should note that the call to `glmnet` returns several models because LASSO is most efficient when used to compute a set of models instead of one model. IHT.jl provides `iht_path` to mimic this behavior.
Let us compute model sizes $1, 2, \ldots, 11$ with IHT:

In [14]:
nmodels = 10
pathidx = collect(1:nmodels)
ihtbetas = iht_path(x, y2, pathidx) # note that ihtpath is a sparse matrix...
full(ihtbetas[bidx,:]) # progression of coefficients entering model; rightmost entry is k = 10

10x10 Array{Float64,2}:
 0.0       0.0       0.0        0.0       …  -0.335262  -0.337671  -0.334219
 0.0       0.0       0.0        0.0           0.614141   0.613477   0.615553
 0.0       0.0       0.0        0.0           0.0        0.0        0.0     
 0.0       0.0       0.0       -0.597181     -0.611819  -0.612644  -0.612667
 1.38657   1.39234   1.38222    1.39867       1.40115    1.40283    1.40294 
 0.0       0.0      -0.625356  -0.621444  …  -0.60589   -0.60554   -0.60714 
 0.0       0.0       0.0        0.0           0.0        0.0        0.0     
 0.0       0.0       0.0        0.0           0.0        0.0        0.0     
 0.0       0.0       0.0        0.0           0.0        0.0        0.0     
 0.0      -1.32376  -1.32592   -1.33478      -1.34528   -1.34168   -1.3408  

Comparing the rightmost entry with `bk2[bidx]`, we see that the estimated nonzeroes and their coefficients are essentially the same:

In [15]:
[bk2[bidx] full(ihtbetas[bidx,10])]

10x2 Array{Float64,2}:
 -0.326644  -0.334219
  0.604358   0.615553
  0.0        0.0     
 -0.612899  -0.612667
  1.39415    1.40294 
 -0.613471  -0.60714 
  0.0        0.0     
  0.0        0.0     
  0.0        0.0     
 -1.33059   -1.3408  

`iht_path` works with `BEDFile`s too:

In [16]:
ihtbetasbed = iht_path(xbed, ybed, pathidx)
[bk[bidxbed] full(ihtbetasbed[bidxbed,10])] # here we just view k = 10 for brevity

10x2 Array{Float64,2}:
 -0.179789   -0.179741 
  1.80377     1.80398  
  0.137861    0.137905 
 -0.854269   -0.854323 
  1.02445     1.02459  
  0.0930143   0.093021 
 -0.439589   -0.439666 
 -1.46272    -1.46291  
  0.705692    0.705768 
  0.0107966   0.0108142

## Crossvalidation

These exploratory efforts are admittedly not illuminating. In a realistic setting, we wouldn't know the correct model size `k = 10`. LASSO deals with this by crossvalidating the regularization path. IHT can do the same with `cv_iht`:

In [17]:
nfolds = 5 # number of crossvalidation folds
nlambda = 50 # number of lambda values to test
pmax = n # maximum number of predictors allowed in a LASSO model; here we allow up to one predictor per case
srand(2016)
folds = IHT.cv_get_folds(y2, nfolds) # fix the crossvalidation folds; LASSO and IHT will use same fold structure
cvlasso = glmnetcv(x, y2, dfmax=nmodels, pmax=pmax, nlambda=nmodels, nfolds=nfolds, folds=folds, parallel=true)
cv_output = cv_iht(x, y2, pathidx, nfolds, folds=folds);

Both IHT and LASSO refit their best model sizes. `cv_iht` returns the best $\boldsymbol{\beta}$ directly, but we need to extract it from LASSO:

In [18]:
b_cv_lasso = cvlasso.path.betas[:,indmin(cvlasso.meanloss)]
bidx_cv_lasso = find(b_cv_lasso)
display(cv_output.bidx)
display(bidx_cv_lasso)

6-element Array{Int64,1}:
  3310
  4460
  9731
 11294
 11378
 23295

6-element Array{Int64,1}:
  3310
  4460
  9731
 11294
 11378
 23295

IHT generally (but not always!) selects a more parsimonious set of features than LASSO, though IHT consistently underestimates the true model size. In this case, IHT and LASSO return the same models. The response `y2` is remarkably noisy, so correctly estimating the true model size will be quite hard. But take a peek at the estimated best $\boldsymbol{\beta}$ coefficient values compared to the true ones:

In [19]:
bfull_cv = zeros(size(b))
bfull_cv[cv_output.bidx] = cv_output.b
[b[bidx] bfull_cv[bidx] b_cv_lasso[bidx]]

10x3 Array{Float64,2}:
 -0.341588  -0.33289   -0.0208224
  0.612107   0.615219   0.316291 
  0.17465    0.0        0.0      
 -0.494423  -0.60089   -0.2941   
  1.36306    1.40044    1.09358  
 -0.58849   -0.613623  -0.327838 
  0.155924   0.0        0.0      
  0.031168   0.0        0.0      
  0.374056   0.0        0.0      
 -1.38168   -1.35266   -1.03589  

In crossvalidating the best model size, IHT loses the smallest components but estimates the remaining coefficients reasonably well. In contrast, LASSO coefficient estimates are biased towards zero. The effect of shrinkage is very stark.

IHT.jl can leverage PLINK.jl to use PLINK files directly in parallel crossvalidation. However, the interface of `cv_iht` changes dramatically in order to accomodate the `SharedArray` constructors that enable parallel computing. The interface changes in two crucial ways. Firstly, instead of feeding the data `x` and `y` directly, we must point `cv_iht` to the locations of all data files stored on disk. Secondly, all data except the covariates must be stored in *binary* format.

Storing data in binary format is actually quite simple. The call in Julia is `write`. For example, to save means and precisions from `xbed` in binary format, we would use

    mpath = fpath * "means.bin"
    ppath = fpath * "precs.bin"
    write(open(mpath, "w"), xbed.means) # open file `mpath` and "w"rite vector `xbed.means` to it
    write(open(ppath, "w"), xbed.precs) # same for `xbed.precs`
 
For this demonstration, we exploit the precomputed means and precision from the PLINK.jl simulated data folder. Let us save `ybed` to the desktop for the time being: 

In [20]:
ypath = expanduser("~/Desktop/y.bin")
write(open(ypath, "w"), ybed)

40000

Now let us run the crossvalidation routine with binary data:

In [21]:
covpath = fpath * "covfile.txt"; # this is a delimited TXT file, in this case merely the grand mean
mpath = fpath * "means.bin"; # note: this is a BINARY file, not a TEXT file; we will explain why later
ppath = fpath * "precs.bin"; # same as means
cv_bed = cv_iht(xpath, xtpath, covpath, ypath, mpath, ppath, pathidx, folds, nfolds)

IHT.IHTCrossvalidationResults{Float64}([21.7460922040304,20.447256597758287,19.83430694666251,19.391307662294984,19.101888891525793,18.986393164934988,18.9670406313367,18.955692783272347,18.950535468674527,18.950461574374113],[0.0],[0],10)