# IHT.jl tutorial

In this tutorial we explore some of the functionality of the IHT.jl package, which implements iterative hard threhsolding on floating point arrays.

IHT minimizes the residual sum of squares $\frac{1}{2} \| \boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|$ for data matrix $\boldsymbol{X}$, response vector $\boldsymbol{y}$, and coefficient vector $\boldsymbol{\beta}$. If $\boldsymbol{\beta}$ is $k$-sparse, then we have a sparse regression problem.

Let us start by defining several simulation parameters:

In [1]:
addprocs(5) # for later parallel XV
using IHT
n = 5000
p = 23999
k = 10
s = 0.1
x_temp = randn(n,p)

5000x23999 Array{Float64,2}:
  0.257141  -0.724777   0.859458   …   0.07989    -0.0800017   0.595581 
  0.740772   0.268289  -0.145353       0.303898   -0.0645964  -1.64504  
  0.816615  -0.688146  -0.232626       1.40207    -0.125299    0.789344 
 -1.02271   -1.67805   -1.04104       -0.225945    0.692461   -0.692703 
  0.373389   0.832239  -0.589455       0.925253    0.777032   -0.97044  
  0.78856   -1.83465    2.15891    …   0.0815397  -0.0720443  -0.24342  
  1.92936    0.346339  -0.609725      -1.8729     -1.97128    -0.762739 
 -0.181114   1.84331   -1.84085        0.584192    0.979634   -1.74861  
  0.592886  -0.564346  -0.53547        0.489543   -1.02777    -1.46408  
  0.605932   0.459408   0.583701       0.905324   -0.0502125  -0.988436 
 -0.579504  -1.52012    0.603341   …  -0.810088   -0.990011   -0.33724  
 -1.79151   -0.150712   0.459638       0.561198    1.80619     1.44635  
  0.269643   0.575173   1.00205       -0.874883    0.84332     1.9512   
  ⋮                   

Since we want our simulation to be reproducible, configure $\boldsymbol{\beta}$ with a fixed random seed:

In [2]:
srand(2016)
b = zeros(p)
b[1:k] = randn(k)
shuffle!(b)
bidx = find(b)

10-element Array{Int64,1}:
  5912
  6593
 10073
 13599
 14572
 14929
 18057
 18357
 23140
 23528

Now we make a noisy response $\boldsymbol{y}$:

In [3]:
y = x_temp*b + s*randn(n)

5000-element Array{Float64,1}:
  1.80556  
  1.04619  
  3.7542   
  1.09357  
  0.411478 
  4.64548  
  1.55792  
  2.62063  
 -8.06077  
 -1.68568  
 -0.998167 
  1.17064  
 -0.502512 
  ⋮        
 -0.558193 
 -1.39937  
 -3.71941  
 -2.14715  
  1.70349  
 -0.0725569
  5.45024  
 -2.19511  
 -3.33305  
 -0.269835 
  1.07424  
 -0.153398 

Next we configure a regression problem. In this case, we need a data matrix with a grand mean included:

In [4]:
x = zeros(n,p+1)
setindex!(x, x_temp, :, 1:p)
x[:,end] = 1.0
x

5000x24000 Array{Float64,2}:
  0.257141  -0.724777   0.859458    0.395087   …  -0.0800017   0.595581   1.0
  0.740772   0.268289  -0.145353    0.339015      -0.0645964  -1.64504    1.0
  0.816615  -0.688146  -0.232626   -0.413353      -0.125299    0.789344   1.0
 -1.02271   -1.67805   -1.04104     1.03776        0.692461   -0.692703   1.0
  0.373389   0.832239  -0.589455    0.600091       0.777032   -0.97044    1.0
  0.78856   -1.83465    2.15891     0.136858   …  -0.0720443  -0.24342    1.0
  1.92936    0.346339  -0.609725   -2.13567       -1.97128    -0.762739   1.0
 -0.181114   1.84331   -1.84085     0.123631       0.979634   -1.74861    1.0
  0.592886  -0.564346  -0.53547     1.30807       -1.02777    -1.46408    1.0
  0.605932   0.459408   0.583701   -0.0100444     -0.0502125  -0.988436   1.0
 -0.579504  -1.52012    0.603341    0.38614    …  -0.990011   -0.33724    1.0
 -1.79151   -0.150712   0.459638   -1.43802        1.80619     1.44635    1.0
  0.269643   0.575173   1.00205    

Using `x`, run iterative hard thresholding to pull the best model of size `k`. The function call is `L0_reg`, which returns an `IHTReults` with the following fields:

    - "time" => compute time
    - "iter" => iterations taken
    - "loss" => residual sum of squares at convergence
    - "beta" => beta vector at convergence

In [5]:
output = L0_reg(x,y,k);
bk = copy(output.beta)
[b[bidx] bk[bidx]]

10x2 Array{Float64,2}:
 -0.938486  -0.936704
 -1.05075   -1.05165 
  1.47077    1.47351 
 -0.139799  -0.142716
 -1.21187   -1.212   
 -0.974419  -0.97422 
 -1.10666   -1.10694 
  0.429356   0.428181
 -2.09573   -2.09561 
 -0.494523  -0.492939

Observe that IHT returns all the correct nonzero coefficients. The coefficient values themselves are fairly close to their originals. We expect this since `s = 0.1` does not yield a very noisy `y`. Observe what happens when we increase the noise:

In [6]:
s = 10
y2 = x_temp*b + s*randn(n)
output2 = L0_reg(x, y2, k);
bk2 = copy(output2.beta)
[b[bidx] bk[bidx] bk2[bidx]]

10x3 Array{Float64,2}:
 -0.938486  -0.936704  -0.881068
 -1.05075   -1.05165   -0.993175
  1.47077    1.47351    1.23718 
 -0.139799  -0.142716   0.0     
 -1.21187   -1.212     -1.27654 
 -0.974419  -0.97422   -1.0409  
 -1.10666   -1.10694   -1.3352  
  0.429356   0.428181   0.599866
 -2.09573   -2.09561   -1.97351 
 -0.494523  -0.492939  -0.53444 

The coefficient values are less accurate, and we lost three nonzeroes. Still, the model recovery performance of IHT is not bad. We still need some sort of benchmark. 

## Regularization Paths and LASSO
[GLMNet.jl](https://github.com/simonster/GLMNet.jl) offers one way to benchmark model recovery with IHT:

In [7]:
using GLMNet
srand(2016)
nmodels = 11
lambda = [2.28262
1.94746
1.66151
1.41755
1.20941
1.03183
0.880322
0.751062
0.640782
0.546695
0.466422]
lassopath = glmnet(x, y2, lambda=lambda) # try to test only a few models

Least Squares GLMNet Solution Path (11 solutions for 24000 predictors in 46 passes):
11×3 DataFrames.DataFrame
│ Row │ df │ pct_dev   │ λ        │
├─────┼────┼───────────┼──────────┤
│ 1   │ 0  │ 0.0       │ 2.28262  │
│ 2   │ 0  │ 0.0       │ 1.94746  │
│ 3   │ 1  │ 0.0070533 │ 1.66151  │
│ 4   │ 1  │ 0.0136703 │ 1.41755  │
│ 5   │ 4  │ 0.0230167 │ 1.20941  │
│ 6   │ 5  │ 0.039381  │ 1.03183  │
│ 7   │ 7  │ 0.0546554 │ 0.880322 │
│ 8   │ 7  │ 0.0675024 │ 0.751062 │
│ 9   │ 7  │ 0.0768536 │ 0.640782 │
│ 10  │ 9  │ 0.0843812 │ 0.546695 │
│ 11  │ 34 │ 0.0974616 │ 0.466422 │

We want the estimates of $\boldsymbol{\beta}$ with `k = 10` nonzeroes. The precise model chosen by the LASSO can vary! Consequently, the LASSO may not return the precise model size that we want (a pointed advantage of IHT!) but row 10 should give us a decent approximation.

In [8]:
betas = convert(Matrix{Float64}, lassopath.betas)
[bk2[bidx] betas[bidx,10]]

10x2 Array{Float64,2}:
 -0.881068  -0.360174
 -0.993175  -0.44382 
  1.23718    0.697118
  0.0        0.0     
 -1.27654   -0.717244
 -1.0409    -0.554452
 -1.3352    -0.79415 
  0.599866   0.046781
 -1.97351   -1.38309 
 -0.53444    0.0     

LASSO selects the correct nonzeroes, but the coefficients are shrunk because LASSO is a shrinkage operator. Users should consider this fact when choosing a feature selection tool. In some scenarios, such as genome-wide association studies, the coefficient values are often quite small, and shrinkage can complicate accurate estimation of the statistical model.

LASSO is most efficient when used to compute a regularization path. IHT.jl provides `iht_path` to mimic this behavior.
Let us compute model sizes $1, 2, \ldots, 11$ with IHT:

In [9]:
colidx  = countnz(sub(betas, :, 10))
pathidx = collect(1:nmodels)
ihtbetas = iht_path(x, y2, pathidx) # ihtpath is a sparse matrix
[bk2[bidx] full(ihtbetas[bidx,10])]

10x2 Array{Float64,2}:
 -0.881068  -0.887491
 -0.993175  -1.00572 
  1.23718    1.2469  
  0.0        0.0     
 -1.27654   -1.29271 
 -1.0409    -1.06223 
 -1.3352    -1.33721 
  0.599866   0.598986
 -1.97351   -1.96237 
 -0.53444   -0.545949

## Crossvalidation

These exploratory efforts are admittedly not illuminating. In a realistic setting, we wouldn't know the correct model size `k = 10`. LASSO deals with this by crossvalidating the regularization path. IHT can do the same with `cv_iht`:

In [10]:
nfolds = 5
srand(2016)
folds = IHT.cv_get_folds(y, nfolds)
cvlasso = glmnetcv(x, y2, dfmax=nmodels, nlambda=nmodels, nfolds=nfolds, folds=folds)
cv_output = cv_iht(x, y2, pathidx, nfolds);



Both IHT and LASSO refit their best model sizes. `cv_iht` returns the best $\boldsymbol{\beta}$ directly, but we need to extract it from LASSO:

In [11]:
b_cv_lasso = cvlasso.path.betas[:,indmin(cvlasso.meanloss)]
bidx_cv_lasso = find(b_cv_lasso)
display(cv_output.bidx)
display(bidx_cv_lasso)

7-element Array{Int64,1}:
  5912
  6593
 10073
 14572
 14929
 18057
 23140

7-element Array{Int64,1}:
  5912
  6593
 10073
 14572
 14929
 18057
 23140

IHT generally (but not always!) selects a more parsimonious set of features than LASSO, though IHT consistently underestimates the true model size. Take a peek at the estimated best $\boldsymbol{\beta}$ values compared to the true ones:

In [12]:
bfull_cv = zeros(size(b))
bfull_cv[cv_output.bidx] = cv_output.b
[b[bidx] bfull_cv[bidx] b_cv_lasso[bidx]]

10x3 Array{Float64,2}:
 -0.938486  -0.887153  -0.163926
 -1.05075   -0.985036  -0.241787
  1.47077    1.24926    0.491653
 -0.139799   0.0        0.0     
 -1.21187   -1.28114   -0.505937
 -0.974419  -1.06152   -0.365709
 -1.10666   -1.32569   -0.595549
  0.429356   0.0        0.0     
 -2.09573   -1.96175   -1.16551 
 -0.494523   0.0        0.0     

In crossvalidating the best model size, IHT loses the smallest components but estimates the remaining coefficients reasonably well. LASSO tends to grab more of the true model but also tends to pull in a lot of garbage. The effect of shrinkage is very stark; we see that the estimated coefficients are biased towards zero.