# OpenMendel Tutorial on Iterative Hard Thresholding

### Last update: 2/11/2019

### Julia version

`MendelIHT.jl` currently supports Julia 1.0 and 1.1, but it currently an unregistered package. To install, press `]` to invoke the package manager mode and install these packages by typing:

```
add https://github.com/OpenMendel/SnpArrays.jl
add https://github.com/OpenMendel/MendelSearch.jl
add https://github.com/OpenMendel/MendelBase.jl
add https://github.com/biona001/MendelIHT.jl
```

For this tutorial you will also need a few registered packages. Add them by typing:

```
add DataFrames, Distributions,BenchmarkTools, Random, LinearAlgebra
```

For reproducibility, the computer spec and Julia version is listed below.

In [1]:
versioninfo()

Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)
Environment:
  JULIA_NUM_THREADS = 8


### Control files for beginning users, direct function calls for advanced users

Numerous functions exist in IHT, which differ in input/output formats and optimization options. Users are encouraged to explore these functions as-is by directly calling them on imported data, demonstrated in the last few examples. For the less computer savvy, we also prepared the option of running analysis by constructing a 'control file' that specifies all analysis parameters. The later option is less flexible than the former to manipulate datasets, but less error prone.  

### When to use Iterative Hard Thresholding

Continuous model selection is advantageous in situations where the multivariate nature of the regressors plays a significant role together. As an alternative to traditional SNP-by-SNP association testing, iterative hard-thresholing (IHT) performs continuous model selection on a GWAS dataset $\mathbf{X} \in \{0, 1, 2\}^{n \times p}$ and continuous phenotype vector $\mathbf{y}$ by minimizing the residual sum of squares $f(\beta) = \frac{1}{2}||\mathbf{y} - \mathbf{X}\beta||^2$ subject to the constraint that $\beta$ is $k-$sparse. This method has the edge over LASSO (which also provides continuous model selection) because IHT does not shrink estimated effect sizes. Parallel computing is offered through `q-`fold cross validation, and in the near future, dense (genotype matrix)-(dense vector) multiplication. 

### Appropriate Datasets and Example Inputs 

All genotype data **must** be stored in the [PLINK binary genotype format](https://www.cog-genomics.org/plink2/formats#bed), where the triplets `.bim`, `.bed` and `.fam` must all be present. Additional non-genetic covariates should be stored in a separate file (e.g. comma separated file). In the first 3 examples of this tutorial, we use "gwas 1 data" (github repo: [here](https://github.com/OpenMendel/MendelGWAS.jl/tree/master/docs)) to illustrate the basic usage and functionalities of MendelIHT. This dataset has 2200 people and a modest 10000 simulated SNPs, with 2 SNPs `rs1935681` and `rs2256412` (and an additional interaction term) contributing to the response . One can obtain this dataset from the first example input of [MendelGWAS.jl](https://openmendel.github.io/MendelGWAS.jl/), or via option 24a of the free application [Mendel version 16](http://software.genetics.ucla.edu/download?package=1). To examine the robustness and accuracy of IHT, examples 4 and 5 simulates data on the spot, and immediately calls IHT on the simulated data. 


### Missing Data

`MendelIHT` assumes there are no missing genotypes, since it uses linear algebra functions defined in [SnpArrays.jl](https://openmendel.github.io/SnpArrays.jl/latest/man/snparray/#linear-algebra-with-snparray). Therefore, you must first impute missing genotypes before you use MendelIHT. SnpArrays.jl offer a naive imputation strategy, but otherwise, our own software [option 23 of Mendel](http://software.genetics.ucla.edu/download?package=1) is a reasonable choice. Open Mendel will soon provide a separate package `MendelImpute.jl` containing new imputation strategies such as alternating least squares.  

### Cross Validation and Regularization paths

We usually have very little information on how many SNPs are affecting the phenotype. In a typical GWAS study, anywhere between 1 to thousands of SNPs could play a role. Thus ideally, we can test many different models to find the best one. MendelIHT provides 2 ways for one to perform this automatically: user specified regulartization paths, and $q-$fold [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics). Users should know that, in the first method, increasing the number of predictors will almost always decrease the error, but as a result introduce overfitting. Therefore, in most practical situations, it is highly recommended to combine this method with cross validation. In $q-$fold cross validation, samples are divided into $q$ disjoint subsets, and IHT fits a model on $q-1$ of those sets data, then computes the [MSE](https://en.wikipedia.org/wiki/Mean_squared_error) tested on the $qth$ samples. Each $q$ subsets are served as the test set exactly once. This functionality of `MendelIHT.jl` natively supports parallel computing. 

### Analysis keywords available to users 

| Keyword | Default Value | Allowed value | Description |
| --- | --- | --- | --- |
|`predictors` | 0 | Positive integer | Max number of non-zero entries of $\beta$ |
|`non_genetic_covariates` | "" | File name on disk | Delimited file containing the non-genetic covariates for each sample |
|`run_cross_validation` | false | boolean | Whether the user wants to run cross-validation |
|`model_sizes` | "" | Integers stored in string separated by ',' | Different model sizes users wish to run IHT |
|`cv_fold` | 0 | Positive integer | Number of disjoint subsets the samples should be divided into |
|`max_groups` (\*) | 1 | Integer | Total number of groups |
|`group_membership` (\*) | "" | File name on disk | File indicating group membership |
|`prior_weights` (\*) | "" | maf | How to scale predictors based on different weights |
|`glm` (\*) | "normal" | normal, logistic, poisson | Running generalized linear models for cases when $y$ is normal, binary, or count data

+ (\*) Indicates experimental features. We currently have no theoretical guarantees on their performance, therefore illustrations of these functionalities are omitted from this tutorial. Users should tread carefully with these features. 
+ A list of OpenMendel keywords common to most analysis package can be found [here](https://openmendel.github.io/MendelBase.jl/#keywords-table)

# Example 1: Run IHT with Only Genotype Data

### Step 1: Preparing Input files

In Open Mendel, all analysis parameters are specified via the [Control file](https://openmendel.github.io/MendelBase.jl/#control-file). Genotype data must be inputted via the PLINK binary format. The most basic control file to run IHT looks like the following:

In [2]:
;cat "./tutorial_data/gwas_1_Control_basic.txt"

#
# Input and Output files.
#
plink_input_basename = gwas_1_data

#
# Analysis parameters for IHT option.
#
predictors = 2

### Step 2: Run MendelIHT

To run IHT with a control file, execute the following in the Julia REPL or in this notebook:

In [3]:
using MendelIHT
IHT("./tutorial_data/gwas_1_Control_basic.txt")

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


 
 
     Welcome to OpenMendel's
      IHT analysis option
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/dev/MendelIHT/docs/tutorial_data".

Keywords modified by the user:

  affected_designator = 2
  control_file = ./tutorial_data/gwas_1_Control_basic.txt
  pedigree_file = gwas_1_data.fam
  plink_input_basename = gwas_1_data
  predictors = 2
  snpdata_file = gwas_1_data.bed
  snpdefinition_file = gwas_1_data.bim
 


┌ Info: Reading in data
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:42
┌ Info: Running normal IHT for model size k = 2 and groups J = 1
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:113


IHT results:

Compute time (sec):     1.142543077468872
Final loss:             1183.6890803570777
Iterations:             4
Max number of groups:   1
Max predictors/group:   2
IHT estimated 2 nonzero coefficients.
2×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 3981      │ 0.147624    │
│ 2   │ 1     │ 7023      │ 0.269147    │

Intercept of model = 0.0


In [4]:
cd("../") #change back to original directory

### Step 3: Interpreting the results

Here the estimated model is the 3981th and 7023th predictor, corresponding to rs1935681 and rs2256412 in the `gwas 1 data.bim` file, which are the correct SNPs. The intercept of the model is given at the bottom of the table. Here the compute time is the time associated with computing the optimal $\beta$ only, so it does not include other necessary processes, such as importing the data. 

**Note:** the `affected_designator = 2` simply indicates that the pedigree is a Plink .fam file, which must always be the case for `MendelIHT` because this analysis option only accepts PLINK binary format as inputs. 

# Example 2: Including Non-Genetic Covariates

Non-genetic covariates must be stored in a comma demited file, with the same number of rows as the number of samples. The intercept term (i.e. grand mean) must also be included in the file. If the user does not specify a non-genetic covariate file, `MendelIHT` will by default include an intercept in the estimated model. 

### Step 1: Prepare Non-Genetic Covariate File

In this example, we generated one non-genetic covariate from a $N(0, 1)$ distribution. After including the grand mean, we saved the file in `gwas 1 noncov.txt` where the entries are separated by a tab. The first few lines of this file looks like the following:

In [5]:
;head -10 "./tutorial_data/gwas_1_noncov.txt"

1	-0.088704513339476
1	-0.9575873240069772
1	-0.9713258274139007
1	-0.9847900613424241
1	-0.5954781589540936
1	0.2124813875751884
1	2.28150775802523
1	1.7643235366779797
1	-0.3933262467789896
1	-0.1348394065324508


### Step 2: Prepare Corresponding Control File 

We need to tell MendelIHT that the covariates are separated by tabs. This can be specified via the [MendelBase](https://openmendel.github.io/MendelBase.jl/) keyword `field_separator` in the control file as follows:

In [6]:
;cat "./tutorial_data/gwas_1_Control_nongen.txt"

#
# Input and Output files.
#
plink_input_basename = gwas_1_data
non_genetic_covariates = gwas_1_noncov.txt
field_separator = '	'
#
# Analysis parameters for IHT option.
#
predictors = 2

### Step 3: Run IHT 

In [7]:
using MendelIHT
IHT("./tutorial_data/gwas_1_Control_nongen.txt")

 
 
     Welcome to OpenMendel's
      IHT analysis option
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/dev/MendelIHT/docs/tutorial_data".

Keywords modified by the user:

  affected_designator = 2
  control_file = ./tutorial_data/gwas_1_Control_nongen.txt
  field_separator = 	
  non_genetic_covariates = gwas_1_noncov.txt
  pedigree_file = gwas_1_data.fam
  plink_input_basename = gwas_1_data
  predictors = 2
  snpdata_file = gwas_1_data.bed
  snpdefinition_file = gwas_1_data.bim
 


┌ Info: Reading in data
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:42
┌ Info: Running normal IHT for model size k = 2 and groups J = 1
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:113


IHT results:

Compute time (sec):     0.7715671062469482
Final loss:             1183.6890803570777
Iterations:             4
Max number of groups:   1
Max predictors/group:   2
IHT estimated 2 nonzero coefficients.
2×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 3981      │ 0.147624    │
│ 2   │ 1     │ 7023      │ 0.269147    │

Intercept of model = 0.0


In [8]:
cd("../") #change back to original directory

**Remark:** Observe that the resulting error and model is exactly the same, because the covariates we added is white noise, and thus, was not selected by IHT to be a significant predictor.

# Example 3: Cross Validation

In this example, users can run IHT on any number of model in at most 8 parallel threads. Empirically, running on $n$ threads achieves roughly $n/2$ fold speedup. Note that running $q$ fold cross validation on $r$ different models entails running IHT $q \times r$ times. 

### Step 0: IMPORTANT in order to take advantage of multithreads, discontinue this notebook and julia. Have a terminal open.

### Step 1: IMPORTANT Execute following line in the terminal BEFORE starting notebook (or Julia REPL) 
export JULIA_NUM_THREADS=8

### Step 2: Start notebook in the same terminal window and verify that notebook is indeed running with 8 threads:  
Note that if you computer's capacity is less than 8, it will default to the largest number it can run. 

In [9]:
Threads.nthreads()

8

### Step 3: Specify the model sizes

The paths should be inside quotes and separated by comma, specified via the keyword `model_sizes`. Each entry must be an integer. In this example, we tried to run IHT for model sizes $k = 1, 2, ..., 10$ and 5 different folds. This is equivalent to running IHT 50 different times, and hence, ideal for parallel computing. 

In [10]:
;cat "./tutorial_data/gwas_1_Control_cv.txt"

#
# Input and Output files.
#
plink_input_basename = gwas_1_data
#
# Cross Validation parameters
#
run_cross_validation = true
cv_folds = 5
model_sizes = "1,2,3,4,5,6,7,8,9,10"


### Step 4: Run Cross Validation to find best model size

In [11]:
using MendelIHT
IHT("./tutorial_data/gwas_1_Control_cv.txt")

 
 
     Welcome to OpenMendel's
      IHT analysis option
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/dev/MendelIHT/docs/tutorial_data".

Keywords modified by the user:

  affected_designator = 2
  control_file = ./tutorial_data/gwas_1_Control_cv.txt
  cv_folds = 5
  model_sizes = 1,2,3,4,5,6,7,8,9,10
  pedigree_file = gwas_1_data.fam
  plink_input_basename = gwas_1_data
  run_cross_validation = true
  snpdata_file = gwas_1_data.bed
  snpdefinition_file = gwas_1_data.bim
 


┌ Info: Reading in data
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:42
┌ Info: Running 5-fold cross validation on the following model sizes:
│ 1,2,3,4,5,6,7,8,9,10.
│ Ignoring keyword predictors.
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:88




Crossvalidation Results:
k	MSE
1	0.5482980579413379
2	0.540387690843733
3	0.5281201972176272
4	0.5335837402453649
5	0.5356074075522838
6	0.5377695731430505
7	0.5395404937328938
8	0.5414253752022411
9	0.5476004945198947
10	0.5496105157367333

The lowest MSE is achieved at k = 3


3

In [12]:
cd("../") #change back to original directory

### Step 5: Re-run ordinary IHT on the best model size

According to our cross validation result, the best model size that minimizes out-of-sample errors (i.e. MSE on the q-th subset of samples) is attained at $k = 3$. That is, cross validation likely detected that we need 3 SNPs (one being the interaction term captured via the intercept) to achieve the best model size. Using this information, one can re-run the IHT code to obtain the estimated model.

In [13]:
;cat "./tutorial_data/gwas_1_Control_basic2.txt"

#
# Input and Output files.
#
plink_input_basename = gwas_1_data

#
# Analysis parameters for IHT option.
#
predictors = 3

In [14]:
IHT("./tutorial_data/gwas_1_Control_basic2.txt")


 
 
     Welcome to OpenMendel's
      IHT analysis option
 
 
Reading the data.

The current working directory is "/Users/biona001/.julia/dev/MendelIHT/docs/tutorial_data".

Keywords modified by the user:

  affected_designator = 2
  control_file = ./tutorial_data/gwas_1_Control_basic2.txt
  pedigree_file = gwas_1_data.fam
  plink_input_basename = gwas_1_data
  predictors = 3
  snpdata_file = gwas_1_data.bed
  snpdefinition_file = gwas_1_data.bim
 


┌ Info: Reading in data
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:42
┌ Info: Running normal IHT for model size k = 3 and groups J = 1
└ @ MendelIHT /Users/biona001/.julia/dev/MendelIHT/src/IHT_wrapper.jl:113


IHT results:

Compute time (sec):     0.8421881198883057
Final loss:             1161.9511204211908
Iterations:             4
Max number of groups:   1
Max predictors/group:   3
IHT estimated 2 nonzero coefficients.
2×3 DataFrames.DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 3981      │ 0.147625    │
│ 2   │ 1     │ 7023      │ 0.269147    │

Intercept of model = 0.14057571565526017


In [15]:
cd("../") #change back to original directory

# Example 4: Using simulated results to examine IHT signal reconstruction

In this example, we want to explicitly examine the ability for IHT to recover signals. Note the follow 2 examples directly call functions of `MendelIHT.jl` instead of using control files. This allows greater flexibility on end users because they can directly manipulate variables. 

### Step 1: Simulate data

We first simulate a uniform random SNP matrix $X$ and a known model $\beta$ as follows:
$$\beta \sim N(0, 1),$$
$$y = X\beta + \epsilon$$
$$\epsilon \in N(0, s),$$
$$s \in \{0.1, 1, 10, 25\}.$$

In [16]:
#load packages
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra

#set random seed
Random.seed!(1111)

#simulate data
n = 5000
p = 30000
bernoulli_rates = 0.5rand(p) #minor allele frequencies are drawn from uniform (0, 0.5)
x = simulate_random_snparray(n, p, bernoulli_rates)
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 

#specify dimension and noise of data
k = 10                          # number of true predictors per group
S = [0.1, 1.0, 10.0, 25.0]      # noise vector, from very little noise to a lot of noise

#construct non-genetic covariates and true model b
z           = ones(n, 1)                   # non-genetic covariates, just the intercept
true_b      = zeros(p)                     # model vector
true_b[1:k] = randn(k)                     # Initialize k non-zero entries in the true model
shuffle!(true_b)                           # Shuffle the entries
correct_position = findall(x -> x != 0, true_b) # keep track of what the true entries are
noise = [rand(Normal(0, s), n) for s in S] # noise vectors from N(0, s) where s ∈ S = {0.01, 0.1, 1, 10}s

#simulate phenotypes under different noises by: y = Xb + noise
y = [zeros(n) for i in 1:length(S)]
for i in 1:length(S)
    y[i] = xbm * true_b + noise[i]
end

### Observe the noisiness of our observations. 
Each column is the same response vector with varying levels of noise added. The farther right, the noisier the data.

In [17]:
[y[1] y[2] y[3] y[4]]

5000×4 Array{Float64,2}:
  4.0369     5.57741    -4.36356    -9.6449  
  0.191536  -1.04378    -2.07723   -47.0979  
 -4.38938   -3.34411    -1.42214    24.8877  
 -0.919674   0.399415  -11.4534    -41.1765  
  2.97062    3.56937    -4.50204     4.03327 
  2.02283    2.55848    13.8248     -4.50728 
 -5.78682   -5.10249    10.6672      5.34934 
 -2.31817   -1.59809     4.346       3.5354  
 -8.3404    -9.37824   -25.9527    -39.9853  
 -7.41212   -7.01681     3.36147   -10.8692  
 -1.88005   -0.969617   -4.15115   -40.838   
 -1.24192   -1.25763    -5.5715      6.39366 
 -3.44554   -3.29327     7.4746     16.119   
  ⋮                                          
 -2.89109   -2.74476     2.29241    15.7058  
 -5.08137   -4.66423    -7.16946   -11.6489  
  9.61788    9.90731    28.5531    -23.4735  
  8.41764    9.23892     8.4852    -19.3525  
  6.27949    5.94512    18.3532      8.64536 
 -1.43935   -3.03143   -11.0977     -0.475652
 -4.25876   -3.34605   -26.7249     17.2168  
 -1.69186

### Step 2: Examine reconstruction of IHT given true model size

In [18]:
#compute IHT result for less noisy data
estimated_models = [zeros(k) for _ in 1:length(y)]
for i in 1:length(y)
    result = L0_reg(x, z, y[i], 1, k)
    estimated_models[i] .= result.beta[correct_position]
end

#compare and contrast
true_model = true_b[correct_position]
compare_model = DataFrame(
    correct_position = correct_position, 
    true_β           = true_model, 
    noise_level_1    = estimated_models[1],      #N(0, 0.1)
    noise_level_2    = estimated_models[2],      #N(0, 1.0)
    noise_level_3    = estimated_models[3],      #N(0, 10.0)
    noise_level_4    = estimated_models[4])      #N(0, 25.0)

Unnamed: 0_level_0,correct_position,true_β,noise_level_1,noise_level_2,noise_level_3,noise_level_4
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64,Float64
1,5929,0.994762,0.995545,1.00746,0.957444,0.0
2,10164,0.426441,0.427758,0.468934,0.0,0.0
3,11676,-1.32497,-1.32341,-1.33334,-1.30104,-1.22798
4,14082,0.38481,0.386729,0.387764,0.0,0.0
5,17954,-0.628232,-0.62823,-0.647939,-0.83076,0.0
6,18967,-1.30452,-1.30594,-1.29257,-1.41359,0.0
7,19304,-1.37284,-1.37107,-1.37061,-1.29244,-1.54338
8,20792,0.920098,0.917775,0.904498,0.855185,1.28701
9,26349,-0.322137,-0.32181,-0.300086,0.0,0.0
10,27146,2.42516,2.42472,2.42196,2.30561,1.99963


### Step 3: Interpret reconstruction result

IHT finds the correct 10 predictors if we add little noise. With greater noise, we lose predictors with smaller effect size, while losing accuracy. However, found predictors exhibit no shrinkage of effect size. 

# Example 5: Using simulated results to examine IHT Cross Validation

### Step 0: Follow steps 0~2 in example 3 to gain access to multi-threading

In [19]:
Threads.nthreads() # check that multiple threads are running

8

### Step 1: Simulate Data (as done in Example 4 step 1)

### Step 2: Run IHT cross validation code

In [20]:
#reset seed
Random.seed!(1111)

# specify number of fold and the different model sizes (path)
path = collect(1:20)
num_folds = 5
folds = rand(1:num_folds, size(x, 1))

#run cross validation on not-so-noisy data
k_est = cv_iht(x, z, y[1], 1, path, folds, num_folds, use_maf = false, debias=false)



Crossvalidation Results:
k	MSE
1	4.087386566654006
2	3.09896674881206
3	2.235286925818996
4	1.3398166385274393
5	0.8321683827986981
6	0.4285864356097345
7	0.23545804101151324
8	0.14787569877090037
9	0.07123394984266347
10	0.010553689682578828
11	0.013180945864318533
12	0.01321454529484015
13	0.013250436428910795
14	0.013243234830655166
15	0.013259486534082293
16	0.013264444634146878
17	0.013302845545591996
18	0.013308728293452542
19	0.013326603346227688
20	0.013345946393705187

The lowest MSE is achieved at k = 10


10

In [21]:
#reset seed
Random.seed!(1111)

#run cross validation on pretty noisy data
k_est = cv_iht(x, z, y[3], 1, path, folds, num_folds, use_maf = false, debias=false)



Crossvalidation Results:
k	MSE
1	54.45327294696108
2	53.65021406936241
3	52.773114592990716
4	51.694369916592805
5	51.34685719569556
6	51.19018782283655
7	50.71206240949455
8	50.45521918533876
9	50.62425944709808
10	50.72940821688856
11	50.82354603168742
12	50.99524891915468
13	51.22568897286627
14	51.403841463500804
15	51.451138301758064
16	51.57866320166185
17	51.69300160180016
18	51.77244195146731
19	52.0392801348529
20	52.52346277706467

The lowest MSE is achieved at k = 8


8

### Step 3: Interpreting results

With little noise $\epsilon \in N(0, 0.1)$, cross-validation finds the true sparsity parameter: $k_{\text{estimate}} = k_{\text{true}} = 10$. For very noisy data $\epsilon \in N(0, 10)$, cross validation returns $k_{\text{estimate}} = 12$, which slightly under-estimates the true sparsity parameter.

# Conclusion

This notebook demonstrated some of the basic features of IHT. The first 3 examples illustrate the basic usage of IHT for general audiences. The last 2 examples illustrate how to make function calls directly and tests its robustness on reproducible simulated data. We selectively omitted several experimental features of IHT. In the near future, we will release a more detailed notebook that includes tutorials on all the experiemental features of IHT on GWAS data as well as motivations for them.