# Normal: compare false positive/negatives with LASSO

LASSO is currently the de facto penalized least squares method for [feature selection](https://en.wikipedia.org/wiki/Feature_selection). Here we compare the performance (in terms of the number of false positives/negative) of LASSO with IHT for GWAS data, using the `glmnet` implementation of cyclic coordinate descent for LASSO. Since the focus here is not scalability, we test our sample problems on moderately sized genotype matrces of 1000 samples with 10,000 SNPs.

In [1]:
using Distributed
addprocs(4)
nprocs()

5

In [2]:
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using Random
using LinearAlgebra
using DelimitedFiles
using GLM
using RCall
R"library(glmnet)"

┌ Info: Recompiling stale cache file /u/home/b/biona001/.julia/compiled/v1.0/MendelIHT/eaqWB.ji for MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1187
┌ Info: Recompiling stale cache file /u/home/b/biona001/.julia/compiled/v1.0/Plots/ld3vC.ji for Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1187
┌ Info: Recompiling stale cache file /u/home/b/biona001/.julia/compiled/v1.0/RCall/8GFyb.ji for RCall [6f49c342-dc21-5d91-9882-a32aef131414]
└ @ Base loading.jl:1187
│ Setting LC_CTYPE failed, using "C" 
└ @ RCall /u/home/b/biona001/.julia/packages/RCall/ffM0W/src/io.jl:113
│ Setting LC_CTYPE failed, using "C" 
└ @ RCall ~/.julia/packages/RCall/ffM0W/src/io.jl:113
│ Setting LC_CTYPE failed, using "C" 
└ @ RCall ~/.julia/packages/RCall/ffM0W/src/io.jl:113
│ Setting LC_CTYPE failed, using "C" 
└ @ RCall ~/.julia/packages/RCall/ffM0W/src/io.jl:113
│ Setting LC_CTYPE failed, using "C" 
└ @ RCall ~/.julia/packages/RCall/ffM0W/src/io.jl:113
│ Loading require

RObject{StrSxp}
 [1] "glmnet"    "foreach"   "Matrix"    "stats"     "graphics"  "grDevices"
 [7] "utils"     "datasets"  "methods"   "base"     


# How we ran LASSO

We use the R library `glmnet` (implemented in Fortran) to run LASSO. [Documentation of Rcall.jl](http://juliainterop.github.io/RCall.jl/stable/gettingstarted.html) teaches us to to transfer variables between R and Julia. Since glmnet does not operate on genotype file, we need to convert `x` to Float64, and then run `glmnet`. We chose to run IHT across 50 different model sizes $\{1,2,...,50\}$ because the best model selected by LASSO averages to about 25 non-zero predictors, so we doubled that number. 

In [3]:
function iht_lasso(n::Int64, p::Int64, k::Int64, d::UnionAll, l::Link)
    #construct snpmatrix, covariate files, and true model b
    x, = simulate_random_snparray(n, p, undef)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # the intercept
    x_float = [convert(Matrix{Float64}, x, center=true, scale=true) z] #Float64 version of x

    # simulate response, true model b, and the correct non-0 positions of b
    y, true_b, correct_position = simulate_random_response(x, xbm, k, d, l)
 
    #specify path and folds
    num_folds = 3
    folds = rand(1:num_folds, size(x, 1));

    #run glmnet via Rcall
    @rput x_float y folds num_folds #make variables visible to R
    R"lasso_cv_result = cv.glmnet(x_float, y, nfolds = num_folds, foldid = folds)"
    R"lasso_beta_tmp = glmnet(x_float, y, lambda=lasso_cv_result$lambda.min)$beta"
    R"lasso_beta = as.vector(lasso_beta_tmp)"
    @rget lasso_cv_result lasso_beta #pull result from R to Julia
    lasso_k_est = count(!iszero, lasso_beta)
    
    #find non-zero entries returned by best lasso model as largest k estimate
    path = collect(1:50);
    
    #run IHT's cross validation routine 
    mses = cv_iht_distributed(d(), l, x, z, y, 1, path, folds, num_folds, use_maf=false, debias=false, showinfo=false, parallel=true);
    iht_k_est = argmin(mses)
    iht_result = L0_reg(x, xbm, z, y, 1, iht_k_est, d(), l, debias=false, init=false, use_maf=false)
    iht_beta = iht_result.beta
        
    #show lasso and IHT's reconstruction result
    compare_model = DataFrame(
        true_β  = true_b[correct_position], 
        IHT_β   = iht_beta[correct_position],
        lasso_β = lasso_beta[correct_position])
    @show compare_model
    
    #compute true/false positives/negatives for IHT and lasso
    iht_tp = count(!iszero, iht_beta[correct_position])
    iht_fp = iht_k_est - iht_tp
    iht_fn = k - iht_tp
    lasso_tp = count(!iszero, lasso_beta[correct_position])
    lasso_fp = lasso_k_est - lasso_tp
    lasso_fn = k - lasso_tp
    
    println("IHT false positives = $iht_fp")
    println("IHT false negatives = $iht_fn")
    println("LASSO false positives = $lasso_fp")
    println("LASSO false positives = $lasso_fn" * "\n")
    
    return iht_fp, iht_fn, lasso_fp, lasso_fn
end

iht_lasso (generic function with 1 method)

# Normal response

In [8]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

#set random seed
Random.seed!(2019)

#run function above, saving results in 4 vectors
total_runs = 50
iht_false_positives = zeros(total_runs)
iht_false_negatives = zeros(total_runs)
lasso_false_positives = zeros(total_runs)
lasso_false_negatives = zeros(total_runs)
for i in 1:total_runs
    println("current run = $i")
    iht_fp, iht_fn, lasso_fp, lasso_fn = iht_lasso(n, p, k, d, l)
    iht_false_positives[i] = iht_fp
    iht_false_negatives[i] = iht_fn
    lasso_false_positives[i] = lasso_fp
    lasso_false_negatives[i] = lasso_fn
end

current run = 1
compare_model = 10×3 DataFrame
│ Row │ true_β   │ IHT_β     │ lasso_β   │
│     │ Float64  │ Float64   │ Float64   │
├─────┼──────────┼───────────┼───────────┤
│ 1   │ -1.29964 │ -1.24117  │ -1.14111  │
│ 2   │ -0.2177  │ -0.234676 │ -0.124249 │
│ 3   │ 0.786217 │ 0.820139  │ 0.710232  │
│ 4   │ 0.599233 │ 0.583405  │ 0.481213  │
│ 5   │ 0.283711 │ 0.298299  │ 0.199136  │
│ 6   │ -1.12537 │ -1.14459  │ -1.0404   │
│ 7   │ 0.693374 │ 0.673006  │ 0.588039  │
│ 8   │ -0.67709 │ -0.709737 │ -0.626146 │
│ 9   │ 0.14727  │ 0.16866   │ 0.0654867 │
│ 10  │ 1.03477  │ 1.08116   │ 0.967405  │
IHT false positives = 0
IHT false negatives = 0
LASSO false positives = 23
LASSO false positives = 0

current run = 2
compare_model = 10×3 DataFrame
│ Row │ true_β    │ IHT_β     │ lasso_β    │
│     │ Float64   │ Float64   │ Float64    │
├─────┼───────────┼───────────┼────────────┤
│ 1   │ 0.402907  │ 0.435489  │ 0.328459   │
│ 2   │ 1.0652    │ 1.114     │ 1.01424    │
│ 3   │ 1.70025   │ 

# Compute average number of false positives/negatives

In [20]:
normal_iht_false_positives = sum(iht_false_positives) / 50
normal_iht_false_negatives = sum(iht_false_negatives) / 50
normal_lasso_false_positives = sum(lasso_false_positives) / 50
normal_lasso_false_negatives = sum(lasso_false_negatives) / 50
result = [normal_iht_false_positives; normal_iht_false_negatives; normal_lasso_false_positives; normal_lasso_false_negatives]

4-element Array{Float64,1}:
  0.04
  1.28
 28.54
  0.8 

In [1]:
IHT_did_not_converge = 0
LASSO_did_not_converge = 0

0