# Various illustrations on IHT-poisson fitting

In this notebook we illustrate the functionalities and behaviors of IHT used for high-dimensional poisson regressions. Specifically, we investigate:
+ What is a reasonable distribution for $\beta_{true}$?
+ Is initializing $\beta$ good in terms of speed and model selection?
+ Is debiasing good in terms of speed and model selection?
+ How does IHT compare to LASSO?


In [1]:
using Revise
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using StatsFuns: logistic

┌ Info: Recompiling stale cache file /Users/biona001/.julia/compiled/v1.0/MendelIHT/eaqWB.ji for MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1190
┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


# What is a reasonable distribution for $\beta_{true}$?

Recall that when $\lambda \geq 20$, Poisson($\lambda)$ can be [approximated by the normal distributions with a continuity correction](http://wiki.stat.ucla.edu/socr/index.php/AP_Statistics_Curriculum_2007_Limits_Norm2Poisson). Thus in our data simulation, it is a good idea to make sure the mean of the generated $\lambda_i$ for each sample does not exceed this value, because otherwise it would be more appropriate to fit a normal, or log-normal distribution. This can be roughly ensured if we let $\beta \sim N(0, s)$ where $s$ is small, but obviously the smaller the $\beta$, the harder it is to find it. This section investigates what distribution we should choose for $\beta$ to approximately ensure that the median of $y$ lies in a reasonable range. 


## "Bad" poisson reconstruction for extreme $\lambda$ values

To convince the skeptical reader, first we show poisson reconstruction results that are "sometimes bad" as a result of a few extreme $\lambda_i$. These extreme values forces extremely large gradients in the first few iterations. This introduces numerical instability, but also brings the model estimate to disproportionately large values, rendering IHT powerless. For $k = 10$, this happens around $50\%$ of the time (i.e. when mean$(y) \leq 20$), and this probability tend to increase with larger $k$ values because $\lambda_i = \exp(x_i^T\beta)$.

In [54]:
#some function that runs poisson regression on different SNP matrices
function run_poisson_bad(n :: Int64, p :: Int64, k :: Int64)
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b. Here we show N(0, 1) is a BAD MODEL 
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) #intercept
    true_b = zeros(p)
    true_b[1:k] = randn(k) #N(0, 1) is a BAD MODEL 
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    # Simulate poisson data
    y_temp = xbm * true_b
    λ = exp.(y_temp)
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)

    #compute poisson IHT result
    result = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=false, convg=true, show_info=false, true_beta=true_b, scale=false, init=false)

    #check result
    estimated_models = result.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β      = true_model, 
        estimated_β = estimated_models)
    
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("Total iteration number was " * string(result.iter))
    println("Total time was " * string(result.time))
    println("median of y = " * string(median(y)))
    println("maximum of y = " * string(maximum(y)))
    println("minimum of y = " * string(minimum(y)))
    println("skewness of y = " * string(skewness(y)))
    println("kurtosis of y = " * string(kurtosis(y)))
    println("mean of y = " * string(mean(y)))
    println("variance of y = " * string(var(y)))
    println("var / mean = " * string(var(y) / mean(y)) * "\n\n")
end

run_poisson_bad (generic function with 1 method)

In [55]:
for i = 1:50
    n = rand(2000:5000) 
    p = rand(1:10)n
    k = 10
    println("Running the $i th model where " * "n, p = " * string(n) * ", " * string(p))
    run_poisson_bad(n, p, k)
end

Running the 1 th model where n, p = 2274, 15918
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ 1.29571   │ 1.29473     │
│ 2   │ 1.25988   │ 1.25969     │
│ 3   │ -1.22584  │ -1.23833    │
│ 4   │ 0.745645  │ 0.734241    │
│ 5   │ -0.376801 │ -0.383461   │
│ 6   │ -0.305932 │ -0.303664   │
│ 7   │ 0.275386  │ 0.271365    │
│ 8   │ 0.186843  │ 0.184365    │
│ 9   │ 0.173627  │ 0.176104    │
│ 10  │ -0.145285 │ -0.137414   │
Total iteration number was 159
Total time was 31.70826482772827
median of y = 1.0
maximum of y = 1166.0
minimum of y = 0.0
skewness of y = 10.392052871583946
kurtosis of y = 142.86550411712287
mean of y = 14.214160070360599
variance of y = 4088.5458442788117
var / mean = 287.6389335733075


Running the 2 th model where n, p = 3074, 24592
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float6

## What's the appropriate variance of our model?

Below we have $s = 0.1, 0.2, ... 1.0$ and let $\beta \sim N(0, s)$. Then we construct 10 different SNP matrices for each model to simulate data. We compute some summary statistics at the end, but overall variance $= 0.3$ works the best for maximizing $\beta$ coefficients (i.e. making the problem easy) but minimizes extreme outliers. 

In [72]:
function compute_y_mean(variance :: Float64)     
    #simulat data
    n = rand(2000:5000)
    p = 10n
    k = 10 # number of true predictors

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = rand(Normal(0, variance), k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    #simulate data
    y_temp = xbm * true_b
    λ = exp.(y_temp) #inverse log link
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)
    
    return n, p, median(y), maximum(y), mean(y), var(y), skewness(y), kurtosis(y) 
end

compute_y_mean (generic function with 1 method)

In [73]:
Random.seed!(2019)
s = collect(0.1:0.1:1) #beta ~ N(0, s) for s in {0.1, 0.2,..., 1} 
repeats = 30
for i in 1:length(s)
    df = DataFrame(n = Int64[], p = Int64[], median = Float64[], max = Float64[], mean = Float64[], 
        var = Float64[], skew = Float64[], kur = Float64[])
    for j in 1:repeats
        n, p, med, max, μ, σ, skew, kur = compute_y_mean(s[i])
        push!(df, [n, p, med, max, μ, σ, skew, kur])
    end
    println("for variance = " * string(s[i]) * ", the result is as follows:")
    @show(df)
    println("\n\n")
end

for variance = 0.1, the result is as follows:
df = 30×8 DataFrame
│ Row │ n     │ p     │ median  │ max     │ mean    │ var      │ skew     │ kur      │
│     │ Int64 │ Int64 │ Float64 │ Float64 │ Float64 │ Float64  │ Float64  │ Float64  │
├─────┼───────┼───────┼─────────┼─────────┼─────────┼──────────┼──────────┼──────────┤
│ 1   │ 2101  │ 21010 │ 1.0     │ 9.0     │ 1.04998 │ 1.15512  │ 1.22887  │ 2.47232  │
│ 2   │ 2421  │ 24210 │ 1.0     │ 6.0     │ 1.0033  │ 1.08346  │ 1.02249  │ 0.936058 │
│ 3   │ 4679  │ 46790 │ 1.0     │ 6.0     │ 1.06305 │ 1.17495  │ 1.0925   │ 1.18712  │
│ 4   │ 2024  │ 20240 │ 1.0     │ 7.0     │ 1.03656 │ 1.12372  │ 1.145    │ 1.5472   │
│ 5   │ 3581  │ 35810 │ 1.0     │ 7.0     │ 1.0081  │ 1.09016  │ 1.10599  │ 1.43206  │
│ 6   │ 2451  │ 24510 │ 1.0     │ 7.0     │ 1.04366 │ 1.14626  │ 1.10676  │ 1.31603  │
│ 7   │ 2966  │ 29660 │ 1.0     │ 7.0     │ 1.02326 │ 1.09052  │ 1.20477  │ 2.13775  │
│ 8   │ 4298  │ 42980 │ 1.0     │ 8.0     │ 1.09981 │ 1.32795  │

## Examine reconstruction for variance = 0.4

From the above, it seems like variance = 0.4 is a reasonable choice for $\beta$. Below simulate 50 different SNP matrices, with 50 different $\beta \sim N(0, 0.4)$, to examine reconstruction behavior. Overall, IHT struggles to find effect sizes $< 0.1$ but performs well for larger values.

In [2]:
#some function that runs poisson regression on different SNP matrices
function run_poisson(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = rand(Normal(0, 0.4), k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    # Simulate poisson data
    y_temp = xbm * true_b
    λ = exp.(y_temp) #inverse log link
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)

    #compute poisson IHT result
    result = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=false, convg=false, show_info=false)

    #check result
    estimated_models = result.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β      = true_model, 
        estimated_β = estimated_models)
    
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("Total iteration number was " * string(result.iter))
    println("Total time was " * string(result.time))
    println("median of y = " * string(median(y)))
    println("maximum of y = " * string(maximum(y)))
    println("minimum of y = " * string(minimum(y)))
    println("skewness of y = " * string(skewness(y)))
    println("kurtosis of y = " * string(kurtosis(y)))
    println("mean of y = " * string(mean(y)))
    println("variance of y = " * string(var(y)))
    println("var / mean = " * string(var(y) / mean(y)) * "\n\n")
end

run_poisson (generic function with 1 method)

In [3]:
k = 10
for i = 1:50
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    run_poisson(n, p, k)
end

running the 1 th model
n, p = 803, 8030
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ 1.09447    │ 1.12531     │
│ 2   │ 0.556104   │ 0.536675    │
│ 3   │ 0.429742   │ 0.429157    │
│ 4   │ -0.397894  │ 0.0         │
│ 5   │ -0.362368  │ 0.0         │
│ 6   │ 0.313917   │ 0.3287      │
│ 7   │ 0.26377    │ 0.237004    │
│ 8   │ 0.228922   │ 0.222751    │
│ 9   │ -0.219458  │ -0.24286    │
│ 10  │ -0.0312748 │ 0.0         │
Total iteration number was 57
Total time was 2.2881321907043457
median of y = 1.0
maximum of y = 87.0
minimum of y = 0.0
skewness of y = 5.516287937432489
kurtosis of y = 49.98787061852346
mean of y = 2.955168119551681
variance of y = 38.68377313254844
var / mean = 13.090210630188116


running the 2 th model
n, p = 1096, 7672
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64    

# Is initializing $\beta$ good in terms of speed and model selection?

When initilizing the model, we can fit a bunch of univariate regressions to find a good initial approximation to the model before starting the IHT algorithm. Doing so introduces extra computational cost, but could reduce total iteration number and/or improve model selection. This section examines whether this is useful (it's not).

In [4]:
#some function that runs poisson regression on different SNP matrices
function test_poisson_init(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = rand(Normal(0, 0.4), k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    # Simulate poisson data
    y_temp = xbm * true_b
    λ = exp.(y_temp) #inverse log link
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)

    #compute poisson IHT result
    yes_init = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=false, convg=false, show_info=false, init=true)
    no_init = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=false, convg=false, show_info=false, init=false)
    
    #check result
    est_model_init = yes_init.beta[correct_position]
    est_model_no_init = no_init.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β      = true_model, 
        est_β_no_init = est_model_no_init,
        est_β_with_init = est_model_init)
        
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("No initilialization:")
    println("    Iter = " * string(no_init.iter))
    println("    Time = " * string(no_init.time))
    println("Yes initilialization:")
    println("    Iter = " * string(yes_init.iter))
    println("    Time = " * string(yes_init.time) * "\n\n")
    
    #return summary statistic
    yes_init_found = length(findall(!iszero, est_model_init))
    no_init_found = length(findall(!iszero, est_model_no_init))
    yes_init_time = yes_init.time
    no_init_time = no_init.time
    return yes_init_found, no_init_found, yes_init_time, no_init_time
end

test_poisson_init (generic function with 1 method)

In [8]:
k = 10
yes_init_total_found = 0
no_init_total_found = 0
yes_init_total_time = 0
no_init_total_time = 0

iter = 50
for i = 1:iter
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    yif, nif, yit, nit = test_poisson_init(n, p, k)
    yes_init_total_found += yif
    no_init_total_found += nif
    yes_init_total_time += yit
    no_init_total_time += nit
end
println("With initialization, found $yes_init_total_found " * "predictors out of " * string(k*iter))
println("With initialization, average time was " * string(yes_init_total_time/iter))
println("Without initialization, found $no_init_total_found " * "predictors out of " * string(k*iter))
println("Without initialization, average time was " * string(no_init_total_time/iter))


running the 1 th model
n, p = 1350, 13500
sort(compare_model, rev=true, by=abs) = 10×3 DataFrame
│ Row │ true_β     │ est_β_no_init │ est_β_with_init │
│     │ Float64    │ Float64       │ Float64         │
├─────┼────────────┼───────────────┼─────────────────┤
│ 1   │ -0.548454  │ -0.533148     │ -0.526393       │
│ 2   │ 0.44135    │ 0.402357      │ 0.404395        │
│ 3   │ 0.439569   │ 0.426857      │ 0.421325        │
│ 4   │ -0.368719  │ -0.414614     │ -0.417164       │
│ 5   │ 0.296116   │ 0.297873      │ 0.297031        │
│ 6   │ 0.208152   │ 0.205636      │ 0.206739        │
│ 7   │ -0.196641  │ -0.203148     │ -0.19881        │
│ 8   │ 0.120121   │ 0.137878      │ 0.138934        │
│ 9   │ -0.0998402 │ 0.0           │ 0.0             │
│ 10  │ -0.0346985 │ 0.0           │ 0.0             │
No initilialization:
    Iter = 23
    Time = 2.4194629192352295
Yes initilialization:
    Iter = 120
    Time = 16.844393014907837


running the 2 th model
n, p = 870, 5220
sort(compare_m

# Is debiasing good in terms of speed and model selection?

Within each IHT iteration, we can fit a GLM regression (using scoring algorithm) on just the support set of that iteration. This is known as debiasing. We now investigate whether this is a good idea, both in terms of speed and model selection performance. (ans: yes)

In [9]:
function test_poisson_debias(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = rand(Normal(0, 0.4), k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    # Simulate poisson data
    y_temp = xbm * true_b
    λ = exp.(y_temp) #inverse log link
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)

    #compute poisson IHT result
    no_debias = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=false, convg=false, show_info=false, init=false)
    yes_debias = L0_poisson_reg(x, z, y, 1, k, glm = "poisson", debias=true, convg=false, show_info=false, init=false)
    
    #check result
    est_model_yes = yes_debias.beta[correct_position]
    est_model_no = no_debias.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β           = true_model, 
        est_β_no_debias  = est_model_no,
        est_β_yes_debias = est_model_yes)
        
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("No debiasing:")
    println("    Iter = " * string(no_debias.iter))
    println("    Time = " * string(no_debias.time))
    println("Yes debiasing:")
    println("    Iter = " * string(yes_debias.iter))
    println("    Time = " * string(yes_debias.time) * "\n\n")
    
    #return summary statistic
    yes_debias_found = length(findall(!iszero, est_model_yes))
    no_debias_found = length(findall(!iszero, est_model_no))
    yes_debias_time = yes_debias.time
    no_debias_time = no_debias.time
    return yes_debias_found, no_debias_found, yes_debias_time, no_debias_time
end

test_poisson_debias (generic function with 1 method)

In [10]:
k = 10
yes_debias_total_found = 0
no_debias_total_found = 0
yes_debias_total_time = 0
no_debias_total_time = 0

iter = 50
for i = 1:iter
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    ydf, ndf, ydt, ndt = test_poisson_debias(n, p, k)
    yes_debias_total_found += ydf
    no_debias_total_found += ndf
    yes_debias_total_time += ydt
    no_debias_total_time += ndt
end
println("With debiasing, found $yes_debias_total_found " * "predictors out of " * string(k*iter))
println("With debiasing, average time was " * string(yes_debias_total_time/iter))
println("Without debiasing, found $no_debias_total_found " * "predictors out of " * string(k*iter))
println("Without debiasing, average time was " * string(no_debias_total_time/iter))


running the 1 th model
n, p = 1695, 10170
sort(compare_model, rev=true, by=abs) = 10×3 DataFrame
│ Row │ true_β     │ est_β_no_debias │ est_β_yes_debias │
│     │ Float64    │ Float64         │ Float64          │
├─────┼────────────┼─────────────────┼──────────────────┤
│ 1   │ -0.942554  │ -0.930659       │ -0.930799        │
│ 2   │ 0.68117    │ 0.691088        │ 0.69103          │
│ 3   │ -0.371658  │ -0.372763       │ -0.372735        │
│ 4   │ -0.341904  │ -0.328759       │ -0.328743        │
│ 5   │ -0.280735  │ -0.267045       │ -0.267026        │
│ 6   │ -0.255104  │ -0.24732        │ -0.247274        │
│ 7   │ -0.185256  │ -0.165724       │ -0.165698        │
│ 8   │ 0.113998   │ 0.122994        │ 0.122978         │
│ 9   │ 0.0876498  │ 0.102293        │ 0.10228          │
│ 10  │ -0.0591206 │ -0.078097       │ -0.0780855       │
No debiasing:
    Iter = 22
    Time = 2.0880799293518066
Yes debiasing:
    Iter = 24
    Time = 2.273534059524536


running the 2 th model
n, p = 9

# Comparison with LASSO: cross validation

In [1]:
using Revise
using GLMNet #julia wrapper for GLMNet package in R, which calls fortran
using GLM
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using StatsFuns: logistic
using Random
using LinearAlgebra
using DelimitedFiles

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


In [2]:
Threads.nthreads() #verify multiple threads are enabled

8

In [3]:
function iht_lasso_poisson(n :: Int64, p :: Int64, sim :: Int64)
    #define maf and true model size
    k = 10
    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z           = ones(n, 1)                   # non-genetic covariates, just the intercept
    true_b      = zeros(p)                     # model vector
    true_b[1:k] = rand(Normal(0, 0.4), k)      # k true response
    shuffle!(true_b)                           # Shuffle the entries
    correct_position = findall(x -> x != 0, true_b) # keep track of what the true entries are

    # Simulate poisson data
    y_temp = xbm * true_b
    λ = exp.(y_temp) #inverse log link
    y = [rand(Poisson(x)) for x in λ]
    y = Float64.(y)

    #compute poisson IHT result
    cur_time = time()
    path = collect(1:20)
    num_folds = 3
    folds = rand(1:num_folds, size(x, 1))
    k_est_iht = cv_iht(x, z, y, 1, path, folds, num_folds, use_maf=false, glm="poisson", debias=true)
    iht_result = L0_poisson_reg(x, z, y, 1, k_est_iht, glm = "poisson", debias=true, convg=false, show_info=false, true_beta=true_b, init=false)
    iht_time = time() - cur_time
    
    #compute poisson lasso result
    x_float = [convert(Matrix{Float64}, x, center=true, scale=true) z]
    cur_time = time()
    cv = glmnetcv(x_float, y, Poisson(), pmax=20, nfolds=3, folds=folds)
    best = argmin(cv.meanloss)
    lasso_result = cv.path.betas[:, best]
    k_est_lasso = length(findall(!iszero, lasso_result))
    lasso_time = time() - cur_time
    
    #compute regular poisson regression using only true predictors
    x_true = [x_float[:, correct_position] z]
    regular_result = glm(x_true, y, Poisson(), LogLink())
    regular_result = regular_result.pp.beta0
    
    #check result
    IHT_model = iht_result.beta[correct_position]
    lasso_model = lasso_result[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β  = true_model, 
        iht_β   = IHT_model,
        lasso_β = lasso_model,
        regular_β = regular_result[1:10])
    @show compare_model

    #compute summary statistics
    lasso_num_correct_predictors = length(findall(!iszero, lasso_model))
    lasso_false_positives = k_est_lasso - lasso_num_correct_predictors
    lasso_false_negatives = k - lasso_num_correct_predictors
    iht_num_correct_predictors = length(findall(!iszero, IHT_model))
    iht_false_positives = k_est_iht - iht_num_correct_predictors
    iht_false_negatives = k - iht_num_correct_predictors
    println("IHT cv found $iht_false_positives" * " false positives and $iht_false_negatives" * " false negatives, and used $iht_time" * " seconds")
    println("lasso cv found $lasso_false_positives" * " false positives and $lasso_false_negatives" * " false negatives, and used $lasso_time" * " seconds \n\n")

    #write y to file to view distribution later
    #writedlm("./IHT_poisson_simulations_mean/data_simulation_$sim.txt", y)

    return lasso_num_correct_predictors, lasso_false_positives, lasso_false_negatives, lasso_time, iht_num_correct_predictors, iht_false_positives, iht_false_negatives, iht_time
end

iht_lasso_poisson (generic function with 1 method)

In [4]:
function run_iht_lasso_poisson()
    lasso_total_found = 0
    lasso_false_positives = 0
    lasso_false_negatives = 0
    lasso_total_time = 0
    iht_total_found = 0
    iht_false_positives = 0
    iht_false_negatives = 0
    iht_total_time = 0
    
    iter = 10
    for i = 1:iter
        n = rand(1000:3000) 
        p = rand(1:10)n
        println("Running the $i th model where " * "n, p = " * string(n) * ", " * string(p))
        ltf, lfp, lfn, ltt, itf, ifp, ifn, itt = iht_lasso_poisson(n, p, i)

        lasso_total_found += ltf
        lasso_false_positives += lfp
        lasso_false_negatives += lfn
        lasso_total_time += ltt
        iht_total_found += itf
        iht_false_positives += ifp
        iht_false_negatives += ifn
        iht_total_time += itt
    end
    println("IHT  : Found $iht_total_found " * "correct predictors, out of " * string(10iter))
    println("IHT  : False positives = $iht_false_positives")
    println("IHT  : False negatives = $iht_false_negatives")   
    println("IHT  : Average time = " * string(iht_total_time / iter))   
    println("Lasso: Found $lasso_total_found " * "correct predictors, out of " * string(10iter))
    println("Lasso: False positives = $lasso_false_positives")
    println("Lasso: False negatives = $lasso_false_negatives")
    println("Lasso: Average time = " * string(lasso_total_time / iter))   
end

run_iht_lasso_poisson (generic function with 1 method)

In [5]:
Random.seed!(2019)
run_iht_lasso_poisson()

Running the 1 th model where n, p = 1101, 5505


└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β     │ iht_β     │ lasso_β    │ regular_β   │
│     │ Float64    │ Float64   │ Float64    │ Float64     │
├─────┼────────────┼───────────┼────────────┼─────────────┤
│ 1   │ 0.350726   │ 0.348157  │ 0.232491   │ 0.338618    │
│ 2   │ -0.0691278 │ 0.0       │ 0.0        │ -0.0423288  │
│ 3   │ 0.321821   │ 0.33175   │ 0.218928   │ 0.33577     │
│ 4   │ 0.0206803  │ 0.0       │ 0.0        │ -0.00955842 │
│ 5   │ -0.364041  │ -0.348065 │ -0.190411  │ -0.345578   │
│ 6   │ -0.0117973 │ 0.0       │ 0.0        │ -0.0224709  │
│ 7   │ 0.384137   │ 0.408748  │ 0.286903   │ 0.402484    │
│ 8   │ -0.0379634 │ 0.0       │ 0.0        │ -0.0453506  │
│ 9   │ 0.0114505  │ 0.0       │ 0.0        │ 0.0324529   │
│ 10  │ -0.19394   │ 0.0       │ -0.0653144 │ -0.2002     │
IHT cv found 0 false positives and 6 false negatives, and used 60.85119676589966 seconds
lasso cv found 0 false positives and 5 false negatives, and used 2.4331750869750977 seconds 


Runni

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β     │ iht_β     │ lasso_β    │ regular_β │
│     │ Float64    │ Float64   │ Float64    │ Float64   │
├─────┼────────────┼───────────┼────────────┼───────────┤
│ 1   │ -0.166785  │ 0.0       │ -0.0381534 │ -0.137296 │
│ 2   │ -0.0966497 │ -0.106139 │ -0.0220758 │ -0.102758 │
│ 3   │ 0.585055   │ 0.591731  │ 0.543016   │ 0.585122  │
│ 4   │ -0.11679   │ -0.135514 │ -0.0418787 │ -0.132134 │
│ 5   │ -0.469501  │ -0.445559 │ -0.314045  │ -0.437911 │
│ 6   │ 0.179822   │ 0.173472  │ 0.0865679  │ 0.173136  │
│ 7   │ 0.266917   │ 0.280213  │ 0.195209   │ 0.269821  │
│ 8   │ -0.271124  │ -0.277599 │ -0.173511  │ -0.265895 │
│ 9   │ 0.268131   │ 0.281803  │ 0.213427   │ 0.271134  │
│ 10  │ -0.849929  │ -0.921018 │ -0.511578  │ -0.877314 │
IHT cv found 0 false positives and 1 false negatives, and used 77.47188878059387 seconds
lasso cv found 0 false positives and 0 false negatives, and used 10.86150598526001 seconds 


Running the 3 th model where n, 

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β      │ iht_β     │ lasso_β    │ regular_β   │
│     │ Float64     │ Float64   │ Float64    │ Float64     │
├─────┼─────────────┼───────────┼────────────┼─────────────┤
│ 1   │ 0.185692    │ 0.194874  │ 0.11655    │ 0.196478    │
│ 2   │ 0.789222    │ 0.781809  │ 0.770326   │ 0.78866     │
│ 3   │ 0.0306279   │ 0.0       │ 0.0        │ 0.0176097   │
│ 4   │ -0.125003   │ -0.142287 │ -0.0313078 │ -0.145971   │
│ 5   │ -0.366029   │ -0.340488 │ -0.207299  │ -0.35054    │
│ 6   │ 0.116492    │ 0.111686  │ 0.0667364  │ 0.123994    │
│ 7   │ 0.38709     │ 0.375683  │ 0.349523   │ 0.385362    │
│ 8   │ -0.18326    │ -0.173212 │ -0.0716869 │ -0.177553   │
│ 9   │ 0.282194    │ 0.277962  │ 0.193002   │ 0.284255    │
│ 10  │ 0.000576499 │ 0.0       │ 0.0        │ -0.00663532 │
IHT cv found 6 false positives and 2 false negatives, and used 109.81043982505798 seconds
lasso cv found 3 false positives and 2 false negatives, and used 4.152333974838257 sec

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β     │ iht_β      │ lasso_β   │ regular_β  │
│     │ Float64    │ Float64    │ Float64   │ Float64    │
├─────┼────────────┼────────────┼───────────┼────────────┤
│ 1   │ -0.539167  │ -0.550889  │ -0.304965 │ -0.555071  │
│ 2   │ -0.0968708 │ 0.0        │ 0.0       │ -0.0960337 │
│ 3   │ 0.846705   │ 0.821827   │ 0.719073  │ 0.821006   │
│ 4   │ -0.441297  │ -0.461406  │ -0.254502 │ -0.456925  │
│ 5   │ -0.128018  │ -0.0882122 │ 0.0       │ -0.122613  │
│ 6   │ 0.132842   │ 0.0        │ 0.0       │ 0.146273   │
│ 7   │ -0.158806  │ -0.137324  │ 0.0       │ -0.156672  │
│ 8   │ -0.987442  │ -1.01963   │ -0.732095 │ -1.0204    │
│ 9   │ 0.203632   │ 0.190123   │ 0.0770605 │ 0.189536   │
│ 10  │ -0.615237  │ -0.63718   │ -0.402071 │ -0.623348  │
IHT cv found 9 false positives and 2 false negatives, and used 32.19886112213135 seconds
lasso cv found 4 false positives and 4 false negatives, and used 2.7987639904022217 seconds 


Running the 5 th m

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β    │ iht_β     │ lasso_β    │ regular_β │
│     │ Float64   │ Float64   │ Float64    │ Float64   │
├─────┼───────────┼───────────┼────────────┼───────────┤
│ 1   │ 0.149501  │ 0.179655  │ 0.0904689  │ 0.17243   │
│ 2   │ -0.315157 │ -0.360698 │ -0.251767  │ -0.347977 │
│ 3   │ -0.313771 │ 0.0       │ -0.0559417 │ -0.24464  │
│ 4   │ 0.596863  │ 0.596541  │ 0.481438   │ 0.566446  │
│ 5   │ 0.0613366 │ 0.0       │ 0.0        │ 0.011362  │
│ 6   │ 0.39443   │ 0.386599  │ 0.331247   │ 0.379504  │
│ 7   │ 0.046383  │ 0.0       │ 0.0        │ 0.0405169 │
│ 8   │ 0.391428  │ 0.381169  │ 0.283873   │ 0.370438  │
│ 9   │ 0.0246291 │ 0.0       │ 0.0        │ 0.0586396 │
│ 10  │ 0.0489828 │ 0.0       │ 0.0        │ 0.069528  │
IHT cv found 2 false positives and 5 false negatives, and used 38.80105805397034 seconds
lasso cv found 0 false positives and 4 false negatives, and used 4.259769916534424 seconds 


Running the 6 th model where n, p = 1893, 151

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β     │ iht_β     │ lasso_β    │ regular_β   │
│     │ Float64    │ Float64   │ Float64    │ Float64     │
├─────┼────────────┼───────────┼────────────┼─────────────┤
│ 1   │ 0.0565106  │ 0.0       │ 0.0        │ 0.0664289   │
│ 2   │ -0.0288542 │ 0.0       │ 0.0        │ -0.0239858  │
│ 3   │ 0.0157203  │ 0.0       │ 0.0        │ 0.0324769   │
│ 4   │ -0.0170607 │ 0.0       │ 0.0        │ 0.0118499   │
│ 5   │ -0.248815  │ -0.266159 │ -0.176822  │ -0.260139   │
│ 6   │ 0.503222   │ 0.525     │ 0.457777   │ 0.519209    │
│ 7   │ -0.204717  │ -0.187565 │ -0.0914828 │ -0.181551   │
│ 8   │ -0.0144074 │ 0.0       │ 0.0        │ -0.00435644 │
│ 9   │ -0.588691  │ -0.625465 │ -0.490498  │ -0.616987   │
│ 10  │ -0.0416818 │ 0.0       │ 0.0        │ -0.0470831  │
IHT cv found 0 false positives and 6 false negatives, and used 25.94583797454834 seconds
lasso cv found 0 false positives and 6 false negatives, and used 5.142854928970337 seconds 


Runnin

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β    │ iht_β     │ lasso_β    │ regular_β │
│     │ Float64   │ Float64   │ Float64    │ Float64   │
├─────┼───────────┼───────────┼────────────┼───────────┤
│ 1   │ -0.497047 │ -0.484773 │ -0.346357  │ -0.470824 │
│ 2   │ 0.537552  │ 0.534424  │ 0.418754   │ 0.517245  │
│ 3   │ -0.311976 │ -0.319035 │ -0.168462  │ -0.299785 │
│ 4   │ -0.111079 │ -0.10272  │ 0.0        │ -0.103387 │
│ 5   │ -0.419224 │ -0.41377  │ -0.310428  │ -0.400017 │
│ 6   │ -0.191882 │ -0.168608 │ -0.0558217 │ -0.169596 │
│ 7   │ 0.352124  │ 0.333572  │ 0.253353   │ 0.326187  │
│ 8   │ 0.193935  │ 0.180412  │ 0.0765412  │ 0.177762  │
│ 9   │ 0.256262  │ 0.262324  │ 0.197361   │ 0.260144  │
│ 10  │ 0.383622  │ 0.384063  │ 0.263412   │ 0.367638  │
IHT cv found 5 false positives and 0 false negatives, and used 156.70267391204834 seconds
lasso cv found 0 false positives and 1 false negatives, and used 5.286602020263672 seconds 


Running the 8 th model where n, p = 2292, 22

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β    │ iht_β     │ lasso_β    │ regular_β │
│     │ Float64   │ Float64   │ Float64    │ Float64   │
├─────┼───────────┼───────────┼────────────┼───────────┤
│ 1   │ 0.0419999 │ 0.0       │ 0.0        │ 0.0275679 │
│ 2   │ 0.125858  │ 0.107102  │ 0.037222   │ 0.107911  │
│ 3   │ -0.794255 │ -0.79384  │ -0.675938  │ -0.79406  │
│ 4   │ 0.457812  │ 0.468226  │ 0.440908   │ 0.46645   │
│ 5   │ -0.192134 │ -0.195708 │ -0.12773   │ -0.193447 │
│ 6   │ -0.204108 │ -0.204849 │ -0.133766  │ -0.200409 │
│ 7   │ 0.26224   │ 0.274216  │ 0.212134   │ 0.274652  │
│ 8   │ -0.132775 │ -0.112193 │ -0.0367459 │ -0.112336 │
│ 9   │ 0.0217522 │ 0.0       │ 0.0        │ 0.022861  │
│ 10  │ 0.548122  │ 0.54901   │ 0.524515   │ 0.550788  │
IHT cv found 0 false positives and 2 false negatives, and used 9.282888889312744 seconds
lasso cv found 1 false positives and 2 false negatives, and used 1.6354539394378662 seconds 


Running the 9 th model where n, p = 2283, 91

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β     │ iht_β     │ lasso_β   │ regular_β  │
│     │ Float64    │ Float64   │ Float64   │ Float64    │
├─────┼────────────┼───────────┼───────────┼────────────┤
│ 1   │ -0.35539   │ -0.329295 │ -0.257184 │ -0.328476  │
│ 2   │ 0.138335   │ 0.0       │ 0.0678976 │ 0.153547   │
│ 3   │ -0.370387  │ -0.344611 │ -0.251284 │ -0.342879  │
│ 4   │ 0.0358637  │ 0.0       │ 0.0       │ 0.0254869  │
│ 5   │ -0.0912084 │ 0.0       │ 0.0       │ -0.0895277 │
│ 6   │ 0.0378541  │ 0.0       │ 0.0       │ 0.0451899  │
│ 7   │ 0.417998   │ 0.418837  │ 0.369821  │ 0.421582   │
│ 8   │ -0.0813847 │ 0.0       │ 0.0       │ -0.0737161 │
│ 9   │ -0.239892  │ 0.0       │ -0.121925 │ -0.211919  │
│ 10  │ 0.488211   │ 0.490875  │ 0.481137  │ 0.486368   │
IHT cv found 1 false positives and 6 false negatives, and used 97.13794088363647 seconds
lasso cv found 0 false positives and 4 false negatives, and used 5.294406890869141 seconds 


Running the 10 th model where n,

└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209
└ @ GLMNet /Users/biona001/.julia/dev/GLMNet/src/GLMNet.jl:209


compare_model = 10×4 DataFrame
│ Row │ true_β      │ iht_β     │ lasso_β   │ regular_β   │
│     │ Float64     │ Float64   │ Float64   │ Float64     │
├─────┼─────────────┼───────────┼───────────┼─────────────┤
│ 1   │ 0.163141    │ 0.153406  │ 0.117559  │ 0.15415     │
│ 2   │ -0.00928313 │ 0.0       │ 0.0       │ -0.00435653 │
│ 3   │ -0.640022   │ -0.648787 │ -0.529854 │ -0.63863    │
│ 4   │ 1.18325     │ 1.18939   │ 1.12608   │ 1.1773      │
│ 5   │ -0.259045   │ -0.272451 │ -0.193102 │ -0.268432   │
│ 6   │ -0.818365   │ -0.812832 │ -0.64731  │ -0.80088    │
│ 7   │ 0.01962     │ 0.0       │ 0.0       │ 0.00601566  │
│ 8   │ 0.026186    │ 0.0       │ 0.0       │ 0.0233668   │
│ 9   │ -0.216924   │ -0.226072 │ -0.173227 │ -0.228295   │
│ 10  │ -0.291759   │ 0.0       │ 0.0       │ -0.414913   │
IHT cv found 0 false positives and 4 false negatives, and used 138.42418003082275 seconds
lasso cv found 1 false positives and 4 false negatives, and used 17.067068099975586 seconds 


IHT 