# Various illustrations on IHT-logistic fitting

In this notebook we illustrate the functionalities and behaviors of IHT used for high-dimensional logistic regressions. Specifically, we investigate:
+ How large of an effect size is needed for IHT logistic to recover?
+ Is initializing $\beta$ good in terms of speed and model selection?
+ Is debiasing good in terms of speed and model selection?
+ How does IHT compare to LASSO?

In [1]:
using Revise
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using StatsFuns: logistic

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


# How large of an effect size is needed for IHT logistic to recover?

First we simulate 50 different SNP matrices, with 50 different $\beta \sim N(0, 1)$, to examine reconstruction behavior. Overall, IHT struggles to find effect sizes $< 0.3$. For values in $(0.3, 0.5)$, IHT tends to overestimate the effect size, a consequence of the [MLE estimator for logistic regression is biased in high dimensions](https://arxiv.org/abs/1803.06964). 

In [6]:
#some function that runs poisson regression on different SNP matrices
function run_logistic(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = randn(k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    #simulate bernoulli data
    y_temp = xbm * true_b
    prob = logistic.(y_temp) #inverse logit link
    y = [rand(Bernoulli(x)) for x in prob]
    y = Float64.(y)

    #compute logistic IHT result
    result = L0_logistic_reg(x, z, y, 1, k, glm = "logistic", debias=false, show_info=false)

    #check result
    estimated_models = result.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β      = true_model, 
        estimated_β = estimated_models)
    
    
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("Total iteration number was " * string(result.iter))
    println("Total time was " * string(result.time) * "\n\n")
end

run_logistic (generic function with 1 method)

In [9]:
k = 10
for i = 1:50
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    run_logistic(n, p, k)
end

running the 1 th model
n, p = 1624, 6496
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ -2.05135  │ -2.10342    │
│ 2   │ -1.6632   │ -1.84877    │
│ 3   │ 1.04428   │ 1.07126     │
│ 4   │ -0.88058  │ -0.974179   │
│ 5   │ -0.716923 │ -0.913469   │
│ 6   │ -0.36553  │ -0.466592   │
│ 7   │ -0.2244   │ 0.0         │
│ 8   │ -0.133693 │ 0.0         │
│ 9   │ -0.132844 │ -0.294623   │
│ 10  │ 0.125008  │ 0.0         │
Total iteration number was 37
Total time was 2.4596590995788574


running the 2 th model
n, p = 901, 8109
sort(compare_model, rev=true, by=abs) = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ 1.35335   │ 1.54232     │
│ 2   │ 1.05911   │ 1.25674     │
│ 3   │ 1.00805   │ 1.11736     │
│ 4   │ 0.666356  │ 0.85089     │
│ 5   │ 0.624684  │ 0.768927    │
│ 6   │ 0.598996  │ 0.510321 

# Is initializing $\beta$ good in terms of speed and model selection?

When initilizing the model, we can fit a bunch of univariate regressions to find a good initial approximation to the model before starting the IHT algorithm. Doing so introduces extra computational cost, but could reduce total iteration number and/or improve model selection. This section examines whether this is useful for logistic regression (ans: it is not).

In [16]:
function test_logistic_init(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = randn(k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    #simulate bernoulli data
    y_temp = xbm * true_b
    prob = logistic.(y_temp) #inverse logit link
    y = [rand(Bernoulli(x)) for x in prob]
    y = Float64.(y)

    #compute poisson IHT result
    no_init = L0_logistic_reg(x, z, y, 1, k, glm = "logistic", debias=false, convg=false, show_info=false, init=false)
    yes_init = L0_logistic_reg(x, z, y, 1, k, glm = "logistic", debias=false, convg=false, show_info=false, init=true)
    
    #check result
    est_model_init = yes_init.beta[correct_position]
    est_model_no_init = no_init.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β      = true_model, 
        est_β_no_init = est_model_no_init,
        est_β_with_init = est_model_init)
        
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("No initilialization:")
    println("    Iter =" * string(no_init.iter))
    println("    Time =" * string(no_init.time))
    println("Yes initilialization:")
    println("    Iter = " * string(yes_init.iter))
    println("    Time = " * string(yes_init.time) * "\n\n")
    
    #return summary statistic
    yes_init_found = length(findall(!iszero, est_model_init))
    no_init_found = length(findall(!iszero, est_model_no_init))
    yes_init_time = yes_init.time
    no_init_time = no_init.time
    return yes_init_found, no_init_found, yes_init_time, no_init_time
end

test_logistic_init (generic function with 1 method)

In [17]:
k = 10
yes_init_total_found = 0
no_init_total_found = 0
yes_init_total_time = 0
no_init_total_time = 0

iter = 50
for i = 1:iter
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    yif, nif, yit, nit = test_logistic_init(n, p, k)
    yes_init_total_found += yif
    no_init_total_found += nif
    yes_init_total_time += yit
    no_init_total_time += nit
end
println("With initialization, found $yes_init_total_found " * "predictors out of " * string(k*iter))
println("With initialization, average time was " * string(yes_init_total_time/iter))
println("Without initialization, found $no_init_total_found " * "predictors out of " * string(k*iter))
println("Without initialization, average time was " * string(no_init_total_time/iter))


running the 1 th model
n, p = 744, 5952
sort(compare_model, rev=true, by=abs) = 10×3 DataFrame
│ Row │ true_β    │ est_β_no_init │ est_β_with_init │
│     │ Float64   │ Float64       │ Float64         │
├─────┼───────────┼───────────────┼─────────────────┤
│ 1   │ -1.37566  │ -1.49858      │ -1.52421        │
│ 2   │ -0.912635 │ -1.16557      │ -1.26168        │
│ 3   │ -0.879982 │ -1.10409      │ -1.17207        │
│ 4   │ -0.65125  │ -0.814297     │ -0.8513         │
│ 5   │ 0.579642  │ 0.0           │ 0.0             │
│ 6   │ -0.5352   │ -0.719459     │ -0.749592       │
│ 7   │ 0.30467   │ 0.514603      │ 0.459092        │
│ 8   │ -0.234034 │ 0.0           │ 0.0             │
│ 9   │ 0.201716  │ 0.401632      │ 0.41397         │
│ 10  │ -0.153747 │ 0.0           │ 0.0             │
No initilialization:
    Iter =20
    Time =0.5437190532684326
Yes initilialization:
    Iter = 44
    Time = 2.3536880016326904


running the 2 th model
n, p = 693, 4851
sort(compare_model, rev=true, by

# Is debiasing good in terms of speed and model selection?

Within each IHT iteration, we can fit a GLM regression (using scoring algorithm) on just the support set of that iteration. This is known as debiasing. We now investigate whether this is a good idea, both in terms of speed and model selection performance. (ans: yes)

In [20]:
function test_logistic_debias(n :: Int64, p :: Int64, k :: Int64)    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z = ones(n, 1) # non-genetic covariates, just the intercept
    true_b = zeros(p)
    true_b[1:k] = randn(k)
    shuffle!(true_b)
    correct_position = findall(x -> x != 0, true_b)

    #simulate bernoulli data
    y_temp = xbm * true_b
    prob = logistic.(y_temp) #inverse logit link
    y = [rand(Bernoulli(x)) for x in prob]
    y = Float64.(y)

    #compute poisson IHT result
    no_debias = L0_logistic_reg(x, z, y, 1, k, glm = "logistic", debias=false, convg=false, show_info=false, init=false)
    yes_debias = L0_logistic_reg(x, z, y, 1, k, glm = "logistic", debias=true, convg=false, show_info=false, init=false)
    
    #check result
    est_model_yes = yes_debias.beta[correct_position]
    est_model_no = no_debias.beta[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β           = true_model, 
        est_β_no_debias  = est_model_no,
        est_β_yes_debias = est_model_yes)
        
    #display results
    @show sort(compare_model, rev=true, by=abs)
    println("No debiasing:")
    println("    Iter = " * string(no_debias.iter))
    println("    Time = " * string(no_debias.time))
    println("Yes debiasing:")
    println("    Iter = " * string(yes_debias.iter))
    println("    Time = " * string(yes_debias.time) * "\n\n")
    
    #return summary statistic
    yes_debias_found = length(findall(!iszero, est_model_yes))
    no_debias_found = length(findall(!iszero, est_model_no))
    yes_debias_time = yes_debias.time
    no_debias_time = no_debias.time
    return yes_debias_found, no_debias_found, yes_debias_time, no_debias_time
end

test_logistic_debias (generic function with 1 method)

In [22]:
k = 10
yes_debias_total_found = 0
no_debias_total_found = 0
yes_debias_total_time = 0
no_debias_total_time = 0

iter = 50
for i = 1:iter
    println("running the $i th model")
    n = rand(500:2000) 
    p = rand(1:10)n
    println("n, p = " * string(n) * ", " * string(p))
    ydf, ndf, ydt, ndt = test_logistic_debias(n, p, k)
    yes_debias_total_found += ydf
    no_debias_total_found += ndf
    yes_debias_total_time += ydt
    no_debias_total_time += ndt
end
println("With debiasing, found $yes_debias_total_found " * "predictors out of " * string(k*iter))
println("With debiasing, average time was " * string(yes_debias_total_time/iter))
println("Without debiasing, found $no_debias_total_found " * "predictors out of " * string(k*iter))
println("Without debiasing, average time was " * string(no_debias_total_time/iter))


running the 1 th model
n, p = 1773, 14184
sort(compare_model, rev=true, by=abs) = 10×3 DataFrame
│ Row │ true_β     │ est_β_no_debias │ est_β_yes_debias │
│     │ Float64    │ Float64         │ Float64          │
├─────┼────────────┼─────────────────┼──────────────────┤
│ 1   │ -1.45114   │ -1.28484        │ -1.28621         │
│ 2   │ -1.27653   │ -1.39328        │ -1.39483         │
│ 3   │ 1.2719     │ 1.25807         │ 1.25945          │
│ 4   │ -1.06845   │ -1.1152         │ -1.11638         │
│ 5   │ -1.05779   │ -1.20612        │ -1.2075          │
│ 6   │ -1.04699   │ -1.08232        │ -1.08359         │
│ 7   │ -0.599334  │ -0.482863       │ -0.483373        │
│ 8   │ -0.259217  │ -0.268291       │ -0.268572        │
│ 9   │ -0.0652767 │ 0.0             │ 0.0              │
│ 10  │ -0.0383971 │ 0.0             │ 0.0              │
No debiasing:
    Iter = 35
    Time = 5.318166971206665
Yes debiasing:
    Iter = 12
    Time = 1.9383821487426758


running the 2 th model
n, p = 1

# How does IHT compare to LASSO?

In [24]:
using Revise
using GLMNet #julia wrapper for GLMNet package in R, which calls fortran
using GLM
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using StatsFuns: logistic
using Random
using LinearAlgebra
using DelimitedFiles

In [23]:
#verify multiple threads are enabled
Threads.nthreads()

1

In [None]:
function iht_lasso_logistic(n :: Int64, p :: Int64, sim :: Int64)
    #define maf and true model size
    k = 10
    
    #set random seed
    Random.seed!(1111)

    #construct snpmatrix, covariate files, and true model b
    x, maf = simulate_random_snparray(n, p)
    xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
    z           = ones(n, 1)                   # non-genetic covariates, just the intercept
    true_b      = zeros(p)                     # model vector
    true_b[1:k] = rand(Normal(0, 0.25), k)     # k true response
    shuffle!(true_b)                           # Shuffle the entries
    correct_position = findall(x -> x != 0, true_b) # keep track of what the true entries are

    #simulate bernoulli data
    y_temp = xbm * true_b
    prob = logistic.(y_temp) #inverse logit link
    y = [rand(Bernoulli(x)) for x in prob]
    y = Float64.(y)

    #compute logistic IHT result
    path = collect(1:20)
    num_folds = 5
    folds = rand(1:num_folds, size(x, 1))
    k_est_iht = cv_iht(x, z, y, 1, path, folds, num_folds, use_maf=false, glm="logistic", debias=true)
    iht_result = L0_logistic_reg(x, z, y, 1, k_est_iht, glm = "logistic", debias=true, convg=false, show_info=false, init=false)

    #compute poisson lasso result
    x_float = [convert(Matrix{Float64}, x, center=true, scale=true) z]
    y_glmnet = [1 .- y y]
    cv = glmnetcv(x_float, y_glmnet, Binomial(), pmax=20, nfolds=5, folds=folds)
    best = argmin(cv.meanloss)
    lasso_result = cv.path.betas[:, best]
    k_est_lasso = length(findall(!iszero, lasso_result))
    
    #compute regular poisson regression using only true predictors
    x_true = [x_float[:, correct_position] z]
    regular_result = glm(x_true, y, Poisson(), LogLink())
    regular_result = regular_result.pp.beta0
    
    #check result
    IHT_model = iht_result.beta[correct_position]
    lasso_model = lasso_result[correct_position]
    true_model = true_b[correct_position]
    compare_model = DataFrame(
        true_β  = true_model, 
        iht_β   = IHT_model,
        lasso_β = lasso_model,
        regular_β = regular_result[1:10])
    @show compare_model

    #compute summary statistics
    lasso_num_correct_predictors = length(findall(!iszero, lasso_model))
    lasso_false_positives = k_est_lasso - lasso_num_correct_predictors
    lasso_false_negatives = k - lasso_num_correct_predictors
    iht_num_correct_predictors = length(findall(!iszero, IHT_model))
    iht_false_positives = k_est_iht - iht_num_correct_predictors
    iht_false_negatives = k - iht_num_correct_predictors
    println("IHT cv found $iht_false_positives" * " false positives and $iht_false_negatives" * " false negatives")
    println("lasso cv found $lasso_false_positives" * " false positives and $lasso_false_negatives" * " false negatives\n\n")

    #write y to file to view distribution later
    writedlm("./IHT_poisson_simulations_mean/data_simulation_$sim.txt", y)

    return lasso_num_correct_predictors, lasso_false_positives, lasso_false_negatives, iht_num_correct_predictors, iht_false_positives, iht_false_negatives
end