# Simulations for IHT using various GLM

IHT can be used to fit generalized linear models when $p \gg n$ (i.e. *high dimension*), which is common for [genome wide association studies](https://en.wikipedia.org/wiki/Genome-wide_association_study). In this tutorial, we simulate response data from 5 selected distributions, and illustrate reconstruction results on various simulated GWAS datasets.


In [1]:
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using GLM

┌ Info: Recompiling stale cache file /Users/biona001/.julia/compiled/v1.0/MendelIHT/eaqWB.ji for MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1190
┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


## Normal response (i.e. quantitave responses)

In [13]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = randn(k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)

#compute result
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false, convg=true)

IHT results:

Compute time (sec):     0.37221598625183105
Final loglikelihood:    -1406.8807652877347
Iterations:             6
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -1.24117    │
│ 2   │ 1     │ 877       │ -0.234677   │
│ 3   │ 1     │ 924       │ 0.82014     │
│ 4   │ 1     │ 2703      │ 0.583402    │
│ 5   │ 1     │ 4241      │ 0.298304    │
│ 6   │ 1     │ 4783      │ -1.14459    │
│ 7   │ 1     │ 5094      │ 0.673013    │
│ 8   │ 1     │ 5284      │ -0.709734   │
│ 9   │ 1     │ 7760      │ 0.168659    │
│ 10  │ 1     │ 8255      │ 1.08117     │

Intercept of model = 0.0


In [7]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β   │ estimated_β │
│     │ Float64  │ Float64     │
├─────┼──────────┼─────────────┤
│ 1   │ -1.29964 │ -1.24117    │
│ 2   │ -0.2177  │ -0.234677   │
│ 3   │ 0.786217 │ 0.82014     │
│ 4   │ 0.599233 │ 0.583402    │
│ 5   │ 0.283711 │ 0.298304    │
│ 6   │ -1.12537 │ -1.14459    │
│ 7   │ 0.693374 │ 0.673013    │
│ 8   │ -0.67709 │ -0.709734   │
│ 9   │ 0.14727  │ 0.168659    │
│ 10  │ 1.03477  │ 1.08117     │
Total iteration number was 6
Total time was 0.3785209655761719
Total found predictors = 10


## Bernoulli response (i.e. case-control studies)

In [14]:
n = 1000
p = 15000
k = 10
d = Bernoulli
l = ProbitLink() #could use LogitLink() or CloglogLink() instead

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = randn(k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false, convg=true)

IHT results:

Compute time (sec):     0.5659849643707275
Final loglikelihood:    -276.7458944999689
Iterations:             6
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 297       │ -0.474122   │
│ 2   │ 1     │ 1229      │ -0.248409   │
│ 3   │ 1     │ 3979      │ 0.739422    │
│ 4   │ 1     │ 5859      │ 1.00449     │
│ 5   │ 1     │ 5973      │ 0.845822    │
│ 6   │ 1     │ 7005      │ 0.804647    │
│ 7   │ 1     │ 10514     │ -1.26911    │
│ 8   │ 1     │ 11089     │ -0.664177   │
│ 9   │ 1     │ 12536     │ -0.236307   │
│ 10  │ 1     │ 14460     │ -0.240234   │

Intercept of model = 0.0


In [10]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ -0.549133 │ -0.474122   │
│ 2   │ -0.185487 │ -0.248409   │
│ 3   │ 0.680247  │ 0.739422    │
│ 4   │ 0.918479  │ 1.00449     │
│ 5   │ 0.851805  │ 0.845822    │
│ 6   │ 0.750598  │ 0.804647    │
│ 7   │ -1.25515  │ -1.26911    │
│ 8   │ -0.600394 │ -0.664177   │
│ 9   │ 0.141611  │ 0.0         │
│ 10  │ 0.230208  │ 0.0         │
Total iteration number was 6
Total time was 1.8919661045074463
Total found predictors = 8


## Poisson response (i.e. count data)

In [15]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Poisson
l = canonicallink(d())

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)


#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false, convg=true)

IHT results:

Compute time (sec):     0.7138028144836426
Final loglikelihood:    -1315.1512095005749
Iterations:             6
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -0.38744    │
│ 2   │ 1     │ 924       │ 0.240514    │
│ 3   │ 1     │ 2703      │ 0.225127    │
│ 4   │ 1     │ 2757      │ 0.0927794   │
│ 5   │ 1     │ 4241      │ 0.0894244   │
│ 6   │ 1     │ 4783      │ -0.307515   │
│ 7   │ 1     │ 5094      │ 0.215149    │
│ 8   │ 1     │ 5284      │ -0.209308   │
│ 9   │ 1     │ 6921      │ 0.0935756   │
│ 10  │ 1     │ 8255      │ 0.301717    │

Intercept of model = 0.0


In [16]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ -0.38744    │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ 0.240514    │
│ 4   │ 0.17977    │ 0.225127    │
│ 5   │ 0.0851134  │ 0.0894244   │
│ 6   │ -0.33761   │ -0.307515   │
│ 7   │ 0.208012   │ 0.215149    │
│ 8   │ -0.203127  │ -0.209308   │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ 0.301717    │
Total iteration number was 6
Total time was 0.7138028144836426
Total found predictors = 8


## Negative Binomial response

In [17]:
#simulat data with k true predictors from Negative Binomial, fixing number of success to be nn
n = 1000
p = 10000
k = 10
d = NegativeBinomial
l = LogLink()
nn = 10 

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. vector y) 
μ = linkinv.(l, xbm * true_b)
prob = 1 ./ (1 .+ μ ./ nn)
y = [rand(d(nn, i)) for i in prob] #number of failtures before nn success occurs
y = Float64.(y)

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false, convg=false)

IHT results:

Compute time (sec):     1.1765589714050293
Final loglikelihood:    -1431.7171248415666
Iterations:             19
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -0.391114   │
│ 2   │ 1     │ 924       │ 0.201996    │
│ 3   │ 1     │ 1003      │ 0.113302    │
│ 4   │ 1     │ 2703      │ 0.15804     │
│ 5   │ 1     │ 4241      │ 0.123563    │
│ 6   │ 1     │ 4783      │ -0.227633   │
│ 7   │ 1     │ 5094      │ 0.233542    │
│ 8   │ 1     │ 5284      │ -0.260375   │
│ 9   │ 1     │ 8255      │ 0.318128    │
│ 10  │ 1     │ 9183      │ -0.100754   │

Intercept of model = 0.0


In [18]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ -0.391114   │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ 0.201996    │
│ 4   │ 0.17977    │ 0.15804     │
│ 5   │ 0.0851134  │ 0.123563    │
│ 6   │ -0.33761   │ -0.227633   │
│ 7   │ 0.208012   │ 0.233542    │
│ 8   │ -0.203127  │ -0.260375   │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ 0.318128    │
Total iteration number was 19
Total time was 1.1765589714050293
Total found predictors = 8


## Gamma response

In [19]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Gamma
l = LogLink()
α = 1 #shape parameter for gamma


#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. vector y) 
μ = linkinv.(l, xbm * true_b)
β = 1 ./ μ #here β is the rate parameter for gamma distribution
y = [rand(d(α, i)) for i in β]

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false, convg=true)

IHT results:

Compute time (sec):     2.1839661598205566
Final loglikelihood:    -1005.6789334756007
Iterations:             32
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 796       │ 0.070883    │
│ 2   │ 1     │ 853       │ 0.312042    │
│ 3   │ 1     │ 1537      │ 0.0706703   │
│ 4   │ 1     │ 2703      │ -0.193457   │
│ 5   │ 1     │ 4000      │ 0.116691    │
│ 6   │ 1     │ 4536      │ -0.078473   │
│ 7   │ 1     │ 4783      │ 0.329159    │
│ 8   │ 1     │ 5094      │ -0.237703   │
│ 9   │ 1     │ 5284      │ 0.243424    │
│ 10  │ 1     │ 8255      │ -0.299223   │

Intercept of model = 0.0


In [20]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ 0.312042    │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ 0.0         │
│ 4   │ 0.17977    │ -0.193457   │
│ 5   │ 0.0851134  │ 0.0         │
│ 6   │ -0.33761   │ 0.329159    │
│ 7   │ 0.208012   │ -0.237703   │
│ 8   │ -0.203127  │ 0.243424    │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ -0.299223   │
Total iteration number was 32
Total time was 2.1839661598205566
Total found predictors = 6


## Inverse Gaussian