# Simulations for IHT using various GLM

IHT can be used to fit generalized linear models when $p \gg n$ (i.e. *high dimension*), which is common for [genome wide association studies](https://en.wikipedia.org/wiki/Genome-wide_association_study). In this tutorial, we simulate response data from 5 selected distributions, and illustrate reconstruction results on various simulated GWAS datasets.


In [1]:
using MendelIHT
using SnpArrays
using DataFrames
using Distributions
using BenchmarkTools
using Random
using LinearAlgebra
using GLM

┌ Info: Loading DataFrames support into Gadfly.jl
└ @ Gadfly /Users/biona001/.julia/packages/Gadfly/09PWZ/src/mapping.jl:228


## Normal response (i.e. quantitave responses)

In [4]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = randn(k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)

#compute result
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false)

IHT results:

Compute time (sec):     0.3808889389038086
Final loglikelihood:    -1406.8807652877347
Iterations:             6
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -1.24117    │
│ 2   │ 1     │ 877       │ -0.234677   │
│ 3   │ 1     │ 924       │ 0.82014     │
│ 4   │ 1     │ 2703      │ 0.583402    │
│ 5   │ 1     │ 4241      │ 0.298304    │
│ 6   │ 1     │ 4783      │ -1.14459    │
│ 7   │ 1     │ 5094      │ 0.673013    │
│ 8   │ 1     │ 5284      │ -0.709734   │
│ 9   │ 1     │ 7760      │ 0.168659    │
│ 10  │ 1     │ 8255      │ 1.08117     │

Intercept of model = 0.0


In [5]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β   │ estimated_β │
│     │ Float64  │ Float64     │
├─────┼──────────┼─────────────┤
│ 1   │ -1.29964 │ -1.24117    │
│ 2   │ -0.2177  │ -0.234677   │
│ 3   │ 0.786217 │ 0.82014     │
│ 4   │ 0.599233 │ 0.583402    │
│ 5   │ 0.283711 │ 0.298304    │
│ 6   │ -1.12537 │ -1.14459    │
│ 7   │ 0.693374 │ 0.673013    │
│ 8   │ -0.67709 │ -0.709734   │
│ 9   │ 0.14727  │ 0.168659    │
│ 10  │ 1.03477  │ 1.08117     │
Total iteration number was 6
Total time was 0.3808889389038086
Total found predictors = 10


## Bernoulli response (i.e. case-control studies)

In [6]:
n = 1000
p = 15000
k = 10
d = Bernoulli
l = ProbitLink() #could use LogitLink() or CloglogLink() instead

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = randn(k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false)

IHT results:

Compute time (sec):     1.5665409564971924
Final loglikelihood:    -269.26434907787626
Iterations:             16
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 297       │ -0.468073   │
│ 2   │ 1     │ 3979      │ 0.718565    │
│ 3   │ 1     │ 5859      │ 0.996407    │
│ 4   │ 1     │ 5973      │ 0.815285    │
│ 5   │ 1     │ 7005      │ 0.792825    │
│ 6   │ 1     │ 10514     │ -1.23212    │
│ 7   │ 1     │ 11089     │ -0.665506   │
│ 8   │ 1     │ 11921     │ 0.239418    │
│ 9   │ 1     │ 12536     │ -0.236023   │
│ 10  │ 1     │ 14460     │ -0.254735   │

Intercept of model = 0.0


In [7]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β    │ estimated_β │
│     │ Float64   │ Float64     │
├─────┼───────────┼─────────────┤
│ 1   │ -0.549133 │ -0.468073   │
│ 2   │ -0.185487 │ 0.0         │
│ 3   │ 0.680247  │ 0.718565    │
│ 4   │ 0.918479  │ 0.996407    │
│ 5   │ 0.851805  │ 0.815285    │
│ 6   │ 0.750598  │ 0.792825    │
│ 7   │ -1.25515  │ -1.23212    │
│ 8   │ -0.600394 │ -0.665506   │
│ 9   │ 0.141611  │ 0.239418    │
│ 10  │ 0.230208  │ 0.0         │
Total iteration number was 16
Total time was 1.5665409564971924
Total found predictors = 8


## Poisson response (i.e. count data)

In [8]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Poisson
l = canonicallink(d())

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. response vector y) 
prob = linkinv.(l, xbm * true_b)
y = [rand(d(i)) for i in prob]
y = Float64.(y)

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false)

IHT results:

Compute time (sec):     0.960075855255127
Final loglikelihood:    -1306.2397127817226
Iterations:             14
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -0.388286   │
│ 2   │ 1     │ 924       │ 0.242203    │
│ 3   │ 1     │ 2703      │ 0.233816    │
│ 4   │ 1     │ 2757      │ 0.0984703   │
│ 5   │ 1     │ 4783      │ -0.303483   │
│ 6   │ 1     │ 5094      │ 0.213278    │
│ 7   │ 1     │ 5284      │ -0.212898   │
│ 8   │ 1     │ 6921      │ 0.0859946   │
│ 9   │ 1     │ 7006      │ -0.0941811  │
│ 10  │ 1     │ 8255      │ 0.301585    │

Intercept of model = 0.0


In [9]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ -0.388286   │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ 0.242203    │
│ 4   │ 0.17977    │ 0.233816    │
│ 5   │ 0.0851134  │ 0.0         │
│ 6   │ -0.33761   │ -0.303483   │
│ 7   │ 0.208012   │ 0.213278    │
│ 8   │ -0.203127  │ -0.212898   │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ 0.301585    │
Total iteration number was 14
Total time was 0.960075855255127
Total found predictors = 7


## Negative Binomial response

In [10]:
#simulat data with k true predictors from Negative Binomial, fixing number of success to be nn
n = 1000
p = 10000
k = 10
d = NegativeBinomial
l = LogLink()
nn = 10 

#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. vector y) 
μ = linkinv.(l, xbm * true_b)
prob = 1 ./ (1 .+ μ ./ nn)
y = [rand(d(nn, i)) for i in prob] #number of failtures before nn success occurs
y = Float64.(y)

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false)

IHT results:

Compute time (sec):     1.2406079769134521
Final loglikelihood:    -1431.7171279141987
Iterations:             20
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ -0.391195   │
│ 2   │ 1     │ 924       │ 0.201995    │
│ 3   │ 1     │ 1003      │ 0.113309    │
│ 4   │ 1     │ 2703      │ 0.15803     │
│ 5   │ 1     │ 4241      │ 0.123561    │
│ 6   │ 1     │ 4783      │ -0.227624   │
│ 7   │ 1     │ 5094      │ 0.233531    │
│ 8   │ 1     │ 5284      │ -0.260377   │
│ 9   │ 1     │ 8255      │ 0.318111    │
│ 10  │ 1     │ 9183      │ -0.100758   │

Intercept of model = 0.0


In [11]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ -0.391195   │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ 0.201995    │
│ 4   │ 0.17977    │ 0.15803     │
│ 5   │ 0.0851134  │ 0.123561    │
│ 6   │ -0.33761   │ -0.227624   │
│ 7   │ 0.208012   │ 0.233531    │
│ 8   │ -0.203127  │ -0.260377   │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ 0.318111    │
Total iteration number was 20
Total time was 1.2406079769134521
Total found predictors = 8


## Gamma response

In [12]:
#simulat data with k true predictors, from distribution d and with link l.
n = 1000
p = 10000
k = 10
d = Gamma
l = LogLink()
α = 1 #shape parameter for gamma


#set random seed
Random.seed!(2019)

#construct snpmatrix, covariate files, and true model b
x, maf = simulate_random_snparray(n, p, "tmp.bed")
xbm = SnpBitMatrix{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true); 
z = ones(n, 1) # the intercept
true_b = zeros(p)
true_b[1:k] = rand(Normal(0, 0.3), k)
shuffle!(true_b)
correct_position = findall(x -> x != 0, true_b)

#simulate phenotypes (e.g. vector y) 
μ = linkinv.(l, xbm * true_b)
β = 1 ./ μ #here β is the rate parameter for gamma distribution
y = [rand(d(α, i)) for i in β]

#run IHT
result = L0_reg(x, xbm, z, y, 1, k, d(), l, debias=false, init=false, show_info=false)

IHT results:

Compute time (sec):     6.582669973373413
Final loglikelihood:    -994.1409105420939
Iterations:             113
Max number of groups:   1
Max predictors/group:   10
IHT estimated 10 nonzero coefficients.
10×3 DataFrame
│ Row │ Group │ Predictor │ Estimated_β │
│     │ [90mInt64[39m │ [90mInt64[39m     │ [90mFloat64[39m     │
├─────┼───────┼───────────┼─────────────┤
│ 1   │ 1     │ 853       │ 0.326666    │
│ 2   │ 1     │ 924       │ -0.201214   │
│ 3   │ 1     │ 2703      │ -0.189991   │
│ 4   │ 1     │ 3951      │ 0.157131    │
│ 5   │ 1     │ 4000      │ 0.117061    │
│ 6   │ 1     │ 4783      │ 0.277899    │
│ 7   │ 1     │ 5094      │ -0.22218    │
│ 8   │ 1     │ 5284      │ 0.248775    │
│ 9   │ 1     │ 8255      │ -0.286029   │
│ 10  │ 1     │ 9778      │ 0.139003    │

Intercept of model = 0.0


In [13]:
#compare true model with reconstruction result
compare_model = DataFrame(
    true_β      = true_b[correct_position], 
    estimated_β = result.beta[correct_position])
@show compare_model
println("Total iteration number was " * string(result.iter))
println("Total time was " * string(result.time))
println("Total found predictors = " * string(length(findall(!iszero, result.beta[correct_position]))))

#clean up
rm("tmp.bed", force=true)

compare_model = 10×2 DataFrame
│ Row │ true_β     │ estimated_β │
│     │ Float64    │ Float64     │
├─────┼────────────┼─────────────┤
│ 1   │ -0.389892  │ 0.326666    │
│ 2   │ -0.0653099 │ 0.0         │
│ 3   │ 0.235865   │ -0.201214   │
│ 4   │ 0.17977    │ -0.189991   │
│ 5   │ 0.0851134  │ 0.0         │
│ 6   │ -0.33761   │ 0.277899    │
│ 7   │ 0.208012   │ -0.22218    │
│ 8   │ -0.203127  │ 0.248775    │
│ 9   │ 0.0441809  │ 0.0         │
│ 10  │ 0.310431   │ -0.286029   │
Total iteration number was 113
Total time was 6.582669973373413
Total found predictors = 7


## Inverse Gaussian

## Binomial