# Cross-model selection and averaging

In [1]:
using Revise
using MendelIHT
using GLM
using Random

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1423


Simulate some data

In [2]:
n = 1000            # number of samples
p = 10000           # number of SNPs
k = 10              # 10 causal SNPs
d = Normal
l = IdentityLink()

# set random seed
Random.seed!(0)

# simulate `sim.bed` file with no missing data
x = simulate_random_snparray("sim.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true, impute=true) 

# 2 nongenetic covariate: first column is the intercept, second column is sex: 0 = male 1 = female
z = ones(n, 2) 
z[:, 2] .= rand(0:1, n)
standardize!(@view(z[:, 2:end])) 

# randomly set genetic predictors where causal βᵢ ~ N(0, 1)
true_b = zeros(p) 
true_b[1:k] = randn(k)
shuffle!(true_b)

# find correct position of genetic predictors
correct_position = findall(!iszero, true_b)

# define effect size of non-genetic predictors: intercept & sex
true_c = [1.0; 1.5] 

# simulate phenotype using genetic and nongenetic predictors
prob = GLM.linkinv.(l, xla * true_b .+ z * true_c) # note genotype-vector multiplication is done with `xla`
y = [rand(d(i)) for i in prob]
y = Float64.(y)

1000-element Vector{Float64}:
 -1.8997409590212744
  1.2243772748090584
  2.4192689435118764
  2.7948388319602397
 -1.4045085904368817
  5.5738136999828605
  2.679103975966605
  4.365412876072738
  5.295715346411713
 -3.1540481733120838
  4.539318079054713
 -3.105309364316101
  3.6936846743098983
  ⋮
 -0.34204344934988445
  7.9952776043608145
  4.089320835300376
  3.0971399310899925
 -3.9023325819317827
  3.605057276019452
 -7.332706623078501
 -2.65013014125772
 -6.379570954859518
  5.622602580893729
 -5.706664350827031
 -9.942739493398687

Test that IHT works with true $k$

In [3]:
result = fit_iht(y, xla, z, k=10)
[true_b[correct_position] result.beta[correct_position]]

****                   MendelIHT Version 1.4.7                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Number of threads = 8
Link functin = IdentityLink()
Sparsity parameter (k) = 10
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001 and iteration ≥ 5:

Iteration 1: loglikelihood = -1797.077383122223, backtracks = 0, tol = 0.9819433117510975
Iteration 2: loglikelihood = -1431.8977997334912, backtracks = 0, tol = 0.16858048728230898
Iteration 3: loglikelihood = -1407.615958437318, backtracks = 0, tol = 0.040998511595788045
Iteration 4: loglikelihood = -1407.230100352904, back

10×2 Matrix{Float64}:
  2.16158    2.19796
 -0.506129  -0.466407
 -0.51458   -0.521152
 -0.194935  -0.199011
  0.833797   0.868312
 -1.67266   -1.67367
 -1.80487   -1.77787
  0.39327    0.447257
 -0.836212  -0.846399
 -1.85502   -1.82628

## Test CMSA procedure

In [20]:
result = cmsa_iht(y, xla, z, kmin=1, kmax=50000)
[result.k result.loss]

[32mCross validating...  13%|████▎                           |  ETA: 0:00:04[39m

****                   MendelIHT Version 1.4.7                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Successfully reached early stop! Exiting.


88×2 Matrix{Float64}:
     1.0  2552.11
     2.0  1855.05
     3.0  1322.32
     4.0   667.035
     5.0   493.88
     6.0   347.91
     7.0   302.453
     8.0   255.224
     9.0   208.199
    10.0   201.1
    11.0   207.598
    12.0   210.567
    14.0   216.485
     ⋮    
 15027.0    Inf
 16762.0    Inf
 18698.0    Inf
 20857.0    Inf
 23266.0    Inf
 25953.0    Inf
 28950.0    Inf
 32293.0    Inf
 36023.0    Inf
 40183.0    Inf
 44823.0    Inf
 50000.0    Inf