# Test initialize beta function

In IHT, we can initialze beta values to their univariate values. That is, $\beta_i$ is set to the estimated beta with $y$ as response, and $x_i$ with an intercept term as covariate.

In [1]:
using Revise
using MendelIHT
using SnpArrays
using Random
using GLM
using DelimitedFiles
using Test
using Distributions
using LinearAlgebra
using CSV
using DataFrames
using StatsBase
using TraitSimulation

In [2]:
"""
    linreg(x::Vector, y::Vector)

Performs linear regression with `y` as response, `x` and a vector of 1 as
covariate. `β̂` will be stored in `xty_store`. 

Code inspired from Doug Bates on Discourse:
https://discourse.julialang.org/t/efficient-way-of-doing-linear-regression/31232/28
"""
function linreg!(
    x::AbstractVector{T},
    y::AbstractVector{T},
    xtx_store::AbstractMatrix{T} = zeros(T, 2, 2),
    xty_store::AbstractVector{T} = zeros(T, 2)
    ) where {T<:AbstractFloat}
    N = length(x)
    N == length(y) || throw(DimensionMismatch())
    xtx_store[1, 1] = N
    xtx_store[1, 2] = sum(x)
    xtx_store[2, 2] = sum(abs2, x)
    xty_store[1] = sum(y)
    xty_store[2] = dot(x, y)
    ldiv!(cholesky!(Symmetric(xtx_store, :U)), xty_store)
    return xty_store
end

function initialize_beta(y::AbstractVector, x::AbstractMatrix)
    n, p = size(x)
    xtx_store = zeros(2, 2)
    xty_store = zeros(2)
    β = zeros(p)
    for i in 1:p
        linreg!(@view(x[:, i]), y, xtx_store, xty_store)
        β[i] = xty_store[2]
    end
    return β
end

initialize_beta (generic function with 1 method)

## General matrices

In [3]:
Random.seed!(111)
n = 10000
p = 10
x = randn(n, p)
βtrue = randn(p)
y = x * βtrue + 0.1randn(n);

In [4]:
# compare initialized value to multiple linear regression to truth
[initialize_beta(y, x) x\y βtrue]

10×3 Array{Float64,2}:
  2.05791    1.96174    1.96003
 -0.291102  -0.227084  -0.226961
 -0.786065  -0.695332  -0.696983
 -0.457251  -0.471151  -0.470722
  0.231683   0.176485   0.177769
 -2.13299   -2.07862   -2.07842
  2.01815    1.91951    1.91859
  1.17988    1.18691    1.18838
  1.48591    1.43662    1.43707
  1.38753    1.41129    1.40999

In [5]:
βinit = initialize_beta(y, x)
βinit - βtrue

10-element Array{Float64,1}:
  0.09787941653099996
 -0.06414123513643402
 -0.08908232888485446
  0.01347151929359347
  0.05391369745308988
 -0.054565873990521485
  0.09955851125928827
 -0.008499816030518526
  0.04884058357021703
 -0.022461720603279556

In [6]:
all(βinit - βtrue .< 0.1)

true

## SnpLinAlg

In [7]:
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

Random.seed!(2020)
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, center=true, scale=true)
y, true_b, correct_position = simulate_random_response(xla, k, d, l);

In [8]:
result = fit_iht(y, xla, k=11, init_beta=false)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1581.7768780898489, backtracks = 0, tol = 1.3552509908102688
Iteration 2: loglikelihood = -1409.6637302568977, backtracks = 0, tol = 0.12655245087839057
Iteration 3: loglikelihood = -1381.0402635243368, backtracks = 0, tol = 0.06377958391062873
Iteration 4: loglikelihood = -1379.5419533307588, backtracks = 0, tol = 0.045125306615027116


IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.5957999229431152
Final loglikelihood:    -1379.2588565310857
SNP PVE:                0.8702634813628288
Iterations:             10

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      173     0.231313
   2 │     4779    -1.09732
   3 │     5560     0.106919
   4 │     6260    -0.128241
   5 │     7121     0.145675
   6 │     7159     1.18797
   7 │     7357     1.62846
   8 │     8276     0.223418
   9 │     8529    -0.436094
  10 │     8592    -0.115408
  11 │     8942    -0.88203

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [9]:
result2 = fit_iht(y, xla, k=11, init_beta=true)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -2491.2517588516744, backtracks = 0, tol = 0.5330137594941751
Iteration 2: loglikelihood = -1627.8632644711122, backtracks = 0, tol = 0.31821983069626636
Iteration 3: loglikelihood = -1414.2063646054137, backtracks = 0, tol = 0.11619629982267554
Iteration 4: loglikelihood = -1384.4575659795526, backtracks = 0, tol = 0.05082685059215144



IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.5657999515533447
Final loglikelihood:    -1378.1857056788483
SNP PVE:                0.8705146483705939
Iterations:             10

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      173     0.225838
   2 │     2126     0.118987
   3 │     4779    -1.09328
   4 │     6260    -0.12721
   5 │     7121     0.13656
   6 │     7159     1.19744
   7 │     7357     1.63647
   8 │     8276     0.217466
   9 │     8529    -0.436074
  10 │     8592    -0.119344
  11 │     8942    -0.884608

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [10]:
# compare initialized value to IHT's estimate to truth
[initialize_beta(y, xla)[correct_position] result.beta[correct_position] result2.beta[correct_position] true_b[correct_position]]

10×4 Array{Float64,2}:
  0.172514    0.231313   0.225838   0.290051
 -0.0833819   0.0        0.0        0.113896
 -1.12519    -1.09732   -1.09328   -1.09083
 -0.0668759   0.0        0.0        0.0326341
  1.16751     1.18797    1.19744    1.25615
  1.61687     1.62846    1.63647    1.5655
 -0.0150496   0.0        0.0       -0.0616128
  0.318712    0.223418   0.217466   0.240515
 -0.483093   -0.436094  -0.436074  -0.420895
 -0.838352   -0.88203   -0.884608  -0.893621

In [11]:
# SnpLinAlg
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

Random.seed!(2020)
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, center=true, scale=true)
y, βtrue, correct_position = simulate_random_response(xla, k, d, l);

βinit = initialize_beta(y, xla)
all(βinit[correct_position] - βtrue[correct_position] .< 0.1)

true