# Test initialize beta function

In IHT, we can initialze beta values to their univariate values. That is, $\beta_i$ is set to the estimated beta with $y$ as response, and $x_i$ with an intercept term as covariate.

In [1]:
using Revise
using MendelIHT
using SnpArrays
using Random
using GLM
using DelimitedFiles
using Test
using Distributions
using LinearAlgebra
using CSV
using DataFrames
using StatsBase
using TraitSimulation

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


In [2]:
"""
    linreg(x::Vector, y::Vector)

Performs linear regression with `y` as response, `x` and a vector of 1 as
covariate. `β̂` will be stored in `xty_store`. 

Code inspired from Doug Bates on Discourse:
https://discourse.julialang.org/t/efficient-way-of-doing-linear-regression/31232/28
"""
function linreg!(
    x::AbstractVector{T},
    y::AbstractVector{T},
    xtx_store::AbstractMatrix{T} = zeros(T, 2, 2),
    xty_store::AbstractVector{T} = zeros(T, 2)
    ) where {T<:AbstractFloat}
    N = length(x)
    N == length(y) || throw(DimensionMismatch())
    xtx_store[1, 1] = N
    xtx_store[1, 2] = sum(x)
    xtx_store[2, 2] = sum(abs2, x)
    xty_store[1] = sum(y)
    xty_store[2] = dot(x, y)
    ldiv!(cholesky!(Symmetric(xtx_store, :U)), xty_store)
    return xty_store
end

function initialize_beta(y::AbstractVector, x::AbstractMatrix)
    n, p = size(x)
    xtx_store = zeros(2, 2)
    xty_store = zeros(2)
    β = zeros(p)
    for i in 1:p
        linreg!(@view(x[:, i]), y, xtx_store, xty_store)
        β[i] = xty_store[2]
    end
    return β
end

function initialize_beta(y::AbstractMatrix, x::AbstractMatrix)
    p, n = size(x)
    r = size(y, 1) # number of traits
    xtx_store = zeros(2, 2)
    xty_store = zeros(2)
    B = zeros(r, p)
    for j in 1:r # loop over each y
        yj = @view(y[j, :])
        for i in 1:p
            linreg!(@view(x[i, :]), yj, xtx_store, xty_store)
            B[j, i] = xty_store[2]
        end
    end
    return B
end

initialize_beta (generic function with 2 methods)

## Univariate general matrices

In [3]:
Random.seed!(111)
n = 10000
p = 10
x = randn(n, p)
βtrue = randn(p)
y = x * βtrue + 0.1randn(n);

In [4]:
# compare initialized value to multiple linear regression to truth
[initialize_beta(y, x) x\y βtrue]

10×3 Array{Float64,2}:
  2.05791    1.96174    1.96003
 -0.291102  -0.227084  -0.226961
 -0.786065  -0.695332  -0.696983
 -0.457251  -0.471151  -0.470722
  0.231683   0.176485   0.177769
 -2.13299   -2.07862   -2.07842
  2.01815    1.91951    1.91859
  1.17988    1.18691    1.18838
  1.48591    1.43662    1.43707
  1.38753    1.41129    1.40999

In [5]:
βinit = initialize_beta(y, x)
βinit - βtrue

10-element Array{Float64,1}:
  0.09787941653099996
 -0.06414123513643402
 -0.08908232888485446
  0.01347151929359347
  0.05391369745308988
 -0.054565873990521485
  0.09955851125928827
 -0.008499816030518526
  0.04884058357021703
 -0.022461720603279556

In [6]:
all(βinit - βtrue .< 0.1)

true

## Univariate SnpLinAlg

In [95]:
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

Random.seed!(2021)
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, center=true, scale=true)
y, true_b, correct_position = simulate_random_response(xla, k, d, l);

In [96]:
result = fit_iht(y, xla, k=11, init_beta=false)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1593.9929287471011, backtracks = 0, tol = 1.20797234572672
Iteration 2: loglikelihood = -1487.2621799703284, backtracks = 0, tol = 0.12334401922028171
Iteration 3: loglikelihood = -1468.8965217422317, backtracks = 0, tol = 0.05510756549097026
Iteration 4: loglikelihood = -1467.3016301297482, backtracks = 0, tol = 0.05286328443463906
It


IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.6524169445037842
Final loglikelihood:    -1466.455530287542
SNP PVE:                0.8438409927824799
Iterations:             12

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.443537
   2 │      901     0.753927
   3 │     1204     0.698528
   4 │     1306    -1.43028
   5 │     1655    -0.192022
   6 │     3160    -0.865703
   7 │     3936    -0.153925
   8 │     4201     0.334507
   9 │     4402    -0.128446
  10 │     6879    -1.21182
  11 │     8055     0.115916

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [97]:
result2 = fit_iht(y, xla, k=11, init_beta=true)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -2502.977705625754, backtracks = 0, tol = 0.5052752478773777
Iteration 2: loglikelihood = -1611.5933226294037, backtracks = 0, tol = 0.3331559935047197
Iteration 3: loglikelihood = -1488.464806552407, backtracks = 0, tol = 0.11997454491586085
Iteration 4: loglikelihood = -1467.2420308134322, backtracks = 0, tol = 0.05554090722235994
Ite


IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.5602378845214844
Final loglikelihood:    -1466.4555451229837
SNP PVE:                0.8438557209826311
Iterations:             10

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.443524
   2 │      901     0.753802
   3 │     1204     0.698492
   4 │     1306    -1.43032
   5 │     1655    -0.192114
   6 │     3160    -0.865765
   7 │     3936    -0.153948
   8 │     4201     0.334418
   9 │     4402    -0.128425
  10 │     6879    -1.21187
  11 │     8055     0.115835

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [10]:
# compare initialized value to IHT's estimate to truth
[initialize_beta(y, xla)[correct_position] result.beta[correct_position] result2.beta[correct_position] true_b[correct_position]]

10×4 Array{Float64,2}:
  0.172514    0.231313   0.225838   0.290051
 -0.0833819   0.0        0.0        0.113896
 -1.12519    -1.09732   -1.09328   -1.09083
 -0.0668759   0.0        0.0        0.0326341
  1.16751     1.18797    1.19744    1.25615
  1.61687     1.62846    1.63647    1.5655
 -0.0150496   0.0        0.0       -0.0616128
  0.318712    0.223418   0.217466   0.240515
 -0.483093   -0.436094  -0.436074  -0.420895
 -0.838352   -0.88203   -0.884608  -0.893621

In [11]:
# SnpLinAlg
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

Random.seed!(2020)
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, center=true, scale=true)
y, βtrue, correct_position = simulate_random_response(xla, k, d, l);

βinit = initialize_beta(y, xla)
all(βinit[correct_position] - βtrue[correct_position] .< 0.1)

true

## Multivariate SnpLinAlg


With $r$ traits, each sample's phenotype $\mathbf{y}_{i} \in \mathbb{R}^{n \times 1}$ is simulated under

$$\mathbf{y}_{i}^{r \times 1} \sim N(\mathbf{B}^{r \times p}\mathbf{x}_{i}^{p \times 1}, \ \ \Sigma_{r \times r})$$

This model assumes each sample is independent.

In [110]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs
r = 2     # number of traits

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray("multivariate_$(r)traits.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n, 1)
intercepts = [10.0 1.0] # each trait have different intercept

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_Σ, true_b, correct_position = simulate_random_response(xla, k, r, Zu=z*intercepts, overlap=2)
correct_snps = [x[1] for x in correct_position] # causal snps
Yt = Matrix(Y'); # in MendelIHT, multivariate traits should be rows

In [118]:
result = fit_iht(Yt, Transpose(xla), k=12, init_beta=false)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 12
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 215.4892687838203, backtracks = 0, tol = 0.12434559043803152
Iteration 2: loglikelihood = 1382.37242885485, backtracks = 0, tol = 0.02701989420878901
Iteration 3: loglikelihood = 1477.7383135165255, backtracks = 0, tol = 0.014225517157910431
Iteration 4: loglikelihood = 1511.714843337414, backtracks = 0, tol = 0.004456457


Compute time (sec):     1.6148250102996826
Final loglikelihood:    1521.6067421001371
Iterations:             15
Trait 1's SNP PVE:      0.5545273580919192
Trait 2's SNP PVE:      0.6195879626449298

Trait 1: IHT estimated 4 nonzero SNP predictors
[1m4×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     1197     0.121446
   2 │     5651    -0.200705
   3 │     5797    -1.09767
   4 │     8087     1.2791

Trait 1: IHT estimated 1 non-genetic predictors
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1       10.027

Trait 2: IHT estimated 6 nonzero SNP predictors
[1m6×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      326     0.331




In [119]:
result = fit_iht(Yt, Transpose(xla), k=12, init_beta=true)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 12
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 215.4892687838203, backtracks = 0, tol = 0.0874619371580851
Iteration 2: loglikelihood = 508.71372460708403, backtracks = 0, tol = 0.07593963516634877
Iteration 3: loglikelihood = 1347.5165861204946, backtracks = 0, tol = 0.02898119167266632
Iteration 4: loglikelihood = 1508.563389138361, backtracks = 0, tol = 0.006553722


Compute time (sec):     1.6375181674957275
Final loglikelihood:    1526.557065358178
Iterations:             15
Trait 1's SNP PVE:      0.5617472680179374
Trait 2's SNP PVE:      0.6885981059281754

Trait 1: IHT estimated 4 nonzero SNP predictors
[1m4×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     5651    -0.204097
   2 │     5797    -1.09478
   3 │     6813    -0.24325
   4 │     8087     1.28163

Trait 1: IHT estimated 1 non-genetic predictors
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1       10.027

Trait 2: IHT estimated 6 nonzero SNP predictors
[1m6×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      326     0.3315




In [63]:
Binit = initialize_beta(Yt, Transpose(xla))

2×10000 Array{Float64,2}:
  0.0534366   0.0439406   0.0969756  …   0.0216708  -0.197072    -0.0302861
 -0.0490983  -0.0681586  -0.217925      -0.0520163   0.00796128  -0.0211869

In [64]:
# true beta1 vs initial beta1 
[true_b[correct_snps, 1] Binit[1, correct_snps]]

10×2 Array{Float64,2}:
 -0.224675  -0.162063
 -1.14044   -1.16669
 -0.14698   -0.244613
  1.25668    1.37809
  0.0        0.111058
  0.0        0.0532192
  0.0        0.00868089
 -1.14044   -1.16669
  0.0        0.0524192
 -0.14698   -0.244613

In [65]:
# true beta2 vs initial beta2
[true_b[correct_snps, 2] Binit[2, correct_snps]]

10×2 Array{Float64,2}:
 0.0       -0.0586803
 0.531549   0.55475
 1.43455    1.54564
 0.0        0.0325905
 0.315219   0.252608
 0.609812   0.636104
 1.20121    1.08516
 0.531549   0.55475
 0.808327   0.828278
 1.43455    1.54564

In [123]:
all(true_b[correct_snps, 1] - Binit[1, correct_snps] .< 0.15)

true

In [67]:
true_b[correct_snps, 2] - Binit[2, correct_snps]

10-element Array{Float64,1}:
  0.05868027528864749
 -0.0232008980406212
 -0.11108382038222331
 -0.03259053645592301
  0.06261113915272487
 -0.02629238502256903
  0.11604924413534135
 -0.0232008980406212
 -0.01995118916877381
 -0.11108382038222331

In [122]:
all(true_b[correct_snps, 2] - Binit[2, correct_snps] .< 0.15)

true