# Test debias

After each IHT iteration, $\beta$ is sparse. We can solve for the exact solution on the non-zero indices, a process known as debiasing.

In [1]:
using Revise
using MendelIHT
using SnpArrays
using Random
using GLM
using DelimitedFiles
using Test
using Distributions
using LinearAlgebra
using CSV
using DataFrames
using StatsBase
using TraitSimulation

In [2]:
"""
    debias!(v::mIHTVariable)

Solves the multivariate linear regression `Y = BX + E` by `B̂ = inv(X'X) X'Y`
"""
function debias(Y::AbstractMatrix, X::AbstractMatrix)
    ldiv!(cholesky!(Symmetric(X'*X, :U)), Transpose(X) * Y)
end

debias

## Test multivariate debiasing

In [3]:
Random.seed!(111)
n = 10000
p = 10
r = 2
X = randn(n, p)
βtrue = randn(p, r)
Y = X * βtrue + 0.1randn(n, r);

In [4]:
# compare debiased values to linear regression to truth
[debias(Y, X) X\Y βtrue]

10×6 Array{Float64,2}:
  1.96152   -0.644561    1.96152   -0.644561    1.96003   -0.643827
 -0.227112  -2.03514    -0.227112  -2.03514    -0.226961  -2.03677
 -0.695466   0.0807036  -0.695466   0.0807036  -0.696983   0.0810192
 -0.471365   1.02773    -0.471365   1.02773    -0.470722   1.02645
  0.176366  -0.402362    0.176366  -0.402362    0.177769  -0.402138
 -2.07868   -0.591199   -2.07868   -0.591199   -2.07842   -0.591873
  1.91958    0.0392912   1.91958    0.0392912   1.91859    0.0385307
  1.18673   -1.66567     1.18673   -1.66567     1.18838   -1.66464
  1.43634   -0.522621    1.43634   -0.522621    1.43707   -0.523489
  1.41148    1.43384     1.41148    1.43384     1.40999    1.43506

## Debiasing with IHT (univariate)

In [5]:
n = 1000
p = 10000
k = 10
d = Normal
l = canonicallink(d())

Random.seed!(2021)
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, center=true, scale=true)
y, true_b, correct_position = simulate_random_response(xla, k, d, l);

In [6]:
result = fit_iht(y, xla, k=11, debias=false)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1593.9929287471011, backtracks = 0, tol = 1.20797234572672
Iteration 2: loglikelihood = -1487.2621799703284, backtracks = 0, tol = 0.12334401922028171
Iteration 3: loglikelihood = -1468.8965217422317, backtracks = 0, tol = 0.05510756549097026
Iteration 4: loglikelihood = -1467.3016301297482, backtracks = 0, tol = 0.05286328443463906
It


IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     0.7337119579315186
Final loglikelihood:    -1466.455530287542
SNP PVE:                0.8438409927824799
Iterations:             12

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.443537
   2 │      901     0.753927
   3 │     1204     0.698528
   4 │     1306    -1.43028
   5 │     1655    -0.192022
   6 │     3160    -0.865703
   7 │     3936    -0.153925
   8 │     4201     0.334507
   9 │     4402    -0.128446
  10 │     6879    -1.21182
  11 │     8055     0.115916

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [7]:
result2 = fit_iht(y, xla, k=11, debias=true)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = on
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1593.9929287471011, backtracks = 0, tol = 1.400455921944636
Iteration 2: loglikelihood = -1559.4627688199794, backtracks = 0, tol = 0.07435655637351721
Iteration 3: loglikelihood = -1556.3289523410967, backtracks = 0, tol = 0.0
Iteration 4: loglikelihood = -1554.0165116269238, backtracks = 0, tol = 0.0
Iteration 5: loglikelihood = -1552


IHT estimated 11 nonzero SNP predictors and 0 non-genetic predictors.

Compute time (sec):     1.224865198135376
Final loglikelihood:    -1552.7059132114798
SNP PVE:                0.8376246556996713
Iterations:             5

Selected genetic predictors:
[1m11×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782   -0.437298
   2 │      901    0.739183
   3 │     1204    0.677758
   4 │     1306   -1.42774
   5 │     1655   -0.180674
   6 │     2341   -0.0447162
   7 │     3160   -0.840734
   8 │     4201    0.340794
   9 │     6879   -1.20063
  10 │     7410   -0.0199834
  11 │     9091   -0.0604253

Selected nongenetic predictors:
[1m0×2 DataFrame[0m

In [8]:
[result.beta[correct_position] result2.beta[correct_position]]

10×2 Array{Float64,2}:
 -0.443537  -0.437298
  0.753927   0.739183
  0.698528   0.677758
 -1.43028   -1.42774
 -0.192022  -0.180674
 -0.865703  -0.840734
  0.334507   0.340794
  0.0        0.0
  0.0        0.0
 -1.21182   -1.20063

## Multivariate debiasing


With $r$ traits, each sample's phenotype $\mathbf{y}_{i} \in \mathbb{R}^{n \times 1}$ is simulated under

$$\mathbf{y}_{i}^{r \times 1} \sim N(\mathbf{B}^{r \times p}\mathbf{x}_{i}^{p \times 1}, \ \ \Sigma_{r \times r})$$

This model assumes each sample is independent.

In [25]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs
r = 2     # number of traits

# set random seed for reproducibility
Random.seed!(2020)

# simulate `.bed` file with no missing data
x = simulate_random_snparray("multivariate_$(r)traits.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n, 1)
intercepts = [10.0 1.0] # each trait have different intercept

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_Σ, true_b, correct_position = simulate_random_response(xla, k, r, Zu=z*intercepts, overlap=2)
correct_snps = [x[1] for x in correct_position] # causal snps
Yt = Matrix(Y'); # in MendelIHT, multivariate traits should be rows

In [26]:
result = fit_iht(Yt, Transpose(xla), k=12, debias=false)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 12
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 311.122249277171, backtracks = 0, tol = 0.15846789223457347
Iteration 2: loglikelihood = 2368.959802261556, backtracks = 0, tol = 0.02724750792869442
Iteration 3: loglikelihood = 2604.2428515947267, backtracks = 0, tol = 0.0026097666071417153
Iteration 4: loglikelihood = 2616.3189330378927, backtracks = 0, tol = 0.0013206


Compute time (sec):     1.586827039718628
Final loglikelihood:    2619.4732108754683
Iterations:             15
Trait 1's SNP PVE:      0.7874775482319452
Trait 2's SNP PVE:      0.7474637310695169

Trait 1: IHT estimated 4 nonzero SNP predictors
[1m4×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     3634    -0.21381
   2 │     4935     0.884535
   3 │     6526    -0.544904
   4 │     9269    -1.87866

Trait 1: IHT estimated 1 non-genetic predictors
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1      10.0304

Trait 2: IHT estimated 6 nonzero SNP predictors
[1m6×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     1624    -0.916




In [27]:
result2 = fit_iht(Yt, Transpose(xla), k=12, debias=true)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 12
Prior weight scaling = off
Doubly sparse projection = off
Debias = on
Max IHT iterations = 200
Converging when tol < 0.0001:

reached here!
Iteration 1: loglikelihood = 311.122249277171, backtracks = 0, tol = 0.17051846008510982
Iteration 2: loglikelihood = 2368.959802261556, backtracks = 0, tol = 0.020586437386725893
Iteration 3: loglikelihood = 2559.7083449041525, backtracks = 0, tol = 2.0130182239936735e-17
Iteration 4: loglikelihood = 2595.741124173523, backtracks = 0, t


Compute time (sec):     0.53472900390625
Final loglikelihood:    2611.308544163625
Iterations:             5
Trait 1's SNP PVE:      0.7953165106913967
Trait 2's SNP PVE:      0.7736662179178111

Trait 1: IHT estimated 4 nonzero SNP predictors
[1m4×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     3634    -0.227077
   2 │     4935     0.888378
   3 │     6526    -0.538614
   4 │     9269    -1.88919

Trait 1: IHT estimated 1 non-genetic predictors
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1      10.0304

Trait 2: IHT estimated 6 nonzero SNP predictors
[1m6×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     1624    -0.93505




In [29]:
# true beta1 vs no debias vs yes debias
[true_b[correct_snps, 1] result.beta[1, correct_snps] result2.beta[1, correct_snps]]

10×3 Array{Float64,2}:
 -0.203676  -0.21381   -0.227077
  0.892669   0.884535   0.888378
 -0.537023  -0.544904  -0.538614
 -1.93132   -1.87866   -1.88919
  0.0        0.0        0.0
  0.0        0.0        0.0
  0.892669   0.884535   0.888378
  0.0        0.0        0.0
  0.0        0.0        0.0
 -1.93132   -1.87866   -1.88919

In [30]:
# true beta1 vs no debias vs yes debias
[true_b[correct_snps, 2] result.beta[2, correct_snps] result2.beta[2, correct_snps]]

10×3 Array{Float64,2}:
  0.0        0.0        0.0
 -0.815928  -0.796363  -0.80035
  0.0        0.0        0.0
 -0.246635  -0.273261  -0.26297
 -0.910143  -0.916906  -0.935058
  1.00117    1.00181    1.01879
 -0.815928  -0.796363  -0.80035
  0.472513   0.469215   0.462194
 -1.00364   -0.993253  -1.02723
 -0.246635  -0.273261  -0.26297