# Use IHT to estimate phenotype's Percent Variation Explained (PVE)

In IHT, the **Percent Variation Explained (PVE)** for univariate traits is defined as

$$h = \frac{var(\hat{\mathbf{y}})}{var(\mathbf{y})}$$

where $\mathbf{y}$ is the vector of phenotype values and $\hat{\mathbf{y}} = \mathbf{X}\hat{\beta}$ is the predicted phenotype. 

In [1]:
using Distributed
# addprocs(4)

@everywhere begin
    using Revise
    using MendelIHT
    using SnpArrays
    using Random
    using GLM
    using DelimitedFiles
    using Test
    using Distributions
    using LinearAlgebra
    using CSV
    using DataFrames
end

using VarianceComponentModels

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


# Simulate data

Our simple simulation involves 10 causal SNPs and 1 intercept:

$$y_i = 1.0 + \mathbf{x}_i^t\beta + \epsilon$$

In [2]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray("heritability/univariate.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 10.0

# simulate response y, true model b, and the correct non-0 positions of b
y, true_b, correct_position = simulate_random_response(xla, k, d, l, Zu=z*intercept);

# save true SNP's position and effect size
open("heritability/univariate_true_beta.txt", "w") do io
    println(io, "snpID,effectsize")
    println(io, "intercept,$intercept")
    for pos in correct_position
        println(io, "snp$pos,", true_b[pos])
    end
end

# create `.bim` and `.bam` files using phenotype
make_bim_fam_files(x, y, "heritability/univariate")

# create `.phen` file for GCTA
open("heritability/univariate.phen", "w") do io
    for i in 1:length(y)
        println(io, "$i\t1\t$(y[i])")
    end
end

# Run IHT

In [4]:
ktrue = k + (intercept == 0 ? 0 : 1)
@time result = fit_iht(y, xla, z, d=d(), l=l, k=ktrue)

****                   MendelIHT Version 1.3.3                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 100
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1577.1707947596888, backtracks = 0, tol = 0.11179714706508592
Iteration 2: loglikelihood = -1484.8568136206184, backtracks = 0, tol = 0.025721679193033527
Iteration 3: loglikelihood = -1472.952963590493, backtracks = 0, tol = 0.012701500670561118
Iteration 4: loglikelihood = -1472.5366421393844, backtracks = 1, tol = 0.0009958001674466


IHT estimated 10 nonzero SNP predictors and 1 non-genetic predictors.

Compute time (sec):     0.5100820064544678
Final loglikelihood:    -1472.391164492576
SNP PVE:                0.8425967157821636
Iterations:             8

Selected genetic predictors:
[1m10×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.437843
   2 │      901     0.748033
   3 │     1204     0.690989
   4 │     1306    -1.42524
   5 │     1655    -0.194052
   6 │     3160    -0.861222
   7 │     3936    -0.14667
   8 │     4201     0.338804
   9 │     4402    -0.126361
  10 │     6879    -1.21894

Selected nongenetic predictors:
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1      10.0202

In [5]:
# compare estimated vs true beta values
[result.beta[correct_position] true_b[correct_position]]

10×2 Array{Float64,2}:
 -0.437843  -0.402269
  0.748033   0.758756
  0.690989   0.729135
 -1.42524   -1.47163
 -0.194052  -0.172668
 -0.861222  -0.847906
  0.338804   0.296183
  0.0       -0.0034339
  0.0        0.125965
 -1.21894   -1.24972

# GEMMA estimated PVE

GEMMA estimates $pve = 0.444316$ with standard error $se(pve) =0.132402$. This is the proportion of variance in phenotypes explained (pve) quantity.

In [2]:
;cat pve/gemma_run.out

GEMMA 0.98.4 (2021-01-29) by Xiang Zhou and team (C) 2012-2021
Reading Files ... 
## number of total individuals = 1000
## number of analyzed individuals = 1000
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    10000
## number of analyzed SNPs         =    10000
Start Eigen-Decomposition...
pve estimate =0.444316
se(pve) =0.132402
**** INFO: Done.


# Bayesian GEMMA estimated PVE

GEMMA estimates $pve = 0.461838$ with standard error $se(pve) =0.132402$. This is the proportion of variance in phenotypes explained (pve) quantity.

In [3]:
;cat pve/gemma.pve.result.log.txt

##
## GEMMA Version    = 0.98.4 (2021-01-29)
## Build profile    = /gnu/store/8mkllydvkgfy6ydlrymrx8wj0dy1x6lm-profile
## GCC version      = 7.5.0
## GSL Version      = 2.6
## OpenBlas         = OpenBLAS 0.3.9  - OpenBLAS 0.3.9 DYNAMIC_ARCH NO_AFFINITY SkylakeX MAX_THREADS=128
##   arch           = SkylakeX
##   threads        = 24
##   parallel type  = threaded
##
## Command Line Input = gemma -bfile univariate -bslmm 1 -maf 0.0000001 -o gemma.pve.result 
##
## Date = Sat Mar 13 12:04:53 2021
##
## Summary Statistics:
## number of total individuals = 1000
## number of analyzed individuals = 1000
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var = 10000
## number of analyzed SNPs/var = 10000
## REMLE log-likelihood in the null model = -2389.39
## MLE log-likelihood in the null model = -2390.97
## pve estimate in the null model = 0.461838
## se(pve) in the null model = 0.132357
## vg estimate in the null model = 0
## ve estimate in the null model = 0
##

# GCTA estimated heritability

GCTA estimated heritability is 0.726

In [11]:
;cat heritability/gcta.univariate.hsq

Source	Variance	SE
V(G)	5.131934	1.015303
V(e)	1.941108	0.923529
Vp	7.073042	0.324029
V(G)/Vp	0.725562	0.132691
logL	-1467.224
logL0	-1480.753
LRT	27.058
df	1
Pval	9.8735e-08
n	1000


# VarianceComponentModel.jl estimated heritability

In [6]:
Φgrm = grm(x; method = :Robust) # genetic relationship matrix
VCdata = VarianceComponentVariate(y, z, (2Φgrm, Matrix(1.0I, n, n)))

# pre-compute eigen-decomposition 
@time VCdata_rotated = TwoVarCompVariateRotate(VCdata)
fieldnames(typeof(VCdata_rotated))

# form data set for trait 
trait_data = TwoVarCompVariateRotate(VCdata_rotated.Yrot, 
    VCdata_rotated.Xrot, VCdata_rotated.eigval, VCdata_rotated.eigvec, 
    VCdata_rotated.logdetV2)

# initialize model parameters
trait_model = VarianceComponentModel(trait_data)

# estimate variance components
_, _, _, Σcov, = mle_fs!(trait_model, trait_data; solver=:Ipopt, verbose=false)
σ2a = trait_model.Σ[1][1] # additive genetic variance 
σ2e = trait_model.Σ[2][1] # environmental variance 
@show σ2a, σ2e
@show σ2a / (σ2a + σ2e)

  0.499204 seconds (1.17 M allocations: 81.571 MiB, 3.75% gc time)

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

(σ2a, σ2e) = (3.257729405950718, 3.8241316336013482)
σ2a / (σ2a + σ2e) = 0.4600103543060726


0.4600103543060726

# Conclusion (univariate traits)

| Method                     | Estimated PVE          | 
|----------------------------|------------------------|
| MendelIHT.jl               | 0.8426                 |
| GCTA                       | 0.726                  |
| GEMMA (Bayesian LMM)       | 0.461838               |
| VarianceComponentModels.jl | 0.46001                |
| GEMMA (LMM)                | 0.444316               |

# Multivariate traits 

For multivariate traits, $\mathbf{Y} = (\mathbf{y}_1,...,\mathbf{y}_r)$, we define the PVE for trait $i$ as

$$h_i = \frac{var(\hat{\mathbf{y}_i})}{var(\mathbf{y}_i)}.$$

In [7]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs
r = 2     # number of traits

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray("multivariate_$(r)traits.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n, 1)
intercepts = 10.0 .* randn(r)' # each trait have different intercept

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_Σ, true_b, correct_position = simulate_random_response(xla, k, r, Zu=z*intercepts, overlap=2);

Yt = Matrix(Y')
Zt = Matrix(z');

In [13]:
ktrue = k + count(!iszero, intercepts)
@time result = fit_iht(Yt, Transpose(xla), Zt, k=ktrue, verbose=true)

****                   MendelIHT Version 1.3.3                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 12
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 100
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 246.48266427359408, backtracks = 0, tol = 0.15459640295013669
Iteration 2: loglikelihood = 1468.5670281874275, backtracks = 0, tol = 0.036687758419846395
Iteration 3: loglikelihood = 1525.390857344177, backtracks = 0, tol = 0.0029580061822276106
Iteration 4: loglikelihood = 1526.6178339030791, backtracks = 0, tol = 0.0012


Compute time (sec):     1.8533520698547363
Final loglikelihood:    1527.1518814099722
Iterations:             21
Trait 1's SNP PVE:      0.6095385242121772
Trait 2's SNP PVE:      0.603060240834564

Trait 1: IHT estimated 4 nonzero SNP predictors
[1m4×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │     5668    -0.207284
   2 │     5797     0.575018
   3 │     6812     1.42102
   4 │     7988     1.26966

Trait 1: IHT estimated 1 non-genetic predictors
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1     -1.69969

Trait 2: IHT estimated 6 nonzero SNP predictors
[1m6×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      222     0.393




In [9]:
# first beta
β1 = result.beta[1, :]
true_b1_idx = findall(!iszero, true_b[:, 1])
[β1[true_b1_idx] true_b[true_b1_idx, 1]]

4×2 Array{Float64,2}:
 -0.207284  -0.224675
  0.575018   0.531549
  1.42102    1.43455
  1.26966    1.25668

In [10]:
# second beta
β2 = result.beta[2, :]
true_b2_idx = findall(!iszero, true_b[:, 2])
[β2[true_b2_idx] true_b[true_b2_idx, 2]]

6×2 Array{Float64,2}:
  0.393725   0.315219
  0.607747   0.609812
  1.18908    1.20121
  0.784013   0.812423
  0.771363   0.808327
 -0.547082  -0.589568

In [11]:
# non genetic covariates
[result.c intercepts']

2×2 Array{Float64,2}:
 -1.69969  -1.72668
  7.3276    7.29135

In [12]:
# covariance matrix
[vec(result.Σ) vec(true_Σ)]

4×2 Array{Float64,2}:
  2.43694   2.53934
 -1.80781  -1.85399
 -1.80781  -1.85399
  2.39758   2.41416