In [1]:
using Distributed
# addprocs(4)

@everywhere begin
    using Revise
    using MendelIHT
    using SnpArrays
    using Random
    using GLM
    using DelimitedFiles
    using Test
    using Distributions
    using LinearAlgebra
    using CSV
    using DataFrames
end

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


# Simulate data & run IHT

In [9]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray("heritability/univariate.bed", n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 1.0

# simulate response y, true model b, and the correct non-0 positions of b
y, true_b, correct_position = simulate_random_response(xla, k, d, l, Zu=z*intercept);

# save true SNP's position and effect size
open("heritability/univariate_true_beta.txt", "w") do io
    println(io, "snpID,effectsize")
    println(io, "intercept,$intercept")
    for pos in correct_position
        println(io, "snp$pos,", true_b[pos])
    end
end

# create `.bim` and `.bam` files using phenotype
make_bim_fam_files(x, y, "heritability/univariate")

# create `.phen` file for GCTA
open("heritability/univariate.phen", "w") do io
    for i in 1:length(y)
        println(io, "$i\t1\t$(y[i])")
    end
end

In [3]:
ktrue = k + (intercept == 0 ? 0 : 1)
@time result = fit_iht(y, xla, z, d=d(), l=l, k=ktrue)

****                   MendelIHT Version 1.3.3                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 100
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1577.170794759688, backtracks = 0, tol = 0.609864675283163
Iteration 2: loglikelihood = -1484.8568136206177, backtracks = 0, tol = 0.1269955771967065
Iteration 3: loglikelihood = -1472.9529635904933, backtracks = 0, tol = 0.05823372413927707
Iteration 4: loglikelihood = -1472.5366421393842, backtracks = 1, tol = 0.004508958581388835
It


IHT estimated 10 nonzero SNP predictors and 1 non-genetic predictors.

Compute time (sec):     3.279958963394165
Final loglikelihood:    -1472.3905989669602
SNP heritability:       0.8426823000266493
Iterations:             10

Selected genetic predictors:
[1m10×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.437828
   2 │      901     0.747956
   3 │     1204     0.691327
   4 │     1306    -1.42505
   5 │     1655    -0.19456
   6 │     3160    -0.861591
   7 │     3936    -0.147235
   8 │     4201     0.338606
   9 │     4402    -0.126472
  10 │     6879    -1.21895

Selected nongenetic predictors:
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1      1.02016

In [4]:
# compare estimated vs true beta values
[result.beta[correct_position] true_b[correct_position]]

10×2 Array{Float64,2}:
 -0.437828  -0.402269
  0.747956   0.758756
  0.691327   0.729135
 -1.42505   -1.47163
 -0.19456   -0.172668
 -0.861591  -0.847906
  0.338606   0.296183
  0.0       -0.0034339
  0.0        0.125965
 -1.21895   -1.24972

# GEMMA estimated heritability

GEMMA estimates $pve = 0.444316$ with standard error $se(pve) =0.132402$. This is the proportion of variance in phenotypes explained (pve) quantity.

In [12]:
;cat heritability/gemma_run.out

GEMMA 0.98.4 (2021-01-29) by Xiang Zhou and team (C) 2012-2021
Reading Files ... 
## number of total individuals = 1000
## number of analyzed individuals = 1000
## number of covariates = 1
## number of phenotypes = 1
## number of total SNPs/var        =    10000
## number of analyzed SNPs         =    10000
Start Eigen-Decomposition...
pve estimate =0.444316
se(pve) =0.132402
**** INFO: Done.


# GCTA estimated heritability

GCTA estimated heritability is 0.726

In [11]:
;cat heritability/gcta.univariate.hsq

Source	Variance	SE
V(G)	5.131934	1.015303
V(e)	1.941108	0.923529
Vp	7.073042	0.324029
V(G)/Vp	0.725562	0.132691
logL	-1467.224
logL0	-1480.753
LRT	27.058
df	1
Pval	9.8735e-08
n	1000


# Conclusion

In a (very simple) simulation with 10 causal SNPs and 1 intercept, we have


| Method            | Estimated Heritability | 
|-------------------|------------------------|
| IHT               | 0.8426                 |
| GCTA              | 0.726                  |
| GEMMA             | 0.444316               |