In [1]:
using Distributed
addprocs(4)

@everywhere begin
    using Revise
    using MendelIHT
    using SnpArrays
    using Random
    using GLM
    using DelimitedFiles
    using Test
    using Distributions
    using LinearAlgebra
    using CSV
    using DataFrames
    using StatsBase
    BLAS.set_num_threads(1) # remember to set BLAS threads to 1 !!!
#     using TraitSimulation, OrdinalMultinomialModels, VarianceComponentModels
end

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


# Univariate Gaussian trait

In [3]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 1.0

# simulate response y, true model b, and the correct non-0 positions of b
y, true_b, correct_position = simulate_random_response(xla, k, d, l, Zu=z*intercept);

## Run IHT

In [9]:
@time result = fit_iht(y, xla, z, k=11)

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 11
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1577.1707947596879, backtracks = 0, tol = 0.6098646752831619
Iteration 2: loglikelihood = -1484.856813620618, backtracks = 0, tol = 0.1269955771967064
Iteration 3: loglikelihood = -1472.9529635904933, backtracks = 0, tol = 0.05823372413927715
Iteration 4: loglikelihood = -1472.5366421393844, backtracks = 1, tol = 0.004508958581388824
I


IHT estimated 10 nonzero SNP predictors and 1 non-genetic predictors.

Compute time (sec):     0.1159050464630127
Final loglikelihood:    -1472.3907123942718
SNP PVE:                0.8426548660312022
Iterations:             10

Selected genetic predictors:
[1m10×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │      782    -0.437833
   2 │      901     0.74798
   3 │     1204     0.691217
   4 │     1306    -1.42511
   5 │     1655    -0.194394
   6 │     3160    -0.861465
   7 │     3936    -0.147045
   8 │     4201     0.338675
   9 │     4402    -0.126433
  10 │     6879    -1.21895

Selected nongenetic predictors:
[1m1×2 DataFrame[0m
[1m Row [0m│[1m Position [0m[1m Estimated_β [0m
[1m     [0m│[90m Int64    [0m[90m Float64     [0m
─────┼───────────────────────
   1 │        1      1.02016

## Check answer

In [10]:
[true_b[correct_position] result.beta[correct_position]]

10×2 Array{Float64,2}:
 -0.402269   -0.437833
  0.758756    0.74798
  0.729135    0.691217
 -1.47163    -1.42511
 -0.172668   -0.194394
 -0.847906   -0.861465
  0.296183    0.338675
 -0.0034339   0.0
  0.125965    0.0
 -1.24972    -1.21895

In [11]:
# non genetic covariates
[result.c intercept]

1×2 Array{Float64,2}:
 1.02016  1.0

## Test Cross validation

In [14]:
Random.seed!(2020)
@time mses = cv_iht(y, xla, z);

[32mCross validating...100%|████████████████████████████████| Time: 0:00:05[39m




Crossvalidation Results:
	k	MSE
	1	1218.4976589426035
	2	842.5598738789062
	3	634.0396881925165
	4	487.58571842217555
	5	391.3888892988487
	6	305.3137200757752
	7	267.91129790103656
	8	243.05905296570523
	9	243.47674074722744
	10	245.6469865967401
	11	250.6216744074316
	12	253.98118656214808
	13	254.78915817386076
	14	255.89349140083132
	15	263.59632300410664
	16	269.0451615858919
	17	271.040578488372
	18	274.4169982210701
	19	279.30823157249597
	20	284.55829823829794

Best k = 8

  5.162944 seconds (32.17 k allocations: 6.922 MiB)
