# Lets profile `fit` function for univariate and multivariate

Currently multivariate IHT is ~20x slower than univariate IHT. Is that due to bad code or is estimating covariance matrix that much slower?

In [1]:
using Revise
using MendelIHT
using SnpArrays
using Random
using GLM
using DelimitedFiles
using Test
using Distributions
using LinearAlgebra
using CSV
using DataFrames
using StatsBase
using Profile
using ProfileView
BLAS.set_num_threads(1) # remember to set BLAS threads to 1 !!!

┌ Info: Precompiling MendelIHT [921c7187-1484-5754-b919-5d3ed9ac03c4]
└ @ Base loading.jl:1278


## Univariate response with SnpLinAlg

In [2]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 1.0

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_b, correct_position = simulate_random_response(xla, k, d, l, Zu=z*intercept);

In [4]:
Random.seed!(2020)
@time result = fit_iht(Y, xla, z);
speed_per_iter = result.time / result.iter

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 10
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1555.2120188322158, backtracks = 0, tol = 0.6244599273449228
Iteration 2: loglikelihood = -1486.0751108703125, backtracks = 0, tol = 0.12833948300313805
Iteration 3: loglikelihood = -1480.1283538979592, backtracks = 0, tol = 0.05314824937761959
Iteration 4: loglikelihood = -1479.7262774611138, backtracks = 0, tol = 0.006247043188130052

0.006113835743495396

Univariate IHT runs at 0.006 seconds per iteration. Let's profile our fit function.

In [21]:
@profview fit_iht(Y, xla, z, verbose=false)  # run once to trigger compilation (ignore this one)
@profview fit_iht(Y, xla, z, verbose=false)

Gtk.GtkWindowLeaf(name="", parent, width-request=-1, height-request=-1, visible=TRUE, sensitive=TRUE, app-paintable=FALSE, can-focus=FALSE, has-focus=FALSE, is-focus=FALSE, focus-on-click=TRUE, can-default=FALSE, has-default=FALSE, receives-default=FALSE, composite-child=FALSE, style, events=0, no-show-all=FALSE, has-tooltip=FALSE, tooltip-markup=NULL, tooltip-text=NULL, window, opacity=1.000000, double-buffered, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-left, margin-right, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, margin=0, hexpand=FALSE, vexpand=FALSE, hexpand-set=FALSE, vexpand-set=FALSE, expand=FALSE, scale-factor=2, border-width=0, resize-mode, child, type=GTK_WINDOW_TOPLEVEL, title="Profile", role=NULL, resizable=TRUE, modal=FALSE, window-position=GTK_WIN_POS_NONE, default-width=800, default-height=600, destroy-with-parent=FALSE, hide-titlebar-when-maximized=FALSE, icon, icon-name=NULL, screen, type-hint=GDK_WINDOW_TYPE_HINT_NORMAL, skip-taskbar-hint

In [18]:
fit_iht(Y, xla, z, verbose=false)
Profile.clear()
@profile fit_iht(Y, xla, z, verbose=false);
Profile.print()

Overhead ╎ [+additional indent] Count File:Line; Function
 ╎30 @Base/task.jl:356; (::IJulia.var"#15#18")()
 ╎ 30 @IJulia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
 ╎  30 @Base/essentials.jl:709; invokelatest
 ╎   30 @Base/essentials.jl:710; #invokelatest#1
 ╎    30 ...c/execute_request.jl:67; execute_request(::ZMQ.Socket, ::...
 ╎     30 ...c/SoftGlobalScope.jl:65; softscope_include_string(::Modu...
 ╎    ╎ 30 @Base/loading.jl:1091; include_string(::Function, ::M...
 ╎    ╎  30 @MendelIHT/src/fit.jl:58; (::MendelIHT.var"#fit_iht##kw"...
 ╎    ╎   4  @MendelIHT/src/fit.jl:70; fit_iht(::Array{Float64,1}, ::...
 ╎    ╎    4  ...data_structures.jl:111; initialize
 ╎    ╎     4  ...ata_structures.jl:118; initialize(::SnpLinAlg{Float...
 ╎    ╎    ╎ 3  .../src/utilities.jl:312; init_iht_indices!(::IHTVari...
 ╎    ╎    ╎  3  .../src/utilities.jl:116; score!(::IHTVariable{Float6...
 ╎    ╎    ╎   3  ...linalg_direct.jl:160; mul!(::Array{Float64,1}, :...
 ╎    ╎    ╎    3  ...linalg_direct.j

**Conclusion:** In univariate IHT, most time (26/30 samples) are spent on `score!` function, specifically on the line `mul!(v.df, Transpose(x), v.r)` (i.e. computing the gradient which requires full genotype matrix times dense vector). This is expected.

## Multivariate response with SnpLinAlg

In [33]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
r = 2     # number of traits

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = simulate_random_snparray(undef, n, p)
xla = SnpLinAlg{Float64}(x, model=ADDITIVE_MODEL, center=true, scale=true) 

# intercept is the only nongenetic covariate
z = ones(n, 1)
intercepts = randn(r)' # each trait have different intercept

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_Σ, true_b, correct_position = simulate_random_response(xla, k, r, Zu=z*intercepts, overlap=2);

In [42]:
Random.seed!(2020)
Yt = Matrix(Y')
Zt = Matrix(z')
@time result = fit_iht(Yt, Transpose(xla), Zt);
speed_per_iter = result.time / result.iter

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 10
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 246.48266427359385, backtracks = 0, tol = 0.7520043193454602
Iteration 2: loglikelihood = 1480.502027439885, backtracks = 0, tol = 0.03791016822142674
Iteration 3: loglikelihood = 1501.078241267321, backtracks = 0, tol = 0.017401495424306013
Iteration 4: loglikelihood = 1503.2783927518146, backtracks = 0, tol = 0.00747364

0.022071493996514216

In [11]:
@profview fit_iht(Yt, Transpose(xla), Zt, verbose=false)  # run once to trigger compilation (ignore this one)
@profview fit_iht(Yt, Transpose(xla), Zt, verbose=false)

Gtk.GtkWindowLeaf(name="", parent, width-request=-1, height-request=-1, visible=TRUE, sensitive=TRUE, app-paintable=FALSE, can-focus=FALSE, has-focus=FALSE, is-focus=FALSE, focus-on-click=TRUE, can-default=FALSE, has-default=FALSE, receives-default=FALSE, composite-child=FALSE, style, events=0, no-show-all=FALSE, has-tooltip=FALSE, tooltip-markup=NULL, tooltip-text=NULL, window, opacity=1.000000, double-buffered, halign=GTK_ALIGN_FILL, valign=GTK_ALIGN_FILL, margin-left, margin-right, margin-start=0, margin-end=0, margin-top=0, margin-bottom=0, margin=0, hexpand=FALSE, vexpand=FALSE, hexpand-set=FALSE, vexpand-set=FALSE, expand=FALSE, scale-factor=2, border-width=0, resize-mode, child, type=GTK_WINDOW_TOPLEVEL, title="Profile", role=NULL, resizable=TRUE, modal=FALSE, window-position=GTK_WIN_POS_NONE, default-width=800, default-height=600, destroy-with-parent=FALSE, hide-titlebar-when-maximized=FALSE, icon, icon-name=NULL, screen, type-hint=GDK_WINDOW_TYPE_HINT_NORMAL, skip-taskbar-hint

In [9]:
fit_iht(Yt, Transpose(xla), Zt, verbose=false)
Profile.clear()
@profile fit_iht(Yt, Transpose(xla), Zt, verbose=false);
Profile.print()

Overhead ╎ [+additional indent] Count File:Line; Function
  ╎121 @Base/task.jl:356; (::IJulia.var"#15#18")()
  ╎ 121 @IJulia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  ╎  121 @Base/essentials.jl:709; invokelatest
  ╎   121 @Base/essentials.jl:710; #invokelatest#1
  ╎    121 .../execute_request.jl:67; execute_request(::ZMQ.Socket, :...
  ╎     121 .../SoftGlobalScope.jl:65; softscope_include_string(::Mod...
 1╎    ╎ 121 @Base/loading.jl:1091; include_string(::Function, :...
  ╎    ╎  120 @MendelIHT/src/fit.jl:58; (::MendelIHT.var"#fit_iht##kw...
  ╎    ╎   6   @MendelIHT/src/fit.jl:70; fit_iht(::Array{Float64,2}, ...
  ╎    ╎    6   ...ata_structures.jl:111; initialize
  ╎    ╎     6   ...ata_structures.jl:118; initialize(::Transpose{Floa...
  ╎    ╎    ╎ 6   .../multivariate.jl:334; init_iht_indices!(::Mendel...
  ╎    ╎    ╎  6   .../multivariate.jl:54; score!(::MendelIHT.mIHTVar...
  ╎    ╎    ╎   6   .../multivariate.jl:73; update_df!(::MendelIHT.mI...
  ╎    ╎    ╎    6   ...mul

Conclusion: In multivariate IHT, most time (106/121 samples) are spent on `score!` function, specifically on `adhoc_mul!` (i.e. computing the gradient which requires full genotype matrix times dense vector). This is expected.

# `SnpLinAlg` does NOT support matrix-matrix mul

It uses a fall back to `LinearAlgebra.matmul.jl`

In [35]:
# matrix-vector
v = randn(p)
y = zeros(n)
@time mul!(y, xla, v);

  0.009754 seconds (42 allocations: 1.828 KiB)


In [49]:
# matrix-matrix
v = randn(p, 2)
y = zeros(n, 2)
@time mul!(y, xla, v);

  0.797749 seconds (6 allocations: 336 bytes)


## Thus we build matrix-matrix `mul!` based on matrix-vector mul

In [48]:
function mul_test!(
    out::AbstractMatrix{T}, 
    sla::SnpLinAlg{T}, 
    v::AbstractMatrix{T}) where T <: AbstractFloat
    @assert size(out, 1) == size(sla, 1) && size(v, 2) == size(v, 2) && size(sla, 2) == size(v, 1)
    for i in 1:size(v, 2)
        outi = @view(out[:, i])
        vi = @view(v[:, i])
        mul!(outi, sla, vi)
    end
end

mul_test! (generic function with 1 method)

In [55]:
# original matrix-matrix fallsback to matmul.jl, which is slow
Random.seed!(2020)
v = randn(p, 2)
y = zeros(n, 2)
@time mul!(y, xla, v);

  0.790923 seconds (6 allocations: 336 bytes)


In [56]:
# new matrix-matrix calls mul! in SnpArrays.jl for each column, which is fast!
Random.seed!(2020)
v = randn(p, 2)
y2 = zeros(n, 2)
@time mul_test!(y2, xla, v);

  0.055297 seconds (86 allocations: 4.062 KiB)


In [58]:
# check correctness
all(y .≈ y2)

true

/Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/LinearAlgebra/src/matmul.jl:345, mul! [inlined]


# Profile numeric matrices

## Univariate case

In [17]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
d = Normal
l = canonicallink(d())

# set random seed for reproducibility
Random.seed!(2021)

# simulate `.bed` file with no missing data
x = randn(n, p)

# intercept is the only nongenetic covariate
z = ones(n)
intercept = 1.0

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_b, correct_position = simulate_random_response(x, k, d, l, Zu=z*intercept);

# run IHT
Random.seed!(2020)
@time result = fit_iht(Y, x, z);
speed_per_iter = result.time / result.iter

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse linear regression
Link functin = IdentityLink()
Sparsity parameter (k) = 10
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 200
Converging when tol < 0.0001:

Iteration 1: loglikelihood = -1496.713961529228, backtracks = 0, tol = 0.42407482809494373
Iteration 2: loglikelihood = -1440.396428956392, backtracks = 0, tol = 0.11860044142480099
Iteration 3: loglikelihood = -1437.2177092287425, backtracks = 0, tol = 0.02302770129518664
Iteration 4: loglikelihood = -1437.1722070780304, backtracks = 0, tol = 0.0027889369108425118

0.008611162503560385

## Multivariate case

In [46]:
n = 1000  # number of samples
p = 10000 # number of SNPs
k = 10    # number of causal SNPs per trait
r = 2     # number of traits

# set random seed for reproducibility
Random.seed!(2022)

# simulate `.bed` file with no missing data
x = rand(0.:2., n, p)

# intercept is the only nongenetic covariate
z = ones(n, 1)
intercepts = randn(r)' # each trait have different intercept

# simulate response y, true model b, and the correct non-0 positions of b
Y, true_Σ, true_b, correct_position = simulate_random_response(x, k, r, Zu=z*intercepts, overlap=2);

# run IHT
Random.seed!(2020)
Yt = Matrix(Y')
Zt = Matrix(z')
Xt = Matrix(x')
@time result = fit_iht(Yt, Transpose(x), Zt, k = 20, max_iter=500);
speed_per_iter = result.time / result.iter

****                   MendelIHT Version 1.4.0                  ****
****     Benjamin Chu, Kevin Keys, Chris German, Hua Zhou       ****
****   Jin Zhou, Eric Sobel, Janet Sinsheimer, Kenneth Lange    ****
****                                                            ****
****                 Please cite our paper!                     ****
****         https://doi.org/10.1093/gigascience/giaa044        ****

Running sparse Multivariate Gaussian regression
Link functin = IdentityLink()
Sparsity parameter (k) = 20
Prior weight scaling = off
Doubly sparse projection = off
Debias = off
Max IHT iterations = 500
Converging when tol < 0.0001:

Iteration 1: loglikelihood = 191.73602048640578, backtracks = 0, tol = 0.2292371125117518
Iteration 2: loglikelihood = 762.9277376172406, backtracks = 0, tol = 0.021360037643612592
Iteration 3: loglikelihood = 884.2331365763932, backtracks = 0, tol = 0.02091394090715768
Iteration 4: loglikelihood = 971.5768695868555, backtracks = 0, tol = 0.025650042

0.024981919833070017

In [47]:
# first beta
β1 = result.beta[1, :]
true_b1_idx = findall(!iszero, true_b[:, 1])
[β1[true_b1_idx] true_b[true_b1_idx, 1]]

3×2 Array{Float64,2}:
 -1.32085   -1.50987
 -0.457323  -0.619427
 -1.80573   -1.84578

In [48]:
# second beta
β2 = result.beta[2, :]
true_b2_idx = findall(!iszero, true_b[:, 2])
[β2[true_b2_idx] true_b[true_b2_idx, 2]]

7×2 Array{Float64,2}:
 -0.257583  -0.308054
 -0.665576  -0.54334
  0.629238   0.586241
  0.0       -0.0183675
 -1.49299   -1.51484
 -0.329601  -0.360181
  2.04007    2.0501

In [49]:
# non genetic covariates
[result.c intercepts']

2×2 Array{Float64,2}:
 -0.326767   0.900301
  0.0       -0.151044

In [50]:
# covariance matrix
[vec(result.Σ) vec(true_Σ)]

4×2 Array{Float64,2}:
  1.73172   1.69161
 -1.62934  -1.57315
 -1.62934  -1.57315
  1.77269   1.69077