# Multivariate QuasiCopula GWAS with Mixed Marginals

Here we adopt the variance component model framework

$$\mathbf{\Gamma}_i(\mathbf{\theta}) = \sum_{k=1}^m \theta_k\mathbf{V}_{ik}, \quad \theta_k \ge 0$$

In [1]:
using Revise
using DataFrames, Random, GLM, QuasiCopula
using ForwardDiff, Test, LinearAlgebra
using LinearAlgebra: BlasReal, copytri!
using ToeplitzMatrices
using BenchmarkTools
using SnpArrays
using Statistics

BLAS.set_num_threads(1)
Threads.nthreads()

function simulate_random_snparray(s::Union{String, UndefInitializer}, n::Int64,
    p::Int64; mafs::Vector{Float64}=zeros(Float64, p), min_ma::Int = 5)

    #first simulate a random {0, 1, 2} matrix with each SNP drawn from Binomial(2, r[i])
    A1 = BitArray(undef, n, p) 
    A2 = BitArray(undef, n, p) 
    for j in 1:p
        minor_alleles = 0
        maf = 0
        while minor_alleles <= min_ma
            maf = 0.5rand()
            for i in 1:n
                A1[i, j] = rand(Bernoulli(maf))
                A2[i, j] = rand(Bernoulli(maf))
            end
            minor_alleles = sum(view(A1, :, j)) + sum(view(A2, :, j))
        end
        mafs[j] = maf
    end

    #fill the SnpArray with the corresponding x_tmp entry
    return _make_snparray(s, A1, A2)
end

function _make_snparray(s::Union{String, UndefInitializer}, A1::BitArray, A2::BitArray)
    n, p = size(A1)
    x = SnpArray(s, n, p)
    for i in 1:(n*p)
        c = A1[i] + A2[i]
        if c == 0
            x[i] = 0x00
        elseif c == 1
            x[i] = 0x02
        elseif c == 2
            x[i] = 0x03
        else
            throw(MissingException("matrix shouldn't have missing values!"))
        end
    end
    return x
end


┌ Info: Precompiling QuasiCopula [c47b6ae2-b804-4668-9957-eb588c99ffbc]
└ @ Base loading.jl:1423


_make_snparray (generic function with 1 method)

## Simulate data

Given $n$ independent samples, we simulate phenotypes from 
$$\mathbf{y}_i \sim QC(\mathbf{\Gamma}, f_1, ..., f_d)$$

In [153]:
# simulate data
p = 5    # number of fixed effects, including intercept
m = 2    # number of variance componentsac
n = 5000 # number of sample
d = 3    # number of phenotypes per sample
q = 1000 # number of SNPs
k = 0   # number of causal SNPs

# sample d marginal distributions for each phenotype within samples
Random.seed!(1234)
possible_distributions = [Bernoulli, Poisson, Normal]
vecdist = rand(possible_distributions, d)
# vecdist = [Poisson, Bernoulli, Bernoulli] # this derivative test is fine
# vecdist = [Bernoulli, Bernoulli, Poisson] # this derivative test is wrong everywhere
veclink = [canonicallink(vecdist[j]()) for j in 1:d]

# simulate nongenetic coefficient and variance component params
Random.seed!(2022)
Btrue = rand(Uniform(-0.5, 0.5), p, d)
θtrue = fill(0.4, m)
V1 = ones(d, d)
V2 = Matrix(I, d, d)
Γ = m == 1 ? θtrue[1] * V1 : θtrue[1] * V1 + θtrue[2] * V2

# simulate non-genetic design matrix
Random.seed!(2022)
X = [ones(n) randn(n, p - 1)]

# simulate random SnpArray with q SNPs and randomly choose k SNPs to be causal
Random.seed!(2022)
G = simulate_random_snparray(undef, n, q)
Gfloat = convert(Matrix{Float64}, G, center=true, scale=true)
γtrue = zeros(q, d)
causal_snps = sample(1:q, k, replace=false) |> sort
for j in 1:d
    γtrue[causal_snps, j] .= rand([-1, 1], k)
end

# sample phenotypes
Y = zeros(n, d)
y = Vector{Float64}(undef, d)
for i in 1:n
    Xi = X[i, :]
    Gi = Gfloat[i, :]
    η = Btrue' * Xi + γtrue' * Gi
    vecd_tmp = Vector{UnivariateDistribution}(undef, d)
    for j in 1:d
        dist = vecdist[j]
        μj = GLM.linkinv(canonicallink(dist()), η[j])
        vecd_tmp[j] = dist(μj)
    end
    multivariate_dist = MultivariateMix(vecd_tmp, Γ)
    res = Vector{Float64}(undef, d)
    rand(multivariate_dist, y, res)
    Y[i, :] .= y
end

# form model
V = m == 1 ? [V1] : [V1, V2]
qc_model = MultivariateCopulaVCModel(Y, X, V, vecdist, veclink);

In [154]:
X

5000×5 Matrix{Float64}:
 1.0  -0.308648    1.70162   -0.89417      0.780388
 1.0   1.67671    -0.548034  -1.49492      0.252805
 1.0  -0.347153    0.736227   1.17736     -1.43366
 1.0   0.818666   -2.16009    0.0765732   -2.35217
 1.0  -1.71753    -0.273745  -1.47406      0.718669
 1.0  -0.238934    0.942883   0.0937242    0.936967
 1.0   0.701932    1.02868   -2.34081     -1.18712
 1.0  -0.166138   -0.278824   1.23083     -0.157013
 1.0  -0.609614    0.289359   0.86016      0.480996
 1.0   0.68791     0.209478   0.126272     1.35729
 1.0   0.0342303  -0.543192  -1.08445     -0.826963
 1.0  -0.479078   -0.865401  -0.761236    -0.032467
 1.0  -1.63537     0.348029  -0.0360208   -0.731861
 ⋮                                        
 1.0   0.200555    1.14607   -0.504242     0.167705
 1.0  -0.205806    1.98172    0.684624     1.33652
 1.0   1.17812     0.307879  -0.859732     0.578845
 1.0   1.60549     0.817788   2.07335     -1.95796
 1.0   1.63509    -0.960082   0.722145     0.021941
 1.

In [155]:
vecdist

3-element Vector{UnionAll}:
 Poisson
 Bernoulli
 Bernoulli

In [156]:
Y

5000×3 Matrix{Float64}:
 1.0  1.0  0.0
 2.0  1.0  0.0
 0.0  1.0  0.0
 0.0  1.0  0.0
 0.0  0.0  0.0
 1.0  0.0  1.0
 0.0  0.0  0.0
 3.0  0.0  0.0
 1.0  1.0  1.0
 2.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  1.0
 0.0  1.0  0.0
 ⋮         
 1.0  0.0  0.0
 2.0  1.0  0.0
 1.0  1.0  1.0
 1.0  0.0  1.0
 0.0  0.0  0.0
 1.0  0.0  0.0
 0.0  0.0  0.0
 3.0  0.0  1.0
 4.0  0.0  1.0
 0.0  0.0  1.0
 2.0  1.0  1.0
 0.0  0.0  0.0

In [157]:
Statistics.cor(Y)

3×3 Matrix{Float64}:
 1.0       0.140945  0.12974
 0.140945  1.0       0.143133
 0.12974   0.143133  1.0

## Fit Null model

TODO: 

+ Initializing model parameters

In [158]:
@time optm = QuasiCopula.fit!(qc_model,
    Ipopt.IpoptSolver(
        print_level = 5, 
        tol = 10^-6, 
        max_iter = 4,
        accept_after_max_steps = 10,
        warm_start_init_point="yes", 
        limited_memory_max_history = 6, # default value
        hessian_approximation = "limited-memory",
        derivative_test="first-order"
    )
);

This is Ipopt version 3.13.4, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Starting derivative checker for first derivatives.


No errors detected by derivative checker.

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        0

Total number of variables............................:       17
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:  

└ @ QuasiCopula /Users/biona001/.julia/dev/QuasiCopula/src/gwas/multivariate.jl:155


In [159]:
@time optm = QuasiCopula.fit!(qc_model,
    Ipopt.IpoptSolver(
        print_level = 5, 
        tol = 10^-6, 
        max_iter = 200,
        accept_after_max_steps = 10,
        warm_start_init_point="yes", 
        limited_memory_max_history = 6, # default value
        hessian_approximation = "limited-memory",
#         derivative_test="first-order"
    )
);

This is Ipopt version 3.13.4, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:        0
Number of nonzeros in inequality constraint Jacobian.:        0
Number of nonzeros in Lagrangian Hessian.............:        0

Total number of variables............................:       17
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0  

└ @ QuasiCopula /Users/biona001/.julia/dev/QuasiCopula/src/gwas/multivariate.jl:155


In [160]:
@show qc_model.∇vecB
@show qc_model.∇θ;

qc_model.∇vecB = [-0.001397199822637195, -0.0007059656090901206, -0.0015525249093907587, 0.0004933656782029886, 0.0016486679130583365, -0.0013783118827231707, -0.0006964220665060016, -0.0015315372190629357, 0.000486696151672452, 0.0016263805208217349, -0.0008052444726535313, -0.0004068672894119861, -0.000894762568452213, 0.00028434013441259754, 0.0009501724110769414]
qc_model.∇θ = [-2.992042077668428, -2.995547361520035]


In [161]:
[vec(qc_model.B) vec(Btrue)]

15×2 Matrix{Float64}:
 -1.81328   -0.280755
 -0.916199   0.167603
 -2.01486   -0.330996
  0.640288  -0.0469412
  2.13963    0.483384
 -1.8165    -0.269155
 -0.917828   0.144784
 -2.01844   -0.39274
  0.641426   0.0260867
  2.14344   -0.234534
 -1.95067   -0.0969734
 -0.985621   0.0955352
 -2.16753    0.0881894
  0.688803  -0.433561
  2.30176    0.250369

In [162]:
[qc_model.θ θtrue]

2×2 Matrix{Float64}:
 0.000295881  0.4
 0.000295883  0.4

## Score test