# Poisson(1) with AR(1) rho = 0.9, sigma2 = 1.0

For Poisson Base with $\lambda = 1$, with the correlation $\rho = 0.9, \sigma^2 = 1.0,$

The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $\lambda = 1,$ thus we have kurtosis $kt = kt(\lambda) = 4.0.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 4.0 + \frac{1}{2} * (d_i - 1)} =  \frac{0.9}{1 + \frac{1}{2} * (d_i + 3)} $$

# We will show:
    1) Simulating under QC model, the theoretical and empirical correlation IS a function of cluster sizes di

    2) Simulating under GLMM, the theoretical and empirical correlation is NOT function of cluster sizes di
    

## TOC:

# di = 2
* [AR(1) di = 2](#ex0)
* [simulate under GLMM AR(1) di = 2](#ex0g)

# di = 5
* [AR(1) di = 5](#ex1)
* [simulate under GLMM AR(1) di = 5](#ex1g)

# di = 10 
* [AR(1) di = 10](#ex2)
* [simulate under GLMM AR(1) di = 10](#ex2g)

# di = 25
* [AR(1) di = 25](#ex3)
* [simulate under GLMM AR(1) di = 25](#ex3g)

# Comparisons
* [Theoretical vs. Empirical Correlation Simulated under QC](#ex4)
* [Empirical Correlation Simulated under GLMM](#ex4g)

In [1]:
using GLMCopula, DelimitedFiles, LinearAlgebra, Random, GLM, MixedModels, CategoricalArrays
using Random, Roots, SpecialFunctions, StatsBase
using DataFrames, DelimitedFiles, Statistics, ToeplitzMatrices
import StatsBase: sem

In [2]:
function get_V_AR(ρ, n)
    vec = zeros(n)
    vec[1] = 1.0
    for i in 2:n
        vec[i] = vec[i - 1] * ρ
    end
    V = ToeplitzMatrices.SymmetricToeplitz(vec)
    V
end

get_V_AR (generic function with 1 method)

In [3]:
# true parameter values
lambda = [1.0]
βtrue = [log(lambda[1])]
σ2true = [1.0]
ρtrue = [0.9]

samplesize = 100000 # number of sampling units

d = Poisson()
link = LogLink()
D = typeof(d)
Link = typeof(link)
T = Float64

Float64

# Kurtosis of each Poisson(1) base distribution is 4.0

We have $\lambda = 1,$ thus we have kurtosis $kt = kt(\lambda) = 4.0.$

Let's see what happens to the theoretical/empirical correlations when we fix $\rho = 0.9, \sigma^2 = 1.0, kt = 4.0$ and range cluster sizes $d_i$

In [4]:
d = Poisson(lambda[1])
μ, σ², sk, kt = mean(d), var(d), skewness(d), kurtosis(d, false)

(1.0, 1.0, 1.0, 4.0)

## AR(1) di = 2 <a class="anchor" id="ex0"></a>

$d_i = 2$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [5]:
di = 2 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

2×2 Matrix{Float64}:
 1.0  0.9
 0.9  1.0

In [6]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2 = StatsBase.cor(Y_AR)

  1.227417 seconds (5.64 M allocations: 517.991 MiB, 11.89% gc time, 53.20% compilation time)


2×2 Matrix{Float64}:
 1.0       0.230281
 0.230281  1.0

In [7]:
empirical_cor_2[1, 2]

0.23028050600548225

In [8]:
theoretical_rho_2_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.2571428571428572

## simulate under GLMM AR(1) di = 2 <a class="anchor" id="ex0g"></a>

$d_i = 2$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [9]:
function __get_distribution(dist::Type{D}, μ) where D <: UnivariateDistribution
    return dist(μ)
end

for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2_GLMM = StatsBase.cor(Y_AR)

2×2 Matrix{Float64}:
 1.0       0.631196
 0.631196  1.0

## AR(1) di = 5 <a class="anchor" id="ex1"></a>

$d_i = 5$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [10]:
di = 5 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

5×5 Matrix{Float64}:
 1.0     0.9    0.81  0.729  0.6561
 0.9     1.0    0.9   0.81   0.729
 0.81    0.9    1.0   0.9    0.81
 0.729   0.81   0.9   1.0    0.9
 0.6561  0.729  0.81  0.9    1.0

In [11]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5 = StatsBase.cor(Y_AR)

  2.079530 seconds (9.00 M allocations: 1.031 GiB, 29.32% gc time)


5×5 Matrix{Float64}:
 1.0       0.167854  0.14656   0.134947  0.116821
 0.167854  1.0       0.15873   0.148804  0.134362
 0.14656   0.15873   1.0       0.167016  0.154419
 0.134947  0.148804  0.167016  1.0       0.16876
 0.116821  0.134362  0.154419  0.16876   1.0

In [12]:
empirical_cor_5[1, 2]

0.16785355920857664

In [13]:
theoretical_rho_5_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.18

## simulate under GLMM AR(1) di = 5 <a class="anchor" id="ex1g"></a>

$d_i = 5$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [14]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5_GLMM = StatsBase.cor(Y_AR)

5×5 Matrix{Float64}:
 1.0       0.62035   0.541901  0.466411  0.407169
 0.62035   1.0       0.641879  0.543301  0.483319
 0.541901  0.641879  1.0       0.63027   0.546723
 0.466411  0.543301  0.63027   1.0       0.63731
 0.407169  0.483319  0.546723  0.63731   1.0

## AR(1) di = 10 <a class="anchor" id="ex2"></a>

$d_i = 10$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [15]:
di = 10 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

10×10 Matrix{Float64}:
 1.0       0.9       0.81      0.729     …  0.478297  0.430467  0.38742
 0.9       1.0       0.9       0.81         0.531441  0.478297  0.430467
 0.81      0.9       1.0       0.9          0.59049   0.531441  0.478297
 0.729     0.81      0.9       1.0          0.6561    0.59049   0.531441
 0.6561    0.729     0.81      0.9          0.729     0.6561    0.59049
 0.59049   0.6561    0.729     0.81      …  0.81      0.729     0.6561
 0.531441  0.59049   0.6561    0.729        0.9       0.81      0.729
 0.478297  0.531441  0.59049   0.6561       1.0       0.9       0.81
 0.430467  0.478297  0.531441  0.59049      0.9       1.0       0.9
 0.38742   0.430467  0.478297  0.531441     0.81      0.9       1.0

In [16]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10 = StatsBase.cor(Y_AR)

  3.940578 seconds (17.50 M allocations: 2.481 GiB, 26.05% gc time)


10×10 Matrix{Float64}:
 1.0        0.119132   0.0978651  …  0.0552739  0.0560218  0.0372989
 0.119132   1.0        0.109722      0.0622919  0.0526199  0.0504044
 0.0978651  0.109722   1.0           0.071372   0.0659333  0.0612759
 0.0970557  0.100328   0.11962       0.080774   0.0718045  0.0615755
 0.0822284  0.091394   0.10549       0.0909923  0.0845213  0.0754962
 0.0712477  0.0811701  0.0937914  …  0.104688   0.0905186  0.0814601
 0.0615431  0.0723195  0.0827933     0.106277   0.103138   0.0987854
 0.0552739  0.0622919  0.071372      1.0        0.111207   0.105609
 0.0560218  0.0526199  0.0659333     0.111207   1.0        0.113012
 0.0372989  0.0504044  0.0612759     0.105609   0.113012   1.0

In [17]:
empirical_cor_10[1, 2]

0.11913180856400937

In [18]:
theoretical_rho_10_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.12000000000000001

## simulate under GLMM AR(1) di = 10 <a class="anchor" id="ex2g"></a>

$d_i = 10$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [19]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10_GLMM = StatsBase.cor(Y_AR)

10×10 Matrix{Float64}:
 1.0       0.614001  0.531542  0.453661  …  0.271462  0.235331  0.201705
 0.614001  1.0       0.613227  0.525969     0.306441  0.269341  0.228206
 0.531542  0.613227  1.0       0.626663     0.349248  0.303113  0.259055
 0.453661  0.525969  0.626663  1.0          0.402216  0.349634  0.303411
 0.391184  0.447306  0.527552  0.616197     0.464395  0.401158  0.344439
 0.339693  0.38934   0.447763  0.525772  …  0.532263  0.457397  0.394042
 0.301549  0.340743  0.394943  0.456062     0.62656   0.53093   0.457262
 0.271462  0.306441  0.349248  0.402216     1.0       0.622138  0.537588
 0.235331  0.269341  0.303113  0.349634     0.622138  1.0       0.623448
 0.201705  0.228206  0.259055  0.303411     0.537588  0.623448  1.0

## AR(1) di = 25 <a class="anchor" id="ex3"></a>

$d_i = 25$

$\lambda = 1, \rho = 1.0, \sigma^2 = 1.0$

In [20]:
di = 25 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

25×25 Matrix{Float64}:
 1.0        0.9        0.81       …  0.0984771  0.0886294  0.0797664
 0.9        1.0        0.9           0.109419   0.0984771  0.0886294
 0.81       0.9        1.0           0.121577   0.109419   0.0984771
 0.729      0.81       0.9           0.135085   0.121577   0.109419
 0.6561     0.729      0.81          0.150095   0.135085   0.121577
 0.59049    0.6561     0.729      …  0.166772   0.150095   0.135085
 0.531441   0.59049    0.6561        0.185302   0.166772   0.150095
 0.478297   0.531441   0.59049       0.205891   0.185302   0.166772
 0.430467   0.478297   0.531441      0.228768   0.205891   0.185302
 0.38742    0.430467   0.478297      0.254187   0.228768   0.205891
 0.348678   0.38742    0.430467   …  0.28243    0.254187   0.228768
 0.313811   0.348678   0.38742       0.313811   0.28243    0.254187
 0.28243    0.313811   0.348678      0.348678   0.313811   0.28243
 0.254187   0.28243    0.313811      0.38742    0.348678   0.313811
 0.228768   0.254187   

In [21]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25 = StatsBase.cor(Y_AR)

 13.100701 seconds (43.00 M allocations: 13.301 GiB, 24.14% gc time)


25×25 Matrix{Float64}:
 1.0          0.0580281    0.0501365   …  0.00543746  0.00723569  0.0104024
 0.0580281    1.0          0.0553744      0.00630731  0.00103864  0.00665421
 0.0501365    0.0553744    1.0            0.00505777  0.00390679  0.00933252
 0.0441159    0.0544735    0.0542332      0.00410151  0.00604732  0.00673261
 0.0473615    0.0496608    0.0489248      0.0113963   0.00795998  0.00666845
 0.0398135    0.041681     0.0484749   …  0.00973893  0.00841461  0.00594014
 0.0308896    0.0399504    0.04008        0.00965495  0.00858988  0.00764117
 0.0300046    0.0337341    0.0368247      0.0149883   0.0111576   0.00766259
 0.0283822    0.0328022    0.0269309      0.0145076   0.0130394   0.00920751
 0.022914     0.0238827    0.0274403      0.0116083   0.0128386   0.00755216
 0.0199885    0.0241264    0.0271682   …  0.0175401   0.0172671   0.00995768
 0.0239559    0.021748     0.026265       0.0201211   0.0187431   0.0128662
 0.0158394    0.0210431    0.0198442      0.021745    0

In [22]:
empirical_cor_25[1, 2]

0.05802805633381166

In [23]:
theoretical_rho_25_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.060000000000000005

## simulate under GLMM AR(1) di = 25 <a class="anchor" id="ex3g"></a>

$d_i = 25$

$\lambda = 1, \rho = 0.9, \sigma^2 = 1.0$

In [24]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25_GLMM = StatsBase.cor(Y_AR)

25×25 Matrix{Float64}:
 1.0        0.638996   0.541917   …  0.0447875  0.0366016  0.0314837
 0.638996   1.0        0.624209      0.0549801  0.0464837  0.0424987
 0.541917   0.624209   1.0           0.0548328  0.0493031  0.0471143
 0.464366   0.536523   0.632274      0.068452   0.0573659  0.0540509
 0.401577   0.455418   0.533145      0.0701412  0.063146   0.0590902
 0.349568   0.404708   0.461767   …  0.0809532  0.0762609  0.0687859
 0.303511   0.357182   0.399666      0.085409   0.0795987  0.0761418
 0.274626   0.312835   0.349394      0.0934925  0.085116   0.0801193
 0.226635   0.26972    0.296422      0.104672   0.0958599  0.0849924
 0.196293   0.232976   0.259376      0.119929   0.10745    0.098592
 0.178106   0.206731   0.227978   …  0.133437   0.121569   0.110863
 0.155319   0.182124   0.200935      0.150629   0.134537   0.124199
 0.142219   0.166962   0.182324      0.174125   0.152411   0.138409
 0.117149   0.140712   0.15339       0.197999   0.178251   0.163024
 0.10154    0.12

## Comparisons 

##  1. Theoretical vs. Empirical Correlation simulated under QC <a class="anchor" id="ex4"></a>

### Takeaway: Quasi-Copula correlation IS a function of the di, kt, $\rho$ and $\sigma^2$

The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt(\lambda)$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $\lambda = 1,$ thus we have kurtosis $kt = kt(\lambda) = 4.0.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 4.0 + \frac{1}{2} * (d_i - 1)} =  \frac{0.9}{1 + \frac{1}{2} * (d_i + 3)} $$


In [25]:
# di = 2
[theoretical_rho_2_kurtosis empirical_cor_2[1, 2]]

1×2 Matrix{Float64}:
 0.257143  0.230281

In [26]:
# di = 5
[theoretical_rho_5_kurtosis empirical_cor_5[1, 2]]

1×2 Matrix{Float64}:
 0.18  0.167854

In [27]:
# di = 10
[theoretical_rho_10_kurtosis empirical_cor_10[1, 2]]

1×2 Matrix{Float64}:
 0.12  0.119132

In [28]:
# di = 25
[theoretical_rho_25_kurtosis empirical_cor_25[1, 2]]

1×2 Matrix{Float64}:
 0.06  0.0580281

##  2. Empirical Correlation simulated under GLMM<a class="anchor" id="ex4g"></a>

### Takeaway: GLMM correlation is NOT a function of the di

In [29]:
# glmm correlation di = 2 
empirical_cor_2_GLMM[1, 2]

0.631196434845501

In [30]:
# glmm correlation di = 5
empirical_cor_5_GLMM[1, 2]

0.6203499764734532

In [31]:
# glmm correlation di = 10
empirical_cor_10_GLMM[1, 2]

0.6140014172969105

In [32]:
# glmm correlation di = 25
empirical_cor_25_GLMM[1, 2]

0.6389964225572923