# Poisson(5) with AR(1) rho = 0.9, sigma2 = 1.0

For Poisson Base with $\lambda = 5$, with the correlation $\rho = 0.9, \sigma^2 = 1.0,$

The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $\lambda = 5,$ thus we have kurtosis $kt = kt(\lambda) = 3.2.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 3.2 + \frac{1}{2} * (d_i - 1)} =  \frac{0.9}{1 + \frac{1}{2} * (d_i + 2.2)} $$

# We will show:
    1) Simulating under QC model, the theoretical and empirical correlation IS a function of cluster sizes di

    2) Simulating under GLMM, the theoretical and empirical correlation is NOT function of cluster sizes di
    

## TOC:

# di = 2
* [AR(1) di = 2](#ex0)
* [simulate under GLMM AR(1) di = 2](#ex0g)

# di = 5
* [AR(1) di = 5](#ex1)
* [simulate under GLMM AR(1) di = 5](#ex1g)

# di = 10 
* [AR(1) di = 10](#ex2)
* [simulate under GLMM AR(1) di = 10](#ex2g)

# di = 25
* [AR(1) di = 25](#ex3)
* [simulate under GLMM AR(1) di = 25](#ex3g)

# Comparisons
* [Theoretical vs. Empirical Correlation Simulated under QC](#ex4)
* [Empirical Correlation Simulated under GLMM](#ex4g)

In [1]:
using GLMCopula, DelimitedFiles, LinearAlgebra, Random, GLM, MixedModels, CategoricalArrays
using Random, Roots, SpecialFunctions, StatsBase
using DataFrames, DelimitedFiles, Statistics, ToeplitzMatrices
import StatsBase: sem

In [2]:
function get_V_AR(ρ, n)
    vec = zeros(n)
    vec[1] = 1.0
    for i in 2:n
        vec[i] = vec[i - 1] * ρ
    end
    V = ToeplitzMatrices.SymmetricToeplitz(vec)
    V
end

get_V_AR (generic function with 1 method)

In [3]:
# true parameter values
lambda = [5.0]
βtrue = [log(lambda[1])]
σ2true = [1.0]
ρtrue = [0.9]

samplesize = 100000 # number of sampling units

d = Poisson()
link = LogLink()
D = typeof(d)
Link = typeof(link)
T = Float64

Float64

# Kurtosis of each Poisson(5) base distribution is 3.2 

We have $\lambda = 5,$ thus we have kurtosis $kt = kt(\lambda) = 3.2.$

Let's see what happens to the theoretical/empirical correlations when we fix $\rho = 0.9, \sigma^2 = 1.0, kt = 3.2$ and range cluster sizes $d_i$

In [4]:
d = Poisson(lambda[1])
μ, σ², sk, kt = mean(d), var(d), skewness(d), kurtosis(d, false)

(5.0, 5.0, 0.4472135954999579, 3.2)

## AR(1) di = 2 <a class="anchor" id="ex0"></a>

$d_i = 2$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [5]:
di = 2 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

2×2 Matrix{Float64}:
 1.0  0.9
 0.9  1.0

In [6]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2 = StatsBase.cor(Y_AR)

  1.167570 seconds (5.64 M allocations: 627.854 MiB, 10.94% gc time, 45.82% compilation time)


2×2 Matrix{Float64}:
 1.0       0.285195
 0.285195  1.0

In [7]:
empirical_cor_2[1, 2]

0.2851948656513909

In [8]:
theoretical_rho_2_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.2903225806451613

## simulate under GLMM AR(1) di = 2 <a class="anchor" id="ex0g"></a>

$d_i = 2$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [9]:
function __get_distribution(dist::Type{D}, μ) where D <: UnivariateDistribution
    return dist(μ)
end

for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2_GLMM = StatsBase.cor(Y_AR)

2×2 Matrix{Float64}:
 1.0       0.798455
 0.798455  1.0

## AR(1) di = 5 <a class="anchor" id="ex1"></a>

$d_i = 5$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [10]:
di = 5 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

5×5 Matrix{Float64}:
 1.0     0.9    0.81  0.729  0.6561
 0.9     1.0    0.9   0.81   0.729
 0.81    0.9    1.0   0.9    0.81
 0.729   0.81   0.9   1.0    0.9
 0.6561  0.729  0.81  0.9    1.0

In [11]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5 = StatsBase.cor(Y_AR)

  1.963463 seconds (9.00 M allocations: 1.299 GiB, 29.41% gc time)


5×5 Matrix{Float64}:
 1.0       0.196943  0.179219  0.159001  0.140843
 0.196943  1.0       0.193536  0.178693  0.154884
 0.179219  0.193536  1.0       0.191839  0.175218
 0.159001  0.178693  0.191839  1.0       0.197053
 0.140843  0.154884  0.175218  0.197053  1.0

In [12]:
empirical_cor_5[1, 2]

0.19694343677998818

In [13]:
theoretical_rho_5_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.1956521739130435

## simulate under GLMM AR(1) di = 5 <a class="anchor" id="ex1g"></a>

$d_i = 5$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [14]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5_GLMM = StatsBase.cor(Y_AR)

5×5 Matrix{Float64}:
 1.0       0.7895    0.676341  0.584516  0.498512
 0.7895    1.0       0.788338  0.675127  0.582794
 0.676341  0.788338  1.0       0.793215  0.673568
 0.584516  0.675127  0.793215  1.0       0.785478
 0.498512  0.582794  0.673568  0.785478  1.0

## AR(1) di = 10 <a class="anchor" id="ex2"></a>

$d_i = 10$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [15]:
di = 10 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

10×10 Matrix{Float64}:
 1.0       0.9       0.81      0.729     …  0.478297  0.430467  0.38742
 0.9       1.0       0.9       0.81         0.531441  0.478297  0.430467
 0.81      0.9       1.0       0.9          0.59049   0.531441  0.478297
 0.729     0.81      0.9       1.0          0.6561    0.59049   0.531441
 0.6561    0.729     0.81      0.9          0.729     0.6561    0.59049
 0.59049   0.6561    0.729     0.81      …  0.81      0.729     0.6561
 0.531441  0.59049   0.6561    0.729        0.9       0.81      0.729
 0.478297  0.531441  0.59049   0.6561       1.0       0.9       0.81
 0.430467  0.478297  0.531441  0.59049      0.9       1.0       0.9
 0.38742   0.430467  0.478297  0.531441     0.81      0.9       1.0

In [16]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10 = StatsBase.cor(Y_AR)

  3.745381 seconds (17.50 M allocations: 3.017 GiB, 24.30% gc time)


10×10 Matrix{Float64}:
 1.0        0.129212   0.111241   …  0.0654108  0.0627458  0.051693
 0.129212   1.0        0.130465      0.0734216  0.0676921  0.0627178
 0.111241   0.130465   1.0           0.0802944  0.074363   0.0613222
 0.101642   0.117368   0.12355       0.0924405  0.0835027  0.0757609
 0.0880405  0.106948   0.11251       0.100746   0.0886037  0.0801471
 0.0787966  0.0946752  0.0936823  …  0.115538   0.0999786  0.0867638
 0.0751109  0.0807032  0.0901093     0.123241   0.114441   0.104384
 0.0654108  0.0734216  0.0802944     1.0        0.12493    0.110968
 0.0627458  0.0676921  0.074363      0.12493    1.0        0.121423
 0.051693   0.0627178  0.0613222     0.110968   0.121423   1.0

In [17]:
empirical_cor_10[1, 2]

0.12921158248045786

In [18]:
theoretical_rho_10_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.1267605633802817

## simulate under GLMM AR(1) di = 10 <a class="anchor" id="ex2g"></a>

$d_i = 10$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [19]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10_GLMM = StatsBase.cor(Y_AR)

10×10 Matrix{Float64}:
 1.0       0.798802  0.686538  0.591867  …  0.339225  0.302694  0.267577
 0.798802  1.0       0.799824  0.683553     0.386718  0.341606  0.29824
 0.686538  0.799824  1.0       0.794703     0.440669  0.383323  0.335242
 0.591867  0.683553  0.794703  1.0          0.498116  0.431618  0.373987
 0.509367  0.583726  0.674347  0.786307     0.577579  0.505531  0.435616
 0.444698  0.506176  0.582279  0.669931  …  0.676812  0.587036  0.50528
 0.387235  0.43926   0.505446  0.578715     0.789131  0.674672  0.585027
 0.339225  0.386718  0.440669  0.498116     1.0       0.787131  0.678136
 0.302694  0.341606  0.383323  0.431618     0.787131  1.0       0.789833
 0.267577  0.29824   0.335242  0.373987     0.678136  0.789833  1.0

## AR(1) di = 25 <a class="anchor" id="ex3"></a>

$d_i = 25$

$\lambda = 5, \rho = 1.0, \sigma^2 = 1.0$

In [20]:
di = 25 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

25×25 Matrix{Float64}:
 1.0        0.9        0.81       …  0.0984771  0.0886294  0.0797664
 0.9        1.0        0.9           0.109419   0.0984771  0.0886294
 0.81       0.9        1.0           0.121577   0.109419   0.0984771
 0.729      0.81       0.9           0.135085   0.121577   0.109419
 0.6561     0.729      0.81          0.150095   0.135085   0.121577
 0.59049    0.6561     0.729      …  0.166772   0.150095   0.135085
 0.531441   0.59049    0.6561        0.185302   0.166772   0.150095
 0.478297   0.531441   0.59049       0.205891   0.185302   0.166772
 0.430467   0.478297   0.531441      0.228768   0.205891   0.185302
 0.38742    0.430467   0.478297      0.254187   0.228768   0.205891
 0.348678   0.38742    0.430467   …  0.28243    0.254187   0.228768
 0.313811   0.348678   0.38742       0.313811   0.28243    0.254187
 0.28243    0.313811   0.348678      0.348678   0.313811   0.28243
 0.254187   0.28243    0.313811      0.38742    0.348678   0.313811
 0.228768   0.254187   

In [21]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Poisson(lambda[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25 = StatsBase.cor(Y_AR)

 13.607008 seconds (43.00 M allocations: 14.642 GiB, 24.84% gc time)


25×25 Matrix{Float64}:
 1.0         0.057877    0.054407    …  0.00334655  0.00380137  0.00818987
 0.057877    1.0         0.060857       0.00671293  0.00272152  0.00669421
 0.054407    0.060857    1.0            0.00539359  0.00731512  0.00349099
 0.0482577   0.0592987   0.0573111      0.00975628  0.00449522  0.00557292
 0.0476948   0.0448251   0.0535737      0.0119277   0.00800253  0.0119288
 0.0363996   0.0447774   0.0543074   …  0.0125646   0.00597484  0.00863699
 0.040403    0.039845    0.0488385      0.012551    0.0158989   0.0098112
 0.0297361   0.0338403   0.0413809      0.00883903  0.0133106   0.00768023
 0.0348425   0.0376329   0.0365287      0.00791255  0.0101563   0.00833693
 0.0218884   0.0288836   0.0292597      0.0163443   0.0115216   0.00678645
 0.0207626   0.0229643   0.0302499   …  0.0168207   0.0166431   0.016916
 0.0219535   0.0229721   0.0253146      0.0262654   0.0190957   0.0182259
 0.0228315   0.0158543   0.0249878      0.0177675   0.0159887   0.0210917
 0.02091

In [22]:
empirical_cor_25[1, 2]

0.057877033765906125

In [23]:
theoretical_rho_25_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.06164383561643836

## simulate under GLMM AR(1) di = 25 <a class="anchor" id="ex3g"></a>

$d_i = 25$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [24]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25_GLMM = StatsBase.cor(Y_AR)

25×25 Matrix{Float64}:
 1.0        0.794472   0.675906   …  0.052031   0.0434169  0.0379415
 0.794472   1.0        0.790503      0.0602639  0.0503706  0.0440087
 0.675906   0.790503   1.0           0.0681712  0.0568748  0.0501382
 0.579667   0.669935   0.791564      0.0796472  0.0667057  0.0584775
 0.505297   0.579128   0.678592      0.0849799  0.0703619  0.0629699
 0.443646   0.502518   0.578531   …  0.0952401  0.0822579  0.0724376
 0.392455   0.438296   0.505983      0.110663   0.0970143  0.0826461
 0.338989   0.381037   0.433112      0.128541   0.110231   0.0943535
 0.293002   0.331286   0.372569      0.151078   0.127954   0.109659
 0.258585   0.292119   0.329628      0.173826   0.148796   0.128923
 0.225392   0.253935   0.286773   …  0.184701   0.161044   0.139969
 0.201852   0.226394   0.253977      0.206255   0.17896    0.154341
 0.183614   0.205379   0.231355      0.229135   0.20219    0.174614
 0.162994   0.18122    0.205704      0.254181   0.227318   0.196551
 0.14108    0.157

## Comparisons 

##  1. Theoretical vs. Empirical Correlation simulated under QC <a class="anchor" id="ex4"></a>

### Takeaway: Quasi-Copula correlation IS a function of the di, kt, $\rho$ and $\sigma^2$

The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt(\lambda)$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $\lambda = 5,$ thus we have kurtosis $kt = kt(\lambda) = 3.2.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 3.2 + \frac{1}{2} * (d_i - 1)} =  \frac{0.9}{1 + \frac{1}{2} * (d_i + 2.2)} $$


In [25]:
# di = 2
[theoretical_rho_2_kurtosis empirical_cor_2[1, 2]]

1×2 Matrix{Float64}:
 0.290323  0.285195

In [26]:
# di = 5
[theoretical_rho_5_kurtosis empirical_cor_5[1, 2]]

1×2 Matrix{Float64}:
 0.195652  0.196943

In [27]:
# di = 10
[theoretical_rho_10_kurtosis empirical_cor_10[1, 2]]

1×2 Matrix{Float64}:
 0.126761  0.129212

In [28]:
# di = 25
[theoretical_rho_25_kurtosis empirical_cor_25[1, 2]]

1×2 Matrix{Float64}:
 0.0616438  0.057877

##  2. Empirical Correlation simulated under GLMM<a class="anchor" id="ex4g"></a>

### Takeaway: GLMM correlation is NOT a function of the di

In [29]:
# glmm correlation di = 2 
empirical_cor_2_GLMM[1, 2]

0.7984547527377245

In [30]:
# glmm correlation di = 5
empirical_cor_5_GLMM[1, 2]

0.7894996031710032

In [31]:
# glmm correlation di = 10
empirical_cor_10_GLMM[1, 2]

0.7988019832980287

In [32]:
# glmm correlation di = 25
empirical_cor_25_GLMM[1, 2]

0.7944721096751493