# Bernoulli(0.5) with AR(1) rho = 0.9, sigma2 = 1.0

For Bernoulli Base with $p = 0.5$, with the correlation $\rho = 0.9, \sigma^2 = 1.0,$  

The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $p = 0.5,$ thus we have kurtosis $kt = kt(p) = 1.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 1 + \frac{1}{2} * (d_i - 1)} = \frac{0.9}{1 + \frac{d_i}{2}}$$


# We will show: 
    1) Simulating under QC model, the theoretical and empirical correlation IS a function of cluster sizes di

    2) Simulating under GLMM, the theoretical and empirical correlation is NOT function of cluster sizes di



## TOC:

# di = 2
* [AR(1) di = 2](#ex0)
* [simulate under GLMM AR(1) di = 2](#ex0g)

# di = 5
* [AR(1) di = 5](#ex1)
* [simulate under GLMM AR(1) di = 5](#ex1g)

# di = 10 
* [AR(1) di = 10](#ex2)
* [simulate under GLMM AR(1) di = 10](#ex2g)

# di = 25
* [AR(1) di = 25](#ex3)
* [simulate under GLMM AR(1) di = 25](#ex3g)

# Comparisons
* [Theoretical vs. Empirical Correlation Simulated under QC](#ex4)
* [Empirical Correlation Simulated under GLMM](#ex4g)

In [1]:
using GLMCopula, DelimitedFiles, LinearAlgebra, Random, GLM, MixedModels, CategoricalArrays
using Random, Roots, SpecialFunctions, StatsBase
using DataFrames, DelimitedFiles, Statistics, ToeplitzMatrices
import StatsBase: sem

In [2]:
function get_V_AR(ρ, n)
    vec = zeros(n)
    vec[1] = 1.0
    for i in 2:n
        vec[i] = vec[i - 1] * ρ
    end
    V = ToeplitzMatrices.SymmetricToeplitz(vec)
    V
end

get_V_AR (generic function with 1 method)

In [3]:
# true parameter values
p = [0.5]
βtrue = [log(1.0)]
σ2true = [1.0]
ρtrue = [0.9]

samplesize = 100000 # number of sampling units

d = Bernoulli()
link = LogitLink()
D = typeof(d)
Link = typeof(link)
T = Float64

Float64

# Kurtosis of each Bernoulli(0.5) base distribution is 1.0

We have $p = 0.5,$ thus we have kurtosis $kt = kt(p) = 1.$

Let's see what happens to the theoretical/empirical correlations when we fix $\rho = 0.9, \sigma^2 = 1.0, kt = 1.0$ and range cluster sizes $d_i$

In [4]:
d = Bernoulli(p[1])
μ, σ², sk, kt = mean(d), var(d), skewness(d), kurtosis(d, false)

(0.5, 0.25, 0.0, 1.0)

## AR(1) di = 2 <a class="anchor" id="ex0"></a>

$d_i = 2$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [5]:
di = 2 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

2×2 Matrix{Float64}:
 1.0  0.9
 0.9  1.0

In [6]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Bernoulli(p[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2 = StatsBase.cor(Y_AR)

  1.062309 seconds (5.52 M allocations: 360.511 MiB, 14.13% gc time, 64.17% compilation time)


2×2 Matrix{Float64}:
 1.0       0.446784
 0.446784  1.0

In [7]:
empirical_cor_2[1, 2]

0.4467842937984511

In [8]:
theoretical_rho_2_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.45

## simulate under GLMM AR(1) di = 2 <a class="anchor" id="ex0g"></a>

$d_i = 2$

$p = 0.5, \rho = 0.9, \sigma^2 = 1.0$

In [9]:
function __get_distribution(dist::Type{D}, μ) where D <: UnivariateDistribution
    return dist(μ)
end

for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_2_GLMM = StatsBase.cor(Y_AR)

2×2 Matrix{Float64}:
 1.0       0.163437
 0.163437  1.0

## AR(1) di = 5 <a class="anchor" id="ex1"></a>

$d_i = 5$

$\lambda = 5, \rho = 0.9, \sigma^2 = 1.0$

In [10]:
di = 5 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

5×5 Matrix{Float64}:
 1.0     0.9    0.81  0.729  0.6561
 0.9     1.0    0.9   0.81   0.729
 0.81    0.9    1.0   0.9    0.81
 0.729   0.81   0.9   1.0    0.9
 0.6561  0.729  0.81  0.9    1.0

In [11]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Bernoulli(p[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5 = StatsBase.cor(Y_AR)

  1.216907 seconds (8.50 M allocations: 651.550 MiB, 39.56% gc time)


5×5 Matrix{Float64}:
 1.0       0.258186  0.229345  0.211046  0.18123
 0.258186  1.0       0.25702   0.22818   0.206759
 0.229345  0.25702   1.0       0.255039  0.229736
 0.211046  0.22818   0.255039  1.0       0.255299
 0.18123   0.206759  0.229736  0.255299  1.0

In [12]:
empirical_cor_5[1, 2]

0.25818607452235

In [13]:
theoretical_rho_5_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.2571428571428572

## simulate under GLMM AR(1) di = 5 <a class="anchor" id="ex1g"></a>

$d_i = 5$

$p = 0.5, \rho = 0.9, \sigma^2 = 1.0$

In [14]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_5_GLMM = StatsBase.cor(Y_AR)

5×5 Matrix{Float64}:
 1.0      0.15434   0.14428   0.12084   0.11234
 0.15434  1.0       0.156107  0.140023  0.127961
 0.14428  0.156107  1.0       0.15271   0.136977
 0.12084  0.140023  0.15271   1.0       0.151938
 0.11234  0.127961  0.136977  0.151938  1.0

## AR(1) di = 10 <a class="anchor" id="ex2"></a>

$d_i = 10$

$p = 0.5, \rho = 0.9, \sigma^2 = 1.0$

In [15]:
di = 10 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

10×10 Matrix{Float64}:
 1.0       0.9       0.81      0.729     …  0.478297  0.430467  0.38742
 0.9       1.0       0.9       0.81         0.531441  0.478297  0.430467
 0.81      0.9       1.0       0.9          0.59049   0.531441  0.478297
 0.729     0.81      0.9       1.0          0.6561    0.59049   0.531441
 0.6561    0.729     0.81      0.9          0.729     0.6561    0.59049
 0.59049   0.6561    0.729     0.81      …  0.81      0.729     0.6561
 0.531441  0.59049   0.6561    0.729        0.9       0.81      0.729
 0.478297  0.531441  0.59049   0.6561       1.0       0.9       0.81
 0.430467  0.478297  0.531441  0.59049      0.9       1.0       0.9
 0.38742   0.430467  0.478297  0.531441     0.81      0.9       1.0

In [16]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Bernoulli(p[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10 = StatsBase.cor(Y_AR)

  2.352045 seconds (16.50 M allocations: 1.691 GiB, 37.77% gc time)


10×10 Matrix{Float64}:
 1.0        0.152581   0.134902   …  0.0821512  0.076019   0.0656225
 0.152581   1.0        0.156398      0.0842331  0.0803215  0.0727975
 0.134902   0.156398   1.0           0.0994662  0.0922426  0.0823556
 0.117982   0.134158   0.150556      0.109225   0.101123   0.0869153
 0.109059   0.119562   0.133603      0.124536   0.106879   0.0966836
 0.102558   0.115143   0.122985   …  0.137842   0.123098   0.107385
 0.0935395  0.101401   0.112001      0.14949    0.133399   0.120562
 0.0821512  0.0842331  0.0994662     1.0        0.149575   0.131505
 0.076019   0.0803215  0.0922426     0.149575   1.0        0.152643
 0.0656225  0.0727975  0.0823556     0.131505   0.152643   1.0

In [17]:
empirical_cor_10[1, 2]

0.15258130933657935

In [18]:
theoretical_rho_10_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.15

## simulate under GLMM AR(1) di = 10 <a class="anchor" id="ex2g"></a>

$d_i = 10$

$p = 0.5, \rho = 0.9, \sigma^2 = 1.0$

In [19]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_10_GLMM = StatsBase.cor(Y_AR)

10×10 Matrix{Float64}:
 1.0        0.156016   0.140935   …  0.0850477  0.0734744  0.0636699
 0.156016   1.0        0.159432      0.0881954  0.0814053  0.0692936
 0.140935   0.159432   1.0           0.0994758  0.0921649  0.0855342
 0.129055   0.141906   0.156065      0.117183   0.101938   0.0933635
 0.114229   0.124491   0.13637       0.123885   0.115276   0.105107
 0.104244   0.107038   0.124598   …  0.139079   0.122961   0.115499
 0.0907107  0.0989415  0.115384      0.156513   0.139453   0.121489
 0.0850477  0.0881954  0.0994758     1.0        0.154323   0.134737
 0.0734744  0.0814053  0.0921649     0.154323   1.0        0.153304
 0.0636699  0.0692936  0.0855342     0.134737   0.153304   1.0

## AR(1) di = 25 <a class="anchor" id="ex3"></a>

$d_i = 25$

$p = 0.5, \rho = 1.0, \sigma^2 = 1.0$

In [20]:
di = 25 # number of observations per cluster
V_AR = get_V_AR(ρtrue[1], di)

# true Gamma
Γ_AR = σ2true[1] * V_AR

25×25 Matrix{Float64}:
 1.0        0.9        0.81       …  0.0984771  0.0886294  0.0797664
 0.9        1.0        0.9           0.109419   0.0984771  0.0886294
 0.81       0.9        1.0           0.121577   0.109419   0.0984771
 0.729      0.81       0.9           0.135085   0.121577   0.109419
 0.6561     0.729      0.81          0.150095   0.135085   0.121577
 0.59049    0.6561     0.729      …  0.166772   0.150095   0.135085
 0.531441   0.59049    0.6561        0.185302   0.166772   0.150095
 0.478297   0.531441   0.59049       0.205891   0.185302   0.166772
 0.430467   0.478297   0.531441      0.228768   0.205891   0.185302
 0.38742    0.430467   0.478297      0.254187   0.228768   0.205891
 0.348678   0.38742    0.430467   …  0.28243    0.254187   0.228768
 0.313811   0.348678   0.38742       0.313811   0.28243    0.254187
 0.28243    0.313811   0.348678      0.348678   0.313811   0.28243
 0.254187   0.28243    0.313811      0.38742    0.348678   0.313811
 0.228768   0.254187   

In [21]:
vecd = Vector{DiscreteUnivariateDistribution}(undef, di)
for i in 1:di
    vecd[i] = Bernoulli(p[1])
end
nonmixed_multivariate_dist = NonMixedMultivariateDistribution(vecd, Γ_AR)

Random.seed!(12345)
@time Y_nsample = simulate_nobs_independent_vectors(nonmixed_multivariate_dist, samplesize)

gcs = Vector{GLMCopulaARObs{T, D, Link}}(undef, samplesize)
for i in 1:samplesize
    X = ones(di, 1)
    y = Float64.(Y_nsample[i])
    V = [ones(di, di)]
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25 = StatsBase.cor(Y_AR)

  9.664454 seconds (40.50 M allocations: 11.326 GiB, 30.31% gc time)


25×25 Matrix{Float64}:
 1.0         0.0621801   0.0630552   …  0.00840259  0.00960318  0.00196945
 0.0621801   1.0         0.0703015      0.00922001  0.0127      0.0129602
 0.0630552   0.0703015   1.0            0.00323209  0.00747036  0.00837325
 0.0513142   0.0628203   0.068101       0.00732362  0.00748445  0.00164521
 0.0541197   0.05014     0.0602825      0.0148402   0.00920028  0.00369914
 0.044204    0.0511401   0.0559094   …  0.0123578   0.0189173   0.00610926
 0.0393521   0.0423803   0.0485472      0.0122049   0.011526    0.00552026
 0.0354796   0.04354     0.0475225      0.0160403   0.00656034  0.010019
 0.0343134   0.0353802   0.0417824      0.0168441   0.019965    0.0181639
 0.0298228   0.0372      0.0372125      0.0199385   0.0124581   0.0105266
 0.0294739   0.0285801   0.0319003   …  0.0214037   0.0199246   0.0227654
 0.0313145   0.0239401   0.0298184      0.0214434   0.0196041   0.0220868
 0.025957    0.02314     0.0254501      0.0279218   0.0243622   0.0218529
 0.0217899

In [22]:
empirical_cor_25[1, 2]

0.0621801371073535

In [23]:
theoretical_rho_25_kurtosis = (ρtrue[1] * σ2true[1]) / (1 + ((di/2) * σ2true[1]) + (0.5 * (kt - 1) * σ2true[1]))

0.06666666666666667

## simulate under GLMM AR(1) di = 25 <a class="anchor" id="ex3g"></a>

$d_i = 25$

$p = 0.5, \rho = 0.9, \sigma^2 = 1.0$

In [24]:
for i in 1:samplesize
    X = ones(di, 1)
    η = X * βtrue
    # generate mvn response
    mvn_d = MvNormal(η, Γ_AR)
    mvn_η = rand(mvn_d)
    μ = GLM.linkinv.(link, mvn_η)
    y = Float64.(rand.(__get_distribution.(D, μ)))
    gcs[i] = GLMCopulaARObs(y, X, d, link)
end

# form model
gcm = GLMCopulaARModel(gcs);

N = length(gcm.data)
Y_AR = zeros(N, di)
for j in 1:di
    Y_AR[:, j] = [gcm.data[i].y[j] for i in 1:N]
end
empirical_cor_25_GLMM = StatsBase.cor(Y_AR)

25×25 Matrix{Float64}:
 1.0        0.155622   0.141683   …  0.0152764  0.0180976  0.0157831
 0.155622   1.0        0.152166      0.021517   0.0130179  0.0194992
 0.141683   0.152166   1.0           0.0191773  0.0168381  0.0215975
 0.130843   0.143599   0.156738      0.0209031  0.0168822  0.0191628
 0.117988   0.129747   0.136206      0.029021   0.0245607  0.0235746
 0.103411   0.115489   0.123428   …  0.0329214  0.026101   0.0244324
 0.0990966  0.103377   0.113837      0.0307393  0.0267995  0.0265043
 0.0833483  0.0943503  0.105771      0.0331982  0.0341388  0.0235315
 0.0714781  0.081722   0.0903038     0.0424971  0.0309579  0.033001
 0.068143   0.0755424  0.0877622     0.0411003  0.0425602  0.0373786
 0.0633498  0.0661048  0.0758429  …  0.0478448  0.0372634  0.0336167
 0.0536684  0.0604669  0.0634063     0.0517012  0.0489608  0.0422939
 0.0479848  0.0549039  0.0655236     0.0594606  0.0548004  0.048617
 0.0428902  0.0450954  0.0561979     0.0647167  0.0581379  0.0531483
 0.0341522  0

## Comparisons 

##  1. Theoretical vs. Empirical Correlation simulated under QC <a class="anchor" id="ex4"></a>

### Takeaway: Quasi-Copula correlation IS a function of the di, kt, $\rho$ and $\sigma^2$


The theoretical correlation is a function of $d_i, \rho, \sigma^2, kt(p)$

$$Corr(Y_1, Y_2)
 = \frac{\rho * \sigma^2}{1 + \frac{1}{2} * \sigma^2 * kt + \frac{1}{2} * \sigma^2 (d_i - 1)}$$

We have $p = 0.5,$ thus we have kurtosis $kt = kt(p) = 1.$

Let's see what happens to the theoretical/empirical correlation when we fix $\rho = 0.9, \sigma^2 = 1.0$ and range cluster sizes $d_i$

$$Corr(Y_1, Y_2)
 = \frac{0.9}{1 + \frac{1}{2} * 1 + \frac{1}{2} * (d_i - 1)} = \frac{0.9}{1 + \frac{d_i}{2}}$$


In [25]:
# di = 2
[theoretical_rho_2_kurtosis empirical_cor_2[1, 2]]

1×2 Matrix{Float64}:
 0.45  0.446784

In [26]:
# di = 5
[theoretical_rho_5_kurtosis empirical_cor_5[1, 2]]

1×2 Matrix{Float64}:
 0.257143  0.258186

In [27]:
# di = 10
[theoretical_rho_10_kurtosis empirical_cor_10[1, 2]]

1×2 Matrix{Float64}:
 0.15  0.152581

In [28]:
# di = 25
[theoretical_rho_25_kurtosis empirical_cor_25[1, 2]]

1×2 Matrix{Float64}:
 0.0666667  0.0621801

##  2. Empirical Correlation simulated under GLMM<a class="anchor" id="ex4g"></a>

### Takeaway: GLMM correlation is NOT a function of the di

In [29]:
# glmm correlation di = 2 
empirical_cor_2_GLMM[1, 2]

0.16343678921893148

In [30]:
# glmm correlation di = 5
empirical_cor_5_GLMM[1, 2]

0.15434037602803538

In [31]:
# glmm correlation di = 10
empirical_cor_10_GLMM[1, 2]

0.1560161150813794

In [32]:
# glmm correlation di = 25
empirical_cor_25_GLMM[1, 2]

0.15562157141017954