# Real Data Examples
In this notebook we will compare the fit of our quasi-copula model vs that of R packages geepack and gcmr on two example datasets. 

## Table of Contents:
* [Example 1: Poisson Base (gcmr: Epilepsy)](#ex1)
* [Example 2: Bernoulli Base (geepack: Respiratory)](#ex2)

For these examples we will try both autoregressive AR(1) and compound symmetry (CS) parameterization of the covariance matrix $\Gamma,$ estimating correlation parameter $\rho$ and dispersion parameter $\sigma^2$. After fitting our models, for each example we will compare our estimates to that of geepack and gcmr packages using `RCall`. 

    note: For the dispersion parameter, we can an L2 penalty to the loglikelihood to keep the estimates from going off to infinity. This notebook presents results with the unpenalized fit.
    

In [1]:
versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)


In [2]:
using CSV, DataFrames, GLMCopula, LinearAlgebra, GLM, RCall, RData, RDatasets

## Example 1: Poisson Base  <a class="anchor" id="ex1"></a>


We will get the fit of the quasi copula model with Poisson base on the "epilepsy" dataset from the "gcmr" package in R, using the AR(1) and CS parameterizations of the covariance. 


In [3]:
R"""
    library("gcmr")
    data("epilepsy", package = "gcmr")
"""
@rget epilepsy;

└ @ RCall /Users/sarahji/.julia/packages/RCall/6kphM/src/io.jl:172


Let's take a preview of the first 10 lines of the epilepsy dataset.

In [4]:
epilepsy[1:10, :]

Unnamed: 0_level_0,id,age,trt,counts,time,visit
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64,Float64
1,1,31,0,11,8.0,0.0
2,1,31,0,5,2.0,1.0
3,1,31,0,3,2.0,1.0
4,1,31,0,3,2.0,1.0
5,1,31,0,3,2.0,1.0
6,2,30,0,11,8.0,0.0
7,2,30,0,3,2.0,1.0
8,2,30,0,5,2.0,1.0
9,2,30,0,3,2.0,1.0
10,2,30,0,3,2.0,1.0


### Forming the Models

To form the model, we give it the following arguments:

- named dataframe
- outcome variable name of interest as a symbol
- grouping variable name of interest as a symbol
- covariate names of interest as a vector of symbols
- base distribution
- link function

In [5]:
df = epilepsy
y = :counts
grouping = :id
covariates = [:visit, :trt]
d = Poisson()
link = LogLink()

# forming AR(1) model with Poisson base
Poisson_AR_model = AR_model(df, y, grouping, covariates, d, link);

# forming CS model with Poisson base
Poisson_CS_model = CS_model(df, y, grouping, covariates, d, link);

Fit the AR(1) model with Poisson base

In [6]:
GLMCopula.fit!(Poisson_AR_model, IpoptSolver(print_level = 3, max_iter = 100, tol = 10^-8, limited_memory_max_history = 20, hessian_approximation = "limited-memory"));

initializing β using Newton's Algorithm under Independence Assumption
initializing variance components using MM-Algorithm

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

Total number of variables............................:        5
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        1
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and up

Fit the CS model with Poisson base

In [7]:
GLMCopula.fit!(Poisson_CS_model, IpoptSolver(print_level = 3, max_iter = 100, tol = 10^-8, limited_memory_max_history = 20, hessian_approximation = "limited-memory"));

initializing β using Newton's Algorithm under Independence Assumption
initializing σ2 and ρ using method of moments
par0 = [3.4553888182001127, -1.3288345481797776, -0.026384675366883565, 0.2, 1.0]
Total number of variables............................:        5
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        1
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 92

                                   (scaled)                 (unscaled)
Objective...............:   2.1695813266189330e+03    2.1695813266189330e+03
Dual infeasibility......:   9.1969454274476448

We can take a look at the MLE's for the AR(1) model with Poisson base

In [8]:
@show Poisson_AR_model.β
@show Poisson_AR_model.σ2
@show Poisson_AR_model.ρ;

Poisson_AR_model.β = [3.477001434224112, -1.3123828083857931, -0.06552858740032046]
Poisson_AR_model.σ2 = [96534.17924265226]
Poisson_AR_model.ρ = [0.9499485377184236]


We can take a look at the MLE's for the CS model with Poisson base

In [9]:
@show Poisson_CS_model.β
@show Poisson_CS_model.σ2
@show Poisson_CS_model.ρ;

Poisson_CS_model.β = [3.479229893235856, -1.3137359424301136, -0.05223916238672295]
Poisson_CS_model.σ2 = [133560.55732336888]
Poisson_CS_model.ρ = [0.9101785804195232]


Calculate the loglikelihood at the maximum for the AR(1) model with Poisson base

In [10]:
@show loglikelihood!(Poisson_AR_model, false, false);

loglikelihood!(Poisson_AR_model, false, false) = -2168.8986516850264


Calculate the loglikelihood at the maximum for the CS model with Poisson base

In [11]:
@show loglikelihood!(Poisson_CS_model, true, true);

loglikelihood!(Poisson_CS_model, true, true) = -2169.581326618933


##### Using geepack

Analyzing the epilepsy data under the AR(1) covariance using geepack we have:

In [12]:
R"""
    library("geepack")
    data("epilepsy", package = "gcmr")
    gee.ar1 <- geeglm(counts ~ 1 + visit + trt,
    data = epilepsy, id = id, family = poisson,
    corstr = "ar1")

    summary(gee.ar1)
"""

└ @ RCall /Users/sarahji/.julia/packages/RCall/6kphM/src/io.jl:172


RObject{VecSxp}

Call:
geeglm(formula = counts ~ 1 + visit + trt, family = poisson, 
    data = epilepsy, id = id, corstr = "ar1")

 Coefficients:
            Estimate  Std.err   Wald Pr(>|W|)    
(Intercept)  3.35097  0.18684 321.67   <2e-16 ***
visit       -1.29785  0.13039  99.07   <2e-16 ***
trt          0.08819  0.22758   0.15    0.698    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = ar1 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    19.52   8.395
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha   0.8882 0.04193
Number of clusters:   59  Maximum cluster size: 5 


Analyzing the epilepsy data under the CS covariance using geepack we have:

In [13]:
R"""
    library("geepack")
    data("epilepsy", package = "gcmr")
    gee.cs <- geeglm(counts ~ 1 + visit + trt,
    data = epilepsy, id = id, family = poisson,
    corstr = "exchangeable")

    summary(gee.cs)
"""

RObject{VecSxp}

Call:
geeglm(formula = counts ~ 1 + visit + trt, family = poisson, 
    data = epilepsy, id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err   Wald Pr(>|W|)    
(Intercept)   3.4022  0.1684 408.16   <2e-16 ***
visit        -1.3288  0.1055 158.73   <2e-16 ***
trt           0.0737  0.2130   0.12     0.73    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     19.1    8.28
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha     0.77  0.0848
Number of clusters:   59  Maximum cluster size: 5 


##### Using gcmr

Analyzing the epilepsy data under the AR(1) covariance using gcmr we have:

In [14]:
R"""
    library("gcmr")
    data("epilepsy", package = "gcmr")
    mod.ar <- gcmr(counts ~ 1 + visit + trt,
    data = epilepsy, marginal = poisson.marg(link = "log"),
    cormat = cluster.cormat(id, "ar1"))

    summary(mod.ar)
"""

RObject{VecSxp}

Call:
gcmr(formula = counts ~ 1 + visit + trt, data = epilepsy, marginal = poisson.marg(link = "log"), 
    cormat = cluster.cormat(id, "ar1"))


Coefficients marginal model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.3439     0.0270  123.76   <2e-16 ***
visit        -1.2202     0.0271  -45.09   <2e-16 ***
trt          -0.2455     0.0285   -8.62   <2e-16 ***

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
ar1   0.4527     0.0193    23.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood =   1391,  AIC = 2790


Analyzing the epilepsy data under the CS covariance using gcmr we have:

In [15]:
R"""
    library("gcmr")
    data("epilepsy", package = "gcmr")
    mod.cs <- gcmr(counts ~ 1 + visit + trt,
    data = epilepsy, marginal = poisson.marg(link = "log"),
    cormat = cluster.cormat(id, "exchangeable"))

    summary(mod.cs)
"""

RObject{VecSxp}

Call:
gcmr(formula = counts ~ 1 + visit + trt, data = epilepsy, marginal = poisson.marg(link = "log"), 
    cormat = cluster.cormat(id, "exchangeable"))


Coefficients marginal model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.4344     0.0132   259.5   <2e-16 ***
visit        -1.3057     0.0173   -75.5   <2e-16 ***
trt          -0.2639     0.0233   -11.3   <2e-16 ***

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
tau   0.3615     0.0178    20.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood =   1331,  AIC = 2670


## Example 2: Bernoulli Base  <a class="anchor" id="ex2"></a>

We will get the fit of the quasi copula model with Bernoulli base on the "respiratory" dataset from the "geepack" package in R, using the AR(1) and CS parameterizations of the covariance. 

In [16]:
R"""
    data(respiratory, package="geepack")
    respiratory_df <- respiratory[order(respiratory$id),]
"""

@rget respiratory_df;

Let's take a preview of the first 10 lines of the respiratory dataset in long format.

In [17]:
respiratory_df[1:10, :]

Unnamed: 0_level_0,center,id,treat,sex,age,baseline,visit,outcome
Unnamed: 0_level_1,Int64,Int64,Cat…,Cat…,Int64,Int64,Int64,Int64
1,1,1,P,M,46,0,1,0
2,1,1,P,M,46,0,2,0
3,1,1,P,M,46,0,3,0
4,1,1,P,M,46,0,4,0
5,2,1,P,F,39,0,1,0
6,2,1,P,F,39,0,2,0
7,2,1,P,F,39,0,3,0
8,2,1,P,F,39,0,4,0
9,1,2,P,M,28,0,1,0
10,1,2,P,M,28,0,2,0


### Forming the Models

To form the model, we give it the following arguments:

- named dataframe
- outcome variable name of interest as a symbol
- grouping variable name of interest as a symbol
- covariate names of interest as a vector of symbols
- base distribution
- link function

In [18]:
df = respiratory_df
y = :outcome
grouping = :id
covariates = [:center, :age, :baseline]
d = Bernoulli()
link = LogitLink()

# forming AR(1) model with Bernoulli base
Bernoulli_AR_model = AR_model(df, y, grouping, covariates, d, link);

# forming CS model with Bernoulli base
Bernoulli_CS_model = CS_model(df, y, grouping, covariates, d, link);

Fit the AR(1) model with Bernoulli base

In [19]:
GLMCopula.fit!(Bernoulli_AR_model, IpoptSolver(print_level = 3, max_iter = 100, tol = 10^-8, limited_memory_max_history = 20, hessian_approximation = "limited-memory"));

initializing β using Newton's Algorithm under Independence Assumption
initializing variance components using MM-Algorithm
Total number of variables............................:        6
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        1
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 94

                                   (scaled)                 (unscaled)
Objective...............:   2.4055710825065927e+02    2.4055710825065927e+02
Dual infeasibility......:   1.4119798663614347e-09    1.4119798663614347e-09
Constraint violation....:   0.000000000000000

Fit the CS model with Bernoulli base

In [20]:
GLMCopula.fit!(Bernoulli_CS_model, IpoptSolver(print_level = 3, max_iter = 100, tol = 10^-8, limited_memory_max_history = 20, hessian_approximation = "limited-memory"));

initializing β using Newton's Algorithm under Independence Assumption
initializing σ2 and ρ using method of moments
par0 = [-0.7993115972643741, 0.6513519744878128, -0.018744798735221512, 1.6766993967179145, 0.2, 1.0]
Total number of variables............................:        6
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        1
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 27

                                   (scaled)                 (unscaled)
Objective...............:   2.4900152024121175e+02    2.4900152024121175e+02
Dual infeasibility......: 

We can take a look at the MLE's of the AR model with Bernoulli base

In [21]:
@show Bernoulli_AR_model.β
@show Bernoulli_AR_model.σ2
@show Bernoulli_AR_model.ρ;

Bernoulli_AR_model.β = [-0.858664409049024, 0.8334076581881305, -0.026953129746342567, 2.103267661442157]
Bernoulli_AR_model.σ2 = [306890.7562627383]
Bernoulli_AR_model.ρ = [0.7813892966990003]


We can take a look at the MLE's of the CS model with Bernoulli base

In [22]:
@show Bernoulli_CS_model.β
@show Bernoulli_CS_model.σ2
@show Bernoulli_CS_model.ρ;

Bernoulli_CS_model.β = [-0.8073560280686407, 0.8553513879813671, -0.027821756706670475, 2.0702779503223048]
Bernoulli_CS_model.σ2 = [0.35242279695900536]
Bernoulli_CS_model.ρ = [0.8734724569746284]


Calculate the loglikelihood at the maximum of the AR model with Bernoulli base

In [23]:
@show loglikelihood!(Bernoulli_AR_model, false, false);

loglikelihood!(Bernoulli_AR_model, false, false) = -240.55710825065927


Calculate the loglikelihood at the maximum of the CS model with Bernoulli base

In [25]:
@show loglikelihood!(Bernoulli_CS_model, true, true);

loglikelihood!(Bernoulli_CS_model, true, true) = -249.00152024121175


##### Using gcmr

Analyzing the respiratory data under the AR(1) covariance using gcmr we have:

In [26]:
R"""
    library("geepack")
    data(respiratory, package="geepack")
    respiratory_df <- respiratory[order(respiratory$id),]

    mod.ar <- gcmr(outcome ~ center + age + baseline,
     data = respiratory_df, marginal = binomial.marg(link = "logit"),
     cormat = cluster.cormat(id, "ar1"))
    summary(mod.ar)
"""

RObject{VecSxp}

Call:
gcmr(formula = outcome ~ center + age + baseline, data = respiratory_df, 
    marginal = binomial.marg(link = "logit"), cormat = cluster.cormat(id, 
        "ar1"))


Coefficients marginal model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.9660     0.4555   -2.12    0.034 *  
center        0.7669     0.2581    2.97    0.003 ** 
age          -0.0179     0.0104   -1.72    0.085 .  
baseline      1.6069     0.2917    5.51  3.6e-08 ***

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
ar1   0.5886     0.0634    9.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood =  231.7,  AIC = 473.4


Analyzing the respiratory data under the CS covariance using gcmr we have:

In [27]:
R"""
    library("geepack")
    data(respiratory, package="geepack")
    respiratory_df <- respiratory[order(respiratory$id),]

    mod.cs <- gcmr(outcome ~ center + age + baseline,
     data = respiratory_df, marginal = binomial.marg(link = "logit"),
     cormat = cluster.cormat(id, "exchangeable"))
    summary(mod.cs)
"""

RObject{VecSxp}

Call:
gcmr(formula = outcome ~ center + age + baseline, data = respiratory_df, 
    marginal = binomial.marg(link = "logit"), cormat = cluster.cormat(id, 
        "exchangeable"))


Coefficients marginal model:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.74657    0.38807   -1.92    0.054 .  
center       0.68773    0.20897    3.29    0.001 ***
age         -0.01865    0.00897   -2.08    0.038 *  
baseline     1.41880    0.25665    5.53  3.2e-08 ***

Coefficients Gaussian copula:
    Estimate Std. Error z value Pr(>|z|)    
tau   0.2846     0.0761    3.74  0.00019 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

log likelihood =  247.3,  AIC = 504.6


##### Using geepack

Analyzing the respiratory data under the AR(1) covariance using geepack we have:

In [28]:
R"""
    library("geepack")
    data(respiratory, package="geepack")
    gee.ar1 <- geeglm(outcome ~ center + age + baseline, data=respiratory, id=id,
    family=binomial(), corstr="ar1")
    summary(gee.ar1)
"""

RObject{VecSxp}

Call:
geeglm(formula = outcome ~ center + age + baseline, family = binomial(), 
    data = respiratory, id = id, corstr = "ar1")

 Coefficients:
            Estimate Std.err  Wald Pr(>|W|)    
(Intercept)  -1.0124  0.5408  3.51    0.061 .  
center        0.7454  0.3414  4.77    0.029 *  
age          -0.0173  0.0121  2.04    0.153    
baseline      1.7248  0.3303 27.27  1.8e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = ar1 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     1.04   0.193
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha    0.522   0.087
Number of clusters:   111  Maximum cluster size: 4 


Analyzing the respiratory data under the CS covariance using gcmr we have:

In [29]:
R"""
    library("geepack")
    data(respiratory, package="geepack")
    gee.cs <- geeglm(outcome ~ center + age + baseline, data=respiratory, id=id,
    family=binomial(), corstr="exchangeable")
    summary(gee.cs)
"""

RObject{VecSxp}

Call:
geeglm(formula = outcome ~ center + age + baseline, family = binomial(), 
    data = respiratory, id = id, corstr = "exchangeable")

 Coefficients:
            Estimate Std.err  Wald Pr(>|W|)    
(Intercept)  -0.7993  0.5421  2.17    0.140    
center        0.6514  0.3353  3.77    0.052 .  
age          -0.0187  0.0121  2.41    0.121    
baseline      1.6767  0.3289 25.98  3.4e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation structure = exchangeable 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     1.02   0.163
  Link = identity 

Estimated Correlation Parameters:
      Estimate Std.err
alpha     0.38  0.0803
Number of clusters:   111  Maximum cluster size: 4 
