# Real Example Datasets
In this notebook we compare the loglikelihoods of the quasi-copula intercept only random intercept model with that of GLM and GLMM fit using `GLM.jl` and `MixedModels.jl` respectively.

We use the example datasets from various R packages using the `RCall` and `RDatasets` packages.

In [1]:
using QuasiCopula, DataFrames, LinearAlgebra, RCall, RDatasets
using GLM, MixedModels

# dyestuff data: lme4 (Gaussian)

In [2]:
df = dataset("lme4", "Dyestuff")
y = :Yield
grouping = :Batch

d = Normal()
link = IdentityLink()
Gaussian_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Normal
  * link function: IdentityLink
  * number of clusters: 6
  * cluster size min, max: 5, 5
  * number of variance components: 1
  * number of fixed effects: 1

In [3]:
# fit using QuasiCopula
QuasiCopula.fit!(Gaussian_VC_model);

gcm.β = [1527.4999999999998]
initializing dispersion using residual sum of squares
gcm.τ = [0.0002604449267498643]
initializing variance components using MM-Algorithm
gcm.θ = [0.8636149140056555]

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit https://github.com/coin-or/Ipopt
******************************************************************************

Total number of variables............................:        3
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints wi

In [4]:
# qc logl
logl_dyestuff_QC = logl(Gaussian_VC_model)

# fit with glm: GLM.jl
dyestuff_lm = lm(@formula(Yield ~ 1), df);
logl_dyestuff_LM = loglikelihood(dyestuff_lm)

# fit with lmm: MixedModels.jl
dyestuff_formula = @formula(Yield ~ 1 + (1|Batch));
mdl = LinearMixedModel(dyestuff_formula, df)
MixedModels.fit!(mdl);
logl_dyestuff_LMM = loglikelihood(mdl);

│  - To prevent this behaviour, do `ProgressMeter.ijulia_behavior(:append)`. 
└ @ ProgressMeter /Users/sarahji/.julia/packages/ProgressMeter/Vf8un/src/ProgressMeter.jl:620
[32mMinimizing 19 	 Time: 0:00:00 (19.17 ms/it)[39m


In [5]:
@show logl_dyestuff_QC
@show logl_dyestuff_LM
@show logl_dyestuff_LMM;

logl_dyestuff_QC = -163.35547256352643
logl_dyestuff_LM = -166.36494298739052
logl_dyestuff_LMM = -163.6635299405715


# Oxboys data: mlmRev (Gaussian)

In [6]:
df = dataset("mlmRev", "Oxboys")
y = :Height
grouping = :Subject

Gaussian_VC_model2 = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Normal
  * link function: IdentityLink
  * number of clusters: 26
  * cluster size min, max: 9, 9
  * number of variance components: 1
  * number of fixed effects: 1

In [7]:
# fit using QuasiCopula
QuasiCopula.fit!(Gaussian_VC_model2);

gcm.β = [149.51940170940168]
initializing dispersion using residual sum of squares
gcm.τ = [0.012119112908652608]
initializing variance components using MM-Algorithm
gcm.θ = [0.8429685809722144]
Total number of variables............................:        3
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 10

                                   (scaled)                 (unscaled)
Objective...............:   8.2657948607668027e+02    8.2657948607668027e+02
Dual infeasibility......:   2.3491180187725940e-0

In [8]:
# qc logl
logl_Oxboys_QC = logl(Gaussian_VC_model2)
# -826.5794860766803

# fit with glm: GLM.jl
Oxboys_lm = lm(@formula(Height ~ 1), df);
logl_Oxboys_LM = loglikelihood(Oxboys_lm)
# -848.3492814947749

# fit with lmm: MixedModels.jl
Oxboys_formula = @formula(Height ~ 1 + (1|Subject));
mdl = LinearMixedModel(Oxboys_formula, df)
MixedModels.fit!(mdl);
logl_Oxboys_LMM = loglikelihood(mdl);

In [9]:
@show logl_Oxboys_QC
@show logl_Oxboys_LM
@show logl_Oxboys_LMM;

logl_Oxboys_QC = -826.5794860766803
logl_Oxboys_LM = -848.3492814947749
logl_Oxboys_LMM = -734.6676665245718


# Sleepstudy data: lme4 (Gaussian)

In [10]:
df = dataset("lme4", "sleepstudy")
y = :Reaction
grouping = :Subject

Gaussian_VC_model3 = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Normal
  * link function: IdentityLink
  * number of clusters: 18
  * cluster size min, max: 10, 10
  * number of variance components: 1
  * number of fixed effects: 1

In [11]:
# fit using QuasiCopula
QuasiCopula.fit!(Gaussian_VC_model3);

gcm.β = [298.5078916666667]
initializing dispersion using residual sum of squares
gcm.τ = [0.00031692692475632976]
initializing variance components using MM-Algorithm
gcm.θ = [0.17963770582374433]
Total number of variables............................:        3
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 35

                                   (scaled)                 (unscaled)
Objective...............:   9.7311733466615874e+02    9.7311733466615874e+02
Dual infeasibility......:   1.4739948630225240e

In [12]:
# qc logl
logl_sleepstudy_QC = logl(Gaussian_VC_model3)

# fit with glm: GLM.jl
sleepstudy_lm = lm(@formula(Reaction ~ 1), df);
logl_sleepstudy_LM = loglikelihood(sleepstudy_lm)

# fit with lmm: MixedModels.jl
sleepstudy_formula = @formula(Reaction ~ 1 + (1|Subject));
mdl = LinearMixedModel(sleepstudy_formula, df)
MixedModels.fit!(mdl)
logl_sleepstudy_LMM = loglikelihood(mdl);

In [13]:
@show logl_sleepstudy_QC
@show logl_sleepstudy_LM
@show logl_sleepstudy_LMM;

logl_sleepstudy_QC = -973.1173346661587
logl_sleepstudy_LM = -980.5244758509474
logl_sleepstudy_LMM = -955.270529036532


# Gcsemv data: mlmRev (Gaussian)

In [14]:
df = dataset("mlmRev", "Gcsemv")
df = df[completecases(df), :]
y = :Course
grouping = :School

Gaussian_VC_model4 = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Normal
  * link function: IdentityLink
  * number of clusters: 73
  * cluster size min, max: 1, 83
  * number of variance components: 1
  * number of fixed effects: 1

In [15]:
# fit using QuasiCopula
QuasiCopula.fit!(Gaussian_VC_model4);

gcm.β = [73.38138542350625]
initializing dispersion using residual sum of squares
gcm.τ = [0.003703899254589357]
initializing variance components using MM-Algorithm
gcm.θ = [0.5145178403876451]
Total number of variables............................:        3
                     variables with only lower bounds:        2
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 20

                                   (scaled)                 (unscaled)
Objective...............:   6.3611970355006688e+03    6.3611970355006688e+03
Dual infeasibility......:   6.7714640667166823e-08

In [16]:
# qc logl
logl_Gcsemv_QC = logl(Gaussian_VC_model4)

# fit with glm: GLM.jl
Gcsemv_lm = lm(@formula(Course ~ 1), df);
logl_Gcsemv_LM = loglikelihood(Gcsemv_lm)

# fit with lmm: MixedModels.jl
Gcsemv_formula = @formula(Course ~ 1 + (1|School));
mdl = LinearMixedModel(Gcsemv_formula, df)
MixedModels.fit!(mdl);
logl_Gcsemv_LMM = loglikelihood(mdl);

In [17]:
@show logl_Gcsemv_QC
@show logl_Gcsemv_LM
@show logl_Gcsemv_LMM;

logl_Gcsemv_QC = -6361.197035500669
logl_Gcsemv_LM = -6424.201502669516
logl_Gcsemv_LMM = -6247.905067227811


# respiratory data: geepack (Bernoulli)

In [18]:
R"""
    data(respiratory, package="geepack")
    respiratory_df <- respiratory[order(respiratory$id),]
"""
@rget respiratory_df;

df = respiratory_df
df[!, :id] = string.(df[!, :id])
y = :outcome
grouping = :id

# Bernoulli
d = Bernoulli()
link = LogitLink()

Bernoulli_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Bernoulli
  * link function: LogitLink
  * number of clusters: 56
  * cluster size min, max: 4, 8
  * number of variance components: 1
  * number of fixed effects: 1

In [19]:
# fit using QuasiCopula
QuasiCopula.fit!(Bernoulli_VC_model);

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [0.23531408536427043]
initializing variance components using MM-Algorithm
gcm.θ = [0.2643698235606835]
Total number of variables............................:        2
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 6

                                   (scaled)                 (unscaled)
Objective...............:   2.8889498858347241e+02    2.8889498858347241e+02
Dual infeasibility......:   6.4801335408759542e-09    6.48013354

In [20]:
# qc logl
logl_respiratory_QC = logl(Bernoulli_VC_model)

# fit with glm: GLM.jl
respiratory_glm = glm(@formula(outcome ~ 1), df, d, link);
logl_respiratory_GLM = loglikelihood(respiratory_glm)

# fit with glmm: MixedModels.jl
respiratory_formula = @formula(outcome ~ 1 + (1|id));
mdl = GeneralizedLinearMixedModel(respiratory_formula, df, d, link)
MixedModels.fit!(mdl, fast=true);
logl_respiratory_GLMM = loglikelihood(mdl);

│  - To prevent this behaviour, do `ProgressMeter.ijulia_behavior(:append)`. 
└ @ ProgressMeter /Users/sarahji/.julia/packages/ProgressMeter/Vf8un/src/ProgressMeter.jl:620
[32mMinimizing 15 	 Time: 0:00:00 (23.63 ms/it)[39m


In [21]:
@show logl_respiratory_QC
@show logl_respiratory_GLM
@show logl_respiratory_GLMM;

logl_respiratory_QC = -288.8949885834724
logl_respiratory_GLM = -304.70530346181135
logl_respiratory_GLMM = -280.997334355401


# Contraception data: mlmRev  (Bernoulli)

In [22]:
df = dataset("mlmRev", "Contraception")
binary_use = map(x -> string.(x) == "N" ? 0.0 : 1.0, df[!, :Use])
df[!, :outcome] = binary_use
y = :outcome
grouping = :District

# # Bernoulli
# d = Bernoulli()
# link = LogitLink()
Bernoulli_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Bernoulli
  * link function: LogitLink
  * number of clusters: 60
  * cluster size min, max: 2, 118
  * number of variance components: 1
  * number of fixed effects: 1

In [23]:
# fit using QuasiCopula
QuasiCopula.fit!(Bernoulli_VC_model);

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [-0.43702156143913895]
initializing variance components using MM-Algorithm
gcm.θ = [0.08460354610678963]
Total number of variables............................:        2
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 6

                                   (scaled)                 (unscaled)
Objective...............:   1.2764000166885739e+03    1.2764000166885739e+03
Dual infeasibility......:   3.4385887690431207e-09    3.438588

In [24]:
# qc logl
logl_Contraception_QC = logl(Bernoulli_VC_model)

# fit with glm: GLM.jl
Contraception_glm = glm(@formula(outcome ~ 1), df, d, link);
logl_Contraception_GLM = loglikelihood(Contraception_glm)

# fit with glmm: MixedModels.jl
Contraception_formula = @formula(outcome ~ 1 + (1|District));
mdl = GeneralizedLinearMixedModel(Contraception_formula, df, d, link)
MixedModels.fit!(mdl, fast=true);
logl_Contraception_GLMM = loglikelihood(mdl);

In [25]:
@show logl_Contraception_QC
@show logl_Contraception_GLM
@show logl_Contraception_GLMM;

logl_Contraception_QC = -1276.400016688574
logl_Contraception_GLM = -1295.4546621368916
logl_Contraception_GLMM = -1267.2367038450316


# epilepsy data: gcmr (Poisson)

In [26]:
R"""
    library("gcmr")
    data("epilepsy", package = "gcmr")
"""
@rget epilepsy;

df = epilepsy
y = :counts
grouping = :id

# Poisson
d = Poisson()
link = LogLink()
Poisson_VC_model = VC_model(df, y, grouping, d, link)

└ @ RCall /Users/sarahji/.julia/packages/RCall/6kphM/src/io.jl:172


Quasi-Copula Variance Component Model
  * base distribution: Poisson
  * link function: LogLink
  * number of clusters: 59
  * cluster size min, max: 5, 5
  * number of variance components: 1
  * number of fixed effects: 1

In [27]:
# fit using QuasiCopula
QuasiCopula.fit!(Poisson_VC_model);

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [2.5544643397090105]
initializing variance components using MM-Algorithm
gcm.θ = [6.362206769115058]
Total number of variables............................:        2
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 14

                                   (scaled)                 (unscaled)
Objective...............:   2.9453773284413337e+03    2.9453773284413337e+03
Dual infeasibility......:   4.3645371761158458e-09    4.364537176

In [28]:
# qc logl
logl_epilepsy_QC = logl(Poisson_VC_model)

# fit with glm: GLM.jl
epilepsy_glm = glm(@formula(counts ~ 1), df, Poisson(), LogLink());
logl_epilepsy_GLM = loglikelihood(epilepsy_glm)

# fit with glmm: MixedModels.jl
df[!, :id] = string.(df[!, :id])
epilepsy_formula = @formula(counts ~ 1 + (1|id));
mdl = GeneralizedLinearMixedModel(epilepsy_formula, df, Poisson(), LogLink())
MixedModels.fit!(mdl, fast=true);
logl_epilepsy_GLMM = loglikelihood(mdl);

In [29]:
@show logl_epilepsy_QC
@show logl_epilepsy_GLM
@show logl_epilepsy_GLMM;

logl_epilepsy_QC = -2945.3773284413337
logl_epilepsy_GLM = -3092.972481024413
logl_epilepsy_GLMM = -1785.335988439955


# Mmmec data: mlmRev (Poisson)

In [30]:
Mmmec = dataset("mlmRev", "Mmmec");

df = Mmmec
y = :Deaths
grouping = :Nation
# d = Poisson()
# link = LogLink()

Poisson_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Poisson
  * link function: LogLink
  * number of clusters: 9
  * cluster size min, max: 3, 95
  * number of variance components: 1
  * number of fixed effects: 1

In [31]:
# fit using QuasiCopula
QuasiCopula.fit!(Poisson_VC_model);

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [3.326031339981156]
initializing variance components using MM-Algorithm
gcm.θ = [1.0]
Total number of variables............................:        2
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 57

                                   (scaled)                 (unscaled)
Objective...............:   6.6230027620134115e+03    6.6230027620134115e+03
Dual infeasibility......:   1.1068841262247114e-11    1.1068841262247114e-11
Con

In [32]:
# qc logl
logl_Mmmec_QC = logl(Poisson_VC_model)

# fit with glm: GLM.jl
Mmmec_glm = glm(@formula(Deaths ~ 1), df, Poisson(), LogLink());
logl_Mmmec_GLM = loglikelihood(Mmmec_glm)

# fit with glmm: MixedModels.jl
Mmmec_formula = @formula(Deaths ~ 1 + (1|Nation));
mdl = GeneralizedLinearMixedModel(Mmmec_formula, df, Poisson(), LogLink())
MixedModels.fit!(mdl, fast=true);
logl_Mmmec_GLMM = loglikelihood(mdl);

In [33]:
@show logl_Mmmec_QC
@show logl_Mmmec_GLM
@show logl_Mmmec_GLMM;

logl_Mmmec_QC = -6623.0027620134115
logl_Mmmec_GLM = -6672.358128381808
logl_Mmmec_GLMM = -3762.8923979880137


# epilepsy data: gcmr (NB)

In [34]:
R"""
    library("gcmr")
    data("epilepsy", package = "gcmr")
"""
@rget epilepsy;

df = epilepsy
y = :counts
grouping = :id

# negative Binomial
d = NegativeBinomial()
link = LogLink()
NB_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: NegativeBinomial
  * link function: LogLink
  * number of clusters: 59
  * cluster size min, max: 5, 5
  * number of variance components: 1
  * number of fixed effects: 1

In [35]:
# fit using QuasiCopula
QuasiCopula.fit!(NB_VC_model);

initializing β using GLM.jl
gcm.β = [2.5544643337320343]
initializing variance components using MM-Algorithm
gcm.θ = [0.5657670131706649]
initializing r using Newton update
Converging when tol ≤ 1.0e-6 (max block iter = 10)
Block iter 1 r = 0.95, logl = -1041.21, tol = 1041.2064746554058
Block iter 2 r = 0.95, logl = -1041.2, tol = 9.169060760388625e-6


In [36]:
# qc logl
logl_epilepsy_QC_nb = logl(NB_VC_model)

# fit with glm: GLM.jl
epilepsy_glm = glm(@formula(counts ~ 1), df, d, link);
logl_epilepsy_GLM_nb = loglikelihood(epilepsy_glm)

# fit with glmm: MixedModels.jl
df[!, :id] = string.(df[!, :id])
epilepsy_formula = @formula(counts ~ 1 + (1|id));
mdl = GeneralizedLinearMixedModel(epilepsy_formula, df, d, link)
MixedModels.fit!(mdl, fast=true);
logl_epilepsy_GLMM_nb = loglikelihood(mdl);

│ It is best to avoid trying to fit such models in MixedModels until
│ the authors gain a better understanding of those cases.
└ @ MixedModels /Users/sarahji/.julia/packages/MixedModels/mag0C/src/generalizedlinearmixedmodel.jl:374


In [37]:
@show logl_epilepsy_QC_nb
@show logl_epilepsy_GLM_nb
@show logl_epilepsy_GLMM_nb;

logl_epilepsy_QC_nb = -1041.1965870627232
logl_epilepsy_GLM_nb = -1059.746665536632
logl_epilepsy_GLMM_nb = -1025.4137630663238


# Mmmec data: mlmRev (NB)

In [38]:
Mmmec = dataset("mlmRev", "Mmmec");

df = Mmmec
y = :Deaths
grouping = :Nation
# d = NegativeBinomial()
# link = LogLink()

NB_VC_model = VC_model(df, y, grouping, d, link)

Quasi-Copula Variance Component Model
  * base distribution: NegativeBinomial
  * link function: LogLink
  * number of clusters: 9
  * cluster size min, max: 3, 95
  * number of variance components: 1
  * number of fixed effects: 1

In [39]:
# fit using QuasiCopula
QuasiCopula.fit!(NB_VC_model);

initializing β using GLM.jl
gcm.β = [3.3260313399805215]
initializing variance components using MM-Algorithm
gcm.θ = [1.0]
initializing r using Newton update
Converging when tol ≤ 1.0e-6 (max block iter = 10)
Block iter 1 r = 0.95, logl = -1517.43, tol = 1517.4256786488381


In [40]:
# qc logl
logl_Mmmec_QC_nb = logl(NB_VC_model)

# fit with glm: GLM.jl
Mmmec_glm = glm(@formula(Deaths ~ 1), df, d, link);
logl_Mmmec_GLM_nb = loglikelihood(Mmmec_glm)

# fit with glmm: MixedModels.jl
Mmmec_formula = @formula(Deaths ~ 1 + (1|Nation));
mdl = GeneralizedLinearMixedModel(Mmmec_formula, df, d, link);
MixedModels.fit!(mdl, fast=true);
logl_Mmmec_GLMM_nb = loglikelihood(mdl);

│ It is best to avoid trying to fit such models in MixedModels until
│ the authors gain a better understanding of those cases.
└ @ MixedModels /Users/sarahji/.julia/packages/MixedModels/mag0C/src/generalizedlinearmixedmodel.jl:374


In [41]:
@show logl_Mmmec_QC_nb
@show logl_Mmmec_GLM_nb
@show logl_Mmmec_GLMM_nb;

logl_Mmmec_QC_nb = -1517.4254928293658
logl_Mmmec_GLM_nb = -1537.7008165848106
logl_Mmmec_GLMM_nb = -1452.072432624533
