# NHANES I (NHEFS): Bivariate Counts Real Data Example:

In this notebook I use the [National Health and Nutrition Examination Survey Data I (NHANES 1) Epidemiologic Follow-up Study (NHEFS)](https://wwwn.cdc.gov/nchs/nhanes/nhefs/).

The data can be found in the [R package: causaldata](https://cran.r-project.org/web/packages/causaldata/causaldata.pdf), we will use the `RCall` package to access it and transform it from wide to long format for analysis.

We will compare the estimates, loglikelihoods and run times of the random intercept regression model with Poisson, negative Binomial and Bernoulli base distribution using QuasiCopula.jl vs. MixedModels.jl.

Regression Model:
    - GROUPING: We will cluster by ID variable (seqn)
    - COVARIATES: Average price of tobacco in the state of residence (price)
    
    - OUTCOMES: Each outcome vector is a bivariate vector of the following:
    (1) NUMBER OF CIGARETTES SMOKED PER DAY IN 1971
    (2) NUMBER OF CIGARETTES SMOKED PER DAY IN 1982
    
### Table of Contents:
* [Read in the dataset](#Read-in-the-dataset)
* [Check overdispersion](#Check-overdispersion-using-empirical-mean-and-variance)



* [Example 1: Poisson Base Distribution](#Example-1:-Poisson-Base-Distribution)
* [Example 2: Negative Binomial Base Distribution](#Example-2:-NB-Base-Distribution)
* [Example 3: Bernoulli Base Distribution](#Example-3:-Bernoulli-Base-Distribution)



* [Comparisons](#Comparisons)

For the Bernoulli base distribution in Example 3, we transform each outcome = 1 if the number of of cigarettes smoked per day > mean(number of cigarettes smoked per day).


In [1]:
versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8


In [2]:
using QuasiCopula, LinearAlgebra, DataFrames, GLM
using RCall, MixedModels, ProgressMeter
ProgressMeter.ijulia_behavior(:clear);

In [3]:
BLAS.set_num_threads(1)
Threads.nthreads()

8

# Read in the dataset

The data can be found in the [R package: causaldata](https://cran.r-project.org/web/packages/causaldata/causaldata.pdf), we will use the `RCall` package to access it and transform it from wide to long format for analysis.

In [4]:
R"""
    # load NHEFS data
    suppressWarnings(library(causaldata, warn.conflicts=FALSE))
    data(nhefs, package = "causaldata")
    
    # keep both count outcomes from 1971 and 1982
    nhefs$smokeintensity71 = nhefs$smokeintensity
    nhefs$smokeintensity82 = nhefs$smokeintensity + nhefs$smkintensity82_71
    
    # transform data from wide format to long format
    suppressWarnings(library(dplyr, warn.conflicts=FALSE))
    nhefs = nhefs %>% select(seqn, smokeintensity71, smokeintensity82, price71, price82)
    nhefs = as.data.frame(nhefs)
    nhefs_long = reshape(nhefs, direction="long",
    varying=list(c("smokeintensity71","smokeintensity82"), c("price71","price82")), 
    v.names=c("smoke","price"))
    df = nhefs_long[order(nhefs_long$seqn),]
    df = df[complete.cases(df),]
"""
@rget df
df[!, :seqn] .= string.(df[!, :seqn])
df

Unnamed: 0_level_0,seqn,time,smoke,price,id
Unnamed: 0_level_1,String,Int64,Float64,Float64,Int64
1,233.0,1,30.0,2.18359,1
2,233.0,2,20.0,1.73999,1
3,235.0,1,20.0,2.34668,2
4,235.0,2,10.0,1.79736,2
5,244.0,1,20.0,1.56958,3
6,244.0,2,6.0,1.51343,3
7,245.0,1,3.0,1.50659,4
8,245.0,2,7.0,1.4519,4
9,252.0,1,20.0,2.34668,5
10,252.0,2,20.0,1.79736,5


# Check overdispersion using empirical mean and variance

If the count outcome is overdispersed, then using the Poisson Base distribution with the quasi copula model may be a case of model misspecification. The quasi copula model will inflate the variance component to account for the additional overdispersion.

In [5]:
empirical_mean = mean(df[!, :smoke])
empirical_variance = var(df[!, :smoke])

empirical_overdispersion = empirical_variance / empirical_mean

@show empirical_mean
@show empirical_variance
@show empirical_overdispersion;

empirical_mean = 18.277813923227065
empirical_variance = 178.364056918179
empirical_overdispersion = 9.75850053334429


We see that there is overdispersion in the data, therefore using the negative binomial base may be more appropriate for analysis.

# Example 1: Poisson Base Distribution
### Form the random intercept model at fit using QuasiCopula.jl with Poisson Base

In [38]:
y = :smoke
grouping = :seqn
covariates = [:price]

d = Poisson()
link = LogLink()
QC_Poisson_model = VC_model(df, y, grouping, covariates, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Poisson
  * link function: LogLink
  * number of clusters: 1537
  * cluster size min, max: 2, 2
  * number of variance components: 1
  * number of fixed effects: 2


In [39]:
QC_Poisson_fittime = @elapsed QuasiCopula.fit!(QC_Poisson_model)

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [2.1159279480362807, 0.39787703892248655]
initializing variance components using MM-Algorithm
gcm.θ = [5.569804005046703]
Total number of variables............................:        3
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 18

                                   (scaled)                 (unscaled)
Objective...............:   7.4667262223557827e+02    2.1175610374308759e+04
Dual infeasibility......:   8.96197844787866

0.096910441

Without any penalty on the variance component, we see a very large value here. This may indicate there is some overdispersion in the dataset.

Now with the ridge penalty on the variance component.

In [40]:
@show QC_Poisson_fittime
@show QC_Poisson_model.β
@show QC_Poisson_model.θ;

QC_Poisson_fittime = 0.096910441
QC_Poisson_model.β = [2.0118278238060636, 0.43249136776609054]
QC_Poisson_model.θ = [7.11929400615054]


In [41]:
QC_Poisson_logl = logl(QC_Poisson_model)

-21175.61037430876

### Fit using MixedModels.jl

Now we fit the same model using MixedModels.jl with 25 Gaussian quadrature points. 

In [42]:
glmm_formula = @formula(smoke ~ 1 + price + (1|seqn));
mdl = GeneralizedLinearMixedModel(glmm_formula, df, d, link)
GLMM_Poisson_fittime = @elapsed MixedModels.fit!(mdl; nAGQ = 25);
GLMM_Poisson_β = mdl.beta
GLMM_Poisson_θ = mdl.σs[1][1]^2
@show GLMM_Poisson_fittime
@show GLMM_Poisson_β
@show GLMM_Poisson_θ;

[32mMinimizing 76 	 Time: 0:00:00 ( 3.53 ms/it)[39m


GLMM_Poisson_fittime = 0.27081742
GLMM_Poisson_β = [1.5148402125927902, 0.5984329913583817]
GLMM_Poisson_θ = 0.4823489292459855


In [43]:
GLMM_Poisson_logl = loglikelihood(mdl)

-15535.512192772478

There are some differences in the estimates using the Poisson base, and our loglikelihood is lower than that of mixed models. We note that because the data is overdispersed, using the Poisson base may be a misspecified model.

# Example 2: NB Base Distribution
### Form the random intercept model at fit using QuasiCopula.jl with NB Base

In [44]:
y = :smoke
grouping = :seqn
covariates = [:price]

d = NegativeBinomial()
link = LogLink()
QC_NB_model = VC_model(df, y, grouping, covariates, d, link)

Quasi-Copula Variance Component Model
  * base distribution: NegativeBinomial
  * link function: LogLink
  * number of clusters: 1537
  * cluster size min, max: 2, 2
  * number of variance components: 1
  * number of fixed effects: 2


In [45]:
QC_NB_fittime = @elapsed QuasiCopula.fit!(QC_NB_model)

initializing β using GLM.jl
gcm.β = [2.1063055490508487, 0.4027219419973325]
initializing variance components using MM-Algorithm
gcm.θ = [2.152252921565018e-6]
initializing r using Newton update
Converging when tol ≤ 1.0e-6 (max block iter = 10)
Block iter 1 r = 1.11, logl = -12067.22, tol = 12067.219350099704


0.126614904

In [46]:
@show QC_NB_fittime
@show QC_NB_model.β
@show QC_NB_model.θ
@show QC_NB_model.r;

QC_NB_fittime = 0.126614904
QC_NB_model.β = [2.1063339299583452, 0.40270765918773577]
QC_NB_model.θ = [0.0]
QC_NB_model.r = [1.1138519954190327]


In [47]:
QC_NB_logl = logl(QC_NB_model)

-12067.219352849264

### Fit using MixedModels.jl

Now we fit the same model using MixedModels.jl with 25 Gaussian quadrature points. 

These models both estimate the variance component to be 0, and have the same estimates for beta. 

In [48]:
glmm_formula = @formula(smoke ~ 1 + price + (1|seqn));
mdl = GeneralizedLinearMixedModel(glmm_formula, df, d, link)
GLMM_NB_fittime = @elapsed MixedModels.fit!(mdl; nAGQ = 25);
GLMM_NB_β = mdl.beta
GLMM_NB_θ = mdl.σs[1][1]^2
GLMM_NB_r = inv(mdl.σ)
@show GLMM_NB_fittime
@show GLMM_NB_β
@show GLMM_NB_θ
@show GLMM_NB_r;

[32mMinimizing 82 	 Time: 0:00:00 ( 4.05 ms/it)[39m


GLMM_NB_fittime = 0.334116097
GLMM_NB_β = [2.106309390981023, 0.4027195260064574]
GLMM_NB_θ = 0.0
GLMM_NB_r = 1.3827736598127374


In [49]:
GLMM_NB_logl = loglikelihood(mdl)

-12073.979452378717

There are not much differences in the estimates using the Negative Binomial base, and our loglikelihood is higher than that of mixed models. 

# Example 3: Bernoulli Base Distribution

Lets turn the outcome into a Bernoulli indicator = 1 if the number of cigarettes smoked per day is greater than the sample mean. 

In [50]:
# sample mean number of the number of cigarettes smoked per day in 1971 and 1982
mean_count = mean(df[!, :smoke])

18.277813923227065

In [51]:
# make "count" variable binary in new variable "binary_outcome"
df[!, :binary_outcome] = df[!, :smoke] .> mean_count
df

Unnamed: 0_level_0,seqn,time,smoke,price,id,binary_outcome
Unnamed: 0_level_1,String,Int64,Float64,Float64,Int64,Bool
1,233.0,1,30.0,2.18359,1,1
2,233.0,2,20.0,1.73999,1,1
3,235.0,1,20.0,2.34668,2,1
4,235.0,2,10.0,1.79736,2,0
5,244.0,1,20.0,1.56958,3,1
6,244.0,2,6.0,1.51343,3,0
7,245.0,1,3.0,1.50659,4,0
8,245.0,2,7.0,1.4519,4,0
9,252.0,1,20.0,2.34668,5,1
10,252.0,2,20.0,1.79736,5,1


### Form the random intercept model at fit using QuasiCopula.jl with Bernoulli Base

In [52]:
y = :binary_outcome
grouping = :seqn
covariates = [:price]

d = Bernoulli()
link = LogitLink()
QC_Bernoulli_model = VC_model(df, y, grouping, covariates, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Bernoulli
  * link function: LogitLink
  * number of clusters: 1537
  * cluster size min, max: 2, 2
  * number of variance components: 1
  * number of fixed effects: 2


In [53]:
QC_Bernoulli_fittime = @elapsed QuasiCopula.fit!(QC_Bernoulli_model)

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [-1.6639349088875555, 0.9824969792078235]
initializing variance components using MM-Algorithm
gcm.θ = [0.5385024551594434]
Total number of variables............................:        3
                     variables with only lower bounds:        1
                variables with lower and upper bounds:        0
                     variables with only upper bounds:        0
Total number of equality constraints.................:        0
Total number of inequality constraints...............:        0
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:        0


Number of Iterations....: 15

                                   (scaled)                 (unscaled)
Objective...............:   1.9730262442597357e+03    1.9730262442597357e+03
Dual infeasibility......:   1.9131907169622

0.06998636

In [54]:
@show QC_Bernoulli_fittime
@show QC_Bernoulli_model.β
@show QC_Bernoulli_model.θ;

QC_Bernoulli_fittime = 0.06998636
QC_Bernoulli_model.β = [-3.8426699785621885, 2.1755990137575356]
QC_Bernoulli_model.θ = [0.6692358062462648]


In [55]:
QC_Bernoulli_logl =  logl(QC_Bernoulli_model)

-1973.0262442597357

### Fit using MixedModels.jl

Now we fit the same model using MixedModels.jl with 25 Gaussian quadrature points. 


In [56]:
glmm_formula = @formula(binary_outcome ~ 1 + price + (1|seqn));
mdl = GeneralizedLinearMixedModel(glmm_formula, df, d, link)
GLMM_Bernoulli_fittime = @elapsed MixedModels.fit!(mdl; nAGQ = 25);
GLMM_Bernoulli_β = mdl.beta
GLMM_Bernoulli_θ = mdl.σs[1][1]^2
@show GLMM_Bernoulli_fittime
@show GLMM_Bernoulli_β
@show GLMM_Bernoulli_θ;

[32mMinimizing 78 	 Time: 0:00:00 ( 3.58 ms/it)[39m


GLMM_Bernoulli_fittime = 0.281189228
GLMM_Bernoulli_β = [-3.349071450784314, 1.923119293142038]
GLMM_Bernoulli_θ = 3.780959140641813


In [57]:
GLMM_Bernoulli_logl = loglikelihood(mdl)

-2018.049580583702

There are some differences in these estimates for the Bernoulli base, and our loglikelihood is higher than that of mixed models. 

# Comparisons

Here we will just summarize the comparisons between the three models. 

### Poisson Base Distribution

    - Using the Poisson Base distribution to analyze the data is a case of model misspecification because the count data is overdispersed.
    - When using the Poisson base, our model tries to account for it by inflating the variance component. 
    - The loglikelihood of GLMM is higher than that of QuasiCopula.

In [58]:
# QC estimates
@show QC_Poisson_model.β
@show QC_Poisson_model.θ;

QC_Poisson_model.β = [2.0118278238060636, 0.43249136776609054]
QC_Poisson_model.θ = [7.11929400615054]


In [59]:
# GLMM estimates
@show GLMM_Poisson_β
@show GLMM_Poisson_θ;

GLMM_Poisson_β = [1.5148402125927902, 0.5984329913583817]
GLMM_Poisson_θ = 0.4823489292459855


In [60]:
# Loglikelihoods
@show QC_Poisson_logl
@show GLMM_Poisson_logl;

QC_Poisson_logl = -21175.61037430876
GLMM_Poisson_logl = -15535.512192772478


In [61]:
# fittimes
@show QC_Poisson_fittime
@show GLMM_Poisson_fittime;

QC_Poisson_fittime = 0.096910441
GLMM_Poisson_fittime = 0.27081742


### Negative Binomial Base Distribution

    - Using the Negative Binomial base distribution to analyze the data is more appropriate than using the Poisson base distribution since it is inherently overdispersed by definition.
    - Both the QC and GLMM models estimate about the same betas. 
    - Both the QC and GLMM models estimate the variance component to 0. This indicates that there is no additional overdispersion than already accounted for by the negative binomial base distribution. 
    - The loglikelihood of GLMM is lower than that of QuasiCopula.

In [62]:
# QC estimates
@show QC_NB_model.β
@show QC_NB_model.θ
@show QC_NB_model.r;

QC_NB_model.β = [2.1063339299583452, 0.40270765918773577]
QC_NB_model.θ = [0.0]
QC_NB_model.r = [1.1138519954190327]


In [63]:
# GLMM estimates
@show GLMM_NB_β
@show GLMM_NB_θ
@show GLMM_NB_r;

GLMM_NB_β = [2.106309390981023, 0.4027195260064574]
GLMM_NB_θ = 0.0
GLMM_NB_r = 1.3827736598127374


In [64]:
# Loglikelihoods
@show QC_NB_logl
@show GLMM_NB_logl;

QC_NB_logl = -12067.219352849264
GLMM_NB_logl = -12073.979452378717


In [65]:
# fittimes
@show QC_NB_fittime
@show GLMM_NB_fittime;

QC_NB_fittime = 0.126614904
GLMM_NB_fittime = 0.334116097


### Bernoulli Base Distribution

    - Using the Bernoulli base distribution to analyze the data shows comparable estimates between QC and GLMM.
    - The loglikelihood of GLMM is lower than that of QuasiCopula.

In [66]:
# QC estimates
@show QC_Bernoulli_model.β
@show QC_Bernoulli_model.θ;

QC_Bernoulli_model.β = [-3.8426699785621885, 2.1755990137575356]
QC_Bernoulli_model.θ = [0.6692358062462648]


In [67]:
# GLMM estimates
@show GLMM_Bernoulli_β
@show GLMM_Bernoulli_θ;

GLMM_Bernoulli_β = [-3.349071450784314, 1.923119293142038]
GLMM_Bernoulli_θ = 3.780959140641813


In [68]:
# Loglikelihoods
@show QC_Bernoulli_logl
@show GLMM_Bernoulli_logl;

QC_Bernoulli_logl = -1973.0262442597357
GLMM_Bernoulli_logl = -2018.049580583702


In [69]:
# fittimes
@show QC_Bernoulli_fittime
@show GLMM_Bernoulli_fittime;

QC_Bernoulli_fittime = 0.06998636
GLMM_Bernoulli_fittime = 0.281189228
