# NHANES: Medical Conditions Real Data Example:

In this notebook I use the "Questionnaire data" from [NHANES 2017-2018 Medical Conditions](https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Questionnaire&Cycle=2017-2020). The data dictionary can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/P_MCQ.htm#).
    
The data can be found in the [R package: nhanesA](https://cran.r-project.org/web/packages/nhanesA/index.html), we will use the `RCall` package to access it and transform it from wide to long format for analysis.

We will compare the estimates from the random intercept model with Bernoulli Base using QuasiCopula.jl vs. MixedModels.jl.

    - GROUPING: We will cluster by ID variable (SEQN)
    - COVARIATES: Body weight in kg (weight)
    
    - OUTCOMES: Each outcome vector is a vector of length 5 of the following indicators:
    (1) "MCQ010": Ever been told you have asthma {1 = Yes, 0 = No}
    (2) "MCQ080": Doctor ever said you were overweight {1 = Yes, 0 = No}
    (3) "MCQ160A": Doctor ever said you had arthritis {1 = Yes, 0 = No}
    (4) "MCQ371B": Are you now increasing exercise {1 = Yes, 0 = No}
    (5) "MCQ371C": Are you now reducing salt in diet {1 = Yes, 0 = No}

In [1]:
versioninfo()

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8


In [2]:
using QuasiCopula, LinearAlgebra, DataFrames, GLM
using RCall, MixedModels, ProgressMeter, Suppressor
ProgressMeter.ijulia_behavior(:clear);

In [3]:
BLAS.set_num_threads(1)
Threads.nthreads()

8

# Read in the dataset

In [4]:
@suppress begin
R"""
    suppressWarnings(library(nhanesA, warn.conflicts=FALSE))
    suppressWarnings(library(tidyverse, warn.conflicts=FALSE))
    
    # get the questionnaire data from 2017 NHANES
    q <- nhanesTables('Q', 2017)
    # get the medical conditions questionnaire from 2017 NHANES
    qt <- suppressMessages(lapply("MCQ_J", nhanes))
    df = qt[[1]]

    # select the five outcomes of interest per subject
    MC_2017 = df %>% select(SEQN, MCQ010, MCQ080, MCQ160A, MCQ371A, MCQ371B)
    MC_2017 = MC_2017[complete.cases(MC_2017),]

    MC_2017$MCQ010 <-ifelse(MC_2017$MCQ010 == 2, 0, 1)
    MC_2017$MCQ080 <-ifelse(MC_2017$MCQ080 == 2, 0, 1)
    MC_2017$MCQ160A <-ifelse(MC_2017$MCQ160A == 2, 0, 1)
    MC_2017$MCQ371A <-ifelse(MC_2017$MCQ371A == 2, 0, 1)
    MC_2017$MCQ371B <-ifelse(MC_2017$MCQ371B == 2, 0, 1)

    MC_2017_long = pivot_longer(MC_2017, -c(SEQN), values_to = "Outcome", names_to = "Condition")

    # get the weight in kg from the examination data from NHANES 2017
    e = nhanesTables('EXAM', 2017)
    et <- suppressMessages(lapply("BMX_J", nhanes))
    df_exam = et[[1]]

    BW_2017 = df_exam %>% select(SEQN, BMXWT)
    BW_2017 = BW_2017[complete.cases(BW_2017),]

    NHANES_2017_long <- inner_join(MC_2017_long, BW_2017, by="SEQN")
    df = as.data.frame(NHANES_2017_long)
"""
end

@rget df
df[!, :SEQN] .= string.(df[!, :SEQN])
df

Unnamed: 0_level_0,SEQN,Condition,Outcome,BMXWT
Unnamed: 0_level_1,String,String,Float64,Float64
1,93705,MCQ010,1.0,79.5
2,93705,MCQ080,0.0,79.5
3,93705,MCQ160A,1.0,79.5
4,93705,MCQ371A,1.0,79.5
5,93705,MCQ371B,1.0,79.5
6,93708,MCQ010,0.0,53.5
7,93708,MCQ080,0.0,53.5
8,93708,MCQ160A,1.0,53.5
9,93708,MCQ371A,1.0,53.5
10,93708,MCQ371B,0.0,53.5


### Form the random intercept model at fit using QuasiCopula.jl

In [9]:
y = :Outcome
grouping = :SEQN
covariates = [:BMXWT]

d = Bernoulli()
link = LogitLink()
model = VC_model(df, y, grouping, covariates, d, link)

Quasi-Copula Variance Component Model
  * base distribution: Bernoulli
  * link function: LogitLink
  * number of clusters: 5185
  * cluster size min, max: 5, 5
  * number of variance components: 1
  * number of fixed effects: 2


In [10]:
@time QuasiCopula.fit!(model)

initializing β using Newton's Algorithm under Independence Assumption
gcm.β = [-1.7982115203934443, 0.017514205319296185]
initializing variance components using MM-Algorithm
gcm.θ = [0.05965457936271671]
  0.107321 seconds (765.16 k allocations: 14.945 MiB)


-17046.98900846835

In [11]:
@show model.β
@show model.θ;

model.β = [-1.9161185003275032, 0.018614705169489668]
model.θ = [0.06253219172041159]


### Fit using MixedModels.jl

Now we fit the same model using MixedModels.jl with 25 Gaussian quadrature points. 

In [12]:
glmm_formula = @formula(Outcome ~ 1 + BMXWT + (1|SEQN));
mdl = GeneralizedLinearMixedModel(glmm_formula, df, d, link)
@time MixedModels.fit!(mdl; nAGQ = 25);
GLMM_β = mdl.beta
GLMM_θ = mdl.σs[1][1]^2
@show GLMM_β
@show GLMM_θ;

[32mMinimizing 84 	 Time: 0:00:02 (26.07 ms/it)[39m


  2.197379 seconds (31.10 k allocations: 1.373 MiB)
GLMM_β = [-1.9006462097030323, 0.01847894081758789]
GLMM_θ = 0.2647153136431501
