# Practice Notebook for Regression Analysis
with Dependent Data in NHANES

<img src="images/planets.jpg"/>

In [1]:
# Libraries

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

New variables:

* SDMVSTRA - Masked Variance Unit Pseudo-Stratum variable for variance estimation
    
* SDMVPSU - Masked Variance Unit Pseudo-PSU variable for variance estimation

In [2]:
# Read the autism data
da = pd.read_csv("data/nhanes_2015_2016.csv")

# Drop unused columns, drop rows with any missing values.
vars = ["BPXSY1", "RIDAGEYR", "RIAGENDR", "RIDRETH1", "DMDEDUC2", "BMXBMI",
        "SMQ020", "SDMVSTRA", "SDMVPSU", "BPXDI1"]

da = da[vars].dropna()

# This is the grouping variable
da["group"] = 10*da.SDMVSTRA + da.SDMVPSU

# Print out the head
da.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDR,RIDRETH1,DMDEDUC2,BMXBMI,SMQ020,SDMVSTRA,SDMVPSU,BPXDI1,group
0,128.0,62,1,3,5.0,27.8,1,125,1,70.0,1251
1,146.0,53,1,3,3.0,30.8,1,125,1,88.0,1251
2,138.0,78,1,3,3.0,28.8,1,131,1,46.0,1311
3,132.0,56,2,3,5.0,42.4,2,131,1,72.0,1311
4,100.0,42,2,4,4.0,20.3,2,126,2,70.0,1262


## Question 1: 

Build a marginal linear model using GEE for the first measurement of diastolic blood pressure (`BPXDI1`), accounting for the grouping variable defined above.  This initial model should have no covariates, and will be used to assess the ICC of this blood pressure measure.

In [3]:
# If we just include 'one' that just means only have the intercepts, 
# so no actual covariates in our formula

model = sm.GEE.from_formula("BPXDI1 ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)

result = model.fit()

print(result.cov_struct.summary())

The correlation between two observations in the same cluster is 0.031


__Q1a.__ What is the ICC for diastolic blood pressure?  What can you
  conclude by comparing it to the ICC for systolic blood pressure?

In [4]:
model = sm.GEE.from_formula("BPXSY1 ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)

result = model.fit()

print(result.cov_struct.summary())

The correlation between two observations in the same cluster is 0.030


**Answer.** The estimated ICC fo `BPXDI1` is 0.031. The value are similar to what we see for `BPXSY1`.

## Question 2: 

Take your model from question 1, and add gender, age, and BMI to it as covariates.

In [5]:
for v in ["RIAGENDR","RIDAGEYR", "BMXBMI"]:
    model = sm.GEE.from_formula(v + " ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=da)
    result = model.fit()
    print(v, result.cov_struct.summary())
    print()

RIAGENDR The correlation between two observations in the same cluster is -0.001

RIDAGEYR The correlation between two observations in the same cluster is 0.035

BMXBMI The correlation between two observations in the same cluster is 0.039



__Q2a.__ What is the ICC for this model?  What can you conclude by comparing it to the ICC for the model that you fit in question 1?

**Answer.** The values are generally similar to what we saw for `diastolic blood pressure` and `systolic blood pressure`, except for `RIAGENDR`.

## Question 3: 

Split the data into separate datasets for females and for males and fit two separate marginal linear models for diastolic blood pressure, one only for females, and one only for males.

In [6]:
# Create 2 groups: 'males' and 'females'

male = da[da['RIAGENDR']==1]
female = da[da['RIAGENDR']==2]


# Male Model

model = sm.GEE.from_formula("BPXDI1 ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=male)

result = model.fit()

print("Male:")
print(result.cov_struct.summary())
print()


# Female Model

model2 = sm.GEE.from_formula("BPXDI1 ~ 1", groups="group",
           cov_struct=sm.cov_struct.Exchangeable(), data=female)

result2 = model2.fit()

print("Female:")
print(result2.cov_struct.summary())
print()

Male:
The correlation between two observations in the same cluster is 0.035

Female:
The correlation between two observations in the same cluster is 0.029



__Q3a.__ What do you learn by comparing these two fitted models?

**Answer.** *Intraclass Correlation*, or ICC, is a distinct form of correlation from Pearson's correlation. The ICC takes on values from 0 to 1, with 1 corresponding to "perfect clustering" -- the values within a cluster are identical, and 0 corresponding to "perfect independence" -- the mean value within each cluster is identical across all the clusters. Male and Female groups have different relations with diastolic blood pressure. For the Male group, the correlation is higher.

## Question 4: 

Using the entire data set, fit a multilevel model for diastolic blood pressure, predicted by age, gender, BMI, and educational attainment.  Include a random intercept for groups.

In [7]:
# Create a labeled version of the gender variable

da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["DMDEDUC2x"] = da.DMDEDUC2.replace({1: "lt9", 2: "x9_11", 3: "HS", 
                                       4: "SomeCollege",5: "College", 
                                       7: np.nan, 9: np.nan})


# Fit a multilevel (mixed effects) model

model = sm.MixedLM.from_formula("BPXDI1 ~ RIDAGEYR + RIAGENDRx + BMXBMI",
                                groups="group", data=da)
result = model.fit()
result.summary()

0,1,2,3
Model:,MixedLM,Dependent Variable:,BPXDI1
No. Observations:,5102,Method:,REML
No. Groups:,30,Scale:,154.2033
Min. group size:,106,Log-Likelihood:,-20122.6482
Max. group size:,226,Converged:,Yes
Mean group size:,170.1,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,65.316,1.006,64.901,0.000,63.344,67.289
RIAGENDRx[T.Male],2.743,0.350,7.841,0.000,2.058,3.429
RIDAGEYR,-0.041,0.010,-4.106,0.000,-0.061,-0.022
BMXBMI,0.184,0.025,7.207,0.000,0.134,0.234
group Var,4.304,0.110,,,,


__Q4a.__ How would you describe the strength of the clustering in this analysis?

In [8]:
# Explanation in the next block.
4.304/154.2

0.027911802853437098

**Answer.** Multilevel models can also be used to estimate ICC values. In the case of a model with one level, which is what we have here, the ICC is the variance of the grouping variable (4.304) divided by the sum of the variance of the grouping variable and the unexplained variance (154.2). Note that the unexplained variance is in upper part of the output, labeled scale. This ratio is around 0.028, which is very similar to the estimated ICC obtained using GEE.

__Q4b:__ Include a random intercept for age, and describe your findings.

In [9]:
# The cluster-specific intercepts and slopes are independent random variables

da["age_cen"] = da.groupby("group").RIDAGEYR.transform(lambda x: x - x.mean())

model = sm.MixedLM.from_formula("BPXDI1 ~ age_cen + RIDAGEYR + RIAGENDRx + BMXBMI",
           groups="group", vc_formula={"age_cen": "0+age_cen"}, data=da)

result = model.fit()

result.summary()



0,1,2,3
Model:,MixedLM,Dependent Variable:,BPXDI1
No. Observations:,5102,Method:,REML
No. Groups:,30,Scale:,156.3780
Min. group size:,106,Log-Likelihood:,-20146.1080
Max. group size:,226,Converged:,Yes
Mean group size:,170.1,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,80.739,2.568,31.445,0.000,75.707,85.771
RIAGENDRx[T.Male],2.756,0.352,7.825,0.000,2.065,3.446
age_cen,0.319,0.053,6.070,0.000,0.216,0.422
RIDAGEYR,-0.355,0.050,-7.044,0.000,-0.453,-0.256
BMXBMI,0.181,0.025,7.170,0.000,0.132,0.231
age_cen Var,0.004,0.000,,,,


In [10]:
# std deviation = sqrt(var)

np.sqrt(0.004)

0.06324555320336758

**Answer.** We see that the estimated variance for random age slopes is 0.004, which translates to a standard deviation of 0.06. That is, from one cluster to another, the age slopes fluctuate by $\pm 0.06-0.12$ (1-2 standard deviations). These cluster-specific fluctuations are added/subtracted from the fixed effect for age, which is 0.319. Thus, in some clusters DBP may increase by around 0.319 + 0.06 = 0.325 mm/Hg per year, while in other clusters DBP may increase by only around 0.319 - 0.06 = 0.313 mm/Hg per year. Note also that the fitting algorithm produces a warning that the estimated variance parameter is close to the boundary. In this case, however, the algorithm seems to have converged to a point just short of the boundary.