# Multilevel modeling, a case study with the Russia Longitudinal Modeling Survey data

This notebook demonstrates linear and generalized linear multilevel
regression, using data from the Russia Longitudinal Monitoring Survey
([RLMS](https://www.cpc.unc.edu/projects/rlms-hse/)).

We will focus here on linear mixed effects regression, a technique
that can be used to conduct linear regression analysis on datasets
with multilevel or longitudinal structure, or that have other forms of
statistical dependence among the observations.

The RLMS is a longitudinal study conducted in Russia starting in 1994.
It captures hundreds of different characteristics of the subjects.
Subjects in the RLMS may be assessed yearly, but most subjects
participate more sporadically.  This study is based on questionnaires
and interviews.  All responses are entered into the dataset as
reported by the subjects - there is no expert verification of the
responses (e.g. by clinical exam or review of administrative records).

To access the data files, it is necessary to register for an account
on the [UNC Dataverse](https://dataverse.unc.edu/) site, and accept
the terms of use.  Once the registration is complete, the data can be
retrieved from [this
site](https://dataverse.unc.edu/dataset.xhtml?persistentId=doi:10.15139/S3/12438).
The data are available in several formats, below we start with the
compressed data file '`IND_1994_2015_v2_STATA.7z`'.  This is a Stata
dta format file, compressed using the "7z" compression utility.  After
obtaining this file, uncompress it using a program like `7z` to get a
Stata dta file.  Then use the code below to convert the Stata file
into a csv file.  The code in the cell below is set up so that by
default it does not run.  You won't want this code to run every time
you use this notebook, but you can change `False` to `True` in the
'`if`' statement and run this cell once to create the data file named
'`rlms.csv.gz`' that we will use below.

In [None]:
if False: # Do not run by default, change False->True to run

    import pandas as pd
    import gzip

    # Edit this to contain the variables that will be retained.
    cols = ["idind", "year", "psu", "marst", "h5", "h6", "h7_2", "m1",
            "m2", "o38a", "j57", "j59"]

    df = pd.read_stata("RLMS_IND_1994_2015_v2_STATA.dta", columns=cols)

    gid = gzip.open("rlms.csv.gz", "wt")
    df.to_csv(gid, index=None)
    gid.close()

The '`cols`' argument above contains the names of the variables that
are to be included in the final csv file (the raw data files have
several thousand columns but we will only retain a few of them here).
To understand what the variables mean, refer to the data documentation
file '`1994_2015_ind_codebook.pdf`', which can be obtained from the
same web page linked above where the data were obtained.  Note that
the names in the Stata file are in lower case, and decimal points in
the variable names have been changed to underscores.

Here is some more information about some if the key variables that we
will use below:

```
idind: person-level id
year: year of survey response
psu: sampling unit (location) of the respondent
marst: marital status
h5: gender
h6: birth year
h7_2: interview month
m1: body weight (kg)
m2: height (cm)
o38a: hours sleeping
j57: income last 30 days
j59: hours worked last 30 days
```


## Linear multilevel regression of BMI

Now that the data are prepared, we can begin with the analysis.  First
we import the usual libraries that we need for statistical analysis in
Python.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns

Next we read the dataset that we created above.  You should change the
path in the cell below to point to the location where you are keeping
the data file.  You may get a warning about "mixed dtypes" that can be
ignored here.

In [None]:
df = pd.read_csv("/nfs/kshedden/RLMS/rlms.csv.gz") # Change the path here

This is a longitudinal (repeated measures) data set in "long format",
meaning that each observation is contained in a separate row, with
multiple rows used to contain all the data for one subject.  In
contrast, in a "wide format" data file, the data for each subject is
contained in a single row, with multiple columns used to contain the
repeated measures.  For most forms of regression using repeated
measures data (including multilevel regression as discussed here), the
data must be in long format.

Since different subjects in the RLMS respond different numbers of
times, we have "unbalanced repeated measures" here.  The following
gives us an initial sense of how many repeated measures we have (the
first column of the output is the number of repeated measurements for
a subject, the second column is the number of subjects with that
number of repeated measurements).

In [None]:
sz = df[["idind", "m1"]].dropna().groupby("idind").size().value_counts()
print(sz)

One drawback of using text/csv for datasets is that it is not always
evident what data type (e.g. numeric, text, date/time) best represents
each column.  In the next cell, we convert all the columns that should
contain numeric values into a numeric format.

In [None]:
qv = ("idind", "year", "h6", "m1", "m2", "o38a", "j57", "j59")
for c in qv:
    df[c] = pd.to_numeric(df[c], errors="coerce")

Next, we create several new variables and transform other variables in
ways that will be needed for the analysis below.

In [None]:
df["female"] = (df.h5 == "female").astype(np.int)
df["age"] = df.year - df.h6
df["bmi"] = df.m1 / (df.m2 / 100)**2
year_mean = df.year.mean()
df["year_cen"] = df.year - year_mean
age_mean = df.age.mean()
df["age_cen"] = df.age - age_mean
df["birth_year_cen"] = df.h6 - df.h6.mean()

Now we create our final analysis dataset, dropping all rows with any
missing values.  Then we check again for the distribution of the
number of responses per person.

In [None]:
dx = df[["idind", "bmi", "age_cen", "female", "birth_year_cen",
         "year_cen", "psu"]].dropna()
dx.groupby("idind").size().value_counts()

Below we will be looking at regression analyses in which body mass
index (BMI) is the dependent variable.  BMI is anticipated to differ
by gender and age, and also may vary based on birth cohort.

Although our goal here is to conduct regression analyses that account for the
repeated measures aspect of the data, it is a convenient fact that the mean
structure parameters of a linear model can be estimated reasonably
well using ordinary least squares (OLS), even if multilevel-structure
is present.  Thus, we start with an OLS fit:

In [None]:
model0 = sm.OLS.from_formula("bmi ~ age_cen*female + year_cen",
             data=dx)
result0 = model0.fit()
print(result0.summary())

Next we turn to using mixed linear regression for this analysis.  It
is possible, but a bit slow to fit these models to the entire data set
that we have here.  Therefore, we reduce the data size by subsampling.

In [None]:
idx = dx.idind.unique()
idx_use = np.random.choice(idx, 2000, replace=False)
dx = dx.loc[dx.idind.isin(idx_use), :]

## Random intercepts model for BMI

We start with a basic "random intercepts model".  This model has the
same mean structure as the model we fit above using OLS, but it also
gives each subject their own intercept for the regression of BMI on
age, gender, and year.  The standard deviation of this random intercept
reveals how much the different subjects' intercepts differ from each
other.

In [None]:
model1 = sm.MixedLM.from_formula("bmi ~ age_cen*female + year_cen",
                  groups="idind", data=dx)
result1 = model1.fit()
print(result1.summary())

Note that in the OLS fit, all the mean structure parameters are
strongly statistically significant, but the significance levels are
weaker in this mixed model.  Dependence between observations tends to
overstate the information in the data, hence it is typically the case
that properly accounting for dependence in the data reduces the
apparent statistical significance to more accurately depict the
uncertainty in the analysis.

In addition to modeling age, we also model the year in which the
response was recorder (`year_cen`).  This covariate has a
statistically significant positive relationship with BMI, indicating
that the BMI is increasing in the Russian population over the years of
this study.  According to this analysis, in around 50 years time
the BMI of the average Russian male of average age will have increased
by around 1 ${\rm kg}/m^2$.

The generating form for the random intercepts model is

$$
y_{ij} = \beta^\prime x_{ij} + \theta_i + \epsilon_{ij},
$$

where $y_{ij}$ is the observed response value for observation $j$
in cluster $i$, and $x_{ij}$ is the corresponding vector of covariates.
The regression parameters (sometimes called "mean structure parameters"
or "fixed effects parameters") are in $\beta$.  There are two random
terms in the model.  The $\theta_i$ have mean zero and variance $\tau^2$.
The $\epsilon)_{ij}$ have mean zero and variance $\sigma^2$.  Note that
$\theta_i$ is shared by all observations within cluster $i$, but each
observation has its own "noise value" $\epsilon_{ij}$.

The variance structure of the random intercept model can be interpreted
using the notion of "intraclass correlation" (ICC).  In this model,
the ICC is the ratio of the variance of the random intercepts to the
total variance ($\tau^2 / (\tau^2 + \sigma^2)$):

In [None]:
pa = result1.params

icc = pa["idind Var"] / (pa["idind Var"] + result1.scale)
print(icc)

The following plot shows what the "idealized" BMI trends are
for a sample of simulated subjects.  These idealized trends
do not include the independent, unexplained variation
corresponding to the term $\epsilon_{ij}$ in the model
given above.

In [None]:
a = pa["Intercept"]
b = pa["age_cen"]
s = np.sqrt(pa["idind Var"])

for k in range(10):
    x = np.r_[20, 60]
    y = a + b * (x - age_mean) + s * np.random.normal()
    ax = sns.lineplot(x, y, color='purple')

ax.set_xlabel("Age", size=15)
_ = ax.set_ylabel("BMI", size=15)

## Random slopes model for BMI

Next we incorporate a random slope for age into the multilevel model.
Doing this allows the BMI of each subject to change in its own way as
the subject ages.  Note that while the variance of the random slope
for age appears small, the standard deviation (square root of the
variance) conveys its impact in the proper units (BMI units/year of
age).  The random slopes are realized independently for each subject,
and added to the common slope (the fixed effect for age) to obtain the
trend line between BMI and age for one subject.

In [None]:
vcf = {"i": "1", "s": "0 + age_cen"}
model2 = sm.MixedLM.from_formula("bmi ~ age_cen*female + year_cen",
              groups="idind", vc_formula=vcf, data=dx)
result2 = model2.fit()
print(result2.summary())

We can visualize the idealized trajectories for different simulated
subjects as we did above:

In [None]:
pa = result2.params

a = pa["Intercept"]
b = pa["age_cen"]
si = np.sqrt(pa["i Var"])
ss = np.sqrt(pa["s Var"])

for k in range(10):
    x = np.r_[20, 60]
    a1 = a + si * np.random.normal()
    b1 = b + ss * np.random.normal()
    y = a1 + b1 * (x - age_mean)
    ax = sns.lineplot(x, y, color='orange')

ax.set_xlabel("Age", size=15)
_ = ax.set_ylabel("BMI", size=15)

We can visualize the estimated distribution of slopes of BMI on age as
follows (technically this only applies to males, but since the
interaction between age and gender is weak, the distribution of random
slopes for females would look similar).  Based on this plot, we see
that BMI increases by around 0.15 units per year in an average
subject, but for some subjects the BMI trend is around 0.3 units per
year, and for a small number of subject the BMI trend is nearly zero
or slightly negative.

In [None]:
pa = result2.params

m = pa["age_cen"]
s = np.sqrt(pa["s Var"])

x = np.linspace(m - 3*s, m + 3*s, 100)
y = np.exp(-(x - m)**2 / (2 * s**2)) / np.sqrt(2 * np.pi * s**2)

ax = sns.lineplot(x, y)
ax.set_xlabel("Slope", size=15)
_ = ax.set_ylabel("Density", size=15)

The RMLS is a survey in which first "primary sampling units" (PSUs)
are selected, then subjects are selected randomly from the PSUs.  As a
result, subjects in the same PSU may be more similar to each other
than subjects in two different PSUs.  This creates correlation in the
data that should be accounted for.  One way to do this is to include a
random intercept for PSU, along with the random intercept and random
slope that we have already included for subjects.

In [None]:
vcf = {"i": "0 + C(idind)", "s": "0 + C(idind):age_cen",
       "ipsu": "1"}
model3 = sm.MixedLM.from_formula("bmi ~ age_cen*female + year_cen",
              groups="psu", vc_formula=vcf, data=dx)
result3 = model3.fit()
print(result3.summary())

The results above indicate that PSU's may differ slightly in terms of
BMI at the mean age (this is reflected in the `ipsu Var` term).  But
individuals within a PSU vary to a much greater degree (based on the
`i Var term`).  This can be visualized through the plot below, which
shows the distribution of BMI at the mean age for males for two
different PSU's, one of which is 1 SD above the population mean, and
one of which is 1 SD below the population mean.

In [None]:
pa = result3.params

v1 = pa["i Var"]
v2 = pa["ipsu Var"]

s1 = np.sqrt(v1)
s2 = np.sqrt(v2)

x = np.linspace(-3*s1, 3*s1, 100)

y = np.exp(-(x - s2)**2 / (2 * s1**2)) / np.sqrt(2 * np.pi * s1**2)
ax = sns.lineplot(x, y)

y = np.exp(-(x + s2)**2 / (2 * s1**2)) / np.sqrt(2 * np.pi * s1**2)
ax = sns.lineplot(x, y)

ax.set_xlabel("Intercept (BMI at mean age)", size=15)
_ = ax.set_ylabel("Density", size=15)

Above we established that PSUs differ somewhat in terms of the BMI at
the mean age.  We can also consider the possibility that the rate at
which BMI changes with age differs by PSU.

In [None]:
vcf = {"i": "0 + C(idind)", "s": "0 + C(idind):age_cen",
       "ipsu": "1", "spsu": "0 + age_cen"}
model4 = sm.MixedLM.from_formula("bmi ~ age_cen*female + year_cen",
              groups="psu", vc_formula=vcf, data=dx)
result4 = model4.fit()
print(result4.summary())

The results above indicate that the variance parameter for
PSU-specific slopes is estimated to be nearly zero.  Since variances
cannot be negative, this implies that the parameter estimate is
converging to the boundary of the parameter space.  Most standard
optimization algorithms have trouble in this regime.  Also, the
reported standard errors are generally not meaningful when the
estimated parameters are on the boundary of the parameter space.  The
warning messages indicate that the variance parameter for PSU-specific
slopes is likely to be small.  In general, it is best to remove
parameters like this from the model.

## Other related methods

Above we focused exclusively on linear multilevel models.  These
models are appropriate for data in which the marginal and conditional
mean structures can be modeled as linear functions of the covariates
and random effects.  As in the case of regression with independent
observations, in many settings it is desirable to use a nonlinear
model.  In the independent data setting, logistic and Poisson
regression are useful nonlinear models.  Combining these ideas, there
is a class of models called "generalized linear mixed models".  These
models allow data to follow a generalized linear model conditionally
on the random effects.  Currently (February 2019), Statsmodels has
partial support for mixed binomial and Poisson regression.  This
capability is not in the main Statsmodels release, but can be
conducted using a development snapshot of Statsmodels from
[Github](https://github.com/statsmodels).

Another technique that is sometimes useful is the conditional fixed
effects models.  The development snapshot of Statsmodels has support
for conditional logistic and Poisson regression.