# Fitting a Multilevel Model
This analysis will be focusing on a longitudinal study that was conducted on children with autism<sup>1</sup>. We will be looking at several variables and exploring how different factors interact with the socialization of a child with autism as they progress throughout the beginning stages of their life.

The variables we have from the study are:

* AGE is the age of a child which, for this dataset, is between two and thirteen years 
* VSAE measures a child's socialization 
* SICDEGP is the expressive language group at age two and can take on values ranging from one to three. Higher values indicate more expressive language.
* CHILDID is the unique ID that is given to each child and acts as their identifier within the dataset

We will first be fitting a multilevel model with explicit random effects of the children to account for the fact that we have repeated measurements on each child, which introduces correlation in our observations.

<sup>1</sup> Anderson, D., Oti, R., Lord, C., and Welch, K. (2009). Patterns of growth in adaptive social abilities among children with autism spectrum disorders. Journal of Abnormal Child Psychology, 37(7), 1019-1034.

###Importing Data and Packages
Before we begin, we need to include a few packages that will make working with the data a little easier.

In [1]:
# Upgrade to statsmodels 0.9.0
#!pip install --upgrade --user statsmodels

# https://stackoverflow.com/questions/34444607/how-to-ignore-statsmodels-maximum-likelihood-convergence-warning?noredirect=1&lq=1
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)

# Import the libraries that we will need for the analysis
import csv 
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib.pyplot as plt

import patsy
from scipy.stats import chi2 # for sig testing
from IPython.display import display, HTML # for pretty printing

# Read in the Autism Data
dat = pd.read_csv("autism.csv")

# Drop NA's from the data
dat = dat.dropna()

In [2]:
# Print out the first few rows of the data
dat.head()

Unnamed: 0,age,vsae,sicdegp,childid
0,2,6.0,3,1
1,3,7.0,3,1
2,5,18.0,3,1
3,9,25.0,3,1
4,13,27.0,3,1


###Fit the Model without Centering
We will first begin by fitting the model without centering the age component first. This model has both random intercepts and random slopes on age.

In [3]:
# child's socialization against "age" and the "expressive language group"

# Build the model
mlm_mod = sm.MixedLM.from_formula(formula = 'vsae ~ age * C(sicdegp)', 
                                  groups = 'childid', 
                                  re_formula="1 + age", 
                                  data=dat)

# Run the fit
mlm_result = mlm_mod.fit()

# Print out the summary of the fit
mlm_result.summary()



0,1,2,3
Model:,MixedLM,Dependent Variable:,vsae
No. Observations:,610,Method:,REML
No. Groups:,158,Scale:,62.2945
Min. group size:,1,Likelihood:,-2348.7980
Max. group size:,5,Converged:,No
Mean group size:,3.9,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,1.901,1.601,1.188,0.235,-1.236,5.039
C(sicdegp)[T.2],-0.416,2.110,-0.197,0.844,-4.550,3.719
C(sicdegp)[T.3],-3.918,2.346,-1.670,0.095,-8.516,0.681
age,2.957,0.593,4.988,0.000,1.795,4.118
age:C(sicdegp)[T.2],0.741,0.783,0.946,0.344,-0.794,2.277
age:C(sicdegp)[T.3],4.356,0.868,5.017,0.000,2.654,6.058
childid Var,58.305,3.003,,,,
childid x age Cov,-28.728,0.698,,,,
age Var,14.186,0.283,,,,


We can see that the **model** fails to **converge**. Taking a step back, and thinking about the data, how should we expect children's socialization to vary at age zero? Would we expect the children to exhibit different socialization when they are first born? Or is the difference in socialization something that we would expect to manifest over time? 

We would **expect** the **socialization differences should be negligible at age zero** or, at the very least, **difficult to discern**. This homogeneity of newborns implies the **variance of the random intercept would be close to zero, and, as a result, the model is having difficulty estimating the variance parameter of the random intercept. It may not make sense to include a random intercept in this model. We will drop the random intercept and attempt to refit the model to see if the convergence warnings still manifest themselves in the fit.**

In [4]:
# Build the model - note the re_formula definition now 
# has a 0 instead of a 1. This removes the intercept from 
# the model
mlm_mod = sm.MixedLM.from_formula(formula = 'vsae ~ age * C(sicdegp)', 
                                  groups = 'childid', 
                                  re_formula="0 + age", 
                                  data=dat)

# Run the fit
mlm_result = mlm_mod.fit()

# Print out the summary of the fit
mlm_result.summary()

0,1,2,3
Model:,MixedLM,Dependent Variable:,vsae
No. Observations:,610,Method:,REML
No. Groups:,158,Scale:,84.5319
Min. group size:,1,Likelihood:,-2427.0905
Max. group size:,5,Converged:,Yes
Mean group size:,3.9,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,2.482,1.271,1.952,0.051,-0.010,4.973
C(sicdegp)[T.2],-1.293,1.674,-0.773,0.440,-4.574,1.987
C(sicdegp)[T.3],-4.230,1.862,-2.272,0.023,-7.880,-0.580
age,2.822,0.470,6.006,0.000,1.901,3.743
age:C(sicdegp)[T.2],0.985,0.620,1.589,0.112,-0.230,2.199
age:C(sicdegp)[T.3],4.463,0.688,6.482,0.000,3.113,5.812
age Var,8.198,0.124,,,,


The model now converges, which is an indication that removing the random intercepts from the model was beneficial computationally. 

First, we notice that the **interaction term "age:C(sicdegp)[T.3]"** between the expressive language group and the age of children is **positive and significant for the third expressive language group**. This is an indication that the **increase** in socialization as a function of age for this group is **significantly larger relative to the first expressive language group** (i.e., the age slope is significantly larger for this group relative to the first expressive language group).

When we think about the interpretation of the parameters, however, we need to be cautious. 

The **intercept can be interpreted as the mean socialization when a child in the first expressive language group is zero years old. This may not be sensible to estimate. To improve this interpretation, we should center the age variable** and, again, fit the model.

In [5]:
# Center the age variable 
# df groupby childid
# then for similar childid, transform the age column based on each child's mean age
dat["age"] = dat.groupby("childid")["age"].transform(lambda x: x - x.mean())

# Print out the head of the dataset to see the centered measure
dat.head()

Unnamed: 0,age,vsae,sicdegp,childid
0,-4.4,6.0,3,1
1,-3.4,7.0,3,1
2,-1.4,18.0,3,1
3,2.6,25.0,3,1
4,6.6,27.0,3,1


In [6]:
# testing
dat

Unnamed: 0,age,vsae,sicdegp,childid
0,-4.400000,6.0,3,1
1,-3.400000,7.0,3,1
2,-1.400000,18.0,3,1
3,2.600000,25.0,3,1
4,6.600000,27.0,3,1
5,-4.400000,17.0,3,3
6,-3.400000,18.0,3,3
7,-1.400000,12.0,3,3
8,2.600000,18.0,3,3
9,6.600000,24.0,3,3


In [7]:
# Refit the model, again, without the random intercepts
mlm_mod = sm.MixedLM.from_formula(formula = 'vsae ~ age * C(sicdegp)', 
                                  groups = 'childid', 
                                  re_formula="0 + age", 
                                  data=dat)

# Run the fit
mlm_result = mlm_mod.fit()

# Print out the summary of the fit
mlm_result.summary()

0,1,2,3
Model:,MixedLM,Dependent Variable:,vsae
No. Observations:,610,Method:,REML
No. Groups:,158,Scale:,410.7496
Min. group size:,1,Likelihood:,-2752.2106
Max. group size:,5,Converged:,Yes
Mean group size:,3.9,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,17.421,1.470,11.848,0.000,14.539,20.303
C(sicdegp)[T.2],6.359,1.942,3.274,0.001,2.552,10.166
C(sicdegp)[T.3],23.403,2.157,10.852,0.000,19.176,27.630
age,2.731,0.641,4.257,0.000,1.474,3.988
age:C(sicdegp)[T.2],1.188,0.843,1.409,0.159,-0.465,2.840
age:C(sicdegp)[T.3],4.555,0.931,4.891,0.000,2.730,6.381
age Var,9.609,0.112,,,,


Now, our **intercept represents the mean socialization of the children at the mean age** for their measurements. For most children, this measures the socialization around around 6.5 years of age. 

### Significance Testing
The next question that we need to ask is **if the addition of the random age effects is actually significant**; should we retain these random effects in the model? First, we will fit the multilevel model including centered age again. **This time, however, we will compare it to the model that does not have random effects:** 


In [8]:
# Random Effects Mixed Model
mlm_mod = sm.MixedLM.from_formula(formula = 'vsae ~ age * C(sicdegp)', 
                                  groups = 'childid', 
                                  re_formula="0 + age", 
                                  data=dat)

# OLS model - no mixed effects
ols_mod = sm.OLS.from_formula(formula = "vsae ~ age * C(sicdegp)",
                              data = dat)

# Run each of the fits
mlm_result = mlm_mod.fit()
ols_result = ols_mod.fit()

# Print out the summary of the fit
print(mlm_result.summary())
print(ols_result.summary())

            Mixed Linear Model Regression Results
Model:              MixedLM   Dependent Variable:   vsae      
No. Observations:   610       Method:               REML      
No. Groups:         158       Scale:                410.7496  
Min. group size:    1         Likelihood:           -2752.2106
Max. group size:    5         Converged:            Yes       
Mean group size:    3.9                                       
--------------------------------------------------------------
                    Coef.  Std.Err.   z    P>|z| [0.025 0.975]
--------------------------------------------------------------
Intercept           17.421    1.470 11.848 0.000 14.539 20.303
C(sicdegp)[T.2]      6.359    1.942  3.274 0.001  2.552 10.166
C(sicdegp)[T.3]     23.403    2.157 10.852 0.000 19.176 27.630
age                  2.731    0.641  4.257 0.000  1.474  3.988
age:C(sicdegp)[T.2]  1.188    0.843  1.409 0.159 -0.465  2.840
age:C(sicdegp)[T.3]  4.555    0.931  4.891 0.000  2.730  6.381
age V

Now, we perform the significance test with a mixture of chi-squared distributions. We repeat the information from the Likelihood Ratio Tests writeup for this week here:

* Null hypothesis: The variance of the random child effects on the slope of interest is zero (in other words, these random effects on the slope are not needed in the model)
* Alternative hypothesis: The **variance of the random child effects on the slope of interest** is greater than zero


* First, **fit the model WITH random child effects** on the slope of interest, using restricted maximum likelihood estimation
    * -2 REML log-likelihood = 4854.18
* Next, fit the nested model **WITHOUT the random child effects** on the slope:
    * -2 REML log-likelihood = 5524.20 (higher value = worse fit!)
* Compute the **positive difference** in the -2 REML log-likelihood values (“REML criterion”) for the models:
    * Test Statistic (TS) = 5524.20 – 4854.18 = 670.02
* Refer the TS to a mixture of chi-square distributions with 1 and 2 DF, and equal weight 0.5:  


In [9]:
# Compute the p-value using a mixture of chi-squared distributions
# Because the chi-squared distribution with zero degrees of freedom has no 
# mass, we multiply the chi-squared distribution with one degree of freedom by 
# 0.5
pval = 0.5 * (1 - chi2.cdf(670.02, 1)) 
print("The p-value of our significance test is: {0}".format(pval))

The p-value of our significance test is: 0.0


The p-value is so small that we cannot distiguish it from zero. With a p-value this small, we can safely reject the null hypothesis. We have sufficient evidence to conclude that the variance of the random effects on the slope of interest is greater than zero.

# Marginal Models

While we have accounted for correlation among observations from the same children using random age effects in the multilevel model, marginal models attempt to manage the correlation in a slightly different manner. This process of fitting a marginal model, utilizing a method known as Generalized Estimating Equations (GEEs), aims to explicitly model the within-child correlations of the observations. 

We will specify two types of covariance structures for this analysis. The first will be an exchangeable model. In the exchangeable model, the observations within a child have a constant correlation, and constant variance.

The other covariance structure that we will assume is independence. An independent covariance matrix implies that observations within the same child have zero correlation.

We will see how each of these covariance structures affect the fit of the model.






In [10]:
# Fit the exchangable covariance GEE
model_exch = sm.GEE.from_formula(
    formula = "vsae ~ age * C(sicdegp)",
    groups="childid",
    cov_struct=sm.cov_struct.Exchangeable(), 
    data=dat
    ).fit()

# Fit the independent covariance GEE
model_indep = sm.GEE.from_formula(
    "vsae ~ age * C(sicdegp)",
    groups="childid",
    cov_struct = sm.cov_struct.Independence(), 
    data=dat
    ).fit()

# We cannot fit an autoregressive model, but this is how 
# we would fit it if we had equally spaced ages
# model_indep = sm.GEE.from_formula(
#     "vsae ~ age * C(sicdegp)",
#     groups="age",
#     cov_struct = sm.cov_struct.Autoregressive(), 
#     data=dat
#     ).fit()

The autoregressive model cannot be fit because the age variable is not spaced uniformly for each child's measurements (every year or every two years for each measurement). If it was, we can fit it with the commented code above. We will now see how each of the model fits compare to one another:

In [11]:
# Construct a datafame of the parameter estimates and their standard errors
x = pd.DataFrame(
    {
        "OLS_Params": ols_result.params,
        "OLS_SE": ols_result.bse,
        "MLM_Params": mlm_result.params,
        "MLM_SE": mlm_result.bse,
        "GEE_Exch_Params": model_exch.params, 
        "GEE_Exch_SE": model_exch.bse,
        "GEE_Indep_Params": model_indep.params, 
        "GEE_Indep_SE": model_indep.bse
    }
)

# Ensure the ordering is logical
x = x[["OLS_Params", "OLS_SE","MLM_Params", "MLM_SE","GEE_Exch_Params", 
       "GEE_Exch_SE", "GEE_Indep_Params", "GEE_Indep_SE"]]

# Round the results of the estimates to two decimal places
x = np.round(x, 2)
# Print out the results in a pretty way
display(HTML(x.to_html()))

Unnamed: 0,OLS_Params,OLS_SE,MLM_Params,MLM_SE,GEE_Exch_Params,GEE_Exch_SE,GEE_Indep_Params,GEE_Indep_SE
C(sicdegp)[T.2],6.36,2.24,6.36,1.94,5.64,2.95,6.36,2.88
C(sicdegp)[T.3],23.4,2.48,23.4,2.16,22.65,3.55,23.4,3.45
Intercept,17.42,1.69,17.42,1.47,17.59,1.99,17.42,1.91
age,2.6,0.46,2.73,0.64,2.6,0.51,2.6,0.51
age Var,,,0.02,0.01,,,,
age:C(sicdegp)[T.2],1.47,0.6,1.19,0.84,1.47,0.78,1.47,0.78
age:C(sicdegp)[T.3],4.48,0.66,4.56,0.93,4.48,0.88,4.48,0.88


We can see that the estimates for the **parameters are relatively consistent among each of the modeling methodologies**, but the **standard errors differ** from model to model. 

* Overall, the **two GEE models are mostly similar** and both exhibit **standard errors** for parameters that are **slightly larger than each of their corresponding values in the OLS model**. 
* The **multilevel model has the largest standard error for the age coefficient**, but the **smallest standard error for the intercept**. 
* Overall, we see that we would make similar inferences regarding the importance of these fixed effects, but remember that we **need to interpret the multilevel models estimates conditioning on a given child**. 

For example, 
* considering the age coefficient in the multilevel model, we would say that as age increases by one year *for a given child* in the first expressive language group, VSAE is expected to increase by 2.73. 
* In the GEE and OLS models, we would say that as age increases by one year *in general*, the average VSAE is expected to increase by 2.60 ("in that same cluster/child").