# SATs example

In this notebook, code is provided for the SATs school example presented in the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Models". This analysis replicates that of [West et al (2014)](https://www.routledge.com/Linear-Mixed-Models-A-Practical-Guide-Using-Statistical-Software-Second/West-Welch-Galecki/p/book/9781466560994) and uses data notably analysed previously by [Hong and Raudenbush](https://journals.sagepub.com/doi/10.3102/1076998607307355). The dataset used in this example is from [the longitudinal evaluation of school change and performance (LESCP) dataset](https://www2.ed.gov/offices/OUS/PES/esed/lescp_highlights.html). This data is freely available online and can be found, for example, as one of the datasets included as an example in the HLM software package. 

## Python Imports

In [None]:
# Package imports
import numpy as np
import scipy
import os
import sys
import pandas as pd
import time

# Import modules from elsewhere in the repository.
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(os.path.join(module_path,"src","SATExample"))
    sys.path.append(os.path.join(module_path,"src","lib"))
    
from genTestDat import genTestData2D, prodMats2D
from est2d import *
from npMatrix2d import *

## Load Dataset

Load in the the 67th school from the LESCP dataset.

In [None]:
# Read in the data for the school with ID 67
data = pd.read_csv(os.path.join(module_path,"data","SATExample","school67.csv"))
print(data)

## Helper function

The below function can be used to recode a factor so that its levels are represented by $\{1,2,...\}$.

In [None]:
def recodeFactor(factor):

    # New copy of factor vector
    factor = np.array(factor)

    # Work out unique levels of the factor
    uniqueValues = np.unique(factor)
    
    # Loop through the levels replacing them with
    # a 0:l coding where l is the number of levels
    for i in np.arange(len(uniqueValues)):

        factor[factor==uniqueValues[i]]=i

    return(factor)

## Preprocessing

Construct the response vector, $Y$, fixed effects design matrix, $X$ and random effects design matrix $Z$.

In [None]:
# Number of subjects in model
ns = len(data)

# Work out factors for model
studfac = recodeFactor(np.array(data['studid'].values))
tchrfac = recodeFactor(np.array(data['tchrid'].values))

# Work out math and year for model
math = np.array(data['math'].values).reshape(len(data),1)
year = np.array(data['year'].values).reshape(len(data),1)

# Construct X for model
X = np.concatenate((np.ones((ns,1)),year),axis=1)
Y = math

# Work out Z for the first random factor; student
Z_f1 = np.zeros((ns,len(np.unique(studfac))))
Z_f1[np.arange(ns),studfac] = 1

# Work out Z for the second random factor; teacher
Z_f2 = np.zeros((ns,len(np.unique(tchrfac))))
Z_f2[np.arange(ns),tchrfac] = 1

# Construct Z for model
Z = np.concatenate((Z_f1,Z_f2),axis=1)

These variables must be set for parameter estimation.

In [None]:
# Convergence tolerance
tol = 1e-6

# number of levels for each factor in the model
nlevels = np.array([len(np.unique(studfac)),len(np.unique(tchrfac))])

# number of random effects for each factor in the model
nraneffs = np.array([1,1])

## FS method

The below code will calculate, from $X$, $Y$ and $Z$, the parameter estimates for the FS method. The calculation is timed and performed in $2$ stages. The first stage computes the product matrices $X'X, X'Y, X'Z, Y'Y, Y'Z$ and $Z'Z$ and the second stage performs parameter estimation.

In [None]:
# Obtain the product matrices and start recording the computation time
t1 = time.time()
XtX, XtY, XtZ, YtX, YtY, YtZ, ZtX, ZtY, ZtZ = prodMats2D(Y,Z,X)

# Run Fisher Scoring
paramVector_FS,_,nit,llh = FS2D(XtX, XtY, ZtX, ZtY, ZtZ, XtZ, YtZ, YtY, YtX, nlevels, nraneffs, tol, ns, init_paramVector=None)
t2 = time.time()

Below are the time taken for, and number of iterations required by, parameter estimation using the FS method*. Also given is the maximized log-likelihood.

In [None]:
print('Time taken:        ', t2-t1)
print('Number iterations: ', nit)
print('Log-likelihood:    ', llh-ns/2*np.log(2*np.pi))

*This code has been modified for demonstration purposes and only performs parameter estimation once. When generating results for the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Effects Models", this code was run $50$ times with the average time taken across runs reported. This can be done by putting a for loop around the above two blocks of code.

The below are the fixed effects estimates:

In [None]:
print('beta 0 (Intercept): ', paramVector_FS[0,0])
print('beta 1 (Year):      ', paramVector_FS[1,0])

The below are the variance components:

In [None]:
print('Sigma^2_A: ', paramVector_FS[3,0]*paramVector_FS[2,0])
print('Sigma^2_C: ', paramVector_FS[4,0]*paramVector_FS[2,0])
print('Sigma^2_E: ', paramVector_FS[2,0])

## FFS method

The below code will calculate, from $X$, $Y$ and $Z$, the parameter estimates for the FFS method. The calculation is timed and performed in $2$ stages. The first stage computes the product matrices $X'X, X'Y, X'Z, Y'Y, Y'Z$ and $Z'Z$ and the second stage performs parameter estimation.

In [None]:
# Obtain the product matrices and start recording the computation time
t1 = time.time()
XtX, XtY, XtZ, YtX, YtY, YtZ, ZtX, ZtY, ZtZ = prodMats2D(Y,Z,X)

# Run Fisher Scoring
paramVector_FFS,_,nit,llh = fFS2D(XtX, XtY, ZtX, ZtY, ZtZ, XtZ, YtZ, YtY, YtX, nlevels, nraneffs, tol, ns, init_paramVector=None)
t2 = time.time()

Below are the time taken for, and number of iterations required by, parameter estimation using the fFS method*. Also given is the maximized log-likelihood.

In [None]:
print('Time taken:        ', t2-t1)
print('Number iterations: ', nit)
print('Log-likelihood:    ', llh-ns/2*np.log(2*np.pi))

*This code has been modified for demonstration purposes and only performs parameter estimation once. When generating results for the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Effects Models", this code was run $50$ times with the average time taken across runs reported. This can be done by putting a for loop around the above two blocks of code.

The below are the fixed effects estimates:

In [None]:
print('beta 0 (Intercept): ', paramVector_FFS[0,0])
print('beta 1 (Year):      ', paramVector_FFS[1,0])

The below are the variance components:

In [None]:
print('Sigma^2_A: ', paramVector_FFS[3,0]*paramVector_FFS[2,0])
print('Sigma^2_C: ', paramVector_FFS[4,0]*paramVector_FFS[2,0])
print('Sigma^2_E: ', paramVector_FFS[2,0])

## SFS method

The below code will calculate, from $X$, $Y$ and $Z$, the parameter estimates for the SFS method. The calculation is timed and performed in $2$ stages. The first stage computes the product matrices $X'X, X'Y, X'Z, Y'Y, Y'Z$ and $Z'Z$ and the second stage performs parameter estimation.

In [None]:
# Obtain the product matrices and start recording the computation time
t1 = time.time()
XtX, XtY, XtZ, YtX, YtY, YtZ, ZtX, ZtY, ZtZ = prodMats2D(Y,Z,X)

# Run Fisher Scoring
paramVector_SFS,_,nit,llh = SFS2D(XtX, XtY, ZtX, ZtY, ZtZ, XtZ, YtZ, YtY, YtX, nlevels, nraneffs, tol, ns, init_paramVector=None)
t2 = time.time()

Below are the time taken for, and number of iterations required by, parameter estimation using the SFS method*. Also given is the maximized log-likelihood.

In [None]:
print('Time taken:        ', t2-t1)
print('Number iterations: ', nit)
print('Log-likelihood:    ', llh-ns/2*np.log(2*np.pi))

*This code has been modified for demonstration purposes and only performs parameter estimation once. When generating results for the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Effects Models", this code was run $50$ times with the average time taken across runs reported. This can be done by putting a for loop around the above two blocks of code.

The below are the fixed effects estimates:

In [None]:
print('beta 0 (Intercept): ', paramVector_SFS[0,0])
print('beta 1 (Year):      ', paramVector_SFS[1,0])

The below are the variance components:

In [None]:
print('Sigma^2_A: ', paramVector_SFS[3,0]*paramVector_SFS[2,0])
print('Sigma^2_C: ', paramVector_SFS[4,0]*paramVector_SFS[2,0])
print('Sigma^2_E: ', paramVector_SFS[2,0])

## FSFS method

The below code will calculate, from $X$, $Y$ and $Z$, the parameter estimates for the FSFS method. The calculation is timed and performed in $2$ stages. The first stage computes the product matrices $X'X, X'Y, X'Z, Y'Y, Y'Z$ and $Z'Z$ and the second stage performs parameter estimation.

In [None]:
# Obtain the product matrices and start recording the computation time
t1 = time.time()
XtX, XtY, XtZ, YtX, YtY, YtZ, ZtX, ZtY, ZtZ = prodMats2D(Y,Z,X)

# Run Fisher Scoring
paramVector_FSFS,_,nit,llh = fSFS2D(XtX, XtY, ZtX, ZtY, ZtZ, XtZ, YtZ, YtY, YtX, nlevels, nraneffs, tol, ns, init_paramVector=None)
t2 = time.time()

Below are the time taken for, and number of iterations required by, parameter estimation using the FSFS method*. Also given is the maximized log-likelihood.

In [None]:
print('Time taken:        ', t2-t1)
print('Number iterations: ', nit)
print('Log-likelihood:    ', llh-ns/2*np.log(2*np.pi))

*This code has been modified for demonstration purposes and only performs parameter estimation once. When generating results for the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Effects Models", this code was run $50$ times with the average time taken across runs reported. This can be done by putting a for loop around the above two blocks of code.

The below are the fixed effects estimates:

In [None]:
print('beta 0 (Intercept): ', paramVector_FSFS[0,0])
print('beta 1 (Year):      ', paramVector_FSFS[1,0])

The below are the variance components:

In [None]:
print('Sigma^2_A: ', paramVector_FSFS[3,0]*paramVector_FSFS[2,0])
print('Sigma^2_C: ', paramVector_FSFS[4,0]*paramVector_FSFS[2,0])
print('Sigma^2_E: ', paramVector_FSFS[2,0])

## CSFS method

The below code will calculate, from $X$, $Y$ and $Z$, the parameter estimates for the CSFS method. The calculation is timed and performed in $2$ stages. The first stage computes the product matrices $X'X, X'Y, X'Z, Y'Y, Y'Z$ and $Z'Z$ and the second stage performs parameter estimation.

In [None]:
# Obtain the product matrices and start recording the computation time
t1 = time.time()
XtX, XtY, XtZ, YtX, YtY, YtZ, ZtX, ZtY, ZtZ = prodMats2D(Y,Z,X)

# Run Fisher Scoring
paramVector_CSFS,_,nit,llh = cSFS2D(XtX, XtY, ZtX, ZtY, ZtZ, XtZ, YtZ, YtY, YtX, nlevels, nraneffs, tol, ns, init_paramVector=None)
t2 = time.time()

Below are the time taken for, and number of iterations required by, parameter estimation using the FSFS method*. Also given is the maximized log-likelihood.

In [None]:
print('Time taken:        ', t2-t1)
print('Number iterations: ', nit)
print('Log-likelihood:    ', llh-ns/2*np.log(2*np.pi))

*This code has been modified for demonstration purposes and only performs parameter estimation once. When generating results for the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Effects Models", this code was run $50$ times with the average time taken across runs reported. This can be done by putting a for loop around the above two blocks of code.

The below are the fixed effects estimates:

In [None]:
print('beta 0 (Intercept): ', paramVector_CSFS[0,0])
print('beta 1 (Year):      ', paramVector_CSFS[1,0])

The below are the variance components:

In [None]:
print('Sigma^2_A: ', paramVector_CSFS[3,0]*paramVector_CSFS[2,0])
print('Sigma^2_C: ', paramVector_CSFS[4,0]*paramVector_CSFS[2,0])
print('Sigma^2_E: ', paramVector_CSFS[2,0])

## *lmer* method

As *lmer* is written in R, the source code used to run this simulation using *lmer* cannot be found in this notebook. Instead, the source code to run this model in *lmer* can be found in the file "src/SATExample/SATsExample.R".