# Parameter Estimation Simulations

This notebook provides an overview of how the parameter estimation simulations presented in the paper "A Fisher Scoring approach for crossed multiple-factor Linear Mixed Models" were performed using the code available in this repository. 

As the simulations in the paper were run on a SGE cluster set-up (and the scripts used were specific to the cluster and software employed), we provide details on how to run individual simulations and how to combine the results from simulations which have already been run but do not provide scripts for submitting jobs to a cluster. This notebook should make it clear how the simulations can be run on a cluster but, to do so, you will need to write your own cluster-specific scripts for the actual job submission. The file `src/Simulations/LMMPaperSim.sh` provides some initial bash scripts which may help you to do this.

## Python Imports

In [None]:
# Package imports
import numpy as np
import scipy
import os
import shutil
import glob
import sys
import pandas as pd
import time

# Import modules from elsewhere in the repository.
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(os.path.join(module_path,"src","Simulations"))
    sys.path.append(os.path.join(module_path,"src","lib"))
    
from genTestDat import genTestData2D, prodMats2D
from LMMPaperSim import *
from est2d import *
from npMatrix2d import *

## Running a simulation

The function `runSim`, which can be found in `LMMPaperSim.py`, can be used to run a single simulation instance. It takes the following inputs:

 - `simInd`: An index to represent the simulation. All output for this simulation will be saved in files with the index specified by this argument. The simulation with index 1 will also perform any necessary additional setup and should therefore be run before any others.
 - `desInd`: Integer value between 1 and 3 representing which design to run. The designs are as follows:
   - `Design 1: nlevels=[50], nraneffs=[2]`
   - `Design 2: nlevels=[50,10], nraneffs=[3,2]`
   - `Design 3: nlevels=[100,50,10], nraneffs=[4,3,2]`
 - `OutDir`: The output directory.
 - `mode`: String indicating whether to run parameter estimation simulations (`mode='param'`) or T statistic simulations (`mode='Tstat'`).
 - `REML`: Boolean indicating whether to use ML or ReML estimation. 

An example of how this function can be used is given below:

In [None]:
# Run a simulation with index 1
simInd = 1

# Use design 2
desInd = 2

# Set output directory
OutDir = os.path.join(module_path,"data","ParamSimulation")

# Make the output directory if it doesn't exist already.
if not os.path.isdir(OutDir):
    os.mkdir(OutDir)

# Run a simulation
runSim(simInd, desInd, OutDir, mode='param', REML=False)

The example output from running one simulation is given below. The files output for the $i^{th}$ simulation running the $j^{th}$ design are as follows:

- `Sim{i}_Design{j}_X.csv`: The fixed effects design used for the simulation.
- `Sim{i}_Design{j}_Y.csv`: The response vector used for the simulation.
- `Sim{i}_Design{j}_Zfactor{k}.csv`: The factor vector for the $k^{th}$ random factor (See Appendix $6.1$ of the main paper for more details).
- `Sim{i}_Design{j}_Zdata{k}.csv`: The raw regressor matrix for the $k^{th}$ random factor (See Appendix $6.1$ of the main paper for more details).
- `Sim{i}_Design{j}_results.csv`: The results of the simulation.

In [None]:
# List contents of output directory.
os.listdir(OutDir)

The below provides a display of what can be found in the 'results' file listed above. The first column provides the ground truth used for this simulation and the other columns provide the results of each Fisher Scoring method. The column indices are: 

 - `Time`: Time (in seconds) taken for computation.
 - `nit`: Number of iterations performed.
 - `llh`: Maximised (restricted) Log-likelihood function.
 - `beta{i}`: The $i^{th}$ fixed effects estimate, $\beta$.
 - `sigma2`: The fixed effects variance, $\sigma^2$.
 - `D{k},{j}`: The $j^{th}$ estimated component of vech$(D_k)$.
 - `sigma2*D{k},{j}`: The fixed effects variance, $\sigma^2$, multiplied by the $j^{th}$ estimated component of vech$(D_k)$ (often any further computation is more concerned with this quantity than the seperate $\sigma^2$ and $D$ components). 

In [None]:
# Read in the results file.
results_table = pd.read_csv(os.path.join(OutDir,'Sim1_Design2_results.csv'),index_col=0)

# Display results
results_table

The above results may appear alarming on first view as it appears that the parameter estimates differ from the ground truth notably. However, comparing the log-likelihood values in the third row, it can be seen that the ground truth parameters do not correspond exactly to the maxima of the log-likelihood function. Therefore, it can be seen that the observed differences are a feature inherent to likelihood maximisation in general, rather than a byproduct of the Fisher Scoring method. In other words, as expected, the likeliest estimate for a parameter given some specific data is not exactly equal to the true underlying expected value of the parameter.

## Running multiple simulations in serial

Many simulations can be run in serial. The function `sim2D` can be used to do this. It takes the following inputs:

 - `desInd`: Integer value between 1 and 3 representing which design to run. The designs are as follows:
   - `Design 1: nlevels=[50], nraneffs=[2]`
   - `Design 2: nlevels=[50,10], nraneffs=[3,2]`
   - `Design 3: nlevels=[100,50,10], nraneffs=[4,3,2]`
 - `OutDir`: The output directory.
 - `nsim`: Number of simulations (default=`1000`)
 - `mode`: String indicating whether to run parameter estimation simulations (`mode='param'`) or T statistic simulations (`mode='Tstat'`).
 - `REML`: Boolean indicating whether to use ML or ReML estimation. 

An example of how this function can be used is given below:

In [None]:
# Run 3 simulation instances
sim2D(desInd, OutDir, nsim=3, mode='param', REML=False)

# List contents of output directory.
os.listdir(OutDir)

## Running multiple simulations in parallel

As mentioned at the start of this document, the simulation functions for the Fisher Scoring methods have been designed with cluster computation in mind. The file `LMMPaperSim.sh` contains bash commands and commented suggestions for how to run these functions on a cluster. As cluster set-ups vary notably from lab to lab we cannot provide exact commands for cluster submission but hope that this notebook and the surrounding files should be comprehensive enough so that anyone who wishes to submit the code to a cluster can do so.

## Running `lmer`

For comparison, we now run the same simulations in `lmer`. This is why we previously saved the `X`, `Y` and `Z` files for each simulation. To do this the `LMMPaperSim.r` code must be run in the programming language `R`. Before running this code, we suggest you first open the file to check that the input options (i.e. the number of simulations the code thinks we have run, the design index,... etc.) match those used above. Once you have done this, you can either run the file manually or source it in the `R` command line using the below command (with the path changed appropriately).

`source('~/Path/To/Repository/LMMPaper/src/Simulations/LMMPaperSim.r')`

Once this has been done, you will find a column has been added to the results file for `lmer`.

In [None]:
# Read in the results file.
results_table = pd.read_csv(os.path.join(OutDir,'Sim1_Design2_results.csv'),index_col=0)

# Display results
results_table

Note: The `R` code above must be run before moving onto the following sections of this notebook. This is as the following python code is expecting results from `lmer` to now be recorded in the results file.

## Combining simulations

Once the simulations have been run, the results may be combined across simulations using the following functions;

 - `differenceMetrics`: This generates MAE and MRD values.
 - `performanceTables`: This generates the performance metrics.
 
These functions were used to produce the results displayed in the paper.

### Performance tables

The function `performanceTables` calculates the performance metrics (i.e. the computation time, number of iterations and maximized log-likehood criteria) for each simulation and stores the results in tables. The [pandas description](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) of the tables are printed as well. 

In [None]:
performanceTables(desInd, OutDir, nsim=3)

The below code can be used to see which files have now been created. `llhTable.csv` contains the maximized likelihood values for each method in each simulation. `nitTable.csv` contains the number of iterations used for each method in each simulation. `timesTable.csv` contains the times taken for each method in each simulation.

In [None]:
# List contents of output directory.
glob.glob(os.path.join(OutDir,'*Table.csv'))

Below one of the files, the table of performance times, is displayed. The row index corresponds to simulation number and the column index corresponds to the method used for parameter estimation.

In [None]:
# Read in the results file.
times_table = pd.read_csv(os.path.join(OutDir,'timesTable.csv'),index_col=0)

# Display results
times_table

### Difference metrics

The function `differenceMetrics` calculates the MAE (Mean Absolute Error) and MRD (Mean Relative Difference) metrics for each simulation and stores the results in tables. The [pandas description](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) of the tables are printed as well. 

In [None]:
differenceMetrics(desInd, OutDir, nsim=3)

The below code can be used to see which files have now been created. The suffix `_abs` refers to MAE (mean ABSolute error) and the suffix `_rel` refers to MRD (mean RELative differences).

In [None]:
# List contents of output directory.
glob.glob(os.path.join(OutDir,'diff*'))

Below one of the files, the MAE values taken with respect to the ground truth used for simulation, for the beta estimates, is displayed. The row index corresponds to simulation number and the column index corresponds to the method used for parameter estimation.

In [None]:
# Read in the results file.
MAE_table = pd.read_csv(os.path.join(OutDir,'diffTableBetas_truth_abs.csv'),index_col=0)

# Display results
MAE_table