# CROSS VALIDATION: IMPLEMENTATION

This notebook describes how to do cross validation for simulation models.

Cross validation partitions the observed data into *folds*.
A fold consists of a pair of training data and test data.
For simulation models,
the training data are used to estimate parameters, and the quality of these estimates is evaluated based on predictions of the test data.

The main challenges with cross validation for simulation data are: creating folds, estimating parameters on data subsets, and evaluating parameter estimates on a data subset.

# Programming Preliminaries

In [41]:
IS_COLAB = True

In [42]:
if IS_COLAB:
    !pip install -q tellurium
    !pip install -q SBstoat

In [43]:
%matplotlib inline
import tellurium as te
import numpy as np
import math
import random 
import matplotlib.pyplot as plt
import urllib
from SBstoat import ModelFitter, NamedTimeseries, Parameter

In [44]:
def getSharedCodes(moduleName):
  """
  Obtains common codes from the github repository.

  Parameters
  ----------
  moduleName: str
      name of the python module in the src directory
  """
  if IS_COLAB:
      url = "https://github.com/sys-bio/network-modeling-summer-school-2021/raw/main/src/%s.py" % moduleName
      local_python = "python.py"
      _, _ = urllib.request.urlretrieve(url=url, filename=local_python)
  else:
      local_python = "../src/%s.py" % moduleName
  with open(local_python, "r") as fd:
    codeStr = "".join(fd.readlines())
  print(codeStr)
  exec(codeStr, globals())

# Acquire codes
getSharedCodes("util")

# TESTS
assert(isinstance(LINEAR_PATHWAY_DF, pd.DataFrame))

import pandas as pd
import urllib.request

# Linear pathway data
BASE_URL = "https://github.com/vporubsky/network-modeling-summer-school/raw/main/"
BASE_DATA_URL = "%sdata/" % BASE_URL
BASE_MODULE_URL = "%ssrc/" % BASE_URL
BASE_MODEL_URL = "%smodels/" % BASE_URL
LOCAL_FILE = "local_file.txt"


def getData(csvFilename):
    """
    Creates a dataframe from a CSV structured URL file.

    Parameters
    ----------
    csvFilename: str
        Name of the CSV file (w/o ".csv" extension)

    Returns
    -------
    pd.DataFrame
    """
    url = "%s%s.csv" % (BASE_DATA_URL, csvFilename)
    filename, _ = urllib.request.urlretrieve(url, filename=LOCAL_FILE)
    return pd.read_csv(LOCAL_FILE)

def getModule(moduleName):
    """
    Obtains common codes from the github repository.
  
    Parameters
    ----------
    moduleName: str
        name of the python module in the src directory
    """
    url = "%s%s.py" % (BASE_MODULE_URL, moduleName)
    _, _ = urllib.request.urlretrieve(url, fil

In [45]:
def findCloseMatchingValues(longArr, shortArr):
    """
    Finds the indices in longArr that are closest to the values in shortArr.

    Parameters
    ----------
    longArr: np.array
    shortArr: np.arry

    Returns
    -------
    array-int
    """
    indices = []
    for val in shortArr:
        distances = (longArr - val)**2
        minDistance = np.min(distances)
        distancesLst = list(distances)
        idx = distancesLst.index(minDistance)
        indices.append(idx)
    return np.array(indices)

# TESTS
longArr = np.array(range(10))
shortArr = np.array([2.1, 2.9, 4.3])
indexArr = findCloseMatchingValues(longArr, shortArr)
expectedArr = [2, 3, 4]
assert(all([v1 == v2 for v1, v2 in zip(indexArr, expectedArr)]))

In [46]:
# Experimental conditions
END_TIME = max(LINEAR_PATHWAY_ARR[:, 0])
NUM_POINT = 15
NOISE_STD = 1.0
NUM_FOLD = 3

# Constants

In [47]:
# Constants
FOLD_TRAINING = 0 # Training data in a fold
FOLD_TEST = 1 # Test data in a fold
RSQ = "rsq" # R-squared value in a dataframe

# Model

In [48]:
print(LINEAR_PATHWAY_MODEL)

R1:  S1 -> S2; k1*S1  
R2: S2 -> S3; k2*S2
R3: S3 -> S4; k3*S3
R4: S4 -> S5; k4*S4

S1 = 10

// Parameters
k1 = 0; # Nominal value of parameter
k2 = 0; # Nominal value of parameter
k3 = 0; # Nominal value of parameter
k4 = 0; # Nominal value of parameter



# Helper Functions

The following functions provide capabilities used in this notebook.

In [49]:
def runSimulation(simTime=END_TIME, numPoint=NUM_POINT, roadRunner=None,
                  parameterDct=None, model=LINEAR_PATHWAY_MODEL):
    """
    Runs the simulation model for the parameters.
   
    Parameters
    ----------
    endTime: float
        End time for the simulation
    numPoints: int
        Number of points in the simulation
    roadRunner: ExtendedRoadRunner
    parameters: list-str
        
    Returns
    -------
    NamedArray
        results of simulation
    """
    if roadRunner is None:
        roadRunner = te.loada(model)
    else:
        roadRunner.reset()
    if parameterDct is not None:
        # Set the simulation constants for all parameters
        for name in parameterDct.keys():
            roadRunner[name] = parameterDct[name]
    return roadRunner.simulate (0, simTime, numPoint)

# TESTS
numPoint = int(10*END_TIME)
fittedData = runSimulation(parameterDct={"k1": 0.1}, numPoint=numPoint)
numCol = np.shape(fittedData)[1]
assert(np.size(fittedData) == numPoint*numCol)
assert(fittedData[0, 1] == 10)

In [50]:
FITTED_DATA = runSimulation(parameterDct={"k1": 0.1}, numPoint=numPoint)

# Cross Validation Algorithm

Cross validation divides the data into training and test sets.
The training sets are used to estimate parameters.
The test sets are used to evaluate the quality of the
parameter fits.
That is, for each set of test indices, we have a companion set of training indices. The training indices select a subset of the observational data that are used to estimate parameter values.

Below is pseudo code that calculates the $R^2$ value for each fold.

    def crossValidate(model, observedData, numFold):
        folds = generateFolds(observedData, numFold)
        foldQualities = []
        for fold folds:
            foldQuality = evaluateFold(model, fold)
            foldQualities.append(foldQuality)
        return foldQualities

# Generating Folds

The first step is to generate the folds.
A fold is represented by a tuple.
The first element is the training data;
the second element is the test data.
The collection of these tuples is itsself an array.
``generateFolds`` such an array.



In [51]:
def generateFolds(observedData, numFold):
    """
    Generates indices of training and test data
    by alternating between folds
    
    Parameters:
    ----------
    observedData: np.array(N, M) or NamedTimeseries
    numFold: int
        number of pairs of testIndices and trainIndices
    
    Returns:
    --------
    array-tuple(array, array)
    """
    if isinstance(observedData, NamedTimeseries):
        df = observedData.to_dataframe()
        df = df.reset_index()
        observedData = df.to_numpy()
    result = []
    numPoint, numCol = np.shape(observedData)
    indices = range(numPoint)
    for remainder in range(numFold):
        testIndices = np.array([n for n in indices if n % numFold == remainder])
        testData = observedData[testIndices, :]
        trainIndices = np.array(list(set(indices).difference(testIndices)))
        trainData = observedData[trainIndices, :]
        entry = (trainData, testData)
        result.append(entry)
    return result

# TESTS
numFold = 3
observedData = NamedTimeseries(dataframe=LINEAR_PATHWAY_DF)
folds = generateFolds(observedData.to_dataframe().to_numpy(), numFold)
assert(len(folds) == numFold)
fold = folds[0]
assert(len(fold) == 2)
trainData = fold[0]
testData = fold[1]
assert(len(observedData) == (len(trainData) + len(testData)))
#
fold2s = generateFolds(observedData, numFold)
assert(len(folds) == len(fold2s))

In [52]:
folds

[(array([[7.45391115e+00, 7.61656564e-01, 7.37219347e-02, 6.60778277e-03,
          4.82334671e-04],
         [8.28286758e+00, 1.21277213e+00, 2.60522199e-01, 3.87223027e-02,
          7.03598959e-03],
         [6.31537918e+00, 2.25440468e+00, 6.99869349e-01, 1.59682183e-01,
          8.52162442e-02],
         [7.60029359e+00, 2.53934763e+00, 1.00050901e+00, 3.01052446e-01,
          2.08922107e-01],
         [5.41140224e+00, 2.98080193e+00, 1.20718732e+00, 6.36913521e-01,
          4.75265940e-01],
         [4.56449516e+00, 2.01172053e+00, 1.32967397e+00, 6.53568960e-01,
          5.78270309e-01],
         [3.27222225e+00, 2.26336096e+00, 1.41342726e+00, 8.08617233e-01,
          1.19246083e+00],
         [4.40101417e+00, 2.31026849e+00, 1.75426175e+00, 8.45330383e-01,
          1.56057274e+00],
         [3.12149948e+00, 1.96647005e+00, 1.66763659e+00, 9.53115806e-01,
          2.03565210e+00],
         [2.94982466e+00, 1.74039878e+00, 1.54935687e+00, 1.09908406e+00,
          2.53792

In [53]:
trainData

array([[7.45391115e+00, 7.61656564e-01, 7.37219347e-02, 6.60778277e-03,
        4.82334671e-04],
       [8.28286758e+00, 1.21277213e+00, 2.60522199e-01, 3.87223027e-02,
        7.03598959e-03],
       [6.31537918e+00, 2.25440468e+00, 6.99869349e-01, 1.59682183e-01,
        8.52162442e-02],
       [7.60029359e+00, 2.53934763e+00, 1.00050901e+00, 3.01052446e-01,
        2.08922107e-01],
       [5.41140224e+00, 2.98080193e+00, 1.20718732e+00, 6.36913521e-01,
        4.75265940e-01],
       [4.56449516e+00, 2.01172053e+00, 1.32967397e+00, 6.53568960e-01,
        5.78270309e-01],
       [3.27222225e+00, 2.26336096e+00, 1.41342726e+00, 8.08617233e-01,
        1.19246083e+00],
       [4.40101417e+00, 2.31026849e+00, 1.75426175e+00, 8.45330383e-01,
        1.56057274e+00],
       [3.12149948e+00, 1.96647005e+00, 1.66763659e+00, 9.53115806e-01,
        2.03565210e+00],
       [2.94982466e+00, 1.74039878e+00, 1.54935687e+00, 1.09908406e+00,
        2.53792931e+00],
       [2.29400209e+00, 1.5232

In [54]:
testData

array([[8.51678036e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00],
       [7.61303021e+00, 1.81005756e+00, 3.52075303e-01, 8.37075579e-02,
        3.15814521e-02],
       [5.38910939e+00, 2.35260007e+00, 8.49514531e-01, 3.69980464e-01,
        3.06653894e-01],
       [3.53071936e+00, 2.00169360e+00, 1.17378690e+00, 8.68272208e-01,
        9.88267376e-01],
       [3.07306943e+00, 2.27302181e+00, 1.36486858e+00, 8.81426831e-01,
        1.69128357e+00],
       [2.90997538e+00, 1.65727651e+00, 1.43374486e+00, 1.05886613e+00,
        2.80729400e+00],
       [2.21348241e+00, 1.58784304e+00, 1.23041023e+00, 1.01995933e+00,
        5.03545291e+00],
       [1.56750857e+00, 1.13860530e+00, 8.76931269e-01, 1.04437228e+00,
        5.36343076e+00],
       [9.92643738e-01, 9.07052903e-01, 8.10704898e-01, 8.92309880e-01,
        7.13967029e+00],
       [1.00759058e+00, 8.65721716e-01, 7.69303033e-01, 7.08649052e-01,
        6.17415439e+00],
       [6.65970364e-01, 5.4869

# Evaluating Folds

In [55]:
def evaluateFold(model, colnames, parametersToFit, fold,
                 **fittingArgs):
    """
    Calculates the R-squared value for the fit of the predicted test data,
    whose parameters are estimated from the training data, with the observed
    tests data.

    Parameters
    ----------
    model: antimony/ExtendedRoadRunner
    colnames: list-str
        names of data columns
    fold: tuple(np.array, np.array)
        train data, test data
    fittingArgs: dict
        optional arguments for ModelFitter

    Returns
    -------
    float: R squared
    ModelFitter
    """
    # Estimate the parameters
    observedTS = NamedTimeseries(colnames=colnames, array=fold[FOLD_TRAINING])
    fitter = ModelFitter(modelSpecification=model,
                                    parametersToFit=parametersToFit,
                                    observedData=observedTS,
                                    **fittingArgs,
                                    )
    fitter.fitModel()
    parameterDct = dict(fitter.params.valuesdict())
    # Obtain the fitted values for the estimated parameters
    if "endTime" in fittingArgs:
        endTime = fittingArgs["endTime"]
    else:
        endTime = END_TIME
    numPoint = int(10*endTime)
    fittedData = runSimulation(simTime=endTime, numPoint=numPoint,
                                parameterDct=parameterDct,
                                roadRunner=model)
    # Find the time indices that correspond to the test data
    testData = fold[FOLD_TEST]
    testTimes = testData[:, 0]
    fittedTimes = fittedData[:, 0]
    indices = findCloseMatchingValues(fittedTimes, testTimes)
    # Calculate residuals for the corresponding times
    indexArr = np.array(indices)
    fittedTestData = fittedData[indexArr, 1:]
    flatFittedTestData = fittedTestData.flatten()
    flatTestData = (testData[:, 1:]).flatten()
    residualsArr = flatTestData - flatFittedTestData
    rsq = 1 - np.var(residualsArr)/np.var(flatTestData)
    #
    return rsq, fitter

# TESTS
model = te.loada(LINEAR_PATHWAY_MODEL)
colnames = list(LINEAR_PATHWAY_DF.columns)
parametersToFit = [
                Parameter("k1", lower=0, value=0, upper=10),
                Parameter("k2", lower=0, value=0, upper=10),
                Parameter("k3", lower=0, value=0, upper=10),
                Parameter("k4", lower=0, value=0, upper=10),
    ]
folds = generateFolds(LINEAR_PATHWAY_ARR, NUM_FOLD)
rsq, fitter = evaluateFold(model, colnames, parametersToFit, folds[0],
                   fitterMethods=["differential_evolution"])
assert(rsq > 0.85)
assert(isinstance(fitter, ModelFitter))

# Complete Workflow

In [56]:
def crossValidate(model, observedData, parametersToFit, colnames, numFold,
                  **fitterArgs):
    """
    Performs cross validation on the model.

    Parameters
    ----------
    model: ExtendedRoadrunner
    observedData: NamedTimeseries
    parametersToFit: list-SBstoat.Parameter
    colnames: list-str
    numFold: int

    Results
    -------
    pd.DataFrame
        Index: fold
        Columns
            R2: R squared value
            values of parameters
    """
    folds = generateFolds(observedData, numFold)
    result = {p.name: [] for p in parametersToFit}
    result[RSQ] = []
    for fold in folds:
        foldQuality, fitter = evaluateFold(model, colnames, parametersToFit,
                                           fold, **fitterArgs)
        result[RSQ].append(foldQuality)
        valueDct = fitter.params.valuesdict()
        for parameter in parametersToFit:
            result[parameter.name].append(valueDct[parameter.name])
    df = pd.DataFrame(result)
    return df

# TESTS
model = te.loada(LINEAR_PATHWAY_MODEL)
colnames = list(LINEAR_PATHWAY_DF.columns)
parametersToFit = [
                Parameter("k1", lower=0, value=0, upper=10),
                Parameter("k2", lower=0, value=0, upper=10),
                Parameter("k3", lower=0, value=0, upper=10),
                Parameter("k4", lower=0, value=0, upper=10),
    ]
observedData = NamedTimeseries(dataframe=LINEAR_PATHWAY_DF)
resultDF = crossValidate(model, observedData, parametersToFit,
                          colnames, NUM_FOLD,
                          fitterMethods=["differential_evolution"])
trues = [q > 0.85 for q in resultDF[RSQ]]
assert(all(trues))

In [57]:
resultDF

Unnamed: 0,k1,k2,k3,k4,rsq
0,1.067128,2.266265,3.098574,4.675904,0.8616332751370328
1,1.01253,2.241219,3.162902,4.2184,0.881954662823421
2,1.021861,2.05966,3.071075,4.111133,0.886724133099888


# Exercises

This exerise uses the ``WOLF_MODEL`` and the data ``WOLF_DF``.
Only fit the parameters ``J1_k1``, ``J1_ki``, and ``J1_n``.

1. Do 2, 10, 20, and 100 fold cross validations. Save the fold qualities in a python dictionary.

1. Some questions about the results in (1).
   1. How does the mean value of quality change with the number of folds?

   1. Can you explain the trend in the standard deviations of the quality scores?

1. Plot the distribution of quality scores for 20 and 100 folds.
What can you say about these distributions?
   1. Hint: If ``resultDF`` is the DataFrame returned by ``crossvalidate``,
   then ``list(resultDF["R2"])`` is the list of $R^2$ values for the folds.


## (1) Calculate Cross Validations

In [58]:
model = te.loada(WOLF_MODEL)
colnames = list(WOLF_DF.columns)
observedData = NamedTimeseries(dataframe=WOLF_DF)

In [59]:
upper = 1e5
parametersToFit = [
      Parameter("J1_k1", lower=0, value=1, upper=upper), #550
      Parameter("J1_Ki", lower=0, value=1, upper=upper), #1
      Parameter("J1_n", lower=0, value=1, upper=upper), #4                                                                                                                                                                 
]

In [None]:
resultDct = {}
for numFold in [2, 10, 20, 100]:
    qualityDF = crossValidate(model, observedData, parametersToFit,
                          colnames, numFold,
                          fitterMethods=["differential_evolution"],
                          endTime=5)
    resultDct[numFold] = qualityDF


2768.433245:     (Roadrunner exception: : CVODE Error: CV_CONV_FAILURE: Convergence test failures occurred too many times (= MXNCF = 10) during one internal timestep or occurred with |h| = hmin.; In virtual double rr::CVODEIntegrator::integrate(double, double))

2768.443611:     (Roadrunner exception: : CVODE Error: CV_CONV_FAILURE: Convergence test failures occurred too many times (= MXNCF = 10) during one internal timestep or occurred with |h| = hmin.; In virtual double rr::CVODEIntegrator::integrate(double, double))

2768.454061:     (Roadrunner exception: : CVODE Error: CV_CONV_FAILURE: Convergence test failures occurred too many times (= MXNCF = 10) during one internal timestep or occurred with |h| = hmin.; In virtual double rr::CVODEIntegrator::integrate(double, double))

2768.464208:     (Roadrunner exception: : CVODE Error: CV_CONV_FAILURE: Convergence test failures occurred too many times (= MXNCF = 10) during one internal timestep or occurred with |h| = hmin.; In virtual dou

In [None]:
result[100]

## (2) Questions

1. The mean values of $R^2$ change little as the folds increase, although there is a small decrease at 100 folds.
1. The standard deviation of $R^2$ increases with the number of folds.
This is because with fewer data values in the test data,
there is more variability in how well the model fits.

## (3) Plot the distributions

In [None]:
def plotHist(numFold):
    """
    Plots a histogram for the number of folds.

    Parameters
    ----------
    numFold: int
    """
    _ = plt.hist(resultDct[numFold]["R2"])
    plt.xlim([0, 1.0])
    plt.title("Folds: %d" % numFold)

In [None]:
plotHist(20)

In [None]:
plotHist(100)

The histogram plots are consistent with the observation that variance increases with the number of folds.