<span style="font-family:Papyrus; font-size:3em;">Homework 2</span>

<span style="font-family:Papyrus; font-size:2em;">Cross Validation</span>

# Problem

In this homework, you will use cross validation to analyze the effect on model quality
of the number of model parameters and the noise in the observational data.
You do this analysis in the context of design of experiments.
The two factors are (i) number of model parameters and (ii) the noise in the observational data;
the response will be the $R^2$ of the model (actually the $R^2$ averaged across the folds of
cross validation).

You will investigate models of linear pathways with 2, 4, 6, 8, 10 parameters.
For example, a two parameter model is use $S_1 \xrightarrow{v_1} S_2 \xrightarrow{v_3} S_3$,
where $v_i = k_i s_i$, $k_i$ is a parameter to estimate, and $s_i$ is the concentration of $S_i$.
The initial concentration of $S_1 = 10$, and the true value of $k_i$ is $i$. Thus, for a two parameter model,
$k_1 = 1$, $k_2 = 2$.

You will generate the synthetic data by adding a
noise term to the true model.
The noise term is drawn from a normal distribution with mean 0
and standard deviations of 0.2, 0.5, 0.8, 1.0, and 1.5, depending on the experiment.

You will design experiments, implement codes to run them, run the experiments, and interpret the results.
The raw output of these experiments will be
a table structured as the one below.
Cell values will be the average $R^2$ across the folds of the cross validation done with
one level for each factor.

 |   | 2 | 4 | 6 | 8 | 10
  | -- | -- | -- | -- | -- | -- |
 0.2 | ? | ? | ? | ? | ?
 0.5 | ? | ? | ? | ? | ?
 0.8 | ? | ? | ? | ? | ?
 1.0 | ? | ? | ? | ? | ?
 1.5 | ? | ? | ? | ? | ?
 

1. (2 pt) **Generate Models.** Write (or generate) the models in Antimony, and produce plots for their true values. Use a simulation time
of 10 and 100 points.

1. (1 pt) **Generate Synthetic Data.** Write a function that creates synthetic data given the parameters std 
and numParameter.

1. (1 pt) **Extend ``CrossValidator``.** You will extend ``CrossValidator`` (in ``common/util_crossvalidation.py``)
by creating a subclass ``ExtendedCrossValidator`` that has the method
``calcAvgRsq``. The method takes no argument (except ``self``) and returns the average value of
$R^2$ for the folds. Don't forget to document the function and include at least one tests.

1. (4 pt) **Implement ``runExperiments``.** This function has inputs: (a) list of the number of parameters for the
models to study and (b) list of the standard deviations of the noise terms.
It returns a dataframe with: columns are the number of parameters; rows (index) are the standard deviations of noise;
and values are the average $R^2$ for the folds defined by the levels of the factors.
Run experiments that produce the tables described above using five hold cross validation and 100 simulation points.

1. (4 pt) **Calculate Effects.** Using the baseline standard deviation of noise of 0.8, number of parameters of 6, calculate $\mu$, $\alpha_{i,k_i}$,
$\gamma_{i,i_k,j,k_j}$.

1. (3 pt) **Analysis.** Answer the following questions
   1. What is the effect on $R^2$ as the number of parameters increases? Why?
   1. How does the noise standard deviation affect $R^2$? Why?
   1. What are the interaction effects and how do they influence the response (average $R^2$)?
   
**Please do your homework in a copy of this notebook, maintaining the sections.**

# Programming Preliminaries
This section provides the setup to run your python codes.

In [1]:
IS_COLAB = False
#
if IS_COLAB:
  !pip install tellurium
  !pip install SBstoat
#    
# Constants for standalone notebook
if not IS_COLAB:
    CODE_DIR = "/home/ubuntu/advancing-biomedical-models/common"
else:
    from google.colab import drive
    drive.mount('/content/drive')
    CODE_DIR = "/content/drive/My Drive/Winter 2021/common"
import sys
sys.path.insert(0, CODE_DIR)

In [2]:
import util_crossvalidation as ucv
from SBstoat.namedTimeseries import NamedTimeseries, TIME

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tellurium as te

In [3]:
END_TIME = 5
NUM_POINT = 100
NOISE_STD = 0.5
# Column names
C_NOISE_STD = "noisestd"
C_NUM_PARAMETER = "no. parameters"
C_VALUE = "value"
#
NOISE_STDS = [0.2, 0.5, 0.8, 1.0, 1.5]
NUM_PARAMETERS = [2, 4, 6, 8, 10]

In [4]:
def isSame(collection1, collection2):
    """
    Determines if two collections have the same elements.
    """
    diff = set(collection1).symmetric_difference(collection2)
    return len(diff) == 0
    
# Tests
assert(isSame(range(3), [0, 1, 2]))
assert(not isSame(range(4), range(3)))

# Generate Models

# Generate Synthetic Data

# ``ExtendedCrossValidator``

Hint: Subclass using ``class ExtendedCrossValidator(ucv.CrossValidator):``.

# Implement ``runExperiments``

# Calculate Effects
Here, we calculate $\mu$, $\alpha_{i, k_i}$, and $\gamma_{i, k_i, j, k_j}$.

# Analysis

**What is the effect on $R^2$ as the number of parameters increases? Why?**
  

**How does the noise standard deviation affect $R^2$? Why?**

**What are the interaction effects and how do they influence the response (average $R^2$)?**