# Biol 359A | Parameter Estimation and Regularization
### Spring 2024, Week 5
Objectives:
- gain intuition for parameter estimation strategy
- gain intuition for cost function landscapes
- contextualize MLR parameters (coefficients)


In [None]:
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact, Dropdown, Checkbox, HBox, VBox, Button, Output, Layout, IntSlider
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
! rm -r week5_modelselection/
! git clone https://github.com/BIOL359A-FoundationsOfQBio-Spr24/week5_modelselection.git
! cp -r week5_modelselection/* .
! ls

## Parameter Estimation

We have spoken about using a cost function like SSE for estimating the parameters (coefficients) of linear regression models. Today we will begin by looking at parameter estimation in the context of a different type of model.

The [SIR model](https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology) is an ODE model used to study the rate of spread of infectious diseases in a population. Its simplest form (excluding births, deaths, and spatial spread of a population) is as follows:

$\frac{dS}{dt} = -\beta\frac{SI}{N}$

$\frac{dI}{dt} = -\beta \frac{SI}{N} - \gamma I$

$\frac{dR}{dt} = \gamma I$

Where:

$S$ = susceptible individuals

$I$ = infected individuals

$R$ = recovered individuals

$N$ = total population

$\beta$ = transmission rate

$\gamma$= recovery rate 


Lets say we had data that describes the number of infected individuals over time:

![Infected Individuals](images/infected_individuals.png)


Lets say we know the population size is 1000 and we know that initially one person was infected. So at time 0, the number of susceptible individuals was 999, the number of infected individuals was 1, and the number of recovered individuals was 0. How can we use these data to learn about the transmission rate and recovery rate of this disease?

In [None]:
# these are synthetic data
data_df = pd.read_csv('data/individuals_infected.csv')
data_df

In [None]:
# SIR model differential equations
def sir_model(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta * S * I / N
    dIdt = beta * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

In [None]:
# Assumed data and initial conditions
data = data_df['infected'].values
t = data_df['time'].values
N = 1000  # Total population
I0 = 1  # Initial number of infected individuals
S0 = N - I0  # Initial number of susceptible individuals
R0 = 0  # Initial number of recovered individuals

# Define the cost functions:
# mean squared error:
def mse(actual, predicted):
    return np.mean((actual - predicted)**2)
# sum of squared errors:
def sse(actual, predicted):
    return np.sum((actual - predicted)**2)
# mean absolute error:
def mae(actual, predicted):
    return np.mean(np.abs(actual - predicted))
# sum of absolute errors:
def sae(actual, predicted):
    return np.sum(np.abs(actual - predicted))
# mean error
def me(actual, predicted):
    return np.mean(actual - predicted)
# sum of errors
def se(actual, predicted):
    return np.sum(actual - predicted)


# Function to perform parameter estimation
def estimate_parameters(beta_range, gamma_range, num_samples, cost_func):
    # Initialize the parameter space
    beta_values = np.linspace(beta_range[0], beta_range[1], num_samples)
    gamma_values = np.linspace(gamma_range[0], gamma_range[1], num_samples)

    # Initialize the best cost and parameters
    best_cost = float('inf')
    best_params = None
    best_fit = None
    
    # For every value in the parameter space
    for beta in beta_values:
        for gamma in gamma_values:
            # Solve SIR model
            ret = odeint(sir_model, [S0, I0, R0], t, args=(N, beta, gamma))
            S, I, R = ret.T
            
            # Compute the cost
            cost = cost_func(data, I)
            # Keep track of lowest cost parameters
            if cost < best_cost:
                best_cost = cost
                best_params = (beta, gamma)
                best_fit = I
    
    # Plot the best fit
    plt.figure(figsize=(10, 6))
    plt.plot(t, data, '.', label='Data')
    plt.plot(t, best_fit, '-', label=f'Best SIR Fit using {cost_func.__name__.upper()}')
    plt.xlabel('Time (days)')
    plt.ylabel('Number of Infected Individuals')
    plt.title(f'Best Fit SIR Model using {cost_func.__name__.upper()}')
    plt.legend()
    plt.show()

    # Print the best parameters and cost
    print("Best parameters: Beta =", best_params[0], ", Gamma =", best_params[1])
    print("Best Cost (using", cost_func.__name__.upper(), "):", best_cost)  
    
    return best_params, best_cost


The code above sets up a brute-force function you can use to estimate the SIR parameters that best fit these data. It takes a user specified range of potential $\beta$ values, range of potential $\gamma$ values, number of samples, and a cost function. In pseudocode, the funtion does the following:

1. Creates a list of length num_samples of $\beta$ values equally distributed in the user specified range of potential $\beta$ values
2. Creates a list of length num_samples of $\gamma$ values equally distributed in the user specified range of potential $\gamma$ values
3. Solves the SIR ODEs once for each possible pairing of parameters in the $\beta$ list and parameters in the $\gamma$ list
4. Calculates the cost of the model output with each parameter pairing
5. Returns the parameters that yield the lowest model output cost
6. Plots the SIR model predictions for number of infected individuals over time using the best found parameters over the actual data

An example usage of the function is below:

In [None]:
# Example usage
beta_range = (0.1, .2)  # range of beta to test
gamma_range = (0.01, 0.02)  # range of gamma to test
num_samples = 10  # number of values to sample in each range
cost_function = mse  # You can change this to mse (mean of squared errors), sse (sum of squared errors), mae (mean of absolute errors), sae (sum of absolute errors), me (mean error), or se (sum of errors)

best_params, best_cost = estimate_parameters(beta_range, gamma_range, num_samples, cost_function)

Change the parameter ranges, the number of samples, and the cost function above to try to get the lowest cost you can. Find parameter search spaces (ranges for $\beta$ and $\gamma$) and a number of samples that gives you a good fit and then answer the following discussion questions:

DISCUSSION QUESTIONS:
- What are the lowest scoring parameters you can find?
- How does the best fit set of parameters found using me and se as cost functions compare to the best fit set of parameters found using mae and sae?

ASSIGNMENT QUESTIONS:
- please complete questions 10 and 11 in the Jupyter Notebook

### Understanding Cost Function Landscapes

A cost function landscape represents the values of a cost function over a range of parameter settings. In parameter estimation, our goal is often to find the parameter values that minimize this cost function. The shape of the landscape can greatly affect the ease and reliability of finding the global minimum. Convex landscapes, where any line segment between two points on the surface does not dip below the surface, ensure that any local minimum is also a global minimum. The plot below shows a simple convex function. Notice how any two points on the curve, when connected by a straight line, always stay above the curve. You can move the line around using the sliders.



In [None]:
# Function to plot a convex function and a line segment between two points
def plot_convex(a, b):
    x = np.linspace(-10, 10, 400)
    y = x**2
    
    plt.figure(figsize=(8, 4))
    plt.plot(x, y, label='$f(x) = x^2$')
    
    # Points and line segment
    ya = a**2
    yb = b**2
    plt.plot([a, b], [ya, yb], 'ro-')
    
    plt.title('Interactive Convex Function: $f(x) = x^2$')
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.legend()
    plt.grid(True)
    plt.show()

# Widgets for the points a and b
a_slider = widgets.FloatSlider(value=-5, min=-10, max=10, step=0.1, description='Point a:')
b_slider = widgets.FloatSlider(value=5, min=-10, max=10, step=0.1, description='Point b:')

# Display the widgets and output
widgets.interactive(plot_convex, a=a_slider, b=b_slider)


The plot below shows a non-convex function, $f(x) = x^3 - 3x$. You can see that the function has multiple local minima and maxima. A line segment connecting points on different slopes of a local minimum dips below the curve, illustrating non-convex behavior.

In [None]:
# Function to plot a non-convex function and a line segment between two points
def plot_non_convex(a, b):
    x = np.linspace(-3, 3, 400)
    y = x**3 - 3*x
    
    plt.figure(figsize=(8, 4))
    plt.plot(x, y, label='$f(x) = x^3 - 3x$')
    
    # Points and line segment
    ya = a**3 - 3*a
    yb = b**3 - 3*b
    plt.plot([a, b], [ya, yb], 'ro-')
    
    plt.title('Interactive Non-Convex Function: $f(x) = x^3 - 3x$')
    plt.xlabel('x')
    plt.ylabel('f(x)')
    plt.legend()
    plt.grid(True)
    plt.show()

# Widgets for the points a and b
a_slider_nonconvex = widgets.FloatSlider(value=-2, min=-3, max=3, step=0.1, description='Point a:')
b_slider_nonconvex = widgets.FloatSlider(value=2, min=-3, max=3, step=0.1, description='Point b:')

# Display the widgets and output
widgets.interactive(plot_non_convex, a=a_slider_nonconvex, b=b_slider_nonconvex)


In the context of parameter estimation, a convex cost function landscape is ideal because it guarantees that any local minimum found is also the global minimum. It ensures that the optimization process will not get trapped in a local minimum that is not the best possible solution.

In this exercise we have been conceptualizing parameter estimation as a process of trying a bunch of different parameters and choosing the parameters with the lowest cost. However, in many cases if we know the cost function is convex, we instead analytically solve for the parameter values that set the derivative of the cost function to 0. If the cost function is convex, this will always be the global minimum.


ASSIGNMENT QUESTIONS:
- please complete questions 12, 13, and 14 in the Jupyter Notebook

## Multiple Linear Regression Revisited

Just as in the SIR mdoel where we estimate parameters like infection and recovery rates, in multiple linear regression (MLR), we estimate coefficients that multiply the predictor (independent) variables. These coefficients determine how much each predictor (independent) affects the response (dependent) variable. The challenge is to determine the set of coefficients that best fit the observed data, typically by minimizing a cost function, such as the sum of squared errors.


#### Challenges of Large or Small Coefficients in MLR

In multiple linear regression, encountering coefficients that are very large or very small can indicate issues such as multicollinearity or that certain predictors have minimal influence on the response variable. Large coefficients can lead to model instability, where small changes in data lead to large changes in parameter estimates, while very small coefficients might suggest that some predictors are not contributing much to the model.


### Data
We will again be working on real breast cancer data from the [Wisconsin Diagnostic Breast Cancer Database (WDBC)](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic). We will replace the "malignant" and "benign" values in the 'Diagnosis' column with 0s and 1s. A 0 will indicate malignant and a 1 will indicate benign.

In [None]:
# Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
features = cancer.feature_names

# Create a DataFrame for easier manipulation
df = pd.DataFrame(data=X, columns=features)
df['Diagnosis'] = y  # 0 for malignant, 1 for benign

features = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'Diagnosis']

df = df[features]
# Display the first few rows of the dataframe
df.head()

To remind ourselves of the distribution of these features for cancerous vs non cancerous cells, lets plot a correlation plot for each feature.

In [None]:
# Plotting correlation plots of a selection of independent variables against the dependent variable
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.5)
fig.suptitle('Correlation Plots')

selected_features = features[:10]
for i, ax in enumerate(axes.flatten()):
    if i < len(selected_features):
        ax.scatter(df[selected_features[i]], df['Diagnosis'])
        ax.set_title(f'{selected_features[i]} vs Diagnosis')
        ax.set_xlabel(selected_features[i])
        ax.set_ylabel('Diagnosis')

plt.show()


The code below normalizes the data and performs MLR with all selected features as independent variables and diagnosis as the dependent variable. Select different features (hold shift or command while clicking to select multiple features) to see how the coefficients estimated by MLR. Note the R-squared value printed below the coefficients.

In [None]:
from sklearn.preprocessing import StandardScaler

# Function to update the model based on selected features
def update_model(selected_features):
    # Normalize the data for the selected features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(df[list(selected_features)])

    # Initialize models
    mlr = LinearRegression()

    # Fit models
    mlr.fit(X_scaled, y)

    # Predictions
    y_pred_mlr = mlr.predict(X_scaled)

    # Coefficients
    mlr_coefs = pd.Series(mlr.coef_, index=selected_features)

    # R-squared values
    r2_mlr = r2_score(y, y_pred_mlr)

    # Display the coefficients and R-squared values
    coefs_df = pd.DataFrame({
        'MLR Coefficients': mlr_coefs,
    })
    display(pd.concat([coefs_df]))
    print(f"MLR R-squared: {r2_mlr:.2f}")

# Create widget for selecting features
select_features = widgets.SelectMultiple(
    options=features[:10],
    value=[features[0]],
    description='Features',
    disabled=False,
    style={'description_width': 'initial'}
)

# Interactive widget
widgets.interactive(update_model, selected_features=select_features)

Lasso regression may zero out some coefficients, suggesting these features are less important for predicting the outcome, whereas Ridge tends to reduce the magnitude of coefficients more uniformly.


DISCUSSION QUESTION:
- Why is the greatest MLR coefficient in the model using all the features different from the feature with the greatest predictive capacity?
- Several parameters can be dropped in isolation and retain the same predictive capacity. Why do you think that is?

ASSIGNMENT QUESTIONS:
- Please compete question 15