<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/Week-4-Regression-models/4.1-Regression-intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regression

## Application: neonatal brain growth

- Around the time of birth the brain grows very quickly
- Preterm birth alters brain development

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/DevelopingPretermBrain.png" width = "300" style="float: left;">

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/Baby.png" width = "200" style="float: right;">

We will demonstrate the regression concepts using the application of brain growth in preterm neonates. Around the time of birth the brain grows very quickly. Preterm birth can disrupt this process, and therefore preterm brain development is a subject of extensive research.

To investigate the changes caused by preterm birth, we can acquire MRI scans of newborn babies. We can perform automatic segmentations of various brain structures and measure their volumes. The features in the datasets we'll be working with are brain volumes, either for (1) the whole brain, (2) six brain tissues, or (3) 86 brain structures.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/BrainMRI-BrainSegmentation.png" width = "700">

The machine learning regression is the predict the age (target value) of a baby from volumes of brain structures (features).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/AgeVolumePlot.png" width = "400">

In this plot each circle corresponds to a baby (a sample), on the x-axis we have the brain volume (a feature scaled with StandardScaler). The age (target value) is plotted on the y-axis. In this case we are showing a univariate linear regression model in red, with two parameters (slope and intercept). We have achieved R2 score 0.85 which is quite good, so the linear model fits this dataset well.

## How can we increase the performance of this model?

So, you're wondering if we can boost the performance of our model, right?
Specifically, you're looking to get that R2 score closer to the perfect 1. Well, let's dig into the type of error we're dealing with in our dataset. What's your guess?

- bias

- variance

- noise

**Good news:**

1. we're not dealing with much bias here. Our linear model seems to capture the general trend quite nicely, so there's no glaring systematic issue.

2. And as for variance, our model is a simple one, with only two parameters, so it's unlikely that overfitting is the culprit.

3. That leaves us with noise, which, unfortunately, we can't really get rid of by switching up the model.

Now, what about cranking up the complexity by using a dataset with 6 features or even a whopping 86 features?

Sure, we could venture into **multivariate linear regression** territory.

But remember, adding more features doesn't always mean better predictions, especially when the main issue is noise. Keep that in mind, and happy modeling!

## Multivariate linear regression

Can we improve the predictions by using the 6-feature or 86-feature datasets? We can use multivariate linear regression for these datasets.

Multiple features: $$x_{i1}, ... , x_{iD}$$

Target values: $$y_i$$

Samples: $$i = 1, ..., N$$

Prediction model: $$\hat{y} = w_0  + w_1x_1 + ... + w_D x_D$$

Loss function: Sum of square errors

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \sum_{k} w_k x_{ik} - w_0)^2$$

The model is fitted by minimizing the loss function (sum of squared errors).

In the realm of linear regression, we refer to it as multivariate when there's more than just one feature to consider. For instance, let's say we are examining the volumes of D different structures in the brain. These volumes are represented as $ x_{i1}, \ldots, x_{iD} $. For each of these D-dimensional feature vectors, there is an associated target value, $ y_i $, which in this specific application, corresponds to the age at the time of the scan.

In this scenario, the index $ i $ identifies individual samples in our training set. In the context of our example, each sample corresponds to an individual baby. We have a total of $ N $ samples in our dataset. The predictive model we're working with is based on a linear equation. In this equation, $ w_0 $ serves as the intercept and the weights $ w_1 $ through $ w_N $ are the slopes associated with various features.

Our goal is to fine-tune these weights, which are essentially the parameters of our model, to make the most accurate predictions possible. The predicted target value in this context is represented as $ \hat{y} $. To find the best-fitting model, we minimize the sum of squared errors between the predicted and the actual target values.



### Matrix formulation

It is also useful to express the
linear regression problem using matrix formulation.

In this version, we use a feature matrix, denoted by $X$, which contains all samples and their respective features.

To simplify the calculations, we include an extra column of ones in this matrix. This extra column corresponds to the model's intercept, allowing the number of features and parameters ($w$) to be identical, at $D+1$.

- Feature matrix:
  $$
  X = \begin{pmatrix}
    1 & x_{11} & \cdots & x_{1D} \\
    \vdots & \vdots & \ddots & \vdots \\
    1 & x_{N1} & \cdots & x_{ND}
  \end{pmatrix}
  $$

The target values are represented by the vector $y$, while the weight parameters are in the vector $w$.

- Target vector:
  $$
  y = \begin{pmatrix}
    y_1 \\
    \vdots \\
    y_N
  \end{pmatrix}
  $$

- Weight vector:
  $$
  w = \begin{pmatrix}
    w_0 \\
    \vdots \\
    w_D
  \end{pmatrix}
  $$

The predictive model calculates the estimated target values, $\hat{y}$, through simple matrix multiplication: $X \times w$.

- Prediction model:
  $$
  \hat{y} = X w
  $$

The loss function is then calculated as half the square of the difference between the predicted and actual target values, expressed as


- Loss function:
  $$
  F(w) = \frac{1}{2} (y - Xw)^T (y - Xw)
  $$

In summary, $X$ is your feature matrix, $y$ is the target vector you're trying to predict, and $w$ are the weights you'll adjust to make your predictions as accurate as possible.



## How to fit the multivariate linear regression model to the training data

To fit the multivariate linear regression model to the training data, our goal is to find the weight vector $\hat{w}$ that minimizes the loss function. We accomplish this by taking the derivative of the loss function with respect to $w$ and setting it to zero. This leads us to the **Normal Equation**:

$$\hat{\textbf{w}} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y}$$

In this equation, $\hat{\textbf{w}}$ is the weight vector that minimizes the loss, $\textbf{X}$ is the feature matrix (which includes a first column of ones for the intercept), and $\textbf{y}$ is the vector of target values.

 Note that $X^T X$ is a square matrix with dimensions $(D+1) \times (D+1)$, making it invertible—unless it's singular, but we'll get to that later.




## Gradient Descent: An Alternative to the Normal Equation

When dealing with a massive number of features—say, upwards of 100,000—the Normal Equation starts to lose its appeal due to the computational cost of matrix inversion. That's where gradient descent comes in handy. Unlike the Normal Equation, **gradient descent** is an iterative method. You start with a random initial guess for the weight vector and then keep tweaking it (then iteratively updated ) until you find a value that minimizes your loss function or until it converges to a local minimum..


<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescent.png" width = "300">


The Importance of the Learning Rate
One crucial factor to consider is the learning rate.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/learning_rate_comic.jpg" width = "300">


If the learning rate is too small, the algorithm may take a very large number of steps to converge. If the learning rate is too large the algorithm may oscillate instead of converging.

 So, you've got to find that Goldilocks zone for the learning rate, otherwise the gradient descent will not converge. Here you can see.

rate is set well, otherwise the gradient descent will not converge. Here you can see
examples of different learning rates.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/LearrningRate.png" width = "700">

**Scenario 1**

**The Ideal Learning Rate:** When set just right, the gradient descent algorithm will efficiently converge to the optimal solution, represented by the red star in this case.

**Scenario 2**
**Too Small a Learning Rate:** A too-small learning rate is like walking towards a destination in baby steps. You'll get there eventually, but it will take a very long time. Worse still, you might run out of time (or computational resources) before you reach the optimum solution.

**Scenario 3**
**Too Large a Learning Rate:** On the flip side, a too-large learning rate will make the algorithm oscillate around the minimum like a pendulum that's been pushed too hard. Instead of settling at the lowest point, it'll swing from one side to the other, never truly converging.




Another issue to consider with gradient descent is how much of the training data to use at each iteration.


**Batch gradient descent:**
In the classical gradient descent, also called batch gradient descent, we use all the samples to update the weight vector at each iteration. If we have a large number of samples, this process can be very slow.

**Stochastic gradient descent:** Uses samples, this process can be very slow. Stochastic gradient descent only uses one random sample at each iteration, and it is therefore very fast, but it can oscillate and
is therefore error prone (unstable).

**Mini-batch gradient descent:** A compromise between batch and stochastic, this uses a subset of the training data at each iteration.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescentBatches.png" width = "700">

## Evaluating performance of regression models

### $R^2$ score

Certainly! Here's the text with mathematical expressions using a single $ on each side for in-text math:

Last week, we discussed how to assess the performance of a regression model using the $ R^2 $ score. This score tells us how well the model explains the variation in the data. A perfect model would have an $ R^2 $ score of 1.

To calculate $ R^2 $, we start by finding the unexplained variance, which is essentially the squared difference between the actual and predicted target values. We then divide this by the total variance in the data. Although we usually divide these by the number of samples to get variances, these factors cancel out when calculating $ R^2 $.

In code, calculating the $ R^2 $ score is straightforward using Scikit-learn. To find the $ R^2 $ score for all data, you can use the `model.score` method. If you want to use cross-validation, the `cross_val_score` function can help.

This is the proportion of variance in the data explained by the model. A perfect fit would have $R^2 = 1$.

$$R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})}$$

where the $\hat{y}_i$s are the predicted target values and $\hat{y}$ is the average target value where

- **Predicted Target Values:**
$$
\hat{y}_i = w_0 + \sum_{k} w_k x_{ik}
$$

- **Average Target Value:**
$$
\bar{y} = \frac{1}{N} \sum_{i=1}^{N} y_i
$$

In these formulas:
- $\hat{y}_i$ is the predicted target value for the $i$-th sample.
- $w_0$ is the intercept, and $w_k$ are the weights for the features.
- $x_{ik}$ are the feature values for the $i$-th sample.
- $\bar{y}$ is the average target value.
- $N$ is the total number of samples.
- $y_i$ is the actual target value for the $i$-th sample.

Let's use the $R^2$ score to compare linear regression models for each of the datasets.

Now, let's explore if increasing the number of features can improve the $R^2$ score. When we look at the $R^2$ score for the entire dataset, it does increase as we add more features. However, similar to what we observed with polynomial degrees, it's not clear if this improvement is due to a better fit or just overfitting.

To clarify this, we can calculate the cross-validated $R^2$ score. Our findings show that the performance improves as we increase the number of features from one to six, but then declines when we jump to 86 features. This leads us to conclude that the optimal model has 6 features, with a cross-validated $R^2$ score of 0.89. On the other hand, the model with 86 features appears to be overfitting, as its cross-validated $R^2$ score drops to 0.68.

Keep in mind that adding more features also increases the complexity of the model, as it has to fit more parameters.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/R2_scores.png" width = "700">

**Summary:** The $R^2$ scores improve with the larger number of features. However, when we calculate cross-validation scores $R^2$ scores, the best performance is with the 6-feature dataset. This indicates that the 86-feature dataset is being overfitted to the noise in the data.

### Root mean squared error (RMSE)

Another method for evaluating the performance of regression models is to use the Root Mean Squared Error (RMSE). RMSE provides an average error in the same units as the target values, with a perfect model yielding an RMSE of zero.

In a simple terms, RMSE tells you the average mistake your model makes in the same units as what you're trying to predict. A perfect model would have an RMSE of zero, meaning it makes no mistakes.

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i} (y_i - \hat{y}_i)^2}$$

where the formula for predicted target values can be rewritten as:

$$
\hat{y}_i = w_0 + \sum_{k} w_k x_{ik}
$$

In this formula:
- $\hat{y}_i$ represents the predicted target value for the $i$-th sample.
- $w_0$ is the intercept term.
- $w_k$ are the feature weights.
- $x_{ik}$ are the feature values for the $i$-th sample.
- The sum runs over all features \( k \).

where the $y_i$s are the target values and the $\hat{y}_i$s are the predicted values.

Note that unlike $R^2$, the RMSE is in the same units as the target values (i.e. weeks in our example).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RMSE_scores.png" width = "700">

Here is the performance for the different datasets measured by RMSE. Better RMSE scores are lower values (closer to 0). The RMSE scores agree with $R^2$ scores. We see that RMSE on the whole set decreases with increasing number of features.  When we use cross validation, the lowest error is achieved for 6 features, 1.27 weeks, and the
models with 86 features is overfitted to the noise in the data, because it has large
cross validated RMSE.

**Summary:** The best model is on the 6-feature dataset. The 86-feature dataset has been overfitted.



To compute RMSE using Scikit-learn, you first fit the model and make predictions. After that, you calculate the mean squared error between the actual and predicted target values and take the square root to get the RMSE. You can also evaluate the model's performance by training on one dataset and testing it on another.

For cross-validated RMSE, you can use Scikit-learn's `cross_val_score` function. However, you'll need to set the `scoring` parameter to "negative mean squared error." This function returns an array of scores for all the cross-validation folds. These scores are negative, so you'll need to negate them and then take the square root to get the RMSE for each fold. Finally, you average these values to get the overall cross-validated RMSE.

## Evaluating performance of regression models

So what is the relationship between R2 score and RMSE?

Both are related to sum of
squared errors, which is also our loss function. While R2 score is normalised and
therefore comparable between models and datasets, RMSE is interpretable, because
it is expressed in units of the target values.

- **Sum of Squared Errors (SSE) Loss Function:**

$$
F(w) = \frac{1}{2} \sum_{i} \left( y_i - \hat{y}_i \right)^2
$$

- **R-squared ($R^2$) Score:**

$$
R^2 = 1 - \frac{F(w)}{\sigma^2 \left( y_i - \bar{y} \right)^2}
$$

- **Root Mean Squared Error (RMSE):**

$$
RMSE = \sqrt{\frac{1}{N} F(w)}
$$

In these formulas:

- $F(w)$ represents the sum of squared errors between the expected ($y_i$) and predicted ($\hat{y}_i$) target values.

- $R^2$ is a normalized score that can be compared between different models or datasets.
- $RMSE$ is interpretable because it's in the same units as the target values.
- $N$ is the total number of samples.
- $\sigma^2$ is the variance of the target values, and $\bar{y}$ is their average.

## Penalised Regression

### Overfitting

Overfitting is like the classic "too much of a good thing" problem in machine learning, including in multivariate linear regression. You'd think that adding extra features is like adding more spices to your cooking—it'll only make things tastier, right? Well, not necessarily. Too many features can cause your model to learn the "noise," or random fluctuations in the data, rather than the real pattern you're interested in. When that happens, your model may ace the training data but will likely bomb on new, unseen data.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/overfitting_comic.png" width = "400">

performance of multivariate linear regression. This is due to overfitting. So what is the relationship between number of samples, number of features and performance
of the model?

- The typical way to get the model parameters $w$ in multivariate linear regression is by using the normal equation:

$$
w = (X^T X)^{-1} X^T y
$$

- In this equation, $X$ is the feature matrix, which has dimensions $N \times (D+1)$. Here, $N$ is the number of samples and $D$ is the number of features. The vector $y$ contains your target or output values. The $(D+1)$ bit comes in because we usually add a "bias" term to the features.

- Now, here's where it gets tricky. You have to invert the matrix $X^T X$, and this is a $(D+1) \times (D+1)$ matrix. Inverting matrices isn't just a click of a button; it's computationally heavy and can be problematic if $X^T X$ is nearly singular (which is a fancy term for non-invertible or ill-conditioned).

- Also, don't forget that the rank of the matrix $X^T X$ can't be higher than $N$, the number of your samples. So if you have more features than samples ($D+1 > N$), then you're in trouble: $X^T X$ becomes singular and you can't invert it. Even if you have fewer features ($D+1 \leq N$), a low rank could set the stage for overfitting.

- This is when Ridge regression becomes the hero of the day. It adds a penalty term to the loss function, effectively converting the matrix to be inverted into $X^T X + \lambda I$. This not only makes it invertible but also helps in keeping overfitting at bay.

### The relationship between the number of samples \( N \) and the number of features \( D+1 \)

The number of samples $N$ and the number of features $D+1$ play a vital role in how well a multivariate linear regression model performs, particularly when you're worried about overfitting. Here's a quick rundown:

- **Less Samples than Features ($N < D+1$)**
  - In this tricky situation, the matrix $X^T X$ isn't invertible. This happens because you've got more unknowns than equations, leaving you with an underdetermined system. Your options are limited: you can either reduce the number of features or introduce regularization methods to make the matrix invertible.

- **Slightly More Samples than Features ($N \approx D+1$)**
  - Okay, so here the matrix $X^T X$ is invertible, but it's not time to celebrate just yet. Your model might get too cozy with the noise in the data, picking up random fluctuations instead of the real trend. The result? It might do great on the training data but stumble when it sees new, unseen data. To handle this, you could use regularization techniques like Ridge or Lasso.

- **Significantly More Samples than Features ($N \gg D+1$)**
  - Now you're talking! In this case, the matrix $X^T X$ is happily invertible, and the risk of overfitting drops. With a heap of data at your disposal, your model can do a much better job capturing the underlying trends, setting you up for success on new data.

- **Increasing the Number of Samples**
  - If you're still wringing your hands about overfitting, there's a simple fix—just add more samples to the training set. This beefs up your model's understanding of the underlying data distribution and helps it generalize better to new, out-of-sample data.

So, the balance between your number of samples and features is crucial for your model's performance, particularly when overfitting is on the radar.

### Ridge regression

Overfitting can be avoided by reducing the number of parameters in the model (in the case of linear regression models this can be done by reducing the number of features). An alternative is **regularization**, where a penalty term is added to the loss function.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgePenalty.png" width = "500">

The **ridge penalty** penalizes weights with a large magnitude. It is also called the **L2 norm**. $\lambda$ is a hyperparameter that determines the strength of the regularization.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/L2_norm.png" width = "200">

The ridge penalty rises quadratically as the magnitude of the weight increases.

Absolutely, Ridge regression is a form of regularization that tackles the overfitting problem by penalizing the magnitude of the model weights. Let's delve into some of the key aspects:

- **Penalizes Weights with Large Magnitude**
  Ridge regression uses the squared L2 norm of the weight vector $w$ as a penalty term. The idea is to discourage the model from assigning excessively large weights to the features. The larger the weight, the higher the penalty.
  
- **Hyperparameter $ \lambda $**

  The strength of this regularization is controlled by the hyperparameter $\lambda$. A high value of $\lambda$ will result in a stronger penalty, pushing the weights closer to zero. Conversely, a lower value of $\lambda$ will make the regularization less severe.

- **Matrix Formulation**

  The loss function $ F(w) $ for Ridge regression can be expressed in matrix form as:

$$
F(w) = \frac{1}{2} (y - Xw)^T (y - Xw) + \frac{\lambda}{2} w^T w
$$

This function combines the standard loss term, which measures how well the model fits the data, with the regularization term, which penalizes large weights. The $\frac{1}{2}$ in front of each term is often added for mathematical convenience, making it easier to differentiate the function.

In a nutshell, Ridge regression provides a way to balance fitting the data well while keeping the model weights in check, effectively helping to manage overfitting.

We have seen that overfitting in multivariate linear regression can be addressed by
reducing the number of features. An alternative is to keep the same features and
regularise the model instead. To do that we can add a penalty to the loss function
that will discourage weights with very large magnitude. If the penalty is calculated
using squared L2 norm of the weight vector (which is equal to sum of squares of the
individual weights), it is called Ridge penalty. The hyperparameter lambda determines
the strength of the penalty term. The ridge penalty can be expressed in matrix form
as w transposed times w. The penalty for a single weight is plotted in this graph. You
can see that it is zero for a zero weight, and rises quadratically as the magnitude of
the w increases, whether it is negative or positive.

### **Finding the Weight Vector $ \hat{w} $**

Ridge regression aims to find the weight vector $ \hat{w} $ that minimizes the loss function. Let's explore the main components:

- **Finding the Weight Vector $ \hat{w} $**
  In Ridge regression, you're looking to minimize the following loss function:
  $$
  \hat{w} = \arg \min_{w} \left[ \frac{1}{2} (y - Xw)^T (y - Xw) + \frac{\lambda}{2} w^T w \right]
  $$
  Here, $ \arg \min_{w} $ denotes the value of $ w $ that minimizes the expression inside the brackets. The loss function comprises two parts: the original loss term and the regularization term.

- **Setting First Derivative to Zero**
  Differentiating the above loss function with respect to $ w $ and setting it equal to zero, you can derive the closed-form expression for $ \hat{w} $:
  $$
  \hat{w} = (X^T X + \lambda I)^{-1} X^T y
  $$
  This equation shows how to find the optimal $ \hat{w} $ that minimizes the Ridge regression loss function. $ I $ is the identity matrix, making $ X^T X + \lambda I $ invertible and thus solving the problem of non-invertibility we discussed earlier.

- **Gradient Descent Solution**
  If you prefer an iterative approach, you can use gradient descent. The update rule for each iteration is:
  $$
  w_{n+1} = w_n + \eta (X^T y - X^T X w_n + \lambda w_n)
  $$
  Here, $ \eta $ is the learning rate, which controls the step size in the direction of the steepest decrease of the loss function. $ w_n $ and $ w_{n+1} $ are the weight vectors before and after the $ n+1 $th iteration, respectively.

In summary, Ridge regression offers multiple ways to find the optimal weight vector $ \hat{w} $, be it through closed-form solutions or iterative methods like gradient descent. The regularization term helps in managing overfitting, especially when you're dealing with a high-dimensional feature space.

### how the hyperparameter λ influences the performance of Ridge regression

The behavior of ridge regression depends on hyperparameter $\lambda$.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgeRegressionBehavior.png" width = "600">


The hyperparameter λ is typically determined through cross-validation. In a sample plot, we depict the Root Mean Square Error (RMSE) as a function of the logarithm of λ. This is calculated both for the training set (illustrated by a blue dashed line) and through cross-validation (represented by a red solid line).

Let's break it down in terms of the graph you mentioned:

Certainly, here's the revised version:

- **Very Small $ \lambda $**
  When $ \lambda $ is close to zero, the regularization term has little effect. Consequently, the model focuses more on minimizing the original loss term, leading to a model that's potentially overfitted. This is reflected in the graph as a small training Root Mean Square Error (RMSE) but a much larger Cross-Validation (CV) RMSE (blue dashed line for training, red solid line for CV).

- **Optimal $ \lambda $**
  As you increase $ \lambda $, the regularization term starts to play a more significant role. This causes the training RMSE to increase, but it also generally results in the CV RMSE reaching a minimum point. This is the "sweet spot" where the model generalizes well to new data. In terms of the graph, this is where the CV RMSE curve reaches its minimum value. At this point, you're penalizing the large weights just enough to avoid overfitting but not so much that you underfit the data.

- **Large $ \lambda $**
  For larger values of $ \lambda $, the model starts to become too simplistic, essentially underfitting the data. Both training and CV RMSE increase, and they may stabilize at some higher value as $ \lambda $ increases further. At this point, the regularization term dominates, and the model can't fit even the training data well.

Choosing the optimal $ \lambda $ through cross-validation is a practical approach. You typically look for the value of $ \lambda $ where the CV RMSE is minimized, as it indicates a good generalization performance.

This exercise of plotting RMSE against $ \log(\lambda) $ is often called a "Ridge plot" or "Regularization path," and it's a very useful tool for understanding the impact of regularization and choosing an appropriate $ \lambda $.

### Grid search

You can use a grid search to find the optimal lambda values.

Grid search is a commonly used technique to find the optimal value for hyperparameters like $ \lambda $ in Ridge regression. In a grid search, you specify a range of possible values for $ \lambda $, and then the algorithm evaluates the model performance for each value within that range. Usually, this is done through cross-validation to ensure that the selected $ \lambda $ results in a model that generalizes well to unseen data.

Here's a simplified outline of how you might use grid search to find the optimal $ \lambda $:

1. **Specify the Grid**: Choose a range of $ \lambda $ values to explore. This could be a linear or logarithmic scale depending on what you think is most appropriate.
  
2. **Cross-Validation**: For each $ \lambda $ in the specified grid, perform cross-validation and compute the average Cross-Validation (CV) RMSE.

3. **Compare Performances**: Look at the CV RMSE for each $ \lambda $ and choose the one that minimizes this value. This is your optimal $ \lambda $.

4. **Train Final Model**: Using the optimal $ \lambda $, train the Ridge regression model on the entire dataset.

5. **Validation and Testing**: Optionally, you may validate the final model using a separate validation set or additional cross-validation techniques.

Grid search is straightforward but can be computationally expensive if the grid is large and the dataset is sizable. However, it's often worth the computational cost to ensure that you're selecting the best hyperparameter values for your model.

### Ridge regression in Scikit learn

In scikit-learn, Ridge regression can be executed using the Ridge class. The hyperparameter $\lambda$ corresponds to the alpha parameter within this class. To find the optimal value for $\lambda$, one can utilize the GridSearchCV class. It's advisable to specify the range for $\lambda$ on a logarithmic scale, which can be done using NumPy's logspace function. This ensures a comprehensive search over the parameter space.

Feature scaling becomes crucial when implementing Ridge regression, as the magnitudes of the features directly influence the penalty on the weights. Larger weights will incur greater penalties, making scaling essential for consistent regularization.

An additional point to consider is that scikit-learn's implementation of Ridge regression automatically excludes the intercept $w_0$ from regularization. This is done to minimize the bias error that could arise from including the intercept in the regularization term.

**Now going back to our problem:**

Does ridge regression help avoid overfitting on our sample datasets?

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgeRegresion_scores.png" width = "600">

In the table, we present a performance comparison of Ridge regression across the three datasets previously examined in the context of multivariate linear regression. Interestingly, the datasets with 1 and 6 features exhibit performance metrics similar to those observed before. However, Ridge regression truly shines when leveraging a larger feature set. In such cases, it achieves its best performance to date, with an $R^2$ score of 0.91 and an RMSE of 1.17 weeks.

**Answer:** Yes, with an optimal lambda, the overfitting on the 86-feature dataset is avoided.

### Lasso regression

An alternative type of regulariztion is **Lasso regression**. This uses the **Lasso penalty** or **L1 norm**.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/LassoPenalty.png" width = "500">

The **ridge penalty** penalizes weights with a large magnitude. It is also called the **L2 norm**. $\lambda$ is a hyperparameter that determines the strength of the regularization.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/L1_norm.png" width = "200">

Compared with Ridge, Lasso tends to reduce weights to 0 and thus produces sparse solutions (many weights with value = 0).


**Matrix formulation of Lasso regression**

The matrix formulation of the objective function $F(w)$ for Ridge regression with an additional twist of introducing the "sign" function can be written as:

$$
F(w) = \frac{1}{2} (y - Xw)^T(y - Xw) + \frac{\lambda}{2} w^T \text{sign}(w)
$$

Here, $\text{sign}(w)$ is a vector that contains the sign of each element $w_k$ in the weight vector $w$. It is defined as:

$$
\text{sign}(w) = [\text{sign}(w_1), \ldots, \text{sign}(w_K)]^T
$$

Where the function $\text{sign}(w_k)$ is defined as:

$$
\text{sign}(w_k) = \begin{cases}
1 & \text{if } w_k > 0 \\
0 & \text{if } w_k = 0 \\
-1 & \text{if } w_k < 0
\end{cases}
$$

Note that this is a bit different from the classic Ridge regression formulation, which uses the squared $L2$ norm $w^Tw$ instead of $w^T \text{sign}(w)$. The inclusion of the sign function adds an additional non-linearity into the optimization problem.


**To wrap it up:** The Least Absolute Shrinkage and Selection Operator, commonly known as LASSO, serves as another regularization technique, distinct from Ridge regression. In LASSO, the penalty imposed on the weights is based on the $L1$ norm, which is simply the sum of the absolute values of the weights. As we'll demonstrate later, the $L1$ penalty tends to generate sparser solutions compared to the $L2$ penalty used in Ridge regression.

In matrix form, the loss function for LASSO can be articulated by incorporating the $\text{sign}(w)$ vector. This vector contains elements of 1, -1, or 0 depending on the sign of each corresponding weight, thus enabling the representation of the absolute value of each weight in the penalty term.

### How to find the weight vector  𝑤̂  that minimizes the loss function?

In LASSO regression, the goal is to find the weight vector $\hat{w}$ that minimizes the loss function, which can be represented as:

$$
\hat{w} = \arg\min_w \left[ \frac{1}{2} (y - Xw)^T(y - Xw) + \frac{\lambda}{2} w^T \text{sign}(w) \right]
$$

- The LASSO penalty employs the $L1$ norm, which is not differentiable. As a result, there is no closed-form analytical solution for minimizing this loss function.
  
- However, the subgradient of the $L1$ norm can still be computed. In this case, it is the $\text{sign}(w)$ vector, which contains the sign of each weight.

- Despite the non-differentiability, one can use gradient descent to find an optimal solution. The iterative update rule for gradient descent in this context would be:

$$
w_{n+1} = w_n + \eta (X^T y - X^T X w_n - \lambda \text{sign}(w_n))
$$

Here, $\eta$ is the learning rate, $w_n$ and $w_{n+1}$ are the weight vectors before and after the $(n+1)$-th iteration, respectively. This allows us to iteratively converge to an optimal solution, even without a closed-form analytical solution.

**Explanation:** While the LASSO penalty incorporates a non-differentiable $L1$ norm, making it impossible to find an analytical solution, there's no need to worry. We can still work around this by calculating what's known as the subgradient of the $L1$ norm, which is essentially the sign of each individual weight in our model. This is symbolized by $\text{sign}(w)$.

Armed with this subgradient, we're not at a dead-end! We can still apply the gradient descent algorithm to find the optimal set of weights that minimize our loss function. This way, we're iteratively adjusting our model to perform better, even though a neat, closed-form solution isn't available. So, while LASSO throws us a curveball with its non-differentiability, we've got the tools to hit it out of the park!

### Lasso regression implementation in Scikit learn:

Scikit learn object Lasso implements the Lasso regression. We fit the Lasso
regression the same way as the Ridge regression. We set the hyperparameter lambda
using the parameter alpha, and find the solution using GridSearchCV , with parameter
grid defined on a logarithmic scale.

- First off, Scikit-learn has got you covered with its Lasso object, making it super easy to implement Lasso regression.
  
- Wondering about the hyperparameter $\lambda$? You can set it using the parameter named 'alpha'. That's Scikit-learn's way of saying $\lambda$.

- Not sure which $\lambda$ is your Goldilocks value? No worries! The GridSearchCV object is your trusty sidekick for figuring that out.

- Ah, a quick heads-up! Feature scaling is pretty crucial this time. The model's weights are sensitive to the range of your features. A higher weight will get more of a penalty, so you want to make sure everything's on the same scale.

- Last but not least, Scikit-learn gives the intercept term, $w_0$, a free pass by not penalizing it. This little nuance actually helps in improving the performance of your model.


### Comparison of Ridge and Lasso

- Ridge and Lasso penalties both decrease the magnitude of weights
- Ridge penalises weights with large magnitude more
- Lasso penalises weights with small magnitude more

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/ComparisonRidgeLasso.png" width = "700">

Examining the differences between Ridge and Lasso regression reveals how each approach affects the weight magnitudes. When plotting the weights corresponding to 86 brain structures, the coefficients for multivariate linear regression (represented by blue circles) exhibit greater magnitudes compared to those from Ridge and Lasso regression, which are denoted by red stars and green triangles, respectively.

On comparing the Ridge and Lasso penalties, Ridge is observed to penalize large weights more harshly than Lasso. Conversely, Lasso imposes a heavier penalty on small weights, leading to a sparse solution. In other words, Ridge tends to have non-zero weights for all features, as indicated by red stars, while Lasso results in only a few non-zero weights, resulting in what is termed a "sparse solution."

This sparsity in Lasso makes it suitable for feature selection. However, in the specific application of predicting age based on the volumes of 86 brain structures, Ridge regression outperformed Lasso.

## Non-linear Regression

Let’s now have a look how we can formulate a non linear regression model. In machine learning it is common to consider non-linear _feature transformation_ followed by multivariate linear regression model. We have already seen this when we introduced polynomial regression. In general terms, we denote the non linear feature transformation by $\phi$ (_phi_), and the prediction model will then become:

$$\hat{y} = \phi(x)^T \textbf{w}$$

For example, if we use a polynomial regression model of the second degree, then:

Feature transformation for 1D feature vector x:

$$\phi(x) = (1, x, x^2)^T$$

Prediction model:

$$\hat{y} = \phi(x)^T \textbf{w} = w_0 + x w_1 + x^2 w_2$$

Note: Feature transformation usually increases dimension of a feature vector. In the case of univariate polynomial regression 1D feature vector x is transformed to an $M + 1$ dimensional vector $(1, x, x^2, ..., x^M)^T$.

Loss function for non-linear regression:

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \phi(\textbf{x}_i)^T \textbf{w})^2$$

Non-linear ridge regression (to avoid overfitting)

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \phi(\textbf{x}_i)^T \textbf{w})^2 + \frac{\lambda}{2}\textbf{w}^T\textbf{w}$$

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/PolynomialRidgeRegression2.png" width = "700">

The loss function will then become a sum of squared errors between the expected target values $y_i$ and the target values predicted using the modified model. Because we tend to significantly increase the number of features by introducing feature transformation, we will include ridge penalty in our non linear regression model.

In this plot you can see how ridge penalty can be useful for polynomial regression. In the plot on the left we have fitted polynomial of 10 th degree to predict age from cortical volume. We can see that model slightly overfitted the data and CV R2 is 0.88. If we add ridge penalty with lambda 0.25, the overfitting is reduced and the CV $R^2$ increases to 0.9.

How does polynomial ridge regression change with increasing number of features?

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/PolynomialRidgeRegressionPerformance.png" width = "700">


### Kernel trick

The kernel trick helps us to design more versatile non-linear regression models.

We define a **dual representation** $a$ of our model parameter vector $\textbf{w}$

$$\textbf{w} = \phi^T a$$

The prediction model then becomes


$$\hat{y} = \phi^T(x) \textbf{w} = \phi^T(x) \phi^T a = \sum^N_{i=1} \phi^T(x) + \phi(x_i) a_i$$

We now define a **kernel** $\kappa$

$$\kappa(x, x_i) = \phi^T(x)\phi(x_i)$$

The kernel represents a similarity between two feature vectors.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GaussianKernel.png" width = "300"  align="right">

For example, we can use a **gaussian kernel**

$$\kappa(x, x_i) = e^{-\frac{\lvert x - x_i \rvert_2^2}{2\sigma}}$$

$$\lvert x - x_i \rvert_2^2 = \sum_k(x_k- x_{ik})^2$$

Note:
- The original parameter vector $\textbf{w}$ vector has $D$ elements (one per each feature)
- The dual parameter vector $a$ has $N$ elements (one per each sample)

The resulting **dual prediction model** is a linear combination of kernels placed around the feature vectors $x_i$

$$\hat{y} = \sum^N_{i=1} \kappa(x, x_i) a_i$$

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/DualPredictionModel.png" width = "700">

In this plot we can see three samples with feature vectors $x_1$, $x_2$ and $x_3$. The gaussian kernels are placed around them, multiplied by
coefficients ai and then summed up to produce the prediction model. We can also see, that large kernels produce smoother models.

We will see more details on non-linear regression in Notebook 4.4.