<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/Week-4-Regression-models/4.4-Nonlinear-regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Non-linear regression

Linear regression assumes a linear relationship between the input variables (or features) and the output variable. However, in many real-world scenarios, this assumption does not hold, and the relationship between input and output might be non-linear. For these cases, non-linear regression techniques are more suitable.

### Polynomial Ridge regression

One way to capture non-linear relationships is by transforming the original features into a higher-degree polynomial space. The transformation
$ϕ(x)$ maps the original feature vector x to a polynomial feature space. Here, instead of predicting y directly from x, we predict it from the transformed features $ϕ(x)$.

Non-linear regression can be achieved by combining a non-linear feature transformation $\boldsymbol{\phi}(\mathbf{x})$ with linear regression models. The prediction model will become $y=\boldsymbol{\phi}(\mathbf{x})^T\mathbf{w}$. An example is a polynomial feature transformation $\boldsymbol{\phi}(x)=(1,x,...,x^M)$.

Now, when Ridge regularization is applied to this polynomial regression, it becomes Non-linear Ridge regression.

The **Non-linear Ridge regression** is obtained by minimising the loss

$$ F(\mathbf{w})=\frac{1}{2}\sum_{i=1}^N(y_i-\boldsymbol{\phi}(\mathbf{x}_i)^T\mathbf{w})+\frac{\lambda}{2} \mathbf{w}^T\mathbf{w}$$

The loss function $F(w)$ consists of two terms: the usual least squares error and a penalty term on the coefficients $w$ to prevent overfitting.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/KRRsigma.gif" width = "450" style="float: right;">

### Kernel Ridge Regression

Kernel Ridge Regression (KRR) provides another approach to capture non-linear patterns. **Instead of explicitly transforming the features to a higher-dimensional space, KRR employs a kernel function $κ$ that computes the similarity between data points.**

The regression model can be alternatively defined using a **dual representation** $a$

$$\hat{y}=\sum_{i=1}^N\kappa(\mathbf{x},\mathbf{x}_i)a_i$$

(**NOTE:** Most optimization problems have a **primal** form, which is the original formulation of the problem. The **dual** form of an optimization problem, on the other hand, is derived from the primal problem using a set of **mathematical transformations**, often involving Lagrange multipliers. The dual problem provides an alternative way to look at the original problem. (more abuot in next block))

where $\kappa(\mathbf{x},\mathbf{x}_i)$ is a kernel measuring simmilarity between samples $\mathbf{x}$ and $\mathbf{x}_i$, such as Gaussian Kernel. Predictions using Kernel Ridge regression are evaluated as

$$\hat{y}=\mathbf{k(x)}^T(\mathbf{K}+\lambda\mathbf{I})^{-1}\mathbf{y}$$

where $\mathbf{k(x)}=(\kappa(\mathbf{x},\mathbf{x}_1),...,\kappa(\mathbf{x},\mathbf{x}_N))^T$ is a vector of similarities of the training samples with the new sample $\mathbf{x}$ and

$$\mathbf{K}=\begin{pmatrix}\kappa(\mathbf{x}_1,\mathbf{x}_1)&\dots&\kappa(\mathbf{x}_1,\mathbf{x}_N)\\\vdots& \ddots & \vdots\\ \kappa(\mathbf{x}_N,\mathbf{x}_1)&\dots&\kappa(\mathbf{x}_N,\mathbf{x}_N) \end{pmatrix}$$

is the matrix of pair-wise similarities between training samples. A Kernel Ridge regression model fitted to three datapoints with Gaussian kernel with increasing value of $\sigma$ is shown on the right.

In this notebook we will demonstrate fitting of **Polynomial Ridge** and **Kernel Ridge Regression** to predict the GA from a single feature (volume of cortex) and from volumes of 6 brain structures.

### Dual Presentation (Dual, Primal)

Wanna to know more about it? Here it is in simple words:

The term "dual representation" often arises in the context of **optimization problems**, particularly in machine learning and support vector machines. To explain it, let's first establish a foundation:

Most optimization problems have a **primal** form, which is the original formulation of the problem. For instance, in linear regression, the primal problem might be to minimize the squared error of predictions on a dataset.

The **dual** form of an optimization problem, on the other hand, **is derived from the primal problem using a set of mathematical transformations**, often involving Lagrange multipliers. **The dual problem provides an alternative way to look at the original problem**.

There are several reasons and advantages to considering the dual representation of a problem:

1. **Computational Benefits**: Sometimes, especially in the context of Support Vector Machines (SVM), solving the dual problem can be more computationally efficient than solving the primal, especially when dealing with **large feature spaces or when using kernel tricks**.
   
2. **Insights and Intuitions**: The dual form can sometimes offer different insights or interpretations about the problem. For example, in SVMs, the dual form provides insight into the importance of individual data points (support vectors) in defining the decision boundary.

3. **Kernelization**: The dual representation allows for the introduction of kernels in algorithms like SVM. This is particularly useful when **the data isn't linearly separable in its original space**. By using a kernel, **the data can implicitly be mapped to a higher-dimensional space where it might be linearly separable, without the need to explicitly compute the transformation.**

As for Kernel Ridge Regression, the dual representation re-formulates the problem in terms of the kernel (or similarity) between data points, rather than in terms of the data points (or features) themselves. This is especially advantageous when working with non-linear data, as it allows for the implicit mapping of data into higher-dimensional spaces using kernels, facilitating the capture of complex patterns in the data.

**How gaussian kernel represents similarity between two two features?**

The Gaussian kernel, also known as the Radial Basis Function (RBF) kernel, is a popular method used in machine learning to measure the similarity between two data points or features. Here's how it achieves this:

### Gaussian (RBF) Kernel Formula:

Given two data points $x$ and $x'$, the Gaussian kernel is defined as:

$$ K(x, x') = \exp\left(-\frac{||x - x'||^2}{2\sigma^2}\right) $$

Where:
- $||x - x'||^2$ is the squared Euclidean distance between the two points.
- $\sigma$ is a parameter that determines the width of the Gaussian function.

### Intuition:

1. **Closeness in Original Space**: If $x$ and $x'$ are close to each other (i.e., the Euclidean distance between them is small), the value inside the exponential function will be close to zero. Thus, $K(x, x')$ will be close to 1, indicating high similarity.

2. **Distant in Original Space**: Conversely, if $x$ and $x'$ are far apart, the value inside the exponential will be a large negative number. This makes $K(x, x')$ close to 0, indicating low similarity.

3. **Role of $\sigma$**: The parameter $\sigma$ determines the sensitivity of the kernel. A small $\sigma$ will make the kernel's value decay rapidly (i.e., the similarity drops off quickly as the distance between points increases), making the kernel sensitive to distances between points. A larger $\sigma$ makes the decay slower, meaning points that are further apart can still be considered somewhat similar.

In essence, the Gaussian kernel translates the concept of "distance" into a measure of similarity. Points close in the input space yield high kernel values (indicating strong similarity), while distant points yield low kernel values (indicating weak similarity). This characteristic allows machine learning algorithms, especially those like Support Vector Machines, to capture intricate, non-linear patterns in the data when used in combination with the Gaussian kernel.



---


### The standard deviation of the Gaussian kernel

The standard deviation of the Gaussian kernel is denoted as $\sigma $.

In the context of the Gaussian (or RBF) kernel, $ \sigma $ plays a critical role. The formula for the kernel is:


$ K(x, x') = \exp\left(-\frac{||x - x'||^2}{2\sigma^2}\right) $

Here, **$ \sigma $ determines the width of the Gaussian function**.

- A smaller $ \sigma $ means the kernel function will be **narrow**, making the **model sensitive to individual data points** and resulting in a more flexible model.

- Conversely, a larger $ \sigma $ gives a **broader** kernel function, leading to a **smoother and more general** model.

This $ \sigma $ in the kernel function is similar to the standard deviation in a Gaussian or normal distribution, which determines **the spread or width of the distribution**.

### Example: Kernel ridge regression

In Kernel Ridge Regression, the idea is to extend the capabilities of Ridge Regression by using a kernel to implicitly map the input data to a higher-dimensional feature space. Here's how you could go about fitting a Kernel Ridge Regression model to your provided data.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/KRRsigma.gif" width = "450" style="float: right;">

In this movie, we demonstrate a Kernel Ridge Regression model that's been trained on just three samples, while varying the standard deviation ($\sigma$) of the Gaussian kernel. Here's a breakdown of what you'll notice:

- **Small $\sigma$:** With a tiny standard deviation in the Gaussian kernel, the similarity between different points becomes highly localized. As a result, the model's predictions between samples drop to zero. Essentially, the model becomes too "narrow" in its focus, clinging too closely to the training samples. This is classic overfitting: the model is too tuned to the training data and isn't generalizing well to unseen data.

- **Large $\sigma$:** On the flip side, as we crank up the standard deviation, the Gaussian kernel starts treating disparate points as if they're quite similar, effectively smoothing out the differences between them. This leads to a model that's almost too "easygoing," predicting similar outputs for a wide range of inputs. Here, the risk is underfitting—our model becomes too simple to capture the underlying complexity of the data.

So, as you'll see in the movie, kernel size matters—a lot. Too small, and your model obsesses over the training data. Too large, and your model becomes a laid-back generalist, failing to pick up the nuances in the data. Finding the "Goldilocks zone" for $\sigma$ (and the regularization parameter $\lambda$) is crucial for a model that's just right.

## Details of the example

OK. Let's go deeper to go over the details of the example.


### Given Data

- Feature matrix $X = [-1, 0, 1]^T$

- Target vector $y = [2, 3, 2]^T$

### Hyperparameters

- Standard deviation of the Gaussian kernel, denoted as $\sigma$

- Regularization parameter for Ridge penalty, denoted as $\lambda$

### Procedure

1. Use a Gaussian kernel to compute the similarity between points. The Gaussian kernel is defined as:

   $$
   K(x_i, x_j) = e^{-\frac{(x_i - x_j)^2}{2\sigma^2}}
   $$
   
2. Apply Ridge Regression in the transformed feature space. The optimization problem becomes:

   $$
   \min_w \frac{1}{2} ||Kw - y||^2 + \frac{\lambda}{2} w^T w
   $$

### Observations

- **When $\sigma$ is small and $\lambda = 0.02$:** The model tends to make some predictions close to zero. This is likely because the kernel values are too sensitive to differences between points, leading to a less general model.
  
- **When $\sigma$ is large:** The model ends up underfitting the data. In this case, the kernel values become more uniform, making it hard for the model to capture any complex patterns in the data.

So, you'll want to select $\sigma$ and $\lambda$ carefully. Typically, you would use something like cross-validation to select these hyperparameters. This way, you could find a good trade-off between fitting the training data well while also generalizing to new, unseen data.



### Example: Change Ridge regularization parameter ($\lambda$)

In the case of Kernel Ridge Regression, both the standard deviation of the Gaussian kernel ($\sigma$) and the Ridge regularization parameter ($\lambda$) play pivotal roles in model performance. When you're watching the movie that demonstrates these concepts, here's what to look out for as we adjust $\lambda$ while keeping $\sigma$ constant at 1:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/landa.png" width = "450" style="float: right;">


- **Large $\lambda$**: With a high regularization parameter, you'll notice the model's predictions gradually approaching zero. This isn't just a coincidence. When $\lambda$ is large, the Ridge penalty becomes dominant, and the model essentially aims to make the weights as small as possible to minimize the loss function.

- **Kernel Ridge Penalizes Intercept**:

Unlike some other implementations where the intercept term is left alone, in Kernel Ridge Regression, the regularization term penalizes the intercept as well. This is significant because it adds another layer of constraint, pulling the intercept—and subsequently the predictions—toward zero when $\lambda$ is large.

So, the challenge here is to strike the right balance. You want to choose a $\lambda$ that's neither too large (which would drive your predictions toward zero) nor too small (which would risk overfitting). It's another layer of complexity to consider as you try to optimize your model's performance.



## Kernel Ridge Regression and Gram matrixes

**Kernel Ridge Regression** combines the ridge regression (linear least squares with L2-norm regularization) with the kernel trick. This union allows us to tackle the problem in the feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This trick is encapsulated by the kernel function, and a variety of kernel functions can be used.

### Dual Representation in Kernel Ridge Regression:

When it comes to regression, we usually aim to find weights for each feature in our dataset, represented as the vector $w$. However, with the **dual representation**, we express the problem in terms of the training data itself. Instead of a weight for each feature, we get a coefficient (or weight) for each data point in our training set. This set of coefficients is often referred to as $a$.

The transformation to a dual representation is particularly beneficial when combined with the kernel trick. This is because the kernel function lets us compute similarities between data points without needing to deal directly with the potentially high-dimensional transformed features.

### Gram Matrix:

The Gram matrix, denoted as $K$, is central to the dual representation in kernel methods. Each element $K_{ij}$ of the Gram matrix represents the kernel function applied to the $i^{th}$ and $j^{th}$ data points. It essentially captures the similarity (via the chosen kernel) between each pair of training samples.

### Analytical Solution:

The kernel ridge regression problem in its dual form can be expressed as an optimization problem where we aim to minimize a combination of the fit to the data and a regularization term. The solution to this problem provides us the optimal dual coefficients $a$.

Given the formulation of the problem, we can derive an analytical solution for $a$, which can then be used to make predictions on new data points using the kernel function and the training data.


In summary, Kernel Ridge Regression offers an elegant and computationally efficient approach to non-linear regression, leveraging the power of the kernel trick and the duality principle. By working in the dual space, we can handle high-dimensional transformations with ease, and the Gram matrix acts as a compact representation of all pairwise similarities in the dataset.

## Example: Linear Kernel

Indeed, the Linear Kernel plays a special role in the context of Kernel Ridge Regression. The feature transformation for a linear kernel is given by:

$$
\phi(x) = [1, x^T]
$$

When you calculate the linear kernel $ k(x, x') $ using this transformation, it turns out to be:

$$
k(x, x') = \phi(x)^T \phi(x') = (1, x)^T (1, x') = 1 + x x'
$$

The fascinating part is that when you apply Kernel Ridge Regression with this particular linear kernel, you essentially revert back to good old Linear Ridge Regression! This offers an interesting bridge between these two methods, showcasing that Kernel Ridge Regression with a linear kernel is just a specialized form of Linear Ridge Regression.

So, if you're dealing with linearly separable data or you want to simplify your model to reduce computational cost, utilizing the linear kernel can be a solid strategy. It allows you to take advantage of Ridge regularization while sticking to the basics of linear regression.

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/kernel1.png" width = "200" style="float: left;">


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/kernel2.png" width = "200" style="float: center;">


<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/kernel3.png" width = "200" style="float: right;">


If the feature transformation $ \phi(x) $ is $ [1, x] $, then you can easily compute the kernel \( \kappa(x, x') \) by taking the dot product of $ \phi(x) $ and $ \phi(x') $. In mathematical terms:

$$
\kappa(x, x') = \phi(x)^T \phi(x') = [1, x^T] [1, x']^T = 1 + x^T x' = 1 + x \cdot x'
$$

By using this linear kernel in the framework of Kernel Ridge Regression, you effectively end up performing Linear Ridge Regression. It's an elegant way to understand how kernel methods can generalize linear methods!

## Example: Polynomial Kernel

When using a polynomial feature transformation of the second degree, $ \phi(x) $ becomes $ [1, x, x^2] $. This allows us to calculate the quadratic kernel $ \kappa(x, x') $ between two vectors $ x $ and $ x' $ as follows:

$$
\kappa(x, x') = \phi(x)^T \phi(x') = [1, x, x^2] [1, x', x'^2]^T = 1 + x x' + (x^2) (x'^2)
$$

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/3_Regression_lecture_notes.jpg" width = "400" style="float: left;">



In this case, the Kernel Ridge Regression using this polynomial kernel effectively translates into Polynomial Ridge Regression. This showcases the flexibility and extensibility of using kernel methods to model various types of relationships in the data.


Similarly, we can calculate polynomial kernel. You can extend the same line of thinking to compute the polynomial kernel. For a second-degree polynomial feature transformation, the transformed feature $x$ is represented as the vector $[1, x, x^2]$. Consequently, the polynomial kernel of the second degree can be expressed as:

$$
1 + x \times x' + x^2 \times (x')^2
$$

In this case, using this polynomial kernel in Kernel Ridge Regression is essentially the same as executing Polynomial Ridge Regression. This just further demonstrates the versatility of kernel methods in modeling different types of data relationships.

## Kernel Ridge Regression

The prediction model, as we know, can be expressed as a linear combination of kernels. In matrix notation, this can be represented as $k(x)^T \times a$, where $k(x)$ is a vector that stores the similarities between a new feature vector and all feature vectors in the training set.

To assess the model's performance, it's necessary to calculate these similarities for each new sample in relation to all samples in the training set. While this makes the model highly adaptable and flexible, it also comes with a downside: if the number of samples is large, evaluating the model could become computationally expensive and slow. In such scenarios, it might be more efficient to revert to the original parameterization using $w$.

**Note:**

- **Evaluation Method:** To compute the model equation, we find the similarities between a new sample and every sample in the training set. This makes the model non-parametric.

- **Advantage:** The model is highly flexible and can adapt well to the data. This adaptability makes it well-suited for complex and nuanced data patterns.

- **Disadvantage:** The model's computational efficiency takes a hit when the number of samples, $N$, is much larger than the number of features, $D$. In such cases, the model becomes slower to evaluate compared to traditional models that are parameterized by $w$.

### Kernel Ridge Regression in Scikit learn

When it comes to using Scikit-learn for Kernel Ridge Regression, here's what you need to know:

The Kernel Ridge Regression is implemented in the KernelRidge object.

The kernel parameter specifies the type of kernel you want to use (e.g., linear, polynomial, Gaussian, etc.)

The **gamma parameter** sets the (inverse) width of the kernel, defined as $\gamma = \frac{1}{2\sigma}$.

The **alpha parameter** controls the strength of the Ridge penalty.

In summary, Scikit-learn provides a convenient and efficient way to work with Kernel Ridge Regression, giving you the flexibility to customize it according to your needs.

Kernel Ridge Regression in
Scikit learn is implemented in object `KernelRidge` . We set the type of kernel through parameter kernel. `rbf` stands for radial basis function and is implemented by a Gaussian kernel.

Parameter gamma set the inverse width of the kernel, so small gamma means large kernel and other way round. Parameter alpha sets the strength of ridge penalty. The model can be tuned using GridSearchCV as usual.

### Load data

The code below imports the essential libraries and loads the dataset with 6 brain volumes and creates feature matrix `X` and target vector `y`. Run the code.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
import requests

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.kernel_ridge import KernelRidge

In [None]:
# This code will download the required data files from GitHub
def download_data(source, dest):
    base_url = 'https://raw.githubusercontent.com/'
    owner = 'SirTurtle'
    repo = 'ML-BME-UofA-data'
    branch = 'main'
    url = '{}/{}/{}/{}/{}'.format(base_url, owner, repo, branch, source)
    r = requests.get(url)
    f = open(dest, 'wb')
    f.write(r.content)
    f.close()

# Create the temp directory, if it doesn't already exist
import os
if not os.path.exists('temp'):
   os.makedirs('temp')

download_data('Week-4-Regression-models/data/GA-brain-volumes-1-feature.csv', 'temp/GA-brain-volumes-1-feature.csv')
download_data('Week-4-Regression-models/data/GA-brain-volumes-6-features.csv', 'temp/GA-brain-volumes-6-features.csv')

In [None]:
def CreateFeaturesTargets(filename):

    df = pd.read_csv(filename,header=None)

    # convert from 'DataFrame' to numpy array
    data = df.values

    # Features are in columns one to end
    X = data[:,1:]

    # Scale features
    X = StandardScaler().fit_transform(X)

    # Labels are in the column zero
    y = data[:,0]

    # return Features and Labels
    return X, y

X,y = CreateFeaturesTargets('temp/GA-brain-volumes-6-features.csv')

print('Number of samples is', X.shape[0])
print('Number of features is', X.shape[1])

## Univariate non-linear regression

We will explore univariate non-linear regression to understand behaviour of Polynomial Ridge and Gaussian Kernel Ridge regression models. We predict age at scan from the volume of cortex, the first feature in six brain tissue dataset.

### Create univariate dataset

First we extract the cortical volumes. Run the code below.

In [None]:
# Extract volume of cortex
X_cortex = X[:,0].reshape(-1,1)

# Print dimensions
print('Number of samples is', X_cortex.shape[0])
print('Number of features is', X_cortex.shape[1])

# Plot the dataset
plt.scatter(X_cortex, y)
plt.xlabel('Cortical Volume')
plt.ylabel('Age at scan (weeks)')
plt.title('Univariate dataset')

## Univariate regression model

In the next cell you are given functions to calculate cross-validated RMSE and plot a univariate regression model. Look at them and run the cell, you will need this functions later.

In [None]:
def RMSE_CV(model,X,y):
    scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')
    print('Average cross-validated RMSE: {} weeks '.format(round(np.sqrt(-scores.mean()),2)))

def PlotRegressionCurve(model, X, y):
    # Plot the data
    plt.scatter(X, y)
    plt.xlabel('Volume')
    plt.ylabel('GA')

    # Plot the model
    x = np.linspace(X.min(),X.max(),101)
    x = x.reshape(-1, 1)
    y = model.predict(x)
    plt.plot(x, y, 'r-')
    plt.ylim([25,46])

### Polynomial Ridge Regression

This code fits, plots and evaluates the polynomial ridge regression model. Note that:

* The model is a `Pipeline` of `PolynomialFeatures`, `StandardScaler` and `Ridge`.
* Standard scaler normalises the features after the polynomial transformation and this improves the performance of Ridge regression.
* We exclude the feature 1 from `PolynomialFeatures`, because `Ridge` will create an intercept that is not penalised. This also improves the performance of the model.

__Questions:__ Play with the parameters `degree` and `alpha` to see the effect on the curve and performance. Answer the following questions:

* Can you find a setting with the lowest error?
* Set polynomial `degree` to 10 and `alpha` to zero. Is the model overfitted?
* Find the parameter `alpha` to reduce overfitting for `degree` 10.

__Answers:__



In [None]:
# Create model
model = Pipeline((
("poly_features", PolynomialFeatures(degree=2, include_bias=False)),
("scaler", StandardScaler()),
("ridge", Ridge(alpha=0)),))

# Fit model
model.fit(X_cortex,y)

# Evaluate model
RMSE_CV(model,X_cortex,y)

# Plot model
PlotRegressionCurve(model, X_cortex, y)

### Pipeline in Python

In Python, particularly in the context of the Scikit-learn library (often used for machine learning tasks), a `Pipeline` is a tool that **simplifies the process of building and evaluating data preprocessing steps and machine learning models together sequentially**. It ensures that the same sequence of preprocessing steps is applied **to both training and test/validation datasets**. This is crucial to avoid data leakage and other common pitfalls during model training and evaluation.

Here are some key points about pipelines:

1. **Sequential Workflow:** A pipeline bundles a sequence of data processing steps and modeling into a single scikit-learn estimator. Each step in the pipeline is represented as a tuple, where the first element is a name (a string), and the second element is an instance of an estimator (transformer or model).

2. **Avoid Data Leakage:** By ensuring that data transformations and preprocessing are encapsulated within the pipeline, it guarantees that the same procedures are applied to both training and testing data. This avoids common mistakes like fitting a scaler to test data.

3. **Convenience and Reproducibility:** Pipelines help in ensuring that the workflow is implemented consistently, aiding in reproducibility. This is especially useful when sharing code or deploying models.

4. **Compatibility with Other Scikit-learn Tools:** Pipelines are compatible with scikit-learn functions such as `cross_val_score`, `GridSearchCV`, and `RandomizedSearchCV`. This is especially handy when tuning hyperparameters, as the entire workflow, from data preprocessing to modeling, is considered during cross-validation.

5. **Custom Transformers:** Along with built-in transformers in scikit-learn, custom preprocessing steps can be included in a pipeline using the `FunctionTransformer` or by creating a custom transformer class that implements the `fit`, `transform`, and optionally `fit_transform` methods.


The code below performs the grid search to find the best parameters for the Polynomial Ridge Regression model that we defined above. Run it to find the best fit.

In [None]:
# Define parameter grid
parameters = {"poly_features__degree": range(1,15),
             "ridge__alpha":np.logspace(-3,3,100)}

# Perform grid search
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_cortex, y)

# Calculate best CV RMSE
RMSE_CV(grid_search.best_estimator_, X_cortex, y)

# Print best parameters
print('Best degree: ', grid_search.best_estimator_.named_steps["poly_features"].degree)
print('Best alpha: ', round(grid_search.best_estimator_.named_steps["ridge"].alpha,3))

Average cross-validated RMSE: 1.16 weeks 
Best degree:  10
Best alpha:  0.305


### Kernel Ridge Regression

This code fits, plots and evaluates the kernel ridge regression model [`KernelRidge`](https://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html). Note that:

* The kernel has been set to Gaussian using parameter `kernel='rbf'`
* Parameter `gamma` represents $\frac{1}{2\sigma}$, where $\sigma$ is the standard deviation of the Gaussian kernel. Note that small values of `gamma` correspond to a large kernel and other way round.
* Parameter `alpha` determines strengths of the regularisation.

__Questions:__ Play with the parameters `degree` and `alpha` to see the effect on the curve and performance. Answer the following questions:

* Keep `alpha` fixed to `1e-5` and change values of `gamma` to see the effect of the kernel size on the curve. You can for example try settings `1e-5`, `1e-3`, `1e-1`, `1e1` and `1e3`. Which setting performs the best?
* Set `gamma=1` while keeping `alpha=1e-5`. Is the model overfitted?
* Find the parameter `alpha` to reduce overfitting for `gamma=1e1`.

__Answers:__



In [None]:
# Create model
model = KernelRidge(kernel='rbf', gamma=1e0, alpha=1e-5)

# Fit model
model.fit(X_cortex,y)

# Evaluate model
RMSE_CV(model, X_cortex,y)

# Plot model
PlotRegressionCurve(model, X_cortex, y)

The code below performs the grid search to find the best parameters for the Kernel Ridge Regression model that we defined above. Run it to find the best fit.

In [None]:
# Define parameter grid
parameters = {"alpha": np.logspace(-5, 5, num=11), # You can increase the value of num, but it will slow things down
              "gamma": np.logspace(-5, 5, num=11)}

# Perform grid search
grid_search = GridSearchCV(model, parameters, cv=5)
grid_search.fit(X_cortex, y)

# Calculate best CV RMSE
RMSE_CV(grid_search.best_estimator_, X_cortex, y)

# Print best parameters
print('Gamma: {} Alpha: {}'.format(round(grid_search.best_estimator_.gamma,3), round(grid_search.best_estimator_.alpha,3)))

Average cross-validated RMSE: 1.14 weeks 
Gamma: 1.0 Alpha: 0.01


## Exercise

In this exercise we will fit multivariate non-linear regression model to the dataset with volumes of 6 brain tissues, calculate cross-validater RMSE and check the bias error in the model. We will compare three models:

* Linear Ridge Regression
* Polynomial Ridge Regression
* Kernel Ridge Regression with Gaussian Kernel


Answer:

When diving into a multivariate non-linear regression example, let's say we're looking at predicting gestational age (GA) from the volumes of six different brain tissues. Here are some important considerations:

<img src="https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-4-Regression-models/imgs/3_Regression_lecture_notes_2.jpg" width = "700" style="float: left;">



- A useful way to evaluate the model's performance is by plotting the expected versus predicted target values. This helps us to identify any bias error in the model.
  
- In the case of Linear Ridge Regression, you might notice that there's a bias in the predictions for younger babies. Essentially, the model may not be capturing the complexities associated with lower gestational ages.

- Switching to non-linear models like Kernel Ridge Regression can help remove this bias and improve the root mean square error of cross-validation (RMSE CV), offering a more accurate and nuanced representation of the data.

- After various tests, you might find that the best-performing model for this particular application is Kernel Ridge Regression. This model allows you to capture the non-linear relationships in the data more effectively, thus improving overall prediction quality.


Revisiting the example of predicting age at the time of a scan based on the volumes of six different brain tissues provides some insightful observations:

- In multivariate regression problems like this, visualizing the model directly can be challenging. However, plotting the expected versus predicted target values is a great alternative. This helps us identify any bias errors.

- Linear Ridge Regression appears to have a bias when it comes to younger babies, as it tends to predict higher ages than what are actually observed. This is a critical issue if accurate age prediction is crucial for your study or application.

- The good news is that non-linear models like Polynomial Ridge and Kernel Ridge can mitigate this bias. This improvement is quantifiable: The cross-validation root mean square error (CV RMSE) drops from 1.27 weeks with Linear Ridge to 0.83 weeks with Polynomial Ridge, and even further to 0.77 weeks with Kernel Ridge.

- Among all models, Kernel Ridge Regression with six features emerges as the top performer, achieving the lowest CV RMSE.

So, it's clear that for this specific application, Kernel Ridge Regression offers the most accurate and least biased estimates, demonstrating the power and flexibility of non-linear models in capturing complex relationships in the data.

In a nutshell, non-linear models like Kernel Ridge Regression can be more adept at capturing complex relationships, making them a strong candidate for tasks where linear models show noticeable bias or limitations.



### Task: Linear Ridge Regression

The code below tunes Linear Ridge Regression model to the dataset with 6 features, with feature matrix `X` and target vector `y`. Run the code and note the performance. The tuned model is saved in `model_lin`.

In [None]:
print('Linear Ridge Regression:')

# grid for hyperparameter alpha
parameters = {"alpha": np.logspace(-3,3,7)}

# create ridge model
model = Ridge()

# perform grid search
grid_search = GridSearchCV(model, parameters,cv=5)
grid_search.fit(X, y)

# remember optimised model
model_lin = grid_search.best_estimator_

# Print optimal alpha
print('Best alpha =', round(model_lin.alpha,3))

# Calculate RMSE_CV
rmse_cv = RMSE_CV(model_lin,X,y)

Linear Ridge Regression:
Best alpha = 0.1
Average cross-validated RMSE: 1.27 weeks 


Your task is now to plot expected vs predicted target values to see whether there is the bias in the linear model. Complete the function `PlotTargets` to do that.

In [None]:
def PlotTargets(model,X,y):

    # Predict targets
    y_pred = None # Edit this line

    # Plot expected targets on x axis and predicted targets on y axis
    #plt.plot(None, None, 'o', label='Target values') # Edit this line
    plt.plot([27,45], [27,45], 'r', label = '$y=\hat{y}$')
    plt.xlabel('Expected target values')
    plt.ylabel('Predicted target values')
    plt.legend()

#PlotTargets(model_lin, X, y)
plt.title('Linear Ridge Regression')
plt.show()

**Question:** Does the plot show bias error?

**Answer:**

### Task: Polynomial Ridge Regression

Next, you will tune the polynomial ridge regression, measure its performance and plot the target values to see whether there is still bias.

Complete the code before to tune the model. Note that it is saved in `model_poly`

In [None]:
# create model
model = Pipeline((
("poly_features", PolynomialFeatures(include_bias=False)),
("scaler", StandardScaler()),
("ridge", Ridge())))

# define parameter grid
parameters = {"poly_features__degree": range(1,5),
             "ridge__alpha":np.logspace(-3,3,7)}

# perform grid search
#grid_search = None

# remember optimised model
#model_poly = grid_search.best_estimator_

Complete the code below to print the optimal parameters and evaluate performance.

In [None]:
print('Polynomial Ridge Regression:')

# print optimal parameters
print('Best degree:', None)
print('Best alpha:', None)
# Remember from Notebook 2.4 you can use the named_steps method of your Pipeline to access the variables of each step

# Calculate CV RMSE
#RMSE_CV(model_poly, X, y)

Polynomial Ridge Regression:
Best degree: None
Best alpha: None


**Question:** Is the performance better than for Linear Ridge?

**Answer:**

Plot the expected vs predicted target values using the function `PlotTargets` that you created.

In [None]:
# Plot
# Add your code here
plt.title('Polynomial Ridge Regression')

**Question:** Is there less bias than for the linear model?

**Answer:**

### Task: Kernel Ridge Regression

Finally, you will tune the Gaussian Kernel Ridge Regression, measure its performance and plot the target values to see whether the bias error is further reduced.

Complete the code before to tune the model. Note that it is saved in `model_kernel`

In [None]:
# Create model
model = None

# Define parameter grid
parameters = {"alpha": np.logspace(-5, 5, num=11),
              "gamma": np.logspace(-5, 5, num=21)}

# Perform grid search
grid_search = None

# Remember optimised model
model_kernel = None

Complete the code bellow to print the optimal parameters and evaluate performance.

In [None]:
print('Kernel Ridge Regression:')

# print optimal parameters
#print('Best gamma: ', round(None,3))
#print('Best alpha: ', round(None,5))

# Calculate CV RMSE


Kernel Ridge Regression:


**Question:** Is the performance better than for Polynomial Ridge?

**Answer:** The CV RMSE decreased from 0.84 weeks to 0.77 weeks, so the performance is better.

Plot the expected vs predicted target values using the function `PlotTargets` that you created.

In [None]:
# Plot
plt.title('Kernel Ridge Regression')
plt.show()

**Question:** Is there less bias than for the polynomial model?

**Answer:**