# Regression

## Application: neonatal brain growth

- Around the time of birth the brain grows very quickly
- Preterm birth alters brain development

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/DevelopingPretermBrain.png" width = "300" style="float: left;">

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/Baby.png" width = "200" style="float: right;">

We will demonstrate the regression concepts using the application of brain growth in preterm neonates. Around the time of birth the brain grows very quickly. Preterm birth can disrupt this process, and therefore preterm brain development is a subject of extensive research.

To investigate the changes caused by preterm birth, we can acquire MRI scans of newborn babies. We can perform automatic segmentations of various brain structures and measure their volumes. The features in the datasets we'll be working with are brain volumes, either for (1) the whole brain, (2) six brain tissues, or (3) 86 brain structures.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/BrainMRI-BrainSegmentation.png" width = "700">

The machine learning regression is the predict the age (target value) of a baby from volumes of brain structures (features).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/AgeVolumePlot.png" width = "400">

In this plot each circle corresponds to a baby (a sample), on the x-axis we have the brain volume (a feature scaled with StandardScaler). The age (target value) is plotted on the y-axis. In this case we are showing a univariate linear regression model in red, with two parameters (slope and intercept). We have achieved R2 score 0.85 which is quite good, so the linear model fits this dataset well.

## Multivariate linear regression

Can we improve the predictions by using the 6-feature or 86-feature datasets? We can use multivariate linear regression for these datasets.

Multiple features: $$x_{i1}, ... , x_{iD}$$

Target values: $$y_i$$

Samples: $$i = 1, ..., N$$

Prediction model: $$\hat{y} = w_0  + w_1x_1 + ... + w_D x_D$$

Loss function: Sum of square errors

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \sum_{k} w_k x_{ik} - w_0)^2$$

The model is fitted by minimizing the loss function (sum of squared errors).

One way to minimize the loss function is by taking its derivative and finding the solution where it is equal to 0.

This results in the **Normal Equation**:

$$\hat{\textbf{w}} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y}$$

Where $\hat{\textbf{w}}$ is the weight vector that minimizes the loss function, $\textbf{X}$ is the feature matrix (with a first column of ones to model the intercept), and $\textbf{y}$ is the target values vector.

## Gradient descent

Unfortunately if the number of features is very large (e.g. 100,000+) using the normal equation becomes impractical, because of the computational cost of inverting the feature matrix. An alternative is to use gradient descent. Gradient descent is an iterative process, where the weight vector is initialised to a random value and then iteratively updated until it converges to a local minimum.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescent.png" width = "300">

It is important that the learning rate is set well for the algorithm to converge. If the learning rate is too small, the algorithm may take a very large number of steps to converge. If the learning rate is too large the algorithm may oscillate instead of converging.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/LearrningRate.png" width = "700">

Another issue to consider with gradient descent is how much of the training data to use at each iteration.

**Batch gradient descent:** Uses all the training data at each iteration but can be very slow if we have a large number of samples.

**Stochastic gradient descent:** Uses only one random training sample at each iteration. This is very fast but can be unstable.

**Mini-batch gradient descent:** A compromise between batch and stochastic, this uses a subset of the training data at each iteration.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescentBatches.png" width = "700">

## Evaluating performance of regression models

### $R^2$ score

This is the proportion of variance in the data explained by the model. A perfect fit would have $R^2 = 1$.

$$R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})}$$

where the $\hat{y}_i$s are the predicted target values and $\hat{y}$ is the average target value.

Let's use the $R^2$ score to compare linear regression models for each of the datasets.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/R2_scores.png" width = "700">

The $R^2$ scores improve with the larger number of features. However, when we calculate cross-validation scores $R^2$ scores, the best performance is with the 6-feature dataset. This indicates that the 86-feature dataset is being overfitted to the noise in the data.

### Root mean squared error (RMSE)

Another way to measure performance of regression models is with root mean squared error (RMSE).

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i} (y_i - \hat{y}_i)^2}$$

where the $y_i$s are the target values and the $\hat{y}_i$s are the predicted values.

Note that unlike $R^2$, the RMSE is in the same units as the target values (i.e. weeks in our example).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RMSE_scores.png" width = "700">

Better RMSE scores are lower values (closer to 0). The RMSE scores agree with $R^2$ scores. The best model is on the 6-feature dataset. The 86-feature dataset has been overfitted.

## Penalised Regression

### Ridge regression

Overfitting can be avoided by reducing the number of parameters in the model (in the case of linear regression models this can be done by reducing the number of features). An alternative is **regularization**, where a penalty term is added to the loss function.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgePenalty.png" width = "500">

The **ridge penalty** penalizes weights with a large magnitude. It is also called the **L2 norm**. $\lambda$ is a hyperparameter that determines the strength of the regularization.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/L2_norm.png" width = "200">

The ridge penalty rises quadratically as the magnitude of the weight increases.

The behavior of ridge regression depends on hyperparameter $\lambda$.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgeRegressionBehavior.png" width = "600">

You can use a grid search to find the optimal lambda values.

Does ridge regression help avoid overfitting on our sample datasets?

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RidgeRegresion_scores.png" width = "600">

Yes, with an optimal lambda, the overfitting on the 86-feature dataset is avoided.

### Lasso regression

An alternative type of regulariztion is **Lasso regression**. This uses the **Lasso penalty** or **L1 norm**.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/LassoPenalty.png" width = "500">

The **ridge penalty** penalizes weights with a large magnitude. It is also called the **L2 norm**. $\lambda$ is a hyperparameter that determines the strength of the regularization.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/L1_norm.png" width = "200">

Compared with Ridge, Lasso tends to reduce weights to 0 and thus produces sparse solutions (many weights with value = 0).

### Comparison of Ridge and Lasso

- Ridge and Lasso penalties both decrease the magnitude of weights
- Ridge penalises weights with large magnitude more
- Lasso penalises weights with small magnitude more

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/ComparisonRidgeLasso.png" width = "700">

## Non-linear Regression

Let’s now have a look how we can formulate a non linear regression model. In machine learning it is common to consider non-linear _feature transformation_ followed by multivariate linear regression model. We have already seen this when we introduced polynomial regression. In general terms, we denote the non linear feature transformation by $\phi$ (_phi_), and the prediction model will then become:

$$\hat{y} = \phi(x)^T \textbf{w}$$

For example, if we use a polynomial regression model of the second degree, then:

Feature transformation for 1D feature vector x:

$$\phi(x) = (1, x, x^2)^T$$

Prediction model:

$$\hat{y} = \phi(x)^T \textbf{w} = w_0 + x w_1 + x^2 w_2$$

Note: Feature transformation usually increases dimension of a feature vector. In the case of univariate polynomial regression 1D feature vector x is transformed to an $M + 1$ dimensional vector $(1, x, x^2, ..., x^M)^T$.

Loss function for non-linear regression:

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \phi(\textbf{x}_i)^T \textbf{w})^2$$

Non-linear ridge regression (to avoid overfitting)

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \phi(\textbf{x}_i)^T \textbf{w})^2 + \frac{\lambda}{2}\textbf{w}^T\textbf{w}$$

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/PolynomialRidgeRegression2.png" width = "700">

The loss function will then become a sum of squared errors between the expected target values $y_i$ and the target values predicted using the modified model. Because we tend to significantly increase the number of features by introducing feature transformation, we will include ridge penalty in our non linear regression model.

In this plot you can see how ridge penalty can be useful for polynomial regression. In the plot on the left we have fitted polynomial of 10 th degree to predict age from cortical volume. We can see that model slightly overfitted the data and CV R2 is 0.88. If we add ridge penalty with lambda 0.25, the overfitting is reduced and the CV $R^2$ increases to 0.9.

How does polynomial ridge regression change with increasing number of features?

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/PolynomialRidgeRegressionPerformance.png" width = "700">


### Kernel trick

The kernel trick helps us to design more versatile non-linear regression models.

We define a **dual representation** $a$ of our model parameter vector $\textbf{w}$

$$\textbf{w} = \phi^T a$$

The prediction model then becomes


$$\hat{y} = \phi^T(x) \textbf{w} = \phi^T(x) \phi^T a = \sum^N_{i=1} \phi^T(x) + \phi(x_i) a_i$$

We now define a **kernel** $\kappa$

$$\kappa(x, x_i) = \phi^T(x)\phi(x_i)$$

The kernel represents a similarity between two feature vectors.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GaussianKernel.png" width = "300"  align="right">

For example, we can use a **gaussian kernel**

$$\kappa(x, x_i) = e^{-\frac{\lvert x - x_i \rvert_2^2}{2\sigma}}$$

$$\lvert x - x_i \rvert_2^2 = \sum_k(x_k- x_{ik})^2$$

Note:
- The original parameter vector $\textbf{w}$ vector has $D$ elements (one per each feature)
- The dual parameter vector $a$ has $N$ elements (one per each sample)

The resulting **dual prediction model** is a linear combination of kernels placed around the feature vectors $x_i$

$$\hat{y} = \sum^N_{i=1} \kappa(x, x_i) a_i$$

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/DualPredictionModel.png" width = "700">

In this plot we can see three samples with feature vectors $x_1$, $x_2$ and $x_3$. The gaussian kernels are placed around them, multiplied by
coefficients ai and then summed up to produce the prediction model. We can also see, that large kernels produce smoother models.

We will see more details on non-linear regression in Notebook 4.4.