# Regression

## Application: neonatal brain growth

- Around the time of birth the brain grows very quickly
- Preterm birth alters brain development

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/DevelopingPretermBrain.png" width = "300" style="float: left;">

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/Baby.png" width = "200" style="float: right;">

We will demonstrate the regression concepts using the application of brain growth in preterm neonates. Around the time of birth the brain grows very quickly. Preterm birth can disrupt this process, and therefore preterm brain development is a subject of extensive research.

To investigate the changes caused by preterm birth, we can acquire MRI scans of newborn babies. We can perform automatic segmentations of various brain structures and measure their volumes. The features in the datasets we'll be working with are brain volumes, either for (1) the whole brain, (2) six brain tissues, or (3) 86 brain structures.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/BrainMRI-BrainSegmentation.png" width = "700">

The machine learning regression is the predict the age (target value) of a baby from volumes of brain structures (features).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/AgeVolumePlot.png" width = "400">

In this plot each circle corresponds to a baby (a sample), on the x-axis we have the brain volume (a feature scaled with StandardScaler). The age (target value) is plotted on the y-axis. In this case we are showing a univariate linear regression model in red, with two parameters (slope and intercept). We have achieved R2 score 0.85 which is quite good, so the linear model fits this dataset well.

## Types of errors

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/Errors.png" width = "700">

We can classify the errors into three types:

**Bias:** Error between average predict values and the true model

**Variance:** Variance of the values predicted by different models

**Noise:** Error between sample target values and the true model

In the case of the linear model, what type error do we have? It is mainly noise.

## Multivariate linear regression

Can we improve the predictions by using the 6-feature or 86-feature datasets? We can use multivariate linear regression for these datasets.

Multiple features: $$x_{i1}, ... , x_{iD}$$

Target values: $$y_i$$

Samples: $$i = 1, ..., N$$

Prediction model: $$\hat{y} = w_0  + w_1x_1 + ... + w_D x_D$$

Loss function: Sum of square errors

$$F(\textbf{w}) = \frac{1}{2} \sum_{i}(y_i - \sum_{k} w_k x_{ik} - w_0)^2$$

The model is fitted by minimizing the loss function (sum of squared errors).

One way to minimize the loss function is by taking its derivative and finding the solution where it is equal to 0.

This results in the **Normal Equation**:

$$\hat{\textbf{w}} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y}$$

Where $\hat{\textbf{w}}$ is the weight vector that minimizes the loss function, $\textbf{X}$ is the feature matrix (with a first column of ones to model the intercept), and $\textbf{y}$ is the target values vector.

## Gradient descent

Unfortunately if the number of features is very large (e.g. 100,000+) using the normal equation becomes impractical, because of the computational cost of inverting the feature matrix. An alternative is to use gradient descent. Gradient descent is an iterative process, where the weight vector is initialised to a random value and then iteratively updated until it converges to a local minimum.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescent.png" width = "300">

It is important that the learning rate is set well for the algorithm to converge. If the learning rate is too small, the algorithm may take a very large number of steps to converge. If the learning rate is too large the algorithm may oscillate instead of converging.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/LearrningRate.png" width = "700">

Another issue to consider with gradient descent is how much of the training data to use at each iteration.

**Batch gradient descent:** Uses all the training data at each iteration but can be very slow if we have a large number of samples.

**Stochastic gradient descent:** Uses only one random training sample at each iteration. This is very fast but can be unstable.

**Mini-batch gradient descent:** A compromise between batch and stochastic, this uses a subset of the training data at each iteration.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/GradientDescentBatches.png" width = "700">

## Evaluating performance of regression models

### $R^2$ score

This is the proportion of variance in the data explained by the model. A perfect fit would have $R^2 = 1$.

$$R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})}$$

where the $\hat{y}_i$s are the predicted target values and $\hat{y}$ is the average target value.

Let's use the $R^2$ score to compare linear regression models for each of the datasets.

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/R2_scores.png" width = "700">

The $R^2$ scores improve with the larger number of features. However, when we calculate cross-validation scores $R^2$ scores, the best performance is with the 6-feature dataset. This indicates that the 86-feature dataset is being overfitted to the noise in the data.

### Root mean squared error (RMSE)

Another way to measure performance of regression models is with root mean squared error (RMSE).

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i} (y_i - \hat{y}_i)^2}$$

where the $y_i$s are the target values and the $\hat{y}_i$s are the predicted values.

Note that unlike $R^2$, the RMSE is in the same units as the target values (i.e. weeks in our example).

<img src="https://raw.githubusercontent.com/SirTurtle/ML-BME-UofA-imgs/main/Week-4-Regression-models/imgs/RMSE_scores.png" width = "700">

Better RMSE scores are lower values (closer to 0). The RMSE scores agree with $R^2$ scores. The best model is on the 6-feature dataset. The 86-feature dataset has been overfitted.

## Penalised Regression

## Nonlinear Regression