(ml-sup)=
# Supervised Machine Learning

## Introduction

In this chapter, we're going to look at *supervised* machine learning, which we might just as easily call *prediction*. In maths, we are trying to find an $ \hat{f}(\mathbf{x}) $ such that

$$
y = \hat{y} + \varepsilon = \hat{f}(\mathbf{x}) + \varepsilon
$$

for a set of variables $\mathbf{x}$ and an outcome (continuous or discrete) $y$. It's also possible for this to be a multi-valued problem, eg $\mathbf{\hat{y}} = \hat{\mathbf{f}}(\mathbf{x})$.

In the introduction to this section, we saw that there are a number of reasons why we might turn to machine learning. For *supervised* machine learning, the most important application is prediction.

Supervised machine learning is usually split into two types: *regression*, which covers prediction on a continuous interval, and *classification*, which is about predicting a class from a finite set of possible discrete classes.

In supervised machine learning, we will talk a lot about:

- in-sample data, the data used to train a model (the data on which a model learns)
- out-of-sample data, data held back to assess the performance of a model (data on which a model has not learned but that can be used for prediction)

The other key fact to know about supervised machine learning is that the typical way to assess a model is out-of-sample goodness-of-fit. Usually, this is captured by the *mean squared error*,

$$
\text{MSE} = \frac{1}{N} \displaystyle\sum_{i=1}^N (y_i - \hat{y}_i)^2
$$

Just applying this metric on in-sample metric would lead to a lot of overfitting as machine learning models are universal function approximators, and they are very good at it. Overfitting is a problem because it gives undue confidence that a model represents reality when, in truth, it would perform poorly on new data. Imagine you had only ever seen blue aquatic species that are fish; you might conclude that *only* fish can be blue and swim. But your model of the world would perform poorly in the real world because there are mammales that are blue and swim (not least Blue Whales!)

To ensure that we are not overfitting, there are typically two approaches, which can also be combined. The first is to use penalties for complexity in the model, which pushes the algorithms to find a succinct $\hat{f}(\mathbf{x})$. The second is to do tests of model performance are done on 'held out' or 'out-of-sample' data, ie data that is not used to train the model. These give a better reflection of reality. There are a whole host of ways that we can hoodwink ourselves that our model is performing better than it is, so it's important to be on guard for ways that a machine learning model might pick-up clues in ways you might not expect. There's a famous example in which an image recognition machine learning algorithm was identifying which images were from particular scanners as a proxy for which patients had cancer rather than learning to recognise the cancer itself—you can see how the errors can be serious. On the other hand, when used properly, machine learning is extremely powerful.

Although we're trying to keep the theory to a minimum here, it would be remiss not to mention the bias-variance trade-off in machine learning models. Throwing darts at a dartboard helps illustrate the concepts:
- bias is a measure of how close to the target your shots are
- variance is how concentrated in a small area your shots are, regardless of where their average location is

if you think of each prediction of a machine learning model being a 'shot' then you can see how it applies. The following equation breaks down the mean-squared error of a prediction problem on ${\displaystyle y=f(x)+\varepsilon }$, where we wish to minimise ${\displaystyle (y-{\hat {f}}(x))^{2}}$:

$$
{\displaystyle {\begin{aligned}{\text{MSE}}&=f^{2}+\sigma ^{2}-2f\operatorname {E} [{\hat {f}}]+\operatorname {Var} [{\hat {f}}]+\operatorname {E} [{\hat {f}}]^{2}\\&=(f-\operatorname {E} [{\hat {f}}])^{2}+\sigma ^{2}+\operatorname {Var} {\big [}{\hat {f}}{\big ]}\\[5pt]&=\operatorname {Bias} [{\hat {f}}]^{2}+\sigma ^{2}+\operatorname {Var} {\big [}{\hat {f}}{\big ]}\end{aligned}}}
$$

There are three parts here; an irreducible error, $\mathbb{E}[\varepsilon^2] = \sigma^2$; a bias, $\operatorname {Bias} [{\hat {f}}]^{2} = (f-\mathbb{E}[\hat{f}])^2$; and finally a variance, $\operatorname {Var} {\big [}{\hat {f}}{\big ]} = \mathbb{E}[X^2]-\mathbb{E}[X]^2$.

Ideally, we want to just have the irreducible error. In practice, minimising the variance increases the bias and vice versa, and this is why people talk about the bias-variance trade-off. Overfitting is a great example: it decreases bias (makes predictions more accurate) but at the cost of having a more complex, varying set of guesses (higher variance). If you're hungry for more, the [wikipedia page](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff) on this is great.

## Machine learning regression

