# SLU12 - Validation metrics for regression: Learning Notebook

In this notebook, you will learn about:
* [Mean Absolute Error (MAE)]()
* [Mean Squared Error (MSE)]()
* [Root Mean Squared Error (RMSE)]()
* [Coefficient of Determination (R²)]()
* [Adjusted R²]()
* [Regularization]()

In SLU9, you were introduced to one of the most intuitive and used regression models. You were also introduced to function ($J$) that measured how good the linear regression model was. In this SLU, we will take a look at that function, and others, more in-depth. 

But, before we start, there is something we should make it clear first. It will be very usual, while studying & practising data science, that you will hear/read these two words: **loss** and **metric**. Both of the refer to functions that evaluate the **quality** of a model. Sometimes, people will use both as they are the same thing. But really important differences between them:
* **Loss** is the function that your model will minimize;
* **Metric** is the function that you really want to use to evaluate how good your model is.

Sometimes, the loss and metric functions are the same (example: linear regression). But other times, e.g. in classification, the metric will be really different from the loss. 

## Mean Absolute Error (MAE)

$$MAE = \frac{1}{N} \sum_{n=1}^N \left|y_n - \hat{y}_n\right|$$

In [253]:
mae = lambda y, y_hat: np.abs(y - y_hat).mean()

* The lower, the better.
* It is less sensible to outliers than MSE (the next metric).
* The output can be interpreted as the expected error measured in the same units as the target.
* It can be used as both a metric and a loss function. There are some important caveats to take into consideration when doing using MAE as a loss function: (a) the number of solutions, (b) large jumps in the values of the parameters and (c) not having a derivative when MAE is 0.

## Mean Squared Error (MSE)

In SLU9, we already explored this metric

$$MSE = \frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2$$

In [255]:
mse = lambda y, y_hat: ((y - y_hat)**2).mean()

* The lower, the better.
* MSE can be used as both a metric and a loss function (e.g. linear regression).
* It is sensible to outliers in its' original form.
* The units of the metric are not the same as the ones used in the target. For example, if you are predicting house prices, i.e. the output is \$, then the output would have units like \$².

## Root Mean Squared Error (RMSE)

$$RMSE = \sqrt{MSE}$$

In [256]:
rmse = lambda y, y_hat: np.sqrt(mse(y, y_hat))

* The lower, the better.
* Its' output can be interpreted as having the same units as the targets.
* MSE can be used as both a metric and a loss function. If fact, if we computed its partial derivative is $\frac{\partial RMSE}{\partial \hat{y}} = \frac{1}{2 \sqrt{MSE}} MSE$

## Coefficient of Determination (R²)

R² compares how better your regression model is when compared with a predictor that outputs just the mean of the targets

$$\bar{y} = \frac{1}{N} \sum_{n=1}^N y_n$$

$$R² = 1 - \frac{MSE(y, \hat{y})}{MSE(y, \bar{y})} 
= 1 - \frac{\frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)}{\frac{1}{N} \sum_{n=1}^N (y_n - \bar{y})}
= 1 - \frac{\sum_{n=1}^N (y_n - \hat{y}_n)}{\sum_{n=1}^N (y_n - \bar{y})}$$

In [202]:
r2 = lambda y, y_hat: 1 - (mse(y, y_hat) / mse(y, np.mean(y)))

The higher the R², the more sure you are that the independent variables you used explain how the dependent variable changes. For example, if you got a R² of 0.7, you can say that the set of features you used are able to explain 70% of the target variable. (TODO: verificar se este parágrafo está certo).

The higher R² you can get is 1. If you get R² = 0, it means that your model doesn't explain anything in the target by using the features you selected. If you get R² < 0, you are probably suffering too much with overshooting (remember the SGDRegressor example in SLU9?). Also, another reason for having R² < 0 is that the model you used doesn't make sense for that data you have.

Also, when using R², there are something important [caveats](https://en.wikipedia.org/wiki/Coefficient_of_determination#Caveats) to take into account. One of the caveats is that, depending on the model, using more features can inflate the R² when, in fact, those features are really noisy, meaning the model is actually fitting to the noise.

## Adjusted R²

In order to take into account the addition os useless variables, we can use the adjusted R² score

$$R_{adj}^2 = 1 - \frac{N - 1}{N - K - 1} (1 - R^2)$$

where $N$ is the number of observations in the training dataset and K is the number of features your model is using.

In [207]:
adjusted_r2 = lambda y, y_hat, N, K: 1 - ((N - 1) / (N - K - 1)) * (1 - r2(y, y_hat))

## Regularization