# Regression Performance

Up until this point, we have covered a few machine learning techniques focused around regression. However, we have not (yet) analysed **how to rank different algorithms in terms of their predictivity**, or how to find the best hyperparameters for a given algorithm. This is what the focus of this section will be on. 

In this regard, we aim to be quite practical, without focusing too much on the underlying mathematical subtleties of different choices. Instead, we will just quickly show how this is done using `sci-kit learn`.

## Metrics

**Metrics are the collective term given to the family of analysis techniques used to analyse and measure the performance of different machine learning algorithms**. In short, each metric gives a number (or set of numbers), also called <b>scores</b>, that can be used to measure the performance of a given model. 

Keep in mind that most metrics are defined in such a way that a higher value means higher predictivity. However, it is always good to check their exact definition, especially to better understand if it fits your specific purpose, which typically depends on details of the problem you are trying to solve.  

Here we report some of the metrics used for regression techniques. Although most of them are probably intuitive, you can check find their definitions [here](https://towardsdatascience.com/20-popular-machine-learning-metrics-part-1-classification-regression-evaluation-metrics-1ca3e282a2ce)
     
- SE - squared error
- MSE - mean squared error
- RMSE - root mean squared error
- rMSE - relative mean squared error
- R<sup>2</sup> - coefficient of determination
- AE - absolute error
- MAE - mean absolute error
- Adjusted R<sup>2</sup>



# How to evaluate a regression algorithm (a short guide):

1) **Choose a (few) metric(s)**.<br>
Typically, for regression the R<sup>2</sup> or RMSE values are used, and also it is normal for only 1 or 2 of the above metrics to be used to compare different models, and not anymore. However, the right choice of metric is problem specific, and will depend on the reason why you want to deploy your regression model. For example, you might simply want, on average, to make a prediction with the lowest possible error. In other cases, you might accept a larger average error, but want to avoid committing very large errors, even if very few times. 

2) **Calculate the metric for your algorithm**.<br>
Here it is of absolute importance to remember that, in order to check how well an algorithm performs, the metric must be evaluated **on the test set NOT on the training set**.  
The reason for using the test set is that an algorithm might have enough parameters to fit exactly any single one of the points that it trains on. However, such accuracy would be useless if it cannot predict the value of other data outside what is has seen in training!

> As much as it sounds obvious, this is often not the case and forgotten, especially when you move your first steps into ML!

3) **Rank your choice from best to worst, in the order induced by the metric.**
Whether you want to choose between different algorithms (say, linear regression vs random forest) or for a certain set of hyperparameters over another (for example, the $L_1$ coefficient in Ridge regression, or the maximum depth of a tree, or the minimum number of datapoints in each leaf in a tree), just see how their predictivity changes using the metric, and **pick the one with the best value...and other application-dependent performances** such as computational costs (i.e., the time or computing resources the algorithm requires to make a prediction).  

What do we mean by other "application-dependent performances"?  
For certain parameters, the best ranking algorithm might also correspond to a model that is  computationally too demanding and takes a long time to return an answer. Thus, you might need to consider performance also in terms of time, or cost per application (especially when this algorithm might be used billions of times each day!). In other words, **the right choice is not always purely a matter of performance metric and a bit of fine tuning might be necessary to find the right compromise.**

> **Think:** If algorithm A is marginally better than algorithm B in terms of metric but it takes much longer times to return a prediction, which one would you choose for i) predicting car movement for self-driving cars vs ii) predicting the toxicity dose of a drug?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

from sklearn.metrics import (
    explained_variance_score,
    max_error,
    mean_absolute_error,
    mean_squared_error,
    median_absolute_error,
    r2_score,
)

## Comparing different metrics

The example below shows how different metrics can be called. The reality is that many of the scores are used in the exact same way, but it is up to the user to choose which metric is the most appropriate for their application.

The snippet at the end of the code can be broken down in the following way:

1. Loop over a list of the metric functions

```python
for metric in [
    explained_variance_score,
    max_error,
    mean_absolute_error,
    mean_squared_error,
    median_absolute_error,
    r2_score,
]:
```

2. Within a string, print the name of the function

```python
f'{metric.__name__}'
```

3. Print the resulting value of the metric, rounded to 2 decimal places - notice how each of the different functions can be called just using the `metric` variable, because it is within the loop.

```python
f'{metric(y_test, y_pred):.2f}'
```

In [3]:
# import the data and do the train test split
data = pd.read_csv('data tasks/400-fish-preprocessed.csv')
data = data.sample(frac=1).reset_index(drop=True)
columns = list(data.columns)
y_col = columns.pop(0)
y = data[y_col].to_numpy()
X = data[columns].to_numpy()
training_fraction = 0.1 # we will use 1a0% of the total data to train the model (this is arbitrarily chosen for now)
training_size = int(training_fraction * len(X))
X_train = X[:training_size]
X_test = X[training_size:]
y_train = y[:training_size]
y_test = y[training_size:]

# create regressor
regressor = RandomForestRegressor().fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# analyse metrics
for metric in [
    explained_variance_score,
    max_error,
    mean_absolute_error,
    mean_squared_error,
    median_absolute_error,
    r2_score,
]:
    print(f'Metric[{metric.__name__}] = {metric(y_test, y_pred):.2f}')
    

Metric[explained_variance_score] = 0.82
Metric[max_error] = 2.41
Metric[mean_absolute_error] = 0.22
Metric[mean_squared_error] = 0.17
Metric[median_absolute_error] = 0.10
Metric[r2_score] = 0.82


## Summary

In this section we have shown how different types of performance metrics that can be used by taking advantage of the `sklearn.metrics` module. These metrics are specific to regression algorithms but other types might be more appropriate for other tasks (e.g., classification), as we shall see in the next lecture.

# Conclusion

In Lecture 4 we have covered an introduction into different regression techniques, using the `scikit-learn` package. As an extension of the preprocessing stage, we have shown how data can be split into a training set and a test set, to allow the performance of a model to be analysed. Additionally, we have presented an overview of learning models within `scikit-learn` which use that `.fit` and `.predict` methods to do machine learning on training data, and how their performance can be measured by using the `.score` method or by using the `sklearn.metrics` module.