## <font color='#eb3483'>Evaluating Regression models</font>

 Evaluating models is an essential step in doing Machine Learning for two reasons:

- So we can evaluate and find the model that performs best.
- So we can calculate the expected margin of error in our predictions.

We are going to look at various error metrics that judge how well regression models make predictions. 

Before we begin, let's load the necessary libraries for data manipulation.

In [1]:
from IPython.display import Image
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) 

import warnings
warnings.simplefilter("ignore")

We're still using the boston housing dataset from the `sklearn` library.

In [2]:
# import datasets from sklearn

In [3]:
# load the boston dataset

We split data into training and testing sets, fit the model on the training set and made predictions on the test set. 

P.S to learn more about the `random_state` parameter, see [this Stack Overflow answer](https://stackoverflow.com/a/28069274).

In [5]:
# import train_test_split from sklearn.model_selection

In [6]:
# split the data using a test_size of 0.33 and set random_state to 13

We instantiate the model using the `LinearRegression` class.

In [None]:
# import LinearRegression from sklearn.linear_model 

In [None]:
# create an object of the LinearRegression class and call it model

In [None]:
# train the model


# make predictions


# see predictions
for y, y_pred in list(zip(y_test, predictions))[:5]:
    print("Real value: {:.3f} Estimated value: {:.5f}".format(y, y_pred))

## We now look at metrics to evaluate our predictions.

This allows us to get an indication of how good or how poor our predictions actually were - without just eyeballing some visualisations :)

Model evaluation metrics in `scikit-learn` are available under the module `metrics`

In [None]:
# import metrics from sklearn

### <font color='#eb3483'>Mean Absolute Error (MAE)</font>

The Mean Absolute Error (MAE) is defined as:

$$\frac{1}{n}\sum_{i=1}^{n}|y_i -\hat{y}_i|$$

Basically the differences between the real values of the target variable and the predictions in absolute value (so turning negative differences into positive ones).

MAE is a robust metric, that means it doesnt change dramatically when there are outliers. The MAE error can be interpreted in the same units of the target variable (so for example, if the target variable is in dollars, the MAE will also be in dollars).

In [None]:
metrics.mean_absolute_error?

In [None]:
mae = metrics.mean_absolute_error(y_test, predictions)
print("The Mean Absolute Error is {:.3f} dollars".format(mae))

![title](media/train_test_split.png)
![title](media/train_test.png)

### <font color='#eb3483'>Mean Squared Error (MSE)</font>

The Mean Squared Error (MSE) is defined as:

$$\frac{1}{n}\sum_{i=1}^{n}(y_i -\hat{y}_i)^2$$

Similar to the MAE, but using the square of the difference between the true target and the prediction. 

MSE gives more weight to larger errors than MAE (is not robust to outliers). For example, let's imagine we are predicting housing prices using the Boston dataset and we have the following observations:

```
observation1: MEDV: 10  MEDV_pred: 15    MSE: (10-15)²=25
observation2: MEDV: 1000 MEDV_pred: 1010 MSE: (1000-1010)²=100 
```

By using MSE we are giving more weight to the error on observation2 than on observation1, even though a 5000`$` error on a 15,000`$` house is a much worse error than a 10,000`$` error on a 1,000,000`$` house.

The MSE is measured in squared units (squared dollars?) which is hard to understand, so there is another metric called Root Mean Squared Error (RMSE) that is just the root of the MSE.

In [None]:
mse = metrics.mean_squared_error(y_test, predictions)
print("The Mean Squared Error is {:.3f} dollars²".format(mse))

### <font color='#eb3483'>Root Mean Squared Error (RMSE)</font>

Root Mean Squared Error (RMSE) is just the root of the MSE, and it is measured in the same units as the target variable.

$$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i -\hat{y}_i)^2}$$

Similarly to MSE, RMSE is more sensitive than MAE to variations in errors. Here we can see an example of this.

![mae_vs_mse](media/mse_vs_mae.png)

In this example we can see that on the 3 cases MAE remains the same, while RMSE can be equal to MAE or much larger depending on the error distribution.

So in those cases where we care about making big mistakes we can use RMSE. For example, if we are predicting student grades, we might not care that much about individual errors (predicting a 10 when the truth was a 2 is not a big deal), but about the overall performance (and we could use MAE). If we are predicting house values to purchase them, predicting a million dollars when the actual house price is 20,000$ is an error we cant afford!

In [None]:
# sklearn doesnt have rmse, but it's easy to create the metric "manually"
rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
print("The Root Mean Squared Error is {:.3f} dollars".format(rmse))

### <font color='#eb3483'>R2 (Coefficient of Determination)</font>

The Coefficient of Determination (R2, pronounced *R-squared*) measures the portion of the variance that can be explained by the model.

R2 ranges from (-1 to 1) (a model explaining all the variance would have an $r^2$ of 1).

[There are many ways to measure $r^2$](https://en.wikipedia.org/wiki/Coefficient_of_determination) , but one of the simplest ones is simply the squared [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)  between the true target and the prediction, squared.

$$r^2=\frac{\sum_{i=1}^n(\hat y_i-\bar y)^2}{\sum_{i=1}^n(y_i-\bar y)^2}.$$

#### <font color='#eb3483'>Adjusted R2</font>

There is an updated version of R2 called **Adjusted R-squared** that takes into consideration model complexity (so it penalizes a complex model versus a simple one).

$$1 - \frac{(1 - R^2)(n-1)}{(n-k-1)}$$

where `n` is the number of observations and `k` is the number of model coefficients.

<!--Contrary to `MAE` and `MSE` the higher the R2 the better.-->

In [None]:
model_r2 = metrics.r2_score(y_test, predictions)
model_r2

In [None]:
pearson_corr = np.corrcoef(y_test, predictions)
model_r2_ = pearson_corr**2
model_r2_

We can calculate easily the adjusted r² value by hand, to account for model complexity. This value depends on the number of coefficients and imposes a penalty for additional coefficients
<!--(number of coefficients). 
, the regularization lesson goes further into the issue of model complexity.-->

In [None]:
len(model.coef_)

In [None]:
n = len(X_test)
k = len(model.coef_)
r2 = model_r2

adjusted_r2_model = 1 - ((1-r2)*(n-1)/(n-k-1))
adjusted_r2_model

We see the adjusted r² of the model is smaller than the original r².

We can now evaluate our models. First we create a dictionary to store the results, and a function to help us evaluate models. 

In [None]:
RESULTS = {}

def evaluate_model(y_true, y_pred):
    """Function to evaluate models, you could add more metrics here!"""
    return {
        "mae": metrics.mean_absolute_error(y_true, y_pred),
    }

In [None]:
# create the model
# model_ols = ...

In [None]:
# first train and predict without splitting data
model_ols.fit(X=boston['data'], y=boston['target'])
model_ols_preds = model_ols.predict(boston['data'])

# store results
RESULTS["ols"] = evaluate_model(
    boston["target"],
    model_ols_preds,
)
RESULTS

In [None]:
# train and predict using just training dataset
model_ols.fit(X=X_train, y=y_train)
model_ols_train_preds = model_ols.predict(X_train)

RESULTS["ols_train"] = evaluate_model(
    y_train,
    model_ols_train_preds
)
RESULTS

In [None]:
# predict using the test dataset
model_ols_test_preds = model_ols.predict(X_test)
RESULTS["ols_test"] = evaluate_model(
    y_test,
    model_ols_test_preds
)
RESULTS

In [None]:
pd.DataFrame(RESULTS)

We can see that we get we get worse results on the test set than on the training set.

We could just stop here and say *"Our model MAE on the test dataset is 3.56..."*, and we could think everything is ok with this because that is the error on unseen data.

However, this would be a big mistake, why? 

Remember, we have used a specific random state `random_state=13`, what would happen if we use another seed, for example, `random_state=42`?

**Let's run everything again with a diffefent random seed**

In [None]:
RESULTS = {}

X_train, X_test, y_train, y_test = train_test_split(
     boston["data"],   # X
     boston["target"], # y
     test_size=0.33, 
     random_state=42
)

model_ols = LinearRegression()

model_ols.fit(X=X_train, y=y_train)
model_ols_train_preds = model_ols.predict(X_train)
model_ols_test_preds = model_ols.predict(X_test)


RESULTS["ols_train2"] = evaluate_model(
    y_train,
    model_ols_train_preds
)

RESULTS["ols_test2"] = evaluate_model(
    y_test,
    model_ols_test_preds
)

pd.DataFrame(RESULTS)

**The test MAE is lower than the train one!!, WHAAAT!!??** How can that even be possible???

Very simple, it just turns out the split generated by using the seed number 42 puts on the test dataset observations that are easier to predict.

## <font color='#eb3483'> Cross-validation </font>

One way to avoid evaluating on one single lucky split is by doing **Cross Validation**. When we do cross validation we simply split the data into **N** partitions, for each partion, we train the model with the remaining N-1 partitions and evaluate on that one. That way we get N evaluation errors trained and evaluated on different data so we don't rely on a single split. We finally make predictions on the test set to get an unbiased view of model performance.

For example, a 5 fold cross validation would look like this:

![title](media/cross_validation.png)

<!--![title](media/grid_search_cross_validation.png)-->

`scikit-learn` has a function `cross_val_score` that evaluates a model doing cross validation.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score?

To use `cross_val_score`, we need a model, the independent and dependent variables (X and y). We also have to define an evaluation metric (`scoring` argument) and the number of splits (`cv` argument).

In [None]:
# model = ...

We can choose any of the scorers defined on `sklearn.metrics.SCORERS` or we can create our own scoring function.

In [None]:
from sklearn.metrics import SCORERS
SCORERS.keys()

For example, if we want to perform cross validation using the mean absolute error (MAE), we can use the scoring `neg_mean_squared_error`, that is the name of the mean absolute error on the `SCORERS` dictionary above.

**Note**: In scikit-learn, the "bad" scores (i.e. errors) are returned as negative numbers

In [None]:
results_cross_validation = cross_val_score(
    estimator=model, 
    X=boston["data"],
    y=boston["target"],
    scoring="neg_mean_absolute_error", 
    cv=5,
)

`cross_val_score` returns the evaluation on the test set for each one of the splits (in this case, 5 splits). Normal number of splits are between 3 and 10

In [None]:
results_cross_validation

We can finally calculate the average partition error to get a better estimation of its performance.

In [None]:
# we use abs to get positive MAE values
model_mae = abs(results_cross_validation.mean())
print(f"the average MAE for 5 split cross validation is : {model_mae}")

We can also define our own cross validation evaluation metric, which is simply a function that expects to receive as arguments the trained estimator, X and y and return the error.

For example, if we want to use rmse as a scoring for cross validation, we can create our own scoring function:

In [None]:
def rmse_cross_val(estimator, X, y):
    y_pred = estimator.predict(X)
    return np.sqrt(metrics.mean_squared_error(y, y_pred))


In [None]:
results_cross_validation_rmse = cross_val_score(
    estimator=model, 
    X=boston["data"],
    y=boston["target"],
    scoring=rmse_cross_val, 
    cv=5, 
)

mean_rmse_cv = results_cross_validation_rmse.mean()
print(f"the average RMSE for 5 split cross validation is : {mean_rmse_cv}")

**A bit on hyperparameters.** The function `cross_val_score` has the argument `cv`. We can adjust its value to improve the model score. 

In [None]:
cross_val_score?

## <font color='#eb3483'>BONUS: The function cross_validate</font>

If we want to get more information about each split, we can use the function `cross_validate` that returns more information besides the test error. It also accepts multiple scoring functions instead of just one. Think of `cross_val_score` as the simplified version of `cross_validate`.

In [None]:
from sklearn.model_selection import cross_validate

scoring_functions = {"mae": "neg_mean_absolute_error", "rmse": rmse_cross_val}

scores = cross_validate(
    model,                # estimator
    boston["data"],            # X
    boston["target"],          # y
    scoring=scoring_functions, # a single scorer or a dict with multiple scoring functions
    cv=10,                      # number of partitions
    return_train_score=True    # return the training error, not only the test error
)

In [None]:
results_df = pd.DataFrame(scores)

In [None]:
results_df

We get results for each one of the partitions:
- fit time, how long it takes to train the model
- score time, how long it takes to predict
- test and train scores for each one of the scoring functions

Now we can calculate the averages for all the partitions:

In [None]:
pd.DataFrame(scores).mean()