<a href="https://colab.research.google.com/github/ML-Challenge/week3-supervised-learning/blob/master/L2.Model%20Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" /></a>

## Setup

In [None]:
# Download utils.py to working directory
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/ML-Challenge/week3-supervised-learning/master/utils.py', 'utils.py')

In [1]:
# Import utils
# We'll be using this module throughout the lesson
import utils

In [2]:
# Import dependencies
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]

## Creating train, test and validation datasets

We define a holdout dataset as any data that is not used for training and is only used to assess model performance. The available data is split into two datasets. One used for training, and one that is simply off limits while we are training our models, called a test (or holdout) dataset.

This step is vital to model validation and is the number one step we can take to ensure our model's performance. We use the holdout sample as a testing dataset so that we can have an unbiased estimate for our model's performance after we are completely done training. 

Generally, a good rule of thumb is using an `80:20` split.This equates to setting aside twenty percent of the data for the test set and using the rest for training. We might choose to use more training data when the overall data is limited (`90:10`), or less training data if the modeling method is computationally expensive (`70:30`).

### Dataset for preliminary testing?

We know that the test set is off limits until we are completely done training, but what do we do when testing model parameters? For example, if we run a random forest model with 100 trees and one with 1000 trees, which dataset do we use to test these results? When testing parameters, tuning hyper-parameters, or anytime we are frequently evaluating model performance we need to create a second holdout sample, called the validation dataset.

For this dataset, the available data is the original training dataset, which is then split in the same manner used to split the original complete dataset. We use the validation sample to asses our model's performance when using different parameter values.

To created the both holdout samples, the testing and the validation datasets, we use scikit-learn's `train_test_split()` function twice. The first call will create training and testing datasets like normal

```
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
```

The second call we split this so-called temporary training dataset into the final training and validation datasets.

```
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size = 0.25, random_state = 42)
```

In the above example, we used first an 80/20 split to create the test set. With the 80% training dataset, we used a 75/25 split to create a validation dataset. Leaving us with 60% of the data for training, 20% for validation, and 20% for testing.

![Train, test, validation](assets/train_test_validation.png)

## Accuracy metrics: regression models

Now that we've learned about holdout samples let's discuss accuracy metrics used when validating models - starting with regression models. Remember, regression models are built for continuous variables. This could be predicting the number of points a player will score tomorrow, or the number of puppies a dog is about to have!

### Mean absolute error (MAE)

To assess the performance of a regression model, we can use the mean absolute error. It is the simplest and most intuitive error metric and is the average absolute difference between the predictions $y_i$ and the actual values $\hat{y}$:

$$ MAE = \frac{\sum_{i=1}^n |y_i - \hat{y}_i|}{n} $$

If a dog had six puppies, but we predicted only four, the absolute difference would be two. This metric treats all points equally and is not sensitive to outliers. When dealing with applications where we don't want large errors to have a major impact, the mean absolute error can be used. And example could be predicting the car's monthly gas bill, when an outlier may have been caused by a one-time road trip.


Communicating modeling results can be difficult. However, most clients people that on average, a predictive model was off by some number. This makes explaining the mean absolute error easy. For example, when predicting the number of wins for a basketball team, if we predict 42, and they end up with 40, we can easily explain that the error was two wins.

In this example, we have two arrays. `y_test`, the true number of wins for all 30 NBA teams in 2017 and `predictions`, which contains a prediction for each team. Let's calculate the MAE both manually using `sklearn`.

In [3]:
from sklearn.metrics import mean_absolute_error

In [4]:
# Manually calculate the MAE
n = len(utils.nba_predictions)
mae_one = sum(abs(utils.nba_y_test - utils.nba_predictions)) / n
print('With a manual calculation, the error is {}'.format(mae_one))

With a manual calculation, the error is 5.9


In [5]:
# Use scikit-learn to calculate the MAE
mae_two = mean_absolute_error(utils.nba_y_test, utils.nba_predictions)
print('Using scikit-lean, the error is {}'.format(mae_two))

Using scikit-lean, the error is 5.9


### Mean squared error (MSE)

Next is the mean squared error (MSE). It is the most widely used regression error metric for regression models. It is calculated similarly to the mean absolute error, but this time we square the difference term.

$$ MSE = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} $$

The MSE allows larger errors to have a larger impact on the model. Using the previous car example, if we knew once a year we might go on a road trip, we might expect to occasionally have a large error and would want our model to pick up on these trips.

Let's focus on the 2017 NBA predictions again. Every year, there are at least a couple of NBA teams that win way more games than expected. If you use the MAE, this accuracy metric does not reflect the bad predictions as much as if you use the MSE. Squaring the large errors from bad predictions will make the accuracy look worse.

In [6]:
from sklearn.metrics import mean_squared_error

In [7]:
n = len(utils.nba_predictions)
# Finish the manual calculation of the MSE
mse_one = sum((utils.nba_y_test - utils.nba_predictions)**2) / n
print('With a manual calculation, the error is {}'.format(mse_one))

With a manual calculation, the error is 49.1


In [8]:
# Use the scikit-learn function to calculate MSE
mse_two = mean_squared_error(utils.nba_y_test, utils.nba_predictions)
print('Using scikit-lean, the error is {}'.format(mse_two))

Using scikit-lean, the error is 49.1


### MAE vs MSE

Picking between the MAE and the MSE comes down to the application. These results are in different units though and should not be directly compared!
To practice these metrics, let's use 

### Performance on data subsets

In professional basketball, there are two conferences, the East and the West. Coaches and fans often only care about how teams in their own conference will do this year.

We have been working on an NBA prediction model and would like to determine if the predictions were better for the East or West conference. We added a third array to the data called `utils.nba_labels`, which contains an "E" for the East teams, and a "W" for the West.

In [9]:
# Find the East conference teams
east_teams = utils.nba_labels == "E"

In [10]:
# Create arrays for the true and predicted values
true_east = utils.nba_y_test[east_teams]
preds_east = utils.nba_predictions[east_teams]

In [11]:
# Print the accuracy metrics
print('The MAE for East teams is {}'.format(mean_absolute_error(true_east, preds_east)))

The MAE for East teams is 6.733333333333333


In [12]:
# Create arrays for the true and predicted values
true_west = utils.nba_y_test[~east_teams]
preds_west = utils.nba_predictions[~east_teams]

In [13]:
# Print the accuracy metrics
print('The MAE for West teams is {}'.format(mean_absolute_error(true_west, preds_west)))

The MAE for West teams is 5.066666666666666


It looks like the Western conference predictions were about two games better on average. Over the past few seasons, the Western teams have generally won the same number of games as the experts have predicted. Teams in the East are just not as predictable as those in the West.

## Classification metrics

We already understand classification models; now let's look at their accuracy metrics. Classification accuracy metrics are quite a bit different than regression ones. Remember, with classification models, we are predicting what category an observation falls into. There are a lot of accuracy metrics available: `precision`, `recall` (also called `sensitivity`), `accuracy`, `specificity`, `F1-Score`, and it's variations, and several others.

We will focus on precision, recall, and accuracy as each of these are easy to understand and have very practical applications. One way to calculate these metrics is to use the values from the confusion matrix.

### Confusion matrices

When making predictions, especially if there is a binary outcome, this matrix is one of the first outputs we should preview. When we have a binary outcome, the confusion matrix is a 2x2 matrix that shows how our predictions faired across the two outcomes. For example, for predictions of `0` that were actually `0` (or true negatives), we look at the `0,0` square of the matrix.

![Confusion matrix](assets/confusion_matrix.png)

All of the above accuracy metrics can be calculated using the values from this matrix, and it is a great way to visualize the initial results of our classification model.

We can create a confusion matrix using `scikit-learn`'s function `confusion_,matrix()`. When dealing with binary data, this will print out a 2x2 array which represents the confusion matrix. In this matrix, the row index represents the true category, and the column index represents the predicted category. Therefore, the 1,0 entry of the array represents the number of true `1s` that were predicted to be `0`, or 8 in this example.

### Accuracy

Accuracy is the easiest metric to understand and represents the overall ability of the model to correctly predict the correct classification. Using the confusion matrix, we add the values were predicted `0` and are actually `0` (which are called true negatives), to the values predicted to be `1` that are `1` (called true positives), and then divide by the total number of observations:

$$ Accuracy = \frac{TN + TP}{N} = \frac{23 + 62}{23 + 7 + 8 + 62} = 0.85 $$

In this case, our accuracy was 85%. In this example, we can associate a true positive as predicted 1's that are also 1's. However, if our categories were win or loss, we might associate a true positive as the number of predicted wins that were actually wins.

### Precision

Next is precision or the number of true positives out of all predicted positive values:

$$ Precision = \frac{TP}{TP + FP} = \frac{62}{62 + 7} = 0.90 $$

Precision is used when we don't want to over-predict positive values. It is cost $2,000 to fly-in potential new employee's, a company may only have on-campus interviews with individuals that they are really believe are going to join their company. In the example, almost 9 out of 10 predicted 1's would have joined the company.

### Recall

The recall metric is about finding all positive values:

$$ Recall = \frac{TP}{TP + FN} = \frac{62}{62 + 8} = 0.885 $$

Here we correctly predicted 62 true positives and had 8 false negatives. Our recall is 62 out of 70. Recall is used when we can't afford to miss any positive values. For example, even if a patient has a small chance of having cancer, we may want to give them additional tests. The cost of missing a patient who has cancer is far greater than the cost of additional screenings for that patient.

Accuracy, precision, and recall are called similarly. Use the desired accuracy metric function and provide the true and predicted values.

```
from sklearn.metrics import accuracy_score, precision_score, recall_score
accuracy_score(y_test, test_predictions) # .85
precision_score(y_test, test_predictions) # .8986
recall_score(y_test, test_predictions) # .8857
```

Use the desired accuracy metric function and provide the true and predicted values. A single value will be produced as a result. In this example, we got the same values that we calculated using the confusion matrix.

### Confusion matrices, again

Creating a confusion matrix in Python is simple. The biggest challenge will be making sure you understand the orientation of the matrix. This exercise makes sure you understand the `sklearn` implementation of confusion matrices. Here, we have created a `classifier` model using the `tic_tac_toe` dataset to predict outcomes of 0 (loss) or 1 (a win) for Player One.

In [14]:
from sklearn.metrics import confusion_matrix

# Create predictions
test_predictions = utils.classifier.predict(utils.X_test)

# Create and print the confusion matrix
cm = confusion_matrix(utils.y_test, test_predictions)
print(cm)

[[113 191]
 [  7 552]]


In [15]:
# Print the true positives (actual 1s that were predicted 1s)
print("The number of true positives is: {}".format(cm[1, 1]))

The number of true positives is: 552


### Precision vs. recall

The accuracy metrics we use to evaluate our model should always be based on the specific application. For this example, let's assume we are a really sore loser when it comes to playing Tic-Tac-Toe, but only when we are certain that we are going to win.

We need to choose the most appropriate accuracy metric, either precision or recall, to complete this example. But remember, if we think we are going to win, we better win!

In [18]:
from sklearn.metrics import precision_score

In [19]:
# Create precision or recall score based on the metric you imported
score = precision_score(utils.y_test, test_predictions)

# Print the final result
print("The precision value is {0:.2f}".format(score))

The precision value is 0.74


## The bias-variance tradeoff

### Error due to under/over-fitting

### Are we underfitting?