# Chapter #3: Cross Validation

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, make_scorer

## 1. The problems with holdout sets

1. The problems with holdout sets
> Hello again - let's continue our quest for validating machine learning models by discussing why traditional validation approaches still have pitfalls.

2. Transition validation
> The typical modeling procedure looks something like this. We take a dataset, use, say, 80% for training, and the remaining 20% for testing. We learned how to do this a couple lessons ago using scikit-learn. Using the train_test_split() function, we split our data and run a random forest classifier on this single split for our model. Here we have output the MAE, and the error was 10-point-24.

3. Traditional training splits
> If we repeat this process with a different random seed though, we might get different results. Consider the following two samples from the ultimate candy-power-ranking dataset: s1 and s2. This dataset consists of 85 data points about candy characteristics, and we have randomly selected 60 candies for each sample. Only 39 of the 60 candies overlap between the two datasets.

4. Traditional training splits
> Furthermore, the first sample contains 34 chocolate candies, and the second sample only contains 30.

5. The split matters
> Why is this important? Well, we have already seen that selecting 60 candies for a sample can be highly variable. If we split the candy dataset into 60 candies for training and 25 candies for testing, and build the exact same machine learning model, we'll probably get slightly varying results. In this example alone, the second testing accuracy is over 12% worse. Using the first sample, you would report an error of 10-point-32. The second gives an error of 11-point-56. These results are way too different.

6. Train, validation, test
> Even the train, validation, test procedure we discussed earlier is not safe from the problems we could have with holdout samples, especially when we have limited data. Consider this example. We created a train, test, and validation split. We fit a random forest model, and maybe we even did some hyperparameter tuning or testing of various models. In the end, we decided on this random forest regressor model. Look at how close the validation and testing accuracies are to each other - 9-point-18 and 8-point-98. This is awesome, right?

7. Round 2
> Let's run the same model again, but this time we will run it with a different random seed. The errors were 8-point-73 and 10-point-91, which is a big problem. This can happen when using the traditional validation approach, especially with limited data. We think our model is validated, but if we just changed the sample we used - we get drastically different results. This random forest model with only 25 trees and 4 features does not seem to generalize as well to new data as we would expect.

8. Holdout set exercises
> To overcome this limitation of holdout sets, we use something called cross-validation, which is the gold-standard for model validation! Before we fully introduce cross-validation, let's discover why we need it with a couple of exercises.

### 1.1. Two samples 

After building several classification models based on `thetic_tac_toe` dataset, you realize that some models do not generalize as well as others. You have created training and testing splits just as you have been taught, so you are curious why your validation process is not working.

After trying a different training, test split, you noticed differing accuracies for your machine learning model. Before getting too frustrated with the varying results, you have decided to see what else could be going on.

- Getting everything ready.

In [2]:
# Reading the data:
tic_tac_toe = pd.read_csv("./data/tic-tac-toe.csv")

In [3]:
# Exploring the data shape:
tic_tac_toe.shape

(958, 10)

In [4]:
# Exploring the first 5 rows:
tic_tac_toe.head()

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive


In [5]:
# Encoding the target column into (0 / 1):
tic_tac_toe['Class'] = tic_tac_toe['Class'].apply(lambda x : 1 if x == 'positive' else 0)

In [6]:
# Encoding the feaatures into (0 / 1):
features = [col for col in tic_tac_toe.columns if col != 'Class']
tic_tac_toe = pd.get_dummies(data=tic_tac_toe, columns=features)

- Create samples `sample1` and `sample2` with 200 observations that could act as possible testing datasets.

In [7]:
# Creating 2 different samples with 2 different seeds:
sample1 = tic_tac_toe.sample(n=200, random_state=1111)
sample2 = tic_tac_toe.sample(n=200, random_state=1171)

- Use the list comprehension statement to find out how many observations these samples have in common.

In [8]:
# Counting how many common observations by list comprehension:
len([idx for idx in sample1.index if idx in sample2.index])

40

In [9]:
# Or we could do it this way:
len(sample1.index.intersection(sample2.index))

40

- Use the Series.value_counts() method to print the values in both samples for column Class.

In [10]:
# Counting the frequency of values in the target column for both samples:
print(f"Sample #1:\n{sample1['Class'].value_counts()}")
print(f"Sample #2:\n{sample2['Class'].value_counts()}")

Sample #1:
1    134
0     66
Name: Class, dtype: int64
Sample #2:
1    123
0     77
Name: Class, dtype: int64


### 1.2. Potential problems

Which of the following statements are TRUE regarding potential problems with holdout samples:

> - A: Using different data splitting methods may lead to varying data in the final holdout samples.
> - B: If you have limited data, your holdout accuracy may be misleading.
> - C: There are no problems. Creating a single train and test sample is the only way to validate models.
> - D: You shouldn't use holdout samples with limited data because you are limiting the potential training data.

Possible Answers:
- A & D.
- C & D.
- A & B.
- A, B, & D.

> A & B.

## 2. Cross-validation 

1. Cross-validation
> Hello everyone - let's push validation a step further and discuss the gold-standard: cross-validation.

2. Cross-validation
> Before, we talked about using 80% of our data for training and 20% for testing. We took this a step further by splitting the 80% of training data into training and validation splits. Previously, we learned that our accuracy metric on this validation set may be misleading, or if we split this data differently, we might get different results.

3. Cross-validation
> For cross-validation we don't just need one of these training/validation splits— we need a bunch of them. This method makes us run our single model on various training/validation combinations and gives us a lot more confidence in our final metrics. For this example, we have a 5-fold cross-validation. Each time we run the model, a different 80% of the data will be used for training, and a different 20% will be used for validation. And we can do this in such a manner that all of the data will be used in only one of the validation sets. This ensures that every point is used for validation exactly one time. Although using each point in only one validation set is not required for cross-validation, it is often good practice to do so. And fortunately for us, this concept of what this should look like, how this could be done, and why it's even important is the hardest part. Actually implementing this is very straightforward.

4. KFold cross-validation with scikit-learn
> scikit-learn's KFold() function gives us a few options for splitting data into several training and validation sets. We can specify the number of splits that we want; we can specify if the data needs to be shuffled and to replicate our results, we can specify a random state. Here I have generated two arrays to use as data. The X array consists of the numbers 0 through 39, and the y array consists of 20 zeros followed by 20 ones. Next, we create the generator kf, which will split our data. It uses the KFold() function with five splits and no shuffling. To actually split our data, we call kf-dot-split() on X. This only generates indices for us to use. So I don't want you to think that we have generated five training and validation datasets. All we have done is created a list of indices, that can be used for our splits.

5. Accessing indices
> So what's actually in splits if it doesn't contain datasets? The splits variable contains the training and validation indices for the five different splits of X. If we print the length of the indices, we see train_index has 32 values, and test_index has eight values, and this is repeated five times. If we print out what these lists actually look like, we see train_index has the numbers 0 through 31, and test_index has the numbers 32 through 39. Calling these indices on X and y will give us training and validation data.

6. Example using splits
> KFold is generally used when we want to fit the same model using KFold cross-validation. We would create the splits, using kf.split(). We would then loop through the train and validation indices, and fit the same model using the new training data. Finally, we create the predictions and keep track of the errors. To see how well the model performed across the five splits that we created, we can look at the mean of the final error scores.

7. Practice time
> Let's get started and fold some data!

### 2.1. scikit-learn's KFold()

You just finished running a colleagues code that creates a random forest model and calculates an out-of-sample accuracy. You noticed that your colleague's code did not have a random state, and the errors you found were completely different than the errors your colleague reported.

To get a better estimate for how accurate this random forest model will be on new data, you have decided to generate some indices to use for KFold cross-validation.

- Getting everything ready.

In [11]:
# Reading the data:
candy = pd.read_csv("./data/candy-data.csv")

In [12]:
# Exploring the data shape:
candy.shape

(85, 13)

In [13]:
# Exploring the first 5 rows:
candy.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [14]:
# Dropping the first column:
candy.drop(columns='competitorname', inplace=True)

In [15]:
# defining a function to split the data into X & y:
def split_data(data, y_col):
    
    features = [col for col in data.columns if col != y_col]
    
    X = data[features].copy()
    y = data[y_col].copy()
    
    return X, y

In [16]:
# Splitting the data into feature matrix (X) & target column (y):
X, y = split_data(candy, 'winpercent')

- Call the `KFold()` method to split data using five splits, shuffling, and a random state of 1111.

In [17]:
# Creating a kfold object:
kf = KFold(n_splits=5, shuffle=True, random_state=1111)

- Use the `split()` method of `KFold` on `X`.

In [18]:
# Creating a generator for splitting the data:
splits = kf.split(X)

- Print the number of indices in both the train and validation indices lists.

In [19]:
# Exploring the length of generated arrays of indices:
for train_idx, val_idx in splits:
    print(f"Number of training indices: {len(train_idx)}")
    print(f"Number of validation indices: {len(val_idx)}")

Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17


### 2.2. Using KFold indices

You have already created `splits`, which contains indices for the candy-data dataset to complete 5-fold cross-validation. To get a better estimate for how well a colleague's random forest model will perform on a new data, you want to run this model on the five different training and validation indices you just created.

In this exercise, you will use these indices to check the accuracy of this model using the five different splits. A for loop has been provided to assist with this process.

- Use `train_idx` and `val_idx` to call the correct indices of `X` and `y` when creating training and validation data.

In [20]:
# Initiating the model:
rfr = RandomForestRegressor(n_estimators=25, random_state=1111)

- Fit `rfr` using the training dataset.

- Use `rfr` to create predictions for validation dataset and print the validation accuracy.

In [21]:
# for train_idx, val_idx in splits:
    
#     # Setting the training & validation datasets:
#     X_train, y_train = X[train_idx], y[train_idx]
#     X_val, y_val = X.iloc[val_idx], y.iloc[val_idx]
    
#     # Fitting the model:
#     y_pred = rfr.fit(X_train, y_train).predict(X_val)
      
#     # Evaluating performance:
#     print(f"Error: {mse(y_val, y_pred)}")

> there was a problem with this code!

## 3. sklearn's cross_val_score()

1. sklearn's cross_val_score()
> Hello again. Next, we are going to discuss cross-validation in scikit-learn.

2. cross_val_score()
> We have seen that KFold() is a great way to create indices that we can use for cross-validation. If you just want to jump straight into cross-validation and don't want to mess with the indices, you can use scikit-learn's cross_val_score() method. This method requires four parameters. First, we have the estimator or the specific model that you want to use. In this example, we have a RandomForestClassifier() with the default model settings. Next, we use X to specify the complete training dataset and y to specify the response values. Lastly, the parameter cv allows us to specify the number of cross-validation splits (or folds). In this example, we have set the parameter cv to 5, allowing us to perform 5-fold cross-validation. By default, cross_val_score() will use a default scoring function for whichever model you have specified. For example, if you have a RandomForestClassifer as the estimator, the default scoring function is the mean overall accuracy. For most regression models, it will return the R-squared value.

3. Using scoring and make_scorer
> If you want to use a different scoring function, you can create a scorer by using the make_scorer() method, and specifying the scoring metric that you want to use. Here we create a scorer for the mean_absolute_error() function by calling make_scorer() on scikit-learn's method for calculating the mean absolute error. Finally, we set the scoring parameter equal to the newly created mae_scorer inside the function.

4. Full example
> Let's run through a full example of using scikit-learn's cross_val_score() for a regression model. The first step is to load all of the necessary methods. We load the model, cross_val_score(), and both the make_scorer() and mean_squared_error() methods. Next, we specify the regression model we want to use, with the specific parameters, as well as create the scorer that should be used when running the regression model. Finally, we call cross_val_score() on the estimator, rfr, the dataset X, the response values y, and set scoring equal to the scorer we generated. In this example, we set cv to 5 to complete 5-fold cross-validation.

5. Accessing the results
> Let's look at the results. Notice how varied the mean squared errors are. The lowest was almost 86, while the highest was well over 200. If we have chosen an 80/20 split on the data at random, we may have reported an error as low as 86, or an error as high as 223. When we use cross-validation, we usually report the mean of the errors. In this case, it was 150. This is a much more realistic estimate for the out-of-sample accuracy that we can expect to see on new data. Eighty-six was probably way too low of an error, while 223 was way too high. Finally, we can look at the standard deviation to see how varied the five results were. The smaller the standard deviation, the tighter your 5 means were. This indicates that the actual accuracy for new data will probably match the mean of the cross-validation score fairly well.

6. Let's practice!
> Let's now use cross_val_score() to perform cross-validation.

### 3.1. scikit-learn's methods

You have decided to build a regression model to predict the number of new employees your company will successfully hire next month. You open up a new Python script to get started, but you quickly realize that `sklearn` has a lot of different modules. Let's make sure you understand the names of the modules, the methods, and which module contains which method.

Follow the instructions below to load in all of the necessary methods for completing cross-validation using `sklearn`. You will use modules:

> - `metrics`
> - `model_selection`
> - `ensemble`

- Load the method for calculating the scores of cross-validation.
- Load the random forest regression method.
- Load the mean square error metric.
- Load the method for creating a scorer to use with cross-validation.

> Done

### 3.2. Implement cross_val_score()

Your company has created several new candies to sell, but they are not sure if they should release all five of them. To predict the popularity of these new candies, you have been asked to build a regression model using the `candy` dataset. Remember that the response value is a head-to-head win-percentage against other candies.

Before you begin trying different regression models, you have decided to run cross-validation on a simple random forest model to get a baseline error to compare with any future results.

- Fill in cross_val_score().
> - Use `X_train` for the training data, and `y_train` for the response.
> - Use `rfr` as the model, 10-fold cross-validation, and `mse` for the scoring function.

In [22]:
# Initializing the model:
rfr = RandomForestRegressor(n_estimators=25, random_state=1111)

In [23]:
# Creating a scoring startegy:
scorer = make_scorer(mse, greater_is_better=False)

In [24]:
# Implementing 10-fold cross-validation:
cv = cross_val_score(estimator=rfr, X=X, y=y , cv=10, scoring=scorer)

- Print the mean of the `cv` results.

In [25]:
# Printing the mean of cross-validation scores:
print(cv.mean())

-155.4061992697056


## 4. Leave-one-out-cross-validation (LOOCV)

1. Leave-one-out-cross-validation (LOOCV)
> Welcome back - in this lesson we take KFold cross-validation another step forward and discuss leave-one-out-cross-validation.

2. LOOCV
> The name says it all. In leave-one-out-cross-validation, we are going to implement KFold cross-validation, where k is equal to n, the number of observations in the data. This means that every single point will be used in a validation set, completely by itself. For the first model, we will use all of the data for training, except for the first point, which will be used for validation. In model 2, we leave only the second data point out, in model three, the third data point, and so on. We create n models, for n-observations in the data. It might seem odd to use a single point as a complete validation set, but recall what you will do after leave-one-out-cross-validation is complete. You will present the average error of the n model runs.

3. When to use LOOCV?
> You can use this technique when your data is limited, and you want to use as much training data as possible when fitting the model. This method is also used because it provides the best error estimate possible for a single new point. Consider that you just ran n-models, where each time you left out a single point. If you are given a single new point and need to estimate your error, leave-one-out-cross-validation is the right method to use. Unfortunately, this method is very computationally expensive. You should be careful using it if you have a lot of data, or if you are planning on testing a lot of different parameter sets. The best way to judge if this method is even possible is to run KFold cross-validation with a large K, maybe 25 or 50, and gauge how long it would take you to actually run Leave-one-out-cross-validation with the n-observations in your data.

4. LOOCV Example
> Implementing leave-one-out-cross-validation can be done using cross_val_score(). You only need to set the parameter cv equal to the number of observations in your dataset. We can find the number of observations by looking at the shape of the X dataset. The result of running leave-one-out-cross-validation will be a list of errors that stand for the error of running a model and leaving a single point out. The list will have n values, where n is the number of observations. Finally, we print the mean and use this as our overall error metric.

5. Let's practice
> Let's start practicing leave-one-out-cross-validation.

### 4.1. When to use LOOCV

Which of the following are reasons you might NOT run LOOCV on the provided `X` dataset? The `X` data has been loaded for you to explore as you see fit.

1. The `X` dataset has 122,624 data points, which might be computationally expensive and slow.
2. You cannot run LOOCV on classification problems.
3. You want to test different values for 15 different parameters

Possible Answers:
- A & B.
- B & C.
- A & C.
- A.

> 1 & 3.

### 4.2. Leave-one-out-cross-validation

Let's assume your favorite candy is not in the candy dataset, and that you are interested in the popularity of this candy. Using 5-fold cross-validation will train on only 80% of the data at a time. The candy dataset only has 85 rows though, and leaving out 20% of the data could hinder our model. However, using leave-one-out-cross-validation allows us to make the most out of our limited dataset and will give you the best estimate for your favorite candy's popularity!

In this exercise, you will use `cross_val_score()` to perform LOOCV.

- Create a scorer using `mean_absolute_error` for `cross_val_score()` to use.

In [26]:
# Creating a scoring startegy:
mae_scorer = make_scorer(mae, greater_is_better=False)

- Fill out `cross_val_score()` so that the model `rfr`, the newly defined `mae_scorer`, and LOOCV are used.

In [27]:
# Initializing the model:
rfr = RandomForestRegressor(n_estimators=15, random_state=1111)

In [28]:
# Implementing 5-fold cross-validation:
scores = cross_val_score(estimator=rfr, X=X, y=y, scoring=mae_scorer, cv=len(X))

- Print the mean and the standard deviation of scores using numpy (loaded as np).

In [29]:
# Printing the mean and standard deviation of cross-validation scores:
print(f"The mean of the errors is: {np.mean(scores)}.")
print(f"The standard deviation of the errors is: {np.std(scores)}.")

The mean of the errors is: -9.52044832324183.
The standard deviation of the errors is: 7.349020637882744.
