# Random forest tuning & cross-validation 

Topics of focus include:


*   Relevant import statements
*   Encoding of categorical features as dummies
*   Stratification during data splitting
*   Fitting a model
*   Using `GridSearchCV` to cross-validate the model and tune the following hyperparameters:  
    - `max_depth`  
    - `max_features`  
    - `min_samples_split`
    - `n_estimators`  
    - `min_samples_leaf`  
*   Model evaluation using precision, recall, and f1 score



## Review

This notebook is a continuation of the bank churn project. Below is a recap of the considerations and decisions already made.

- **Modeling objective:** To predict whether a customer will churn&mdash;a binary classification task.

- **Target variable:** `Exited` column&mdash;0 or 1.  

- **Class balance:** The data is imbalanced 80/20 (not churned/churned), but we will not perform class balancing.

- **Primary evaluation metric:** F1 score.

- **Modeling workflow and model selection:** The champion model will be the model with the best validation F1 score. Only the champion model will be used to predict on the test data. See the annotated decision tree notebook for details and limitations of this approach.

## A note on cross-validation/validation

This notebook includes two approaches to validation: cross-validating the training data and validating using a separate validation dataset. In practice, only one would be used or a given project. 

Cross-validation is more rigorous, because it maximizes the usage of the training data, but with a very large dataset or limited computing resources, it may be better to validate with a separate validation dataset.

## Import statements

Before we begin with the exercises and analyzing the data, we need to import all libraries and extensions required for this programming exercise. Throughout the course, we will be using numpy and pandas for operations, and matplotlib for plotting. 

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier

# This module lets us save our models once we fit them.
import pickle

## Read in the data

In [2]:
# Read in data
file = 'Churn_Modelling.csv'
df_original = pd.read_csv(file)
df_original.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## Feature engineering

### Feature selection

In this step, prepare the data for modeling.  Notice from above that there are a number of columns that doesn't look like they offer any predictive signal to the model. These columns include `RowNumber`, `CustomerID`, and `Surname`. Drop these columns to avoid introducing noise to the model.  

Also drop the `Gender` column, to avoid making predictions based on gender.

In [3]:
# Drop useless and sensitive (Gender) cols
churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], axis=1)
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


### Feature transformation

Next, dummy encode the `Geography` variable, which is categorical. Do this with the `pd.get_dummies()` function and setting `drop_first='True'`, which replaces the `Geography` column with two new Boolean columns called `Geography_Germany` and `Geography_Spain`.

In [4]:
# Dummy encode categoricals
churn_df2 = pd.get_dummies(churn_df, drop_first='True')
churn_df2.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,False,False
1,608,41,1,83807.86,1,0,1,112542.58,0,False,True
2,502,42,8,159660.8,3,1,0,113931.57,1,False,False
3,699,39,1,0.0,2,0,0,93826.63,0,False,False
4,850,43,2,125510.82,1,1,1,79084.1,0,False,True


## Split the data

Split the data into features and target variable, and into training data and test data using the `train_test_split()` function. 

Don't forget to include the `stratify=y` parameter, as this is what ensures that the 80/20 class ratio of the target variable is maintained in both the training and test datasets after splitting.

Lastly, set a random seed so we and others can reproduce our work.

In [5]:
# Define the y (target) variable
y = churn_df2["Exited"]

# Define the X (predictor) variables
X = churn_df2.copy()
X = X.drop("Exited", axis = 1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

## Modeling

### Cross-validated hyperparameter tuning

The cross-validation process is the same as it was for the decision tree model. The only difference is the tuning with more hyperparameters now. The steps are included below. For details on cross-validating with `GridSearchCV`, refer to the [GridSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) in scikit-learn.

1. Instantiate the classifier (and set the `random_state`). 

2. Create a dictionary of hyperparameters to search over.

3. Create a set of scoring metrics to capture. 

4. Instantiate the `GridSearchCV` object. Pass as arguments:
  - The classifier (`rf`)
  - The dictionary of hyperparameters to search over (`cv_params`)
  - The set of scoring metrics (`scoring`)
  - The number of cross-validation folds you want (`cv=5`)
  - The scoring metric that you want GridSearch to use when it selects the "best" model (i.e., the model that performs best on average over all validation folds) (`refit='f1'`)

5. Fit the data (`X_train`, `y_train`) to the `GridSearchCV` object (`rf_cv`).

Note the use of `%%time` magic at the top of the cell. This outputs the final runtime of the cell. (Magic commands, often just called "magics," are commands that are built into IPython to simplify common tasks. They begin with `%` or `%%`.)


In [6]:
%%time

rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

scoring = {'accuracy':'accuracy', 'precision':'precision', 'recall':'recall', 'f1':'f1'}

rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')

rf_cv.fit(X_train, y_train)

CPU times: total: 31min 22s
Wall time: 31min 51s


0,1,2
,estimator,RandomForestC...andom_state=0)
,param_grid,"{'max_depth': [2, 3, ...], 'max_features': [2, 3, ...], 'min_samples_leaf': [1, 2, ...], 'min_samples_split': [2, 3, ...], ...}"
,scoring,"{'accuracy': 'accuracy', 'f1': 'f1', 'precision': 'precision', 'recall': 'recall'}"
,n_jobs,
,refit,'f1'
,cv,5
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,125
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,2
,min_weight_fraction_leaf,0.0
,max_features,4
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


# Random forest validation on separate dataset 


Topics of focus include:

  * Using `pickle` to save a fit model
  * Using a separate dataset to tune hyperparameters and validate the model
    * Splitting the training data to create a validation dataset
    * Creating a list of split indices to use with `PredefinedSplit` so `GridSearchCV` performs validation on this defined validation set

## Pickle  

When models take a long time to fit, it is not desirable to fit them more than once. If the kernel disconnects, shut down the notebook and lose the cell’s output, then it is a must to refit the model, which can be frustrating and time-consuming. 

`pickle` is a tool that saves the fit model object to a specified location, then quickly reads it back in. It also allows to use models that were fit somewhere else, without having to train them.

This step will ***W***rite (i.e., save) the model, in ***B***inary (hence, `wb`), to the folder designated by the above path. In this case, the name of the file we're writing is `rf_cv_model.pickle`.

In [7]:
# import library
import os

In [8]:
# instanciate the current working directory aka cwd
path = os.getcwd()

In [9]:
# Define the folder where you want to save the model
folder = "pickle"

In [10]:
# define the full path to the folder
full_path = os.path.join(path, folder)

In [11]:
# create the folder if it doesn't exist
os.makedirs(full_path, exist_ok=True)

In [49]:
# define the full path to the pickle file
model_filename = os.path.join(full_path, 'rf_cv_model.pickle')

In [50]:
# pickle the model
with open(model_filename, 'wb') as to_write:
    pickle.dump(rf_cv, to_write)

Once the model is saved, re-fit it won't be necessesary when running this notebook. Ideally, open the notebook, select "Run all," and the cells would run successfully all the way to the end without any model retraining. 

For this to happen, return to the cell where the grid search was defined and comment out the line where the fit the model is. Otherwise, when re-run the notebook, it would refit the model. 

Similarly, go back to where the model is saved as a pickle and comment out those lines.

Next, add a new cell that reads in the saved model from the folder already specified. For this, use `rb` (read binary) and be sure to assign the model to the same variable name as used above, `rf_cv`.

In [51]:
# Read in pickled model
with open(model_filename, 'rb') as to_read:
    rf_cv = pickle.load(to_read)

Now everything above is ready to run quickly and without refitting. Continue by using the model's `best_params_` attribute to check the hyperparameters that had the best average F1 score across all the cross-validation folds.

In [15]:
# calling best params
rf_cv.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 125}

And to check the best average F1 score of this model on the validation folds, use the `best_score_` attribute. Remember, if setting `refit=recall` when the `GridSearchCV` object was instantiated earlier, then calling `best_score_` would return the best recall score, and the best parameters might not be the same as what they are in the above cell, because the model would be optimizing for a different metric.

In [16]:
rf_cv.best_score_

0.580528563620339

The model had an F1 score of **0.5805**; not terrible. Recall that when the grid search was run, it was specified that it would also capture precision, recall, and accuracy.

The reason for doing this is that it's difficult to interpret an F1 score. These other metrics are much more directly interpretable, so they're worth knowing. 

The following cell defines a helper function that extracts these scores from the fit `GridSearchCV` object and returns a pandas dataframe with all four scores from the model with the best average F1 score during validation.

In [17]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'Model': [model_name],
                          'F1': [f1],
                          'Recall': [recall],
                          'Precision': [precision],
                          'Accuracy': [accuracy]
                         }
                        )

    return table

In [18]:
# Make a results table for the rf_cv model using above function
rf_cv_results = make_results('Random Forest CV', rf_cv)
rf_cv_results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333


Concatenate these results to the master results table from when the single decision tree model was build.

In [19]:
# Read in master results table
results = pd.read_csv('results1.csv', index_col=0)
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


In [20]:
# Concatenate the random forest results to the master table
results = pd.concat([rf_cv_results, results])
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


The scores in the above table tell us that the random forest model performs better than the single decision tree model on every metric.

Now, build another random forest model, only this time tune the hyperparameters using a separate validation dataset.

## Modeling

### Hyperparameters tuned with separate validation set  

Begin by splitting the training data to create a validation dataset. Remember, do not touch the test data at all.  

Use `train_test_split` to divide `X_train` and `y_train` into 80% training data (`X_tr`, `y_tr`) and 20% validation data (`X_val`, `y_val`). Don't forget to stratify it and set the random state.

In [21]:
# Create separate validation data
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, 
                                            stratify=y_train, random_state=10)

When tuning hyperparameters with `GridSearchCV` using a separate validation dataset, a few extra steps are needed. `GridSearchCV` wants to cross-validate the data. In fact, if the `cv` argument were left blank, it would split the data into five folds for cross-validation by default. 

That is not needed. Instead, indicate exactly which rows of `X_train` are for training, and which rows are for validation.  

To do this, make a list of length `len(X_train)` where each element is either a 0 or -1. A 0 in index _i_ will indicate to `GridSearchCV` that index _i_ of `X_train` is to be held out for validation. A -1 at a given index will indicate that that index of `X_train` is to be used as training data. 

Make this list using a list comprehension that looks at the index number of each row in `X_train`. If that index number is in `X_val`'s list of index numbers, then the list comprehension appends a 0. If it's not, then it appends a -1.

So if the training data is:  
[A, B, C, D],  
and the list is:   
[-1, 0, 0, -1],  
then `GridSearchCV` will use a training set of [A, D] and validation set of [B, C].

In [23]:
# Create list of split indices
split_index = [0 if x in X_val.index else -1 for x in X_train.index]

Now that this list is ready, import a new function called `PredefinedSplit`. This function is what allows to pass the list we just made to `GridSearchCV`. (You can read more about this function in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.PredefinedSplit.html#sklearn.model_selection.PredefinedSplit).)

In [24]:
from sklearn.model_selection import PredefinedSplit

Now build the model. Everything is the same as when cross-validation was done, except this time pass the `split_index` list to the `PredefinedSplit` function and assign it to a new variable called `custom_split`.

Then use this variable for the `cv` argument when instantiating `GridSearchCV`.

In [27]:
rf = RandomForestClassifier(random_state=0)

cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  

scoring = {'accuracy':'accuracy', 'precision':'precision', 'recall':'recall', 'f1':'f1'}

custom_split = PredefinedSplit(split_index)

rf_val = GridSearchCV(rf, cv_params, scoring=scoring, cv=custom_split, refit='f1')

Now fit the model.

In [28]:
rf_val.fit(X_train, y_train)

0,1,2
,estimator,RandomForestC...andom_state=0)
,param_grid,"{'max_depth': [2, 3, ...], 'max_features': [2, 3, ...], 'min_samples_leaf': [1, 2, ...], 'min_samples_split': [2, 3, ...], ...}"
,scoring,"{'accuracy': 'accuracy', 'f1': 'f1', 'precision': 'precision', 'recall': 'recall'}"
,n_jobs,
,refit,'f1'
,cv,"PredefinedSpl......, -1, -1]))"
,verbose,0
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,n_estimators,150
,criterion,'gini'
,max_depth,
,min_samples_split,3
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,4
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Notice that this took less time than when the cross-validation was done; about 1/5 of the time. This is because _during cross-validation_ the training data was divided into five folds. An ensemble of trees was grown with a particular combination of hyperparameters on four folds of data, and validated on the fifth fold that was held out. This whole process happened for each of five holdout folds. Then, another ensemble was trained with the next combination of hyperparameters, repeating the whole process. This continued until there were no more combinations of hyperparameters to run.  

<img src="./cross_validation_diagram.svg"/>

But now that _a separate validation set is being used,_ an ensemble is built for each combination of hyperparameters. Each ensemble is trained on the new training set and validated on the validation set. But this only happens _one time_ for each combination of hyperparameters, instead of _five times_ with cross-validation. That’s why the training time was only a fifth as long.

<img src="./single_validation_diagram.svg"/>

Let's pickle the model...

In [34]:
# define the full path to the pickle file
model_filename = os.path.join(full_path, 'rf_val_model.pickle')

In [35]:
# Pickle the model
with open(model_filename, 'wb') as to_write:
    pickle.dump(rf_val, to_write)

... and comment out where the fit model is and wrote the pickle, then read back in the pickled model.

In [37]:
# Open pickled model
with open(model_filename, 'rb') as to_read:
    rf_val = pickle.load(to_read)

Now check the parameters of the best-performing model on the validation set:

In [38]:
rf_val.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 150}

Notice that the best hyperparameters were slightly different than the cross-validated model.  

Now, generate the model results using the `make_results` function, add them to the master table, and then sort them by F1 score in descending order.

In [39]:
# Create model results table
rf_val_results = make_results('Random Forest Validated', rf_val)

# Concatentate model results table with master results table
results = pd.concat([rf_val_results, results])

# Sort master results by F1 score in descending order
results.sort_values(by=['F1'], ascending=False)

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333
0,Random Forest Validated,0.57551,0.460784,0.766304,0.861333
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


Save the new master table to use later when we build more models. 

In [48]:
# Save the master results table
results.to_csv(path+'\\results2.csv', index=False);

## Model selection and final results

There are now three models. If the decision has been made to stop trying to improve them, the best model can be used to make predictions on the test holdout data. In this case, the model that was cross-validated without a depth limit will be used. But if the model chosen was the one validated on a separate validation set, it would need to be retrained using all of the training data (both the training and validation sets).

**Note**: _It might seem like a good idea to test all three models on the test data and pick the one that performs best. While this is **can** be done, it introduces bias because the test data would influence the choice of model. The test set should represent **unseen** data. For example, in competitions, you have to submit your final model before seeing the test data._

The table above shows that the cross-validated random forest model performs slightly better than the one trained with a separate validation set. It has good precision and accuracy, but its recall is 0.4725. This means it correctly identifies about 47% of the people who actually left the bank in the validation folds.

The model won’t be applied to the test data yet because there is still one more model to build. That model will be introduced soon. Once it's trained, the best model will be used to make predictions on the test set.