# <center> **Random Forest**
## <center> **Tuning & Cross-Validation** 


- **Topics of focus include:**
    *   Relevant import statements
    *   Encoding of categorical features as dummies
    *   Stratification during data splitting
    *   Fitting a model
    *   Using **`GridSearchCV`** to cross-validate the model and tune the following hyperparameters:  
        - **`max_depth`**  
        - **`max_features`**  
        - **`min_samples_split`**
        - **`n_estimators`**  
        - **`min_samples_leaf`**  
    *   Model evaluation using precision, recall, and f1 score



## **1. Review**

- This notebook is a continuation of the bank churn project. Below is a recap of the considerations and decisions that we've already made. 
* **`Modeling objective:`** To predict whether a customer will churn&mdash;a binary classification task.
* **`Target variable:`** **`Exited`** column; 0 or 1.  
* **`Class balance:`** The data is imbalanced 80/20 (not churned/churned), but we will not perform class balancing.
* **`Primary evaluation metric:`** F1 score.
* **`Modeling workflow and model selection:`** The champion model will be the model with the best validation F1 score. 
* Only the champion model will be used to predict on the test data. See the annotated decision tree notebook for details and limitations of this approach.

## **2. Imports**

- Before we begin with the exercises and analyzing the data, we need to import all libraries and extensions required for this programming exercise. 
- Throughout the course, we will be using numpy and pandas for operations, and matplotlib for plotting. 

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# This lets us see all of the columns, preventing Juptyer from redacting them.
pd.set_option('display.max_columns', None)

from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.ensemble import RandomForestClassifier

# This module lets us save our models once we fit them.
import pickle

#### **2.1. The data**

In [None]:
# Read in data
file = r"C:\Users\barba\OneDrive\Documents\AIO Python\Datasets\Churn_Modelling.csv" # Churn_dataset.csv on github
churn_df = pd.read_csv(file)
churn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


## **3. Feature engineering**

#### **3.1. Feature selection**

- In this step, we'll prepare the data for modeling.  
- Notice from above that there are a number of columns that we wouldn't expect to offer any predictive signal to the model. 
- These columns include **`RowNumber`**, **`CustomerID`**, and **`Surname`**. 
- We'll drop these columns so they don't introduce noise to our model.  
- We'll also drop the **`Gender`** column, because we don't want our model to make predictions based on gender.

In [3]:
# Drop useless and sensitive (Gender) cols
churn_df = churn_df.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], axis=1)
churn_df.head()

Unnamed: 0,CreditScore,Geography,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,41,1,83807.86,1,0,1,112542.58,0
2,502,France,42,8,159660.8,3,1,0,113931.57,1
3,699,France,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,43,2,125510.82,1,1,1,79084.1,0


#### **3.2. Feature transformation**

- Next, we'll dummy encode the **`Geography`** variable, which is categorical. 
- We do this with the **`pd.get_dummies()`** function and setting **`drop_first='True'`**,  
which replaces the **`Geography`** column with two new Boolean columns called **`Geography_Germany`** and **`Geography_Spain`**.

In [4]:
# Dummy encode categoricals
churn_df = pd.get_dummies(churn_df, drop_first='True')
churn_df.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain
0,619,42,2,0.0,1,1,1,101348.88,1,False,False
1,608,41,1,83807.86,1,0,1,112542.58,0,False,True
2,502,42,8,159660.8,3,1,0,113931.57,1,False,False
3,699,39,1,0.0,2,0,0,93826.63,0,False,False
4,850,43,2,125510.82,1,1,1,79084.1,0,False,True


## **4. Split the data**

- We'll split the data into features and target variable, and into training data and test data using the **`train_test_split()`** function. 
- Don't forget to include the **`stratify=y`** parameter, as this is what ensures that the 80/20 class ratio of the target  
variable is maintained in both the training and test datasets after splitting.
- Lastly, we set a random seed so we and others can reproduce our work.

In [5]:
# Define the y (target) variable
y = churn_df["Exited"]
# Define the X (predictor) variables
X = churn_df.copy()
X = X.drop("Exited", axis = 1)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

## **5. Modeling**

#### **5.1. Cross-validated hyperparameter tuning**

- The cross-validation process is the same as it was for the decision tree model. 
- The only difference is that we're tuning more hyperparameters now. 
- The steps are included below as a review. 
- For details on cross-validating with **`GridSearchCV`**, refer back to the decision tree notebook, or to the [**GridSearchCV documentation**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) in scikit-learn.
1. Instantiate the classifier (and set the **`random_state`**). 
2. Create a dictionary of hyperparameters to search over.
3. Create a set of scoring metrics to capture. 
4. Instantiate the **`GridSearchCV`** object. Pass as arguments:
    - The classifier (**`rf`**)
    - The dictionary of hyperparameters to search over (**`cv_params`**)
    - The set of scoring metrics (**`scoring`**)
    - The number of cross-validation folds you want (**`cv=5`**)
    - The scoring metric that you want GridSearch to use when it selects the "best" model  
    (i.e., the model that performs best on average over all validation folds) (**`refit='f1'`**)
5. Fit the data **(`X_train`, `y_train`)** to the **`GridSearchCV`** object **(`rf_cv`)**.

- Note that we use the **`%%time`** magic at the top of the cell. This outputs the final runtime of the cell. 

In [6]:
%%time
rf = RandomForestClassifier(random_state=0)
cv_params = {'max_depth': [2,3,4,5, None], 
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'max_features': [2,3,4],
             'n_estimators': [75, 100, 125, 150]
             }  
scoring = {'accuracy':'accuracy', 
           'precision':'precision', 
           'recall':'recall', 
           'f1':'f1'}
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='f1')
rf_cv.fit(X_train, y_train)

CPU times: total: 26min 10s
Wall time: 26min 27s


#### **5.2. Pickle**  

- When models take a long time to fit, you don’t want to have to fit them more than once. 
- If your kernel disconnects or you shut down the notebook and lose the cell’s output.
- You’ll have to refit the model, which can be frustrating and time-consuming. 
- **`pickle`** is a tool that saves the fit model object to a specified location, then quickly reads it back in. 
- It also allows you to use models that were fit somewhere else, without having to train them yourself.

In [7]:
# Define a path to the folder where you want to save the model
path = r'C:\Users\barba\OneDrive\Documents\AIO Python\Google\Machine Learning\\'

In [None]:
# Pickle the model
with open(path+'rf_cv_model.pickle', 'wb') as to_write:
   pickle.dump(rf_cv, to_write)  # type: ignore

- Once we save the model, we'll never have to re-fit it when we run this notebook. 
- Ideally, we could open the notebook, select "Run all," and the cells would run successfully all the way to the end without any model retraining. 
- For this to happen, we'll need to return to the cell where we defined our grid search and comment out the line where we fit the model. 
- Otherwise, when we re-run the notebook, it would refit the model. 
- Similarly, we'll also need to go back to where we saved the model as a pickle and comment out those lines.  
- Next, we'll add a new cell that reads in the saved model from the folder we already specified. 
- For this, we'll use **`rb`** (read binary) and be sure to assign the model to the same variable name as we used above, **`rf_cv`**.

In [11]:
# Read in pickled model
with open(path + 'rf_cv_model.pickle', 'rb') as to_read:
    rf_cv = pickle.load(to_read)

In [12]:
#rf_cv.fit(X_train, y_train)
rf_cv.best_params_

{'max_depth': None,
 'max_features': 4,
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 125}

- Check the best **`average F1 score`** of this model on the validation folds, we can use the **`best_score_`** attribute. 
- Remember, if we had instead set **`refit=recall`** when we instantiated our **`GridSearchCV`** object earlier, then calling **`best_score_`** would return the best recall score, and the best parameters might not be the same as what they are in the above cell, because the model would be optimizing for a different metric.

In [13]:
rf_cv.best_score_

0.580528563620339

- Our model had an **`F1 score of 0.5805`**; not terrible. 
- Recall that when we ran our grid search, we specified that we also wanted to capture precision, recall, and accuracy. 
- The reason for doing this is that it's difficult to interpret an F1 score. 
- These other metrics are much more directly interpretable, so they're worth knowing. 
- The following cell defines a helper function that extracts these scores from the fit **`GridSearchCV`** object and returns  
 a pandas dataframe with all four scores from the model with the best average F1 score during validation.

In [14]:
def make_results(model_name, model_object):
    '''
    Accepts as arguments a model name (your choice - string) and
    a fit GridSearchCV model object.

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean F1 score across all validation folds.
    '''

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(mean f1 score)
    best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'Model': [model_name],
                          'F1': [f1],
                          'Recall': [recall],
                          'Precision': [precision],
                          'Accuracy': [accuracy]
                         }
                        )

    return table

In [15]:
# Make a results table for the rf_cv model using above function
rf_cv_results = make_results('Random Forest CV', rf_cv)
rf_cv_results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333


In [16]:
# Read in master results table
results = pd.read_csv('Results.csv', index_col=0)
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


In [17]:
# Concatenate the random forest results to the master table
results = pd.concat([rf_cv_results, results])
results

Unnamed: 0,Model,F1,Recall,Precision,Accuracy
0,Random Forest CV,0.580529,0.472517,0.756289,0.861333
0,Tuned Decision Tree,0.560655,0.469255,0.701608,0.8504


#### **5.3. Model selection and final results**

- Now we have two models. 
- If we've decided that we're done trying to optimize them, then we can now use our best model to predict on the test holdout data. 
- We'll be using the random forest cross-validated model without the depth limitation.
- However, if we were instead to use the model that was validated against a separate validation dataset, we'd now go back and retrain the model on the full training set **(training + validation sets)**.

**Note**: _It might be tempting to see how all models perform on the test holdout data, and then to choose the one that performs best. While this **can** be done, it biases the final model, because you used your test data to go back and make an upstream decision. The test data should represent **unseen** data. In competitions, for example, you must submit your final model before receiving the test data._