***
***
***

<br><h2>Session 7b | Ensemble Modeling</h2>
<h4>DAT-5303 | Machine Learning</h4>
Chase Kusterer - Faculty of Analytics<br>
Hult International Business School<br><br><br>

***
***
***

<h3>Part I: Preparation</h3><br>
Run the following code to import necessary packages, load data, and set display options for pandas. 

In [None]:
########################################
# importing packages
########################################
import matplotlib.pyplot as plt                      # data visualization
import pandas as pd                                  # data science essentials
from sklearn.model_selection import train_test_split # train-test split
from sklearn.metrics import roc_auc_score            # auc score
from sklearn.model_selection import GridSearchCV     # hyperparameter tuning
from sklearn.metrics import make_scorer              # customizable scorer


# new packages
from sklearn.ensemble import RandomForestClassifier     # random forest
from sklearn.ensemble import GradientBoostingClassifier # gbm


########################################
# loading data and setting display options
########################################
# loading data
titanic = pd.read_excel('titanic_feature_rich.xlsx')


# loading model performance
model_performance = pd.read_excel('Classification Model Performance.xlsx')


# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


########################################
# explanatory variable sets
########################################
candidate_dict = {

 # full model
 'logit_full'   : ['age', 'sibsp', 'parch', 'fare', 'm_age', 'm_cabin',
                   'm_home.dest', 'potential_youth', 'child',
                   'number_of_names', 'pclass_1', 'pclass_2', 'female'],
 
 # significant variables only
 'logit_sig'    : ['age' , 'sibsp', 'm_cabin', 'number_of_names',
                   'pclass_1', 'female']

}


########################################
# checking previous model performances
########################################
model_performance

***
***

<br>
<strong>User-Defined Functions</strong><br>
Run the following code to load the user-defined functions used throughout this Notebook.

In [None]:
########################################
# plot_feature_importances
########################################
def plot_feature_importances(model, train, export = False):
    """
    Plots the importance of features from a CART model.
    
    PARAMETERS
    ----------
    model  : CART model
    train  : explanatory variable training data
    export : whether or not to export as a .png image, default False
    """
    
    # declaring the number
    n_features = train.shape[1]
    
    # setting plot window
    fig, ax = plt.subplots(figsize=(12,9))
    
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(pd.np.arange(n_features), train.columns)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    
    if export == True:
        plt.savefig('Feature_Importance.png')

***
***

<br>
<strong>Challenge 1</strong><br>
Complete the code to split the dataset into training and validation sets using the <em>logit_sig</em> set of explanatory variables.

In [None]:
# train/test split with the logit_sig variables
titanic_data   =  titanic.loc[ : , candidate_dict['logit_sig']]
titanic_target =  titanic.loc[ : , 'm_boat']


# train/test split
X_train, X_test, y_train, y_test = train_test_split(
            titanic_data,
            titanic_target,
            random_state = 802,
            test_size    = 0.25,
            stratify     = titanic_target)

***
***

<br>

<h3>Part II: Random Forest</h3><br>
A random forest can be thought of as a group of decision trees that are all slightly different. This model type starts by randomly selecting a subset of explanatory variables and building a decision tree. Then, it takes another random subset of explanatory variables and builds another tree. After building several trees, each observation has several different results for its prediction. This can be thought of as giving each tree a vote as to what to predict for each observation.

For example, one observation may have been voted positive 80% of the time (the event in question occurred), and voted negative 20% of the time (the event in question did not occur). After all votes have been cast, whichever class (positive or negative) has the most votes wins, and prediction on the observation is complete.<br><br><br>
<strong>Challenge 2</strong><br>
Build a random forest model using the significant set of explanatory variables ( <em>logit_sig</em> ) and default values for the hyperparameters listed below. Remember, default values are documented in help( ) files.

In [None]:
# INSTANTIATING a random forest model with default values
rf_default = RandomForestClassifier(n_estimators     = 10,
                                    criterion        = 'gini',
                                    max_depth        = None,
                                    min_samples_leaf = 1,
                                    bootstrap        = True,
                                    warm_start       = False,
                                    random_state     = 802)

***

In [None]:
# FITTING the training data
rf_default_fit = rf_default.fit(X_train, y_train)


# PREDICTING based on the testing set
rf_default_fit_pred = rf_default_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', rf_default_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', rf_default_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = rf_default_fit_pred).round(4))

***
***

<br>
<strong>Challenge 3</strong><br>
Write and run the feature importance function in the code cell below.

In [None]:
plot_feature_importances(rf_default_fit,
                         train = X_train,
                         export = False)

***
***

<br>
Run the following code to write the results of the tuned classification model to model_performance.

In [None]:
# declaring model performance objects
rf_train_acc = rf_default_fit.score(X_train, y_train).round(4)
rf_test_acc  = rf_default_fit.score(X_test, y_test).round(4)
rf_auc       = roc_auc_score(y_true  = y_test,
                             y_score = rf_default_fit_pred).round(4)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Random Forest',
                          'Training Accuracy'  : rf_train_acc,
                          'Testing Accuracy'   : rf_test_acc,
                          'AUC Value'          : rf_auc},
                          ignore_index = True)


# checking the results
model_performance

***
***

<br>
<strong>Challenge 4</strong><br>
Prepare the code below so that it splits the data into training and testing sets on the full set of explanatory variables (<em>logit_full</em>).

In [None]:
# train/test split with the logit_sig variables
titanic_data   =  titanic.loc[ : , candidate_dict['logit_full']]
titanic_target =  titanic.loc[ : , 'm_boat']


# train/test split
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
            titanic_data,
            titanic_target,
            random_state = 802,
            test_size    = 0.25,
            stratify     = titanic_target)

***
***

<br>
Run the following code to develop a random forest using default values for each hyperparameter listed below.

In [None]:
# INSTANTIATING a random forest model with default values
rf_default_full = RandomForestClassifier(n_estimators     = 10,
                                         criterion        = 'gini',
                                         max_depth        = None,
                                         min_samples_leaf = 1,
                                         bootstrap        = True,
                                         warm_start       = False,
                                         random_state     = 802)


# FITTING the training data
rf_default_full_fit = rf_default_full.fit(X_train_full, y_train_full)


# PREDICTING based on the testing set
rf_default_full_pred = rf_default_full_fit.predict(X_test_full)


# SCORING the results
print('Training ACCURACY:', rf_default_full_fit.score(X_train_full, y_train_full).round(4))
print('Testing  ACCURACY:', rf_default_full_fit.score(X_test_full, y_test_full).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_full,
                                          y_score = rf_default_full_pred).round(4))

***
***

<br>
<strong>Challenge 5</strong><br>
Write and run the feature importance function in the code cell below.

In [None]:
# plotting feature importance
plot_feature_importances(rf_default_full_fit,
                         train = X_train_full,
                         export = False)

***
***

<br>

<strong>Random Forest with Tuned Hyperparameters</strong><br>
Run the following code to automate hyperparameter optimization for a random forest model.

In [None]:
# declaring a hyperparameter space
estimator_space  = pd.np.arange(100, 1100, 250)
leaf_space       = pd.np.arange(1, 31, 10)
criterion_space  = ['gini', 'entropy']
bootstrap_space  = [True, False]
warm_start_space = [True, False]


# creating a hyperparameter grid
param_grid = {'n_estimators'     : estimator_space,
              'min_samples_leaf' : leaf_space,
              'criterion'        : criterion_space,
              'bootstrap'        : bootstrap_space,
              'warm_start'       : warm_start_space}


# INSTANTIATING the model object without hyperparameters
full_forest_grid = RandomForestClassifier(random_state = 802)


# GridSearchCV object
full_forest_cv = GridSearchCV(estimator  = full_forest_grid,
                              param_grid = param_grid,
                              cv         = 3,
                              scoring    = make_scorer(roc_auc_score,
                                           needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
full_forest_cv.fit(titanic_data, titanic_target)


# PREDICT step is not needed


# printing the optimal parameters and best score
print("Tuned Parameters  :", full_forest_cv.best_params_)
print("Tuned Training AUC:", full_forest_cv.best_score_.round(4))

***
***

<br>
Automated hyperparameter optimization can take a long time. In order to avoid having to run this each time a script is loaded, it is a good practice to:<br>

1. Run an automated hyperparameter optimization technique and record its results
2. Comment out (but not delete) the hyperparameter optimization code
3. Manually set each hyperparameter when building a tuned model

<br>
This will help alleviate processing bottlenecks while allowing you to uncomment and rerun the optimization algorithm as needed.<br><br><br>
<strong>Challenge 6</strong><br>
Instead of utilizing the <em>.best_estimator_</em> attribute of <em>GridSearchCV</em>, manually input the optimal set of hyperparameters when instantiating the model.

In [None]:
# INSTANTIATING the model object without hyperparameters
full_rf_tuned = RandomForestClassifier(bootstrap        = True,
                                       criterion        = 'gini',
                                       min_samples_leaf = 11,
                                       n_estimators     = 850,
                                       warm_start       = True,
                                       random_state     = 802)


# FIT step is needed as we are not using .best_estimator
full_rf_tuned_fit = full_rf_tuned.fit(X_train, y_train)


# PREDICTING based on the testing set
full_rf_tuned_pred = full_rf_tuned_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', full_rf_tuned_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', full_rf_tuned_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = full_rf_tuned_pred).round(4))

***
***

<br>
Run the following code to write the results of the tuned classification model to model_performance.

In [None]:
# declaring model performance objects
rf_train_acc = full_rf_tuned_fit.score(X_train, y_train).round(4)
rf_test_acc  = full_rf_tuned_fit.score(X_test, y_test).round(4)
rf_auc       = roc_auc_score(y_true  = y_test,
                             y_score = full_rf_tuned_pred).round(4)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Tuned Random Forest',
                          'Training Accuracy'  : rf_train_acc,
                          'Testing Accuracy'   : rf_test_acc,
                          'AUC Value'          : rf_auc},
                          ignore_index = True)


# checking the results
model_performance

***
***

<br>
<h3>Part III: Gradient Boosted Machines</h3><br>
Gradient boosted machines (GBMs) are like decision trees, but instead of starting fresh with each iteration, they learn from mistakes made in previous iterations. Unlike random forest, GBMs use a row-wise penalty instead of a column-wise penalty, reweighting each row instead of each column.<br><br>
<strong>Challenge 7</strong><br>
Develop a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html">GradientBoostingClassifier</a> model with default values for the hyperparameters listed below. Remember, default values are documented in help( ) files.

In [None]:
# INSTANTIATING the model object without hyperparameters
full_gbm_default = GradientBoostingClassifier(loss          = 'deviance',
                                              learning_rate = 0.1,
                                              n_estimators  = 100,
                                              criterion     = 'friedman_mse',
                                              max_depth     = 3,
                                              warm_start    = False,
                                              random_state  = 802)


# FIT step is needed as we are not using .best_estimator
full_gbm_default_fit = full_gbm_default.fit(X_train, y_train)


# PREDICTING based on the testing set
full_gbm_default_pred = full_gbm_default_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', full_gbm_default_fit.score(X_train, y_train).round(4))
print('Testing ACCURACY :', full_gbm_default_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = full_gbm_default_pred).round(4))

***
***

<br>
Notice above that we are using <em>friedman_mse</em> as the criterion. Friedman proposed that instead of focusing on one MSE value for the entire tree, the algorithm should localize its optimal MSE for each region of the tree. See <a href="https://statweb.stanford.edu/~jhf/ftp/stobst.pdf">this research paper</a> for more details.

In [None]:
# declaring model performance objects
gbm_train_acc = full_gbm_default_fit.score(X_train, y_train).round(4)
gbm_test_acc  = full_gbm_default_fit.score(X_test, y_test).round(4)
gbm_auc       = roc_auc_score(y_true  = y_test,
                              y_score = full_gbm_default_pred).round(4)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'GBM',
                          'Training Accuracy'  : gbm_train_acc,
                          'Testing Accuracy'   : gbm_test_acc,
                          'AUC Value'          : gbm_auc},
                          ignore_index = True)


# checking the results
model_performance

***
***

<br>
<strong>Challenge 8</strong><br>
Complete the code to perform hyperparameter optimization on a GBM model using the full dataset.

In [None]:
# declaring a hyperparameter space
learn_space     = pd.np.arange(0.1, 1.6, 0.3)
estimator_space = pd.np.arange(50, 250, 50)
depth_space     = pd.np.arange(1, 10)


# creating a hyperparameter grid
param_grid = {'learning_rate' : learn_space,
              'max_depth'     : depth_space,
              'n_estimators'  : estimator_space}


# INSTANTIATING the model object without hyperparameters
full_gbm_grid = GradientBoostingClassifier(random_state = 802)


# GridSearchCV object
full_gbm_cv = GridSearchCV(estimator  = full_gbm_grid,
                           param_grid = param_grid,
                           cv         = 3,
                           scoring    = make_scorer(roc_auc_score,
                                        needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
full_gbm_cv.fit(titanic_data, titanic_target)


# PREDICT step is not needed


# printing the optimal parameters and best score
print("Tuned Parameters  :", full_gbm_cv.best_params_)
print("Tuned Training AUC:", full_gbm_cv.best_score_.round(4))

***
***

<br>
<strong>Challenge 9</strong><br>
Manually input the optimal set of hyperparameters when instantiating the model.

In [None]:
# INSTANTIATING the model object without hyperparameters
gbm_tuned = GradientBoostingClassifier(learning_rate = 0.1,
                                       max_depth     = 2,
                                       n_estimators  = 100,
                                       random_state  = 802)


# FIT step is needed as we are not using .best_estimator
gbm_tuned_fit = gbm_tuned.fit(X_train, y_train)


# PREDICTING based on the testing set
gbm_tuned_pred = gbm_tuned_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', gbm_tuned_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', gbm_tuned_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = gbm_tuned_pred).round(4))

***
***

<br>
<strong>Challenge 10</strong><br>
Write the results of the tuned classification model to model_performance.

In [None]:
# declaring model performance objects
gbm_train_acc = gbm_tuned_fit.score(X_train, y_train).round(4)
gbm_test_acc  = gbm_tuned_fit.score(X_test, y_test).round(4)
gbm_auc       = roc_auc_score(y_true  = y_test,
                              y_score = gbm_tuned_pred).round(4)


# appending to model_performance
model_performance = model_performance.append(
                          {'Model'             : 'Tuned GBM',
                          'Training Accuracy'  : gbm_train_acc,
                          'Testing Accuracy'   : gbm_test_acc,
                          'AUC Value'          : gbm_auc},
                          ignore_index = True)


# checking the results
model_performance

***
***

<br>
Run the following code to view each model's performance metrics.

In [None]:
model_performance.sort_values(by = 'AUC Value',
                              ascending = False)

***

<br>

Run the following code to save model_performance as an Excel file.

In [None]:
# saving the DataFrame to Excel
model_performance_df.to_excel('Classification Model Performance.xlsx',
                              index = False)