# Bagging, Boosting and Random Forest

Decision trees are already quite flexible but they are limited by how the splits are selected. The *greedy algorithm* of selecting the best split for the current node can affect the performance of nodes lower in the tree. Due to this some enhancements have been made to these to help in the bias-variance tradeoff.

Bagging stands for "*bootstrap aggregation*" which consists of using bootstrap sampling to grow various trees with a partial sample of the data to then average the results of each tree. This is particularly helpful to reduce variance in predictions. 

Boosting consists of using many "weak" small trees with very *low variance* but to add them together so that the *bias* can be reduced. 

Finally, Random forests are the growing of multiple trees but changing the variable selected to do the split every time so there is no bias from the *greedy algorithm*. We proceed to fit all 3 decision tree enhancements in this section.

# Libraries

In [40]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, ParameterGrid
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from yellowbrick.classifier import classification_report, confusion_matrix
from yellowbrick.classifier.rocauc import roc_auc
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import roc_curve
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Import Data

In [41]:
X_train = pd.read_csv('0_X_train.csv', index_col='Id')
X_valid = pd.read_csv('1_X_valid.csv', index_col='Id')
X_test  = pd.read_csv('2_X_test.csv', index_col='Id')

y_train = pd.read_csv('0_y_train.csv', index_col='Id')
y_valid = pd.read_csv('1_y_valid.csv', index_col='Id')
y_test  = pd.read_csv('2_y_test.csv', index_col='Id')

#full training set
X = pd.concat([X_train, X_valid, X_test], axis=0)
y = pd.concat([y_train, y_valid, y_test], axis=0)

We modify the Y values to format correctly as per the  `scikitlearn` functions.

In [42]:
y_train = np.array(y_train)
y_train = y_train.ravel()

y = np.array(y)
y = y.ravel()

# Build Bagged Tree

For each of the sections, we proceed to find the parameters that generate the best model by comparing their prediction accuracy scores. 

There are a series of parameters we can adjust. `min_samples_leaf` is the final number of observations in a determined terminal node. `n_estimators` is the amount of trees to be fitted for each round of model generation. `max_features` here is set to none which indicates the function can automatically decide which parameters to split on. This parameter will be modified in Random Forests.

In [43]:
bag_hyperparam_grid={'min_samples_leaf':[1, 3, 5, 7, 9,], "n_estimators":[500, 750, 1000, 1250, 1500, 2000,]}
bag = RandomForestClassifier(oob_score=True, max_features=None, criterion="log_loss",
                             warm_start=False, random_state=1, n_jobs=-2)
best_score=0.5

for g in ParameterGrid(bag_hyperparam_grid):
    bag.set_params(**g)
    bag.fit(X_train,y_train)
    # save if best
    if bag.oob_score_ > best_score:
        best_score = bag.oob_score_
        best_params = g

print(f"OOB: %0.5f" % best_score)
print("Best parameters:", best_params)

#Student note: takes over 7 mins to run

OOB: 0.85174
Best parameters: {'min_samples_leaf': 3, 'n_estimators': 2000}


Once the best parameters have been found we can safe to a new variable and print the predictive accuracy of this model. For this data set the best parameters are: `min_samples_leaf` = 3, `n_estimators` = 2000.

In [44]:
bag = RandomForestClassifier(min_samples_leaf = best_params["min_samples_leaf"], 
                             n_estimators = best_params["n_estimators"],oob_score=True, 
                             max_features=None, criterion="log_loss",warm_start=False, 
                             random_state=1, n_jobs=-2)

#create the bagged tree
# bag = RandomForestClassifier(min_samples_leaf = 3, n_estimators = 200,oob_score=True, 
#                              max_features=None, criterion="log_loss",warm_start=False, 
#                              random_state=1, n_jobs=-2)

## Results: Bagged Tree Best Model

In [45]:
#Print the results
print("Bagged tree Best Model")
print(" ")

#fit and predict on partitioned data set
bag.fit(X_train,y_train)
print("Training accuracy:", round(bag.score(X_train, y_train),4))
print("Validation accuracy:", round(bag.score(X_valid, y_valid),4))
print("Test accuracy:", round(bag.score(X_test, y_test),4))
print("X accuracy for Partially Trained Model:", round(bag.score(X, y),4))

#fit and predict on all data
bag.fit(X,y)
print("X accuracy on Fully Trained Model:", round(bag.score(X, y),4))

Bagged tree Best Model
 
Training accuracy: 0.9764
Validation accuracy: 0.8362
Test accuracy: 0.8474
X accuracy for Partially Trained Model: 0.936
X accuracy on Fully Trained Model: 0.9734


# Build Boosted Tree

For the boosted trees, an additional parameter is indicated. The learning rate ($\lambda$) adjusts the size of the tree at a certain rate for this it is called a *slow learner*. The `learning_rate` is  avalue between 0 and 1 and in finding the best model we indicate various values of this new parameter. Additionally, `max_depth` changes the number of splits done in the tree. As seen from regular trees, the number of nodes affects the prediction accuracy on all sets, so finding the most appropriate one is also important.

In [46]:
boost_hyperparam_grid={'max_depth':[1, 3, 5, 7, 9,], 
                       "n_estimators":[500, 750, 1000, 1250, 1500, 2000,], 
                       "learning_rate":[0.001, 0.01, 0.1, 0.5, 1]}

boost = GradientBoostingClassifier(max_features=None, 
                             warm_start=False, random_state=1)

best_boost_score=0.5

for bg in ParameterGrid(boost_hyperparam_grid):
    boost.set_params(**bg)
    boost.fit(X_train,y_train)
    # save if best
    if boost.score(X_train, y_train) > best_boost_score:
        best_boost_score = boost.score(X_train, y_train)
        best_boost_params = bg

print(f"OOB: %0.5f" % best_boost_score)
print("Best parameters:", best_boost_params)

#Student note: takes over 55 mins to run

OOB: 1.00000
Best parameters: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 1250}


As so we can see that the best combination tha best maximizes the accuracy is  `learning_rate`= 0.01, `max_depth`= 9, and `n_estimators`= 1250.

With this we train a new model and find the accuracy of the validation and test sets.

In [47]:
boost = GradientBoostingClassifier(max_depth = best_boost_params["max_depth"], 
                                  n_estimators = best_boost_params["n_estimators"],
                                  learning_rate = best_boost_params["learning_rate"],
                                  random_state=1)

# boost = GradientBoostingClassifier(max_depth = 9, 
#                                    n_estimators = 1250,
#                                    learning_rate = 0.01,
#                                    random_state=1)

## Results: Boosted Tree Best Model

In [48]:
#Print the results
print("Boosted tree Best Model")
print(" ")

#fit and predict on partitioned data set
boost.fit(X_train,y_train)
print("Training accuracy:", round(boost.score(X_train, y_train),4))
print("Validation accuracy:", round(boost.score(X_valid, y_valid),4))
print("Test accuracy:", round(boost.score(X_test, y_test),4))
print("X accuracy for Partially Trained Model:", round(boost.score(X, y),4))

#fit and predict on all data
boost.fit(X,y)
print("X accuracy on Fully Trained Model:", round(boost.score(X, y),4))

Boosted tree Best Model
 
Training accuracy: 1.0
Validation accuracy: 0.8392
Test accuracy: 0.86
X accuracy for Partially Trained Model: 0.9549
X accuracy on Fully Trained Model: 0.9977


For matters of comparison, we now run the best boosted model, as it has had a very good performance so far, with the predictors selected from Lasso variable selection to see if there is any improvement.

In [49]:
#establish unrequired variable names
cols_lasso = ['banner_views_old', 'days_elapsed_old', 'X3', 'marital_divorced', 'job_entrepreneur', 'job_freelance',
             'job_housekeeper', 'job_technology', 'job_unemployed']

#create new partitions with corresponding lasso variables
x_train_mod = X_train.drop(columns=cols_lasso)
x_valid_mod = X_valid.drop(columns=cols_lasso)
x_test_mod = X_test.drop(columns=cols_lasso)

In [50]:
#fit the model on new training data
boost.fit(x_train_mod, y_train)

#show results of model
print("Training accuracy:", round(boost.score(x_train_mod, y_train),4))
print("Validation acccuracy:", round(boost.score(x_valid_mod, y_valid),4))
print("Test accuracy:", round(boost.score(x_test_mod, y_test),4))

Training accuracy: 0.9998
Validation acccuracy: 0.8295
Test accuracy: 0.8511


As we can see there is no improvement from full variable boosting and the variable selection. Now we proceed to the same procedure with Random Forests.

# Random Forest


The new parameter to choose in Random Forest is the number of features. This refers to the number of variables used in the random selection for growing new trees. To establish an appropopriate amount, rule of thumb is to start with the $\sqrt{p}$, where $p$ is the number of predictors in the data set.

In [51]:
default_max_features_param = np.sqrt(X_train.shape[1])
default_max_features_param

5.5677643628300215

As in this case its close to 5.5 we go two units below 5 and 2 units above 6 to have a decent set to choose from.

In [52]:
#set hyperparameter grid
rf_hyperparam_grid={"max_features":[3, 4, 5, 6, 7, 8],
                 'min_samples_leaf':[1, 3, 5, 7, 9],
                 "n_estimators":[500, 750, 1000, 1250, 1500, 2000]}

#instantiate the randfortest
rfc = RandomForestClassifier(criterion = "log_loss", oob_score=True, warm_start=False, random_state=1, n_jobs=-2)
best_score=0.5

#loop over parameters. running duration 24mins.
for rfg in ParameterGrid(rf_hyperparam_grid):
    rfc.set_params(**rfg)
    rfc.fit(X_train,y_train)
    # save if best
    if rfc.oob_score_ > best_score:
        best_rfc_score = rfc.oob_score_
        best_rfc_params = rfg

#print best results
print(f"OOB: %0.5f" % best_rfc_score)
print("Best parameters:", best_rfc_params)

#Student note: takes over 30 mins to run

OOB: 0.84583
Best parameters: {'max_features': 8, 'min_samples_leaf': 9, 'n_estimators': 2000}


We can see that the best combination is `max_features` = 7, `min_samples_leaf`= 9, and `n_estimators`= 2000. Now the final model is saved to a variable.

In [53]:
# rfc = RandomForestClassifier(n_estimators=best_rfc_params['max_features'], 
#                              max_features=best_rfc_params['max_features'],
#                              min_samples_leaf=best_rfc_params['min_samples_leaf'], 
#                              criterion="log_loss",oob_score=True, warm_start=False, random_state=1)

rfc = RandomForestClassifier(n_estimators=2000, 
                             max_features=7,
                             min_samples_leaf=9, 
                             criterion="log_loss",oob_score=True, warm_start=False, random_state=1)

## Results: Random Forest Best Model

In [54]:
#Print the results
print("Random Forest Best Model")
print(" ")

#fit and predict on partitioned data set
rfc.fit(X_train,y_train)
print("Training accuracy:", round(rfc.oob_score_,4))
print("Validation accuracy:", round(rfc.score(X_valid, y_valid),4))
print("Test accuracy:", round(rfc.score(X_test, y_test),4))
print("X accuracy for Partially Trained Model:", round(rfc.score(X, y),4))

#fit and predict on all data
rfc.fit(X,y)
print("X accuracy on Fully Trained Model:", round(rfc.score(X, y),4))

Random Forest Best Model
 
Training accuracy: 0.8438
Validation accuracy: 0.8258
Test accuracy: 0.8496
X accuracy for Partially Trained Model: 0.873
X accuracy on Fully Trained Model: 0.8859


Finally, we can see that despite the result being better than the original tree, it did not improve upon the boosting results obtained previously.