## Modeling with Ensemble Method: Bagging

In this notebook I will try out a few iterations of **Bagging algorithm** and see which parameters and features maximize our **Class 2 recall score**

In [3]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
#import
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
#import customized functions
from src.data_cleaning import cleaning_functions as cfs
from src.data_cleaning import exploration_functions as efs
from src.data_cleaning import processing_functions as pfs
from src.data_cleaning import modeling_functions as mfs

In [5]:
X_train, X_test, y_train, y_test, classes_dict = pfs.processed_dataset()

In [6]:
X_train, X_test = pfs.ohe_train_and_test_features(X_train, X_test)

In [12]:
scorer = mfs.scorer()

#### Let's checkout our class distribution again

In [14]:
y_train.value_counts(normalize=True)

0    0.544310
2    0.384646
1    0.071044
Name: target, dtype: float64

Using **SMOTE** and fixing this imbalance first

In [15]:
X_train_smoted, y_train_smoted = SMOTE(random_state=2020).fit_sample(X_train.values, y_train)

In [17]:
y_train_smoted.value_counts()

2    24249
1    24249
0    24249
Name: target, dtype: int64

#### We shall use both the unsmoted and smoted sets for our modeling purposes from here on....

#### Starting our modeling with BaggingClassifier

>A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

In [18]:
# Instantiating a BaggingClassifier
bagc = BaggingClassifier(random_state=2020)
bagc.fit(X_train_smoted, y_train_smoted)

BaggingClassifier(random_state=2020)

In [21]:
cross_val_score(bagc, X_train_smoted, y_train_smoted, scoring=scorer, cv=3)

array([0.75132995, 0.73611283, 0.56625015])

Recall score of class 2 about 70% for two out of the three folds, not a totally awful start let's keep at it.
We'll change a few parameters and see if it changes.

Let's first compare it with un-smoted & weight-balanced data and see if it does anything

In [37]:
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced')
bagc2 = BaggingClassifier(dt, random_state=2020)
bagc2.fit(X_train, y_train)
print('New Class 2 recall score :','\n',cross_val_score(bagc2, X_train, y_train, scoring=scorer, cv=3))

New Class 2 recall score : 
 [0.7359944  0.73564426 0.73897059]


Wow that was more consistent across these folds ---interesting

Now let's try changing the parameters, I will first try a grid search, if it takes toooo long I'll start tuning the hyperparameters manually.

In [24]:
# Defining param grid.
param_grid = {
    'n_estimators': [20, 50, 80],
    'max_samples': [5, 15, 20]
}

# Initializing gridsearch with 3-fold cross-validation.
gs = GridSearchCV(estimator=bagc2, param_grid=param_grid, cv=3, scoring=scorer)

In [25]:
gs.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                               random_state=2020),
                                         random_state=2020),
             param_grid={'max_samples': [5, 15, 20],
                         'n_estimators': [20, 50, 80]},
             scoring=make_scorer(class_2_recall))

In [31]:
# Score it.
gs.score(X_train, y_train)

0.5106792717086834

hmmm.... 

In [27]:
gs.best_params_

{'max_samples': 15, 'n_estimators': 20}

In [28]:
gs.best_score_

0.6046918767507002

In [29]:
gs.best_estimator_

BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                        random_state=2020),
                  max_samples=15, n_estimators=20, random_state=2020)

#### Trying cross valdiation with newly discovered `best_estimator_` on `un-smoted dataset`

In [34]:
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced')
bagc3 = BaggingClassifier(base_estimator=dt, max_samples= 15, n_estimators= 20, random_state=2020)
bagc3.fit(X_train, y_train)
print('Class 2 Recall Score:','\n',cross_val_score(bagc3, X_train, y_train, scoring=scorer, cv=3))

Class 2 Recall Score: 
 [0.5852591  0.60696779 0.62184874]


#### Unimpressive across the board, giving it one more go with the `smoted dataset` knowing these best estimators were discovered from unsmoted dataset

In [35]:
dt = DecisionTreeClassifier(random_state=2020)
bagc4 = BaggingClassifier(base_estimator=dt, max_samples= 15, n_estimators= 20, random_state=2020)
bagc4.fit(X_train_smoted, y_train_smoted)
print('Class 2 Recall Score:','\n',cross_val_score(bagc3, X_train_smoted, y_train_smoted, scoring=scorer, cv=3))

Class 2 Recall Score: 
 [0.34813807 0.51193864 0.41346035]


As imagined, didn't get any better

#### Wondering if there's anything else we could do to get a better score.... 
will try changing a few more hyperparameters before moving on to the next ensemble method

In [36]:
#Unsmoted dataset
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced')
bagc5 = BaggingClassifier(base_estimator=dt, max_samples= 20, n_estimators= 100, random_state=2020)
bagc5.fit(X_train, y_train)
print('Class 2 Recall Score:','\n',cross_val_score(bagc5, X_train, y_train, scoring=scorer, cv=3))

Class 2 Recall Score: 
 [0.34733894 0.43434874 0.52538515]


In [42]:
#smoted dataset
dt = DecisionTreeClassifier(random_state=2020)
bagc6 = BaggingClassifier(base_estimator=dt, max_samples= 20, n_estimators= 100, random_state=2020)
bagc6.fit(X_train_smoted, y_train_smoted)
print('Class 2 Recall Score:','\n',cross_val_score(bagc6, X_train_smoted, y_train_smoted, scoring=scorer, cv=3))

Class 2 Recall Score: 
 [0.46022516 0.63973772 0.60336509]


**just can't catch a break with this model...**

### Let's try out our test set with the best performer in this lot `bagc2`

In [38]:
recall_score(y_test, bagc2.predict(X_test), average=None)[2]

0.740857946554149

**Not too bad!!!**

#### Moving on to Random Forest --- Next Notebook