## Modeling with Ensemble Method: Boosting

In this notebook I will try out a few iterations of **Boosting algorithm** one last time and see which parameters and features maximize our **Class 2 recall score**

In [None]:
%load_ext autoreload
%autoreload 2

In [3]:
#import
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, recall_score, make_scorer
from sklearn.dummy import DummyClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
#import customized functions
from src.data_cleaning import cleaning_functions as cfs
from src.data_cleaning import exploration_functions as efs
from src.data_cleaning import processing_functions as pfs
from src.data_cleaning import modeling_functions as mfs

In [2]:
X_train, X_test, y_train, y_test, classes_dict = pfs.processed_dataset()
X_train, X_test = pfs.ohe_train_and_test_features(X_train, X_test)
scorer = mfs.scorer()

In [3]:
X_train_smoted, y_train_smoted = SMOTE(random_state=2020).fit_sample(X_train.values, y_train)

### Boosting Algorithm

> The fundamental idea of boosting is to start with a weak learner and then to use information about its errors to build a new model that can supplement the original model.

In [4]:
# Instantiating a BoostingClassifier
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced')
abc1 = AdaBoostClassifier(random_state=2020)
abc1.fit(X_train, y_train)
print('Class 2 recall score :','\n',cross_val_score(abc1, X_train, y_train, scoring=scorer, cv=3))

Class 2 recall score : 
 [0.59121148 0.61116947 0.61677171]


Seems to have high bias (we've been getting a recall score of about 77%). Let's try it out on smoted data

In [5]:
dt = DecisionTreeClassifier(random_state=2020)
abc2 = AdaBoostClassifier(random_state=2020)
abc2.fit(X_train_smoted, y_train_smoted)
print('Class 2 recall score :','\n',cross_val_score(abc2, X_train_smoted, y_train_smoted, scoring=scorer, cv=3))

Class 2 recall score : 
 [0.59507609 0.61796363 0.4469875 ]


Not good either, will try increasingt the n_estimators and give it a go

In [6]:
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced')
abc1 = AdaBoostClassifier(random_state=2020, n_estimators=200)
abc1.fit(X_train, y_train)
print('Class 2 recall score :','\n',cross_val_score(abc1, X_train, y_train, scoring=scorer, cv=3))

Class 2 recall score : 
 [0.6547619  0.65791317 0.67436975]


Similar to `abc1` low variance but high bias

Instead of Adaptive Boosting let's try out Gradient Boosting and see if that helps

In [7]:
# Instantiating a BoostingClassifier
gbc1 = GradientBoostingClassifier(random_state=2020)
gbc1.fit(X_train, y_train)
print('Class 2 recall score :','\n',cross_val_score(gbc1, X_train, y_train, scoring=scorer, cv=3))

Class 2 recall score : 
 [0.62027311 0.6285014  0.63707983]


In [9]:
gbc1.n_estimators_

100

This one took tooo long to run and the recall scores are just bleh, not worth our time or trouble. 

In [4]:
sys.path

['/Users/lulualakdawala/Documents/DS_course/Mod_3/project/Tanzania/notebooks/ll_notebooks',
 '/opt/anaconda3/lib/python37.zip',
 '/opt/anaconda3/lib/python3.7',
 '/opt/anaconda3/lib/python3.7/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.7/site-packages',
 '/opt/anaconda3/lib/python3.7/site-packages/aeosa',
 '/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions',
 '/Users/lulualakdawala/.ipython',
 '/Users/lulualakdawala/Documents/DS_course/Mod_3/project/Tanzania']