In [1]:
%load_ext autoreload
%autoreload 2

In [18]:
#import
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.metrics import classification_report, recall_score
from sklearn.dummy import DummyClassifier

In [3]:
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [4]:
#import customized functions
from src.data_cleaning import cleaning_functions as cfs
from src.data_cleaning import exploration_functions as efs
from src.data_cleaning import processing_functions as pfs

In [5]:
X_train, X_test, y_train, y_test, classes_dict = pfs.processed_dataset()

In [6]:
X_train, X_test = pfs.ohe_train_and_test_features(X_train, X_test)

### Dummy Clasifier
Starting out with a Dummy Model to have a baseline to compare

In [7]:
dummy_model = DummyClassifier(strategy='most_frequent', random_state=2020)

In [8]:
dummy_model.fit(X_train, y_train)

DummyClassifier(random_state=2020, strategy='most_frequent')

In [9]:
cross_validate(dummy_model, X_train, y_train, cv=3, scoring='recall_micro')['test_score']

array([0.54430976, 0.54430976, 0.54430976])

So our baseline weighted recall score is 54%, this is to say if we guessed class 0 for all the wells in our dataset, we would uncover the truth 54% of the time. We are using weighted recall score here since cross validate from sklearn can only return one single metric per fold. Getting individual class recall score is not an option, so we are returning the weighted recall of each class.
> **Flash forward, make_scorer to the rescue!!, can now find out exact recall for our class (more on this in the next notebook)**

### Modeling with Sklearn Ensemble methods

Let's try out a few models with sklearn

First let's try some boosting classifiers and fine tune a few hyperparameters while we are at it

In [10]:
dt = DecisionTreeClassifier(random_state=2020, class_weight='balanced', max_depth=1)
abc = AdaBoostClassifier(base_estimator=dt, random_state=2020)

In [12]:
abc.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                         max_depth=1,
                                                         random_state=2020),
                   random_state=2020)

In [14]:
cross_val_score(abc, X_train, y_train, scoring='recall_micro', cv=3)

array([0.61084175, 0.63232323, 0.63602694])

In [16]:
gbc = GradientBoostingClassifier(random_state=2020)
gbc.fit(X_train, y_train)

GradientBoostingClassifier(random_state=2020)

In [17]:
cross_val_score(gbc, X_train, y_train, scoring='recall_micro', cv=3)

array([0.75292929, 0.75387205, 0.7573064 ])

**That was a good overall increase in our weighted recall score for our classes**

Let's try a grid search and see if we can fine tune the hyper parameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
#gs = GridSearchCV(estimator=gbc,
#                 param_grid={
#                    'n_estimators': [25, 50, 100],
#                   'loss': ['deviance', 'exponential']
#              }, cv=3)

In [None]:
#gs.fit(X_train, y_train)

**Had to cancel the grid search due to runtime**

## Recall for Class 2

In [None]:
def class_2_recall(y_true, y_pred):
    return recall_score(y_true, y_pred, average=None)[2]

In [None]:
scorer = make_scorer(class_2_recall)
cross_val_score(abc, X_train, y_train.target, scoring=scorer)

#### Takeaways from this notebook

* Boosting may not help us out really, will get back to it after trying out the others too

### Next Steps:
* Use make_scorer and a custom function to calculate our specific recall_score
* Try ALL ensemble methods instead and have a plan for iterating through them all in a logical way