# **Cancel Culture - Classification Modeling Notebook**

---

**Post-Cleaning Modeling Notebook**

---

# -- > 🛑 **FIX**: Add cmts re: post-cleaning, modeling

---

>

---

# **Imports**

---

> I will import several packages and modules to perform my modeling steps. I use different parts of Scikit-Learn's numerous packages for the majority of my modeling. In addition to SKLearn, I tested models from XGBoost as well as Imblearn towards the end of the modeling process.
>
> I include the usual packages as well: pandas, numpy, matplotlib, and seaborn. Additionally, I have several personal functions that I use during the modeling process.

---

In [1]:
## Jupyter Notebook setting to reload functions when called
%load_ext autoreload
%autoreload 2

In [2]:
## Data Handling
import pandas as pd
import numpy as np

## Visualizations
import matplotlib.pyplot as plt
import seaborn as sns

## Personal functions
from bmc_functions import classification as clf

## SKLearn and Modeling Tools
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, \
                                AdaBoostClassifier,GradientBoostingClassifier 
from sklearn import set_config
set_config(display='diagram')

from xgboost import XGBClassifier

from imblearn.ensemble import BalancedBaggingClassifier,\
                                BalancedRandomForestClassifier,\
                                EasyEnsembleClassifier, RUSBoostClassifier

In [3]:
## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 50)

# **Reading the DataFrames**

---

> In my prior EDA notebook, I reviewed, cleaned, and performed some pre-processing steps to prepare my data separately before modeling. I saved the data as a .pickle file to preserve the datatypes; now I will re-read the data for modeling purposes.

---

# -- > 🛑 **FIX**: update data in EDA notebook - keep type, outliers.

In [5]:
data = pd.read_pickle('./data/data_prepped.pickle',
                           compression = 'gzip')
data.head(5)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,market_segment,distribution_channel,is_repeated_guest,previous_cancellations,previous_bookings_not_canceled,reserved_room_type,assigned_room_type,booking_changes,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,3,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,0.0,0,BB,PRT,Direct,Direct,0,0,0,C,C,4,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Direct,Direct,0,0,0,A,C,0,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,0.0,0,BB,GBR,Corporate,Corporate,0,0,0,A,A,0,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,0.0,0,BB,GBR,Online TA,TA/TO,0,0,0,A,A,0,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


# **Train/Test Split**

# -- > 🛑 **FIX**: Make sure to encode cats as "category"

In [None]:
## Identifying target
target= 'is_canceled'

In [None]:
## Dropping target and "reservation_status" (nearly identical indicator)

X = data.drop(columns = [target, 'reservation_status']).copy()
y = data[target].copy()

In [None]:
## Checking for missing values
print(f'Missing values for X:\n {X.isna().sum()[X.isna().sum() >0]}\n')
print(f'Missing values for y: {y.isna().sum()}')

In [None]:
## Splitting - stratify to maintain class balance b/t X_train/_test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, 
                                                    random_state=42, 
                                                    stratify=y)

In [None]:
## Specifying numeric columns for preprocessing
num_cols = X_train.select_dtypes('number').columns.to_list()
num_cols

In [None]:
## Specifying numeric columns for preprocessing
cat_cols = X_train.select_dtypes(include='object').columns.to_list()
cat_cols

# **Prepping the Pipeline**

---

> Pipeline to streamline modeling steps:
* Preprocessing: OHE, scaling, outliers via ƒ-XF?
* Modeling: RFC, BRFC
* GSCV: include as part of pipeline
* Get results:
    * Feature importances - **SHAP**

---

---

>**FIX** Update GSCV:
>* function transformer for outliers?
>* add GSCV to final step?

---

In [None]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding

### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('ohe', OneHotEncoder(handle_unknown='ignore',
                                                 sparse=False))])

num_pipe = Pipeline(steps=[('scaler', StandardScaler())])

## Instantiating the ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[('num', num_pipe, num_cols),
                  ('cat', cat_pipe, cat_cols)
                  ])

preprocessor

In [None]:
## Fitting feature preprocessor
preprocessor.fit(X_train)

## Getting feature names from OHE
ohe_cat_names = preprocessor.named_transformers_['cat'].named_steps['ohe'].get_feature_names(cat_cols)

## Generating list for column index
final_cols = [*num_cols, *ohe_cat_names]

final_cols

In [None]:
## Transform the data via the ColumnTransformer preprocessor

X_train_tf = preprocessor.transform(X_train)
X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

X_test_tf = preprocessor.transform(X_test)
X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# **--->** 🛑 Add Comment: Class Imbalance

---

> Due to class imbalance, will attempt to use "class_weight = balanced" to correct.

---

# **Baseline Model**

In [None]:
## Creating baseline classifier model

base = DummyClassifier(strategy='stratified', random_state = 42)

base.fit(X_train_tf_df, y_train)

clf.evaluate_classification(base,X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test, 
                           metric = 'balanced accuracy')

# -- > 🛑 **FIX**: Re-run all models w/ new data, kfold CV

# **Logistic Regression Model**

---

**Results:**

> Training balanced accuracy score: 0.82
> 
> Testing balanced accuracy score: 0.82
> 
> * *The scores are the same size.*
>
> Training data log loss: 0.37
>
> Testing data log loss: 0.37

---

**Interpretation**

> 

---

In [None]:
# ## LogReg Model

# logreg = LogisticRegression(class_weight='balanced', n_jobs=-1,random_state = 42)

# logreg.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(logreg, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

# **Random Forest Model**

---

**Results:**

> Training balanced recall score: 0.99
> 
> Testing balanced recall score: 0.88
>
> * *The training score is larger by 0.11 points.*
>
> Training data log loss: 0.08
>
> Testing data log loss: 0.27

---

**Interpretation**

> 

---

In [None]:
# rfc = RandomForestClassifier(class_weight = 'balanced', n_jobs=-1,
#                              random_state=42)

# rfc.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(rfc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

In [None]:
# clf.plot_importances(rfc, X_train_tf_df)

# **AdaBoost Model**

---

**Results:**

> Training balanced recall score: 0.82
>
> Testing balanced recall score: 0.81
>
> * *The training score is larger by 0.01 points.*
>
> Training data log loss: 0.68
>
> Testing data log loss: 0.68

---

**Interpretation**

> 

---

In [None]:
# abc = AdaBoostClassifier(n_estimators=100, random_state=42)

# abc.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(abc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

# **Gradient Boosting Model**

---

**Results:**

> Training score: 0.81
> 
> Testing score: 0.80
> 
> * *The scores are the same size.*
>
> Training data log loss: 0.4
>
> Testing data log loss: 0.4

---

**Interpretation**

> 

---

In [None]:
# gbc = GradientBoostingClassifier(learning_rate=1.0, max_depth=1, random_state=42)

# gbc.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(gbc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

# **ExtraTreesClassifier**

---

**Results:**

> Training balanced recall score: 1.0
> 
> Testing balanced recall score: 0.87
> 
> * 
>
> Training data log loss: 0.01
>
> Testing data log loss: 0.33

---

**Interpretation**

> 

---

In [None]:
# etc = ExtraTreesClassifier(class_weight = 'balanced', n_jobs=-1, random_state=42)

# etc.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(etc, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

# **XGBoost Classifier**

---

**Results:**

> Training balanced accuracy score: 0.86
> 
> Testing balanced accuracy score: 0.86
> 
> * *The scores are the same size.*
>
> Training data log loss: 0.24
>
> Testing data log loss: 0.26

---

**Interpretation**

> 

---

In [None]:
# xbgc = XGBClassifier(n_jobs = 4, random_state=42)
# xbgc.fit(X = X_train_tf_df, y=y_train)

In [None]:
# clf.evaluate_classification(xbgc,X_train_tf_df, y_train, X_test_tf_df, y_test,
#                             metric= 'balanced accuracy')

# **GridSearchCV Results**

# -- > 🛑 **FIX**: Rerun w/ KFCV, new data

## **GridSearchCV - Logistic Regression**

---

**Results:**

> *Results unavailable - did not run model due to time required*

---

**Interpretation**

> 

---

In [None]:
# lr_params = {
#  'C': [.001, .01, .1, 1],
#     'penalty':['l1', 'l2'],#, 'elasticnet', 'none'],
#     'solver':['newton-cg', 'liblinear', 'sag', 'saga', 'lbfgs'],
#     'max_iter':[100, 300, 500]}

# gscv = GridSearchCV(LogisticRegression(class_weight='balanced'), lr_params,
#                     scoring = 'balanced_accuracy', cv=3, n_jobs = -1)
# gscv

# gscv.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(gscv.best_estimator_, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced accuracy')

In [None]:
# logreg_params = gscv.best_params_

# logreg_params

In [None]:
# gscv.best_estimator_

## **GSCV - RandomForest**

---

**Scores**

> Training balanced accuracy score: 0.99
> 
> Testing balanced accuracy score: 0.88
> 
> * *The training score is larger by 0.11 points.*
>
> Training data log loss: 0.10
>
> Testing data log loss: 0.26

---

**Best Parameters**

> 'criterion': 'entropy'
> 
> 'max_depth': None
> 
> 'min_samples_leaf': 2
> 
> 'min_samples_split': 2

**Interpretation**

> 

---

In [None]:
# rfc_params = {
#     'criterion': ['gini', 'entropy'],
#     'max_depth': [25,50, None],
#     'min_samples_split': [2,3,4],
#     'min_samples_leaf': [1,2]
# }

In [None]:
# rfgs = GridSearchCV(RandomForestClassifier(class_weight = 'balanced',
#                                            random_state = 42, n_jobs=-1),
#                     rfc_params,scoring = 'balanced_accuracy',
#                     cv=3,verbose = 4)

# rfgs.fit(X_train_tf_df, y_train)

In [None]:
# rfgs.best_params_

In [None]:
# rfgs.best_score_

In [None]:
# rfc_new = rfgs.best_estimator_

In [None]:
# clf.evaluate_classification(rfc_new, X_train_tf_df, y_train, X_test_tf_df, 
#                            y_test, 'balanced recall')

# Additional Modeling

# -- > 🛑 **FIX**: Re-run w/ new data, KFCV

## Balanced Bagging Classifier

---

**MODEL: BalancedBaggingClassifier**

**Scores**

> Training balanced accuracy score: 0.98
> 
> Testing balanced accuracy score: 0.89
> 
> * *The training score is larger by 0.11 points.*
>
> Training data log loss: 0.11
>
> Testing data log loss: 0.30

---

**Interpretation**

> 

---

In [None]:
## Testing balanced bagging classifier
balanced_bagging = BalancedBaggingClassifier(n_estimators=100, n_jobs=4, random_state=42)

balanced_bagging.fit(X_train_tf_df, y_train)

In [None]:
clf.evaluate_classification(balanced_bagging, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'balanced recall')

### GSCV: Balanced Bagging

---

**MODEL:** GridSearchCV - BalancedBaggingClassifier

**Scores**

> Training balanced accuracy score: 0.96
> 
> Testing balanced accuracy score: 0.89
> 
> * *The training score is larger by 0.07 points.*
>
> Training data log loss: 0.20
>
> Testing data log loss: 0.30

---

**Best Parameters**

> 'base_estimator': RandomForestClassifier()
>
> 'replacement': 'True'
>
> 'sampling_strategy': 'majority'

**Interpretation**

> 

---

In [None]:
# bal_bag_params = {'base_estimator': [RandomForestClassifier()],
#                   'replacement': ['True'],
#                   'sampling_strategy': ['majority', 'not minority',
#                                         'not majority', 'all']
#                  }

In [None]:
# bbgs = GridSearchCV(BalancedBaggingClassifier(n_jobs=-1,random_state=42), 
#                     bal_bag_params, scoring = 'balanced_accuracy',
#                     cv=3, verbose = 4)

# bbgs.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(bbgs, X_train_tf_df, y_train, X_test_tf_df, 
#                            y_test, 'balanced accuracy')

In [None]:
# bbgs_params = bbgs.best_params_

# bbgs_params

## Testing Balanced Random Forest Classifier

---

**MODEL: BalancedRandomForestClassifier**

**Scores**

> Training balanced accuracy score: 0.97
> 
> Testing balanced accuracy score: 0.89
> 
> * *The training score is larger by 0.8 points.*
>
> Training data log loss: 0.17
>
> Testing data log loss: 0.30

---

**Best Parameters**

> 

**Interpretation**

> 

---

In [None]:
balanced_forest = BalancedRandomForestClassifier(n_jobs = -1, random_state=42)

balanced_forest.fit(X_train_tf_df, y_train)

In [None]:
clf.evaluate_classification(balanced_forest, X_train = X_train_tf_df, y_train = y_train,
                           X_test = X_test_tf_df, y_test = y_test,
                          metric = 'balanced recall')

### GSCV: Balanced Bagging

---

**MODEL:** GridSearchCV - BalancedBaggingClassifier

**Scores**

> Training balanced accuracy score: 0.96
> 
> Testing balanced accuracy score: 0.89
> 
> * *The training score is larger by 0.07 points.*
>
> Training data log loss: 0.20
>
> Testing data log loss: 0.30

---

**Best Parameters**

> * 'criterion': 'entropy'
>
> * 'max_features': None
>
> * 'min_samples_leaf': 1
>
> * 'min_samples_split': 2

**Interpretation**

> 

---

In [None]:
# brfc_params = {'criterion': ['gini', 'entropy'],
#                'min_samples_split': [2, 3, 4],
#                'min_samples_leaf': [1,2,3],
#                'max_features': ['auto', 'log2', None]
#               }

In [None]:
# brfc = GridSearchCV(BalancedRandomForestClassifier(n_jobs=-1,random_state=42), 
#                     brfc_params,scoring = 'balanced_accuracy',
#                     cv=3,verbose = 3)

# brfc.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(brfc, X_train_tf_df, y_train, X_test_tf_df, 
#                            y_test, 'balanced accuracy')

In [None]:
# brfc_params = brfc.best_params_

# brfc_params

## EasyEnsembleClassifier

---

**MODEL: EasyEnsembleClassifier**

**Scores**

> Training balanced accuracy score: 0.86
> 
> Testing balanced accuracy score: 0.85
> 
> * *The training score is larger by 0.01 points.*
>
> Training data log loss: 0.64
>
> Testing data log loss: 0.64

---

**Interpretation**

> 

---

In [None]:
# easy_ensemble = EasyEnsembleClassifier(n_jobs = -1, random_state=42)

# easy_ensemble.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(easy_ensemble, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

## RUSBoostClassifier

---

**MODEL: RUSBoostClassifier**

**Scores**

> Training balanced accuracy score: 0.85
> 
> Testing balanced accuracy score: 0.85
> 
> * *The scores are the same size.*
>
> Training data log loss: 0.64
>
> Testing data log loss: 0.64

---

**Interpretation**

> 

---

In [None]:
# rus_boost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R',
#                                random_state=42)

# rus_boost.fit(X_train_tf_df, y_train)

In [None]:
# clf.evaluate_classification(rus_boost, X_train = X_train_tf_df, y_train = y_train,
#                            X_test = X_test_tf_df, y_test = y_test,
#                           metric = 'balanced recall')

# Testing azure

In [None]:
from azureml.opendatasets import PublicHolidays

from datetime import datetime
from dateutil import parser
from dateutil.relativedelta import relativedelta