In this notebook, we will try to fit two common boosting algorithms-XGBoost and ADABoost and also the categorical Naive Bayes algorithm to our dataset.

As we already saw before, the significant features for classifying the pharmacy approval rates are the bin and drug type, so we start off with these two features and try to fit our models and find the metric values.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train (1).csv


In [None]:
import io
drugs_train = pd.read_csv(io.BytesIO(uploaded['train.csv']),index_col=1, parse_dates=True)

In [None]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test (1).csv


In [None]:
import io
drugs_test = pd.read_csv(io.BytesIO(uploaded['test.csv']),index_col=1, parse_dates=True)

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_train = drugs_train.drop(columns=pa_columns+id_columns+date_columns)
drugs_train = drugs_train.dropna()

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_test = drugs_test.drop(columns=pa_columns+id_columns+date_columns)
drugs_test = drugs_test.dropna()

In [None]:
drugs_train=drugs_train.drop(columns='calendar_month')
drugs_train=drugs_train.drop(columns='calendar_day')
drugs_train=drugs_train.drop(columns='day_of_week')
drugs_train=drugs_train.drop(columns='is_weekday')
drugs_train=drugs_train.drop(columns='is_workday')
drugs_train=drugs_train.drop(columns='is_holiday')

In [None]:
drugs_test=drugs_test.drop(columns='calendar_month')
drugs_test=drugs_test.drop(columns='calendar_day')
drugs_test=drugs_test.drop(columns='day_of_week')
drugs_test=drugs_test.drop(columns='is_weekday')
drugs_test=drugs_test.drop(columns='is_workday')
drugs_test=drugs_test.drop(columns='is_holiday')

By this point, we have dropped all features save the bin and drug type. As before, we need to encode these two columns since they are categorical features.

In [None]:
one_hot_encoded_traindata = pd.get_dummies(drugs_train, columns = ['bin', 'drug'])

In [None]:
one_hot_encoded_testdata = pd.get_dummies(drugs_test, columns = ['bin', 'drug'])

In [None]:
X = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns != 'pharmacy_claim_approved']
y = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns == 'pharmacy_claim_approved']

In [None]:
Xtest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns != 'pharmacy_claim_approved']
ytest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns == 'pharmacy_claim_approved']

Next we split the train date (given by X and y) into validation and holdout sets. We use k=5 for our validation.

In [None]:
from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold

In [None]:
kfold = KFold(n_splits=5, 
                 shuffle = True,
                 random_state=614)

In [None]:
kfold.split(X, y)

<generator object _BaseKFold.split at 0x7f5726bb2bd0>

In [None]:
for train_index, test_index in kfold.split(X, y):
    X_train = X.iloc[train_index,:]
    y_train = y.iloc[train_index]
    X_holdout = X.iloc[test_index,:]
    y_holdout = y.iloc[test_index]

Now let's try to fit the inbuilt XGBoost algorithm to our train data (the validation set).

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

In [None]:
xgbc = XGBClassifier()

In [None]:
xgbc.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier()

We now let our XGBoost model predict the y-values for the holdout set and then the y-values for the test dataset as well.

In [None]:
ypredholdout = xgbc.predict(X_holdout)

In [None]:
ypred = xgbc.predict(Xtest)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cm = confusion_matrix(y_holdout,ypredholdout) 
print(cm)

[[ 62750  11577]
 [     0 104640]]


In [None]:
cm = confusion_matrix(ytest,ypred) 
print(cm)

[[155332  28434]
 [     0 256975]]


In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

We now import the various metrics and compute them for our fitted model.

We do this first with respect to the predictions for the holdout set and then for the test data.

In [None]:
print(accuracy_score(y_holdout, ypredholdout))
print(f1_score(y_holdout, ypredholdout))
print(precision_score(y_holdout, ypredholdout))
print(recall_score(y_holdout, ypredholdout))
print(roc_auc_score(y_holdout, ypredholdout))

0.9353120966435153
0.9475814667409229
0.9003846253129921
1.0
0.9221211672743418


In [None]:
print(accuracy_score(ytest, ypred))
print(f1_score(ytest, ypred))
print(precision_score(ytest, ypred))
print(recall_score(ytest, ypred))
print(roc_auc_score(ytest, ypred))

0.9354859203023999
0.9475758871943125
0.900374550206896
1.0
0.9226353079459747


Now that we have fitted XGBoost, we see that the metric values are really good!

Does ADABoost or the Naive Bayes algorithm do better? We try to find out.

In [None]:
from sklearn.naive_bayes import CategoricalNB
cnb = CategoricalNB()

In [None]:
cnb.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


CategoricalNB()

In [None]:
y_predholdout_cnb = cnb.predict(X_holdout)

In [None]:
y_pred_cnb = cnb.predict(Xtest)

In [None]:
print(accuracy_score(y_holdout, y_predholdout_cnb))
print(f1_score(y_holdout, y_predholdout_cnb))
print(precision_score(y_holdout, y_predholdout_cnb))
print(recall_score(y_holdout, y_predholdout_cnb))
print(roc_auc_score(y_holdout, y_predholdout_cnb))

0.7999407712036297
0.8121609569277582
0.9003489589391648
0.7397075688073395
0.8122233136460716


In [None]:
print(accuracy_score(ytest, y_pred_cnb))
print(f1_score(ytest, y_pred_cnb))
print(precision_score(ytest, y_pred_cnb))
print(recall_score(ytest, y_pred_cnb))
print(roc_auc_score(ytest, y_pred_cnb))

0.8013345706435299
0.8130817180428657
0.9005745631664814
0.7410837630119661
0.8133359783465304


It's clear that XGBoost performs better overall compared to Naive Bayes. The precision scores are pretty similar, but overall XGBoost should be chosen.

Let's check the ADABoost algorithm once.

In [None]:
from sklearn.datasets import make_classification
from sklearn.ensemble import AdaBoostClassifier

In [None]:
model = AdaBoostClassifier()

In [None]:
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


AdaBoostClassifier()

In [None]:
ypredada = model.predict(Xtest)

In [None]:
ypredholdoutada = model.predict(X_holdout)

In [None]:
print(accuracy_score(y_holdout, ypredholdoutada))
print(f1_score(y_holdout, ypredholdoutada))
print(precision_score(y_holdout, ypredholdoutada))
print(recall_score(y_holdout, ypredholdoutada))
print(roc_auc_score(y_holdout, ypredholdoutada))

0.7619002385914722
0.78415342066073
0.8342818340554873
0.7397075688073395
0.7664256896332633


In [None]:
print(accuracy_score(ytest, ypredada))
print(f1_score(ytest, ypredada))
print(precision_score(ytest, ypredada))
print(recall_score(ytest, ypredada))
print(roc_auc_score(ytest, ypredada))

0.7635368617850393
0.7851593179948834
0.8348091388893759
0.7410837630119661
0.7680093129133163


ADABoost performance is worse, so among the boosting algorithms we want to go with XGBoost. It has the highest precision (very close to precision of Naive Bayes), but overall performs the best. 

Recall that for the claim approval problem, higher the precision, lesser will be the number of claims rejected by our model which should have been approved. So it means lesser is the number of cases where we predict rejection but the claim actually gets approval. 

Since such an occurrence means lost revenue for CMM, we want to minimise the Type I errors. So we choose the model with the best precision (in this case both XGBoost and Categorical Naive Bayes are pretty close in this regard) and among them, choose the one with the best metrics overall, which in this case is XGBoost.