In this notebook, we will try to fit the Random forest classifier to our data. First we see the fit with the default parameter values, and thereafter optimise for the parameters.

As before, we choose bin and drug as our significant features and split our train data using 5-fold cross validation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

In [None]:
from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


In [None]:
import io
drugs_train = pd.read_csv(io.BytesIO(uploaded['train.csv']),index_col=1, parse_dates=True)

In [None]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv


In [None]:
import io
drugs_test = pd.read_csv(io.BytesIO(uploaded['test.csv']),index_col=1, parse_dates=True)

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_test = drugs_test.drop(columns=pa_columns+id_columns+date_columns)
drugs_test = drugs_test.dropna()

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_train = drugs_train.drop(columns=pa_columns+id_columns+date_columns)
drugs_train = drugs_train.dropna()

In [None]:
drugs_train=drugs_train.drop(columns='calendar_month')
drugs_train=drugs_train.drop(columns='calendar_day')
drugs_train=drugs_train.drop(columns='day_of_week')
drugs_train=drugs_train.drop(columns='is_weekday')
drugs_train=drugs_train.drop(columns='is_workday')
drugs_train=drugs_train.drop(columns='is_holiday')

In [None]:
drugs_test=drugs_test.drop(columns='calendar_month')
drugs_test=drugs_test.drop(columns='calendar_day')
drugs_test=drugs_test.drop(columns='day_of_week')
drugs_test=drugs_test.drop(columns='is_weekday')
drugs_test=drugs_test.drop(columns='is_workday')
drugs_test=drugs_test.drop(columns='is_holiday')

At this point, we have retained only bin and drug as our features. Recall that both are categorical variables, so we encode them each.

In [None]:
one_hot_encoded_traindata = pd.get_dummies(drugs_train, columns = ['bin', 'drug'])

In [None]:
one_hot_encoded_testdata = pd.get_dummies(drugs_test, columns = ['bin', 'drug'])

In [None]:
X = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns != 'pharmacy_claim_approved']
y = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns == 'pharmacy_claim_approved']

In [None]:
Xtest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns != 'pharmacy_claim_approved']
ytest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns == 'pharmacy_claim_approved']

We now invoke cross validation, splitting our train data with k=5.

In [None]:
kfold = KFold(n_splits=5, 
                 shuffle = True,
                 random_state=614)

In [None]:
kfold.split(X, y)

<generator object _BaseKFold.split at 0x7f6480365f50>

In [None]:
for train_index, test_index in kfold.split(X, y):
    print("Train index:", train_index)
    print("Test index:", test_index)
    print()
    print()

Train index: [     1      2      3 ... 894829 894831 894832]
Test index: [     0      4      6 ... 894830 894833 894834]


Train index: [     0      2      3 ... 894832 894833 894834]
Test index: [     1      7     13 ... 894823 894827 894828]


Train index: [     0      1      4 ... 894832 894833 894834]
Test index: [     2      3      5 ... 894824 894826 894829]


Train index: [     0      1      2 ... 894832 894833 894834]
Test index: [     9     14     28 ... 894800 894801 894812]


Train index: [     0      1      2 ... 894830 894833 894834]
Test index: [     8     18     20 ... 894822 894831 894832]




In [None]:
for train_index, test_index in kfold.split(X, y):
    X_train = X.iloc[train_index,:]
    y_train = y.iloc[train_index]
    X_holdout = X.iloc[test_index,:]
    y_holdout = y.iloc[test_index]

This is what our train set looks like at this point for the predictor variables.

In [None]:
X_train

Unnamed: 0_level_0,bin_417380,bin_417614,bin_417740,bin_999001,drug_A,drug_B,drug_C
date_val,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-04-07,0,0,0,1,1,0,0
2017-01-30,0,0,0,1,0,0,1
2019-11-11,0,0,1,0,0,1,0
2019-06-28,0,0,0,1,1,0,0
2017-02-27,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...
2019-07-07,0,0,0,1,0,1,0
2018-10-03,0,0,1,0,1,0,0
2017-04-11,1,0,0,0,0,0,1
2018-08-14,0,0,0,1,0,0,1


Let's now fit the Random Forest Calssifier with default parameters to our train data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
clf=RandomForestClassifier(n_estimators=100)

In [None]:
clf.fit(X_train,y_train)

  """Entry point for launching an IPython kernel.


RandomForestClassifier()

Using the fitted model, our next task is to predict the y-values (whether or not claim is approved) for the holdout set and then the test set.

In [None]:
y_predholdout=clf.predict(X_holdout)

In [None]:
from sklearn import metrics

For the holdout set, we obtain the follwoing values for the metrics, which are already pretty good!

In [None]:
print("Accuracy:",metrics.accuracy_score(y_holdout, y_predholdout))
print("F 1 score:",metrics.f1_score(y_holdout, y_predholdout))
print("Precision score:",metrics.precision_score(y_holdout, y_predholdout))
print("Recall score:",metrics.recall_score(y_holdout, y_predholdout))
print("ROC AUC score:",metrics.roc_auc_score(y_holdout, y_predholdout))

Accuracy: 0.9353120966435153
F 1 score: 0.9475814667409229
Precision score: 0.9003846253129921
Recall score: 1.0
ROC AUC score: 0.9221211672743418


We now predict for the test set and note the metric values in this case. We expect the metric values to be pretty similar to what we got for prediciting on the holdout set, and it turns out we do get very similar values.

In [None]:
y_pred=clf.predict(Xtest)

In [None]:
print("Accuracy:",metrics.accuracy_score(ytest, y_pred))
print("F 1 score:",metrics.f1_score(ytest, y_pred))
print("Precision score:",metrics.precision_score(ytest, y_pred))
print("Recall score:",metrics.recall_score(ytest, y_pred))
print("ROC AUC score:",metrics.roc_auc_score(ytest, y_pred))

Accuracy: 0.9354859203023999
F 1 score: 0.9475758871943125
Precision score: 0.900374550206896
Recall score: 1.0
ROC AUC score: 0.9226353079459747


Remember for the claim approval, we want to optimize precision (see the notebook 'Boosting and Naive Bayes' at the end for a justification), and it's around 0.9 for the random forest with default parameters as we see here.

These values are the same as we obtained with the XGBoost, and our next goal will be to see if we can improve this by changing the default parameters.

We do that in the notebook titled 'Parameter tuning for Random Forest'.