In this notebook, we will see how the KNN classifier performs on our dataset.

As before, bin and drug are the important features, so we prepare our dataset accordingly.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


In [None]:
import io
drugs_train = pd.read_csv(io.BytesIO(uploaded['train.csv']),index_col=1, parse_dates=True)

In [None]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test.csv


In [None]:
import io
drugs_test = pd.read_csv(io.BytesIO(uploaded['test.csv']),index_col=1, parse_dates=True)

We throw away unimportant features and prepare our dataset for the model.

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_train = drugs_train.drop(columns=pa_columns+id_columns+date_columns)
drugs_train = drugs_train.dropna()

In [None]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_test = drugs_test.drop(columns=pa_columns+id_columns+date_columns)
drugs_test = drugs_test.dropna()

In [None]:
drugs_train=drugs_train.drop(columns='calendar_month')
drugs_train=drugs_train.drop(columns='calendar_day')
drugs_train=drugs_train.drop(columns='day_of_week')

In [None]:
drugs_test=drugs_test.drop(columns='calendar_month')
drugs_test=drugs_test.drop(columns='calendar_day')
drugs_test=drugs_test.drop(columns='day_of_week')

Recall that both bin and drug are categorical variables, so we encode them.

In [None]:
one_hot_encoded_traindata = pd.get_dummies(drugs_train, columns = ['bin', 'drug'])

In [None]:
one_hot_encoded_testdata = pd.get_dummies(drugs_test, columns = ['bin', 'drug'])

In [None]:
X = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns != 'pharmacy_claim_approved']
y = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns == 'pharmacy_claim_approved']

In [None]:
Xtest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns != 'pharmacy_claim_approved']
ytest = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns == 'pharmacy_claim_approved']

Next, we use k-fold cross validation with k=5 to split our train data.

In [None]:
from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold

In [None]:
kfold = KFold(n_splits=5, 
                 shuffle = True,
                 random_state=614)

In [None]:
kfold.split(X, y)

<generator object _BaseKFold.split at 0x7f4a9567c7d0>

In [None]:
for train_index, test_index in kfold.split(X, y):
    X_train = X.iloc[train_index,:]
    y_train = y.iloc[train_index]
    X_holdout = X.iloc[test_index,:]
    y_holdout = y.iloc[test_index]

Now we fir the KNN classifier with k=5 on our train set.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(5,n_jobs=-1)


In [None]:
knn.fit(X_train,y_train)

  return self._fit(X, y)


KNeighborsClassifier(n_jobs=-1)

Our first prediction is on the holdout set values. We then predict the y-values in the test set as well.

In [None]:
ypredholdout = knn.predict(X_holdout)

In [None]:
ypred=knn.predict(Xtest)

How well did the model perform? We write down the values of the five important metrics for this model. We first do this for predictions on the holdout data, and then for predictions on the test data itself.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

In [None]:
print(accuracy_score(y_holdout, ypredholdout))
print(f1_score(y_holdout, ypredholdout))
print(precision_score(y_holdout, ypredholdout))
print(recall_score(y_holdout, ypredholdout))
print(roc_auc_score(y_holdout, ypredholdout))

0.9353120966435153
0.9475814667409229
0.9003846253129921
1.0
0.9221211672743418


In [None]:
print(accuracy_score(ytest, ypred))
print(f1_score(ytest, ypred))
print(precision_score(ytest, ypred))
print(recall_score(ytest, ypred))
print(roc_auc_score(ytest, ypred))

0.9354859203023999
0.9475758871943125
0.900374550206896
1.0
0.9226353079459747


We note that we get the same metric values as XGBoost or the Random Forest Classifier. However, KNN perform badly in the sense it takes a much longer time to run compared to the other models with similar performance metrics.

Looking at the run-time as well as the performance metrics, we can easily choose the Random Forest Classifier and the XGBoost algorithm as the best. Here we have fitted all the algorithms with default paremeters.

We now optimise the parameters for Random Forest and XGBoost and choose the one which performs the best.