In this notebook, we are going to look at the problem of determining whether an EPa will be approved or not. First we import the metrics, classifiers etc we are going to need.

In [56]:
import numpy as np
import pandas as pd

#Metrics
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

##Classifiers
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold

#Hyperparameters
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from pprint import pprint
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

from sklearn.model_selection import cross_val_score

Next we import the training data and drop the NaN values.

In [57]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [58]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train (4).csv


In [74]:
import io
drugs_train = pd.read_csv(io.BytesIO(uploaded['train.csv']),index_col=1, parse_dates=True)

In [75]:
drugs_train = drugs_train.dropna()

Now we drop the coluns which don't have much significance for EPa approvals. These would be 'is weekday', 'is holiday', 'correct diagnosis'. 'dim claim id', 'dim pa id', 'dim date id' and 'pharmacy claim approved'.

Further, we encode the categorical columns of bin, drug and reject code.

In [76]:
X = drugs_train.loc[:, drugs_train.columns != 'pa_approved']
y = drugs_train.loc[:, drugs_train.columns == 'pa_approved']

In [78]:
X = X.drop(columns=['is_weekday', 'is_workday', 'is_holiday', 'correct_diagnosis','calendar_year','dim_claim_id','dim_pa_id','dim_date_id'])

In [79]:
X = X.drop([X.columns[0]], axis='columns')

In [80]:
X = pd.get_dummies(X, columns = ['bin', 'drug','reject_code'])

In [82]:
X=X.drop(columns='pharmacy_claim_approved')

We check to see what X and y (the target y here is of course the 'pa approved' column) look like.

We also check they have the same shape.

In [83]:
X

Unnamed: 0_level_0,calendar_month,calendar_day,day_of_week,tried_and_failed,contraindication,bin_417380,bin_417614,bin_417740,bin_999001,drug_A,drug_B,drug_C,reject_code_70,reject_code_75,reject_code_76
date_val,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-11-11,11,11,2,1.0,0.0,0,0,1,0,0,1,0,1,0,0
2019-06-28,6,28,6,0.0,0.0,0,0,0,1,1,0,0,0,0,1
2019-07-31,7,31,4,0.0,0.0,0,0,1,0,0,0,1,0,1,0
2018-02-21,2,21,4,1.0,0.0,0,1,0,0,0,1,0,0,1,0
2018-06-28,6,28,5,1.0,0.0,0,0,1,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2019-04-19,4,19,6,0.0,0.0,0,0,1,0,0,1,0,1,0,0
2017-05-06,5,6,7,1.0,1.0,1,0,0,0,0,0,1,1,0,0
2019-07-07,7,7,1,0.0,0.0,0,0,0,1,0,1,0,0,0,1
2017-04-11,4,11,3,1.0,0.0,1,0,0,0,0,0,1,1,0,0


In [84]:
X.shape

(372185, 15)

In [85]:
y.shape

(372185, 1)

In [86]:
import warnings
warnings.filterwarnings('ignore')

from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import AdaBoostClassifier

Now, let's define our model (ADABoost) and let's do repeated k-fold CV with 3 repeats and 10 splits.

The baseline model accuracy comes to 81.4%.

In [87]:
model = AdaBoostClassifier()

In [88]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [89]:
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')

In [90]:
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.814 (0.001)


In [101]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV


Next we try to improve by tuning parameters. For this problem, the metric of interest is recall. In a different notebook (see Miriam's branch), the optimal parameters for stratified k-fold CV with k=5 have been obtained.

Here, we are instead doing repeated k-fold CV with k=10 and 3 repeats to see if the recall improves.

We define our grid aiming to optimise the number of estimators and the learning rate. We check the number of estimators between 1 and 50, an learning rate between 0.1 and 2.2.

In [102]:
model = AdaBoostClassifier()

In [103]:
grid = dict()
grid['n_estimators'] = [int(x) for x in np.linspace(start = 1, stop = 50, num = 10)]
grid['learning_rate'] = [x for x in np.arange(0.1, 2.2, 0.2)]

In [104]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

In [107]:
random_search = RandomizedSearchCV(estimator=model, param_distributions=grid, n_jobs=-1, cv=cv, scoring='recall')

With recall as the chosen metric to be optimized, we find the optimal number of estimators to be 28 and the optimal learning rate to be 0.1

Compare this with the optimal number of estimators as 6 and optimal learning rate as 0.7 for 5-fold CV.

In [108]:
random_result = random_search.fit(X, y)

In [109]:
random_result.best_params_

{'learning_rate': 0.1, 'n_estimators': 28}

With the optimal parameters obtained, let's fit our model and evaluate it.

In [116]:
metrics = {'accuracy':accuracy_score,'precision':precision_score,'recall':recall_score,
          'f1':f1_score,'roc_auc':roc_auc_score}

def evaluate_model(model):    
    scores = {metric:[] for metric in metrics.keys()}
    RSKF=RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

    k = 0
    for train_idx, test_idx in RSKF.split(X.values, y.values):
        print('On data split # ',k+1)
        model.fit(X.values[train_idx],y.values[train_idx])
        ykpred = model.predict(X.values[test_idx])
        for i, (metric,score) in enumerate(metrics.items()):
            scores[metric].append(score(y.values[test_idx], ykpred))
        k= k + 1
    
    return pd.DataFrame(scores)

In [117]:
random_tunedresult = evaluate_model(AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
            n_estimators = 28,
            algorithm="SAMME.R",
            learning_rate = 0.1))

On data split #  1
On data split #  2
On data split #  3
On data split #  4
On data split #  5
On data split #  6
On data split #  7
On data split #  8
On data split #  9
On data split #  10
On data split #  11
On data split #  12
On data split #  13
On data split #  14
On data split #  15
On data split #  16
On data split #  17
On data split #  18
On data split #  19
On data split #  20
On data split #  21
On data split #  22
On data split #  23
On data split #  24
On data split #  25
On data split #  26
On data split #  27
On data split #  28
On data split #  29
On data split #  30


We obtain the following metrics, averaged over the 30 runs of the model (3 repeats with 10 splits each).

In [118]:
random = random_tunedresult.mean()
random

accuracy     0.799833
precision    0.803709
recall       0.962604
f1           0.876009
roc_auc      0.655991
dtype: float64

For the sake of completeness, let's compare what we got here with what we get with 5-fold CV (as done by Miriam).

|| 5-fold CV | 10-fold repeated CV (3 repeats)
| --- | --- | --- |
| Optimal estimators  | 6 | 28 |
| Optimal learning rate| 0.7 | 0.1|
| Accuracy | 0.806595 | 0.799833 |
| Precision | 0.815649 | 0.803709 |
| Recall | 0.952041 | 0.962604 |
| F 1 score | 0.878519 | 0.876009 |
| ROC AUC score | 0.678180 | 0.655991|

Note that even though we have increased the complexity of our model by increasing the number of splits and repeatations, we have not gained much, if at all.

We tried increasing the recall, and we have done so a little (95.20% for 5-fold CV and 96.26% for 10-fold CV with 3 repeats). But all other metric values have been compromised a little to make up for that increase in recall.

We conclude that 5-fold CV is good enough except in the case when we really need to maximise recall at a slight expense to other metrics. Then increasing the number of splits does increase the recall by about 1%.