In this notebook, we attempt to identify the best values for certain parameters of the XGBoost model for the claim approvals.

First let's load the data and preprocess it by retaining the bin, drug as significant features.

In [56]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

In [57]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [58]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train (2).csv


In [59]:
import io
drugs_train = pd.read_csv(io.BytesIO(uploaded['train.csv']),index_col=1, parse_dates=True)

In [61]:
from google.colab import files
uploaded = files.upload()

Saving test.csv to test (3).csv


In [62]:
import io
drugs_test = pd.read_csv(io.BytesIO(uploaded['test.csv']),index_col=1, parse_dates=True)

In [63]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_train = drugs_train.drop(columns=pa_columns+id_columns+date_columns)
drugs_train = drugs_train.dropna()

In [64]:
pa_columns = ['correct_diagnosis','tried_and_failed','contraindication','pa_approved','reject_code']
id_columns = ['dim_pa_id','dim_date_id','dim_claim_id','Unnamed: 0']
date_columns = ['calendar_year']
drugs_test = drugs_test.drop(columns=pa_columns+id_columns+date_columns)
drugs_test = drugs_test.dropna()

In [65]:
drugs_train=drugs_train.drop(columns='calendar_month')
drugs_train=drugs_train.drop(columns='calendar_day')
drugs_train=drugs_train.drop(columns='day_of_week')
drugs_train=drugs_train.drop(columns='is_weekday')
drugs_train=drugs_train.drop(columns='is_workday')
drugs_train=drugs_train.drop(columns='is_holiday')

In [66]:
drugs_test=drugs_test.drop(columns='calendar_month')
drugs_test=drugs_test.drop(columns='calendar_day')
drugs_test=drugs_test.drop(columns='day_of_week')
drugs_test=drugs_test.drop(columns='is_weekday')
drugs_test=drugs_test.drop(columns='is_workday')
drugs_test=drugs_test.drop(columns='is_holiday')

Next, we encode the categorical variables bin and drug.

In [67]:
one_hot_encoded_traindata = pd.get_dummies(drugs_train, columns = ['bin', 'drug'])
one_hot_encoded_testdata = pd.get_dummies(drugs_test, columns = ['bin', 'drug'])

In [68]:
X_train = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns != 'pharmacy_claim_approved']
y_train = one_hot_encoded_traindata.loc[:, one_hot_encoded_traindata.columns == 'pharmacy_claim_approved']

In [69]:
X_test = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns != 'pharmacy_claim_approved']
y_test = one_hot_encoded_testdata.loc[:, one_hot_encoded_testdata.columns == 'pharmacy_claim_approved']

In [70]:
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score

In [71]:
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

We try to optimize the following parameters-max_depth, gamma, reg_alpha, reg_lambda, colsample_bytree, min_child_weight, n_estimators, learning_rate.

In [95]:
space={'max_depth': hp.quniform("max_depth", 3, 18, 1),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': hp.quniform('n_estimators', 0, 180, 1),
        'learning_rate': hp.quniform('learning_rate', 0, 0.5, 0.01),
        'seed': 0
    }

In [96]:
def objective(space):
    clf=xgb.XGBClassifier(
                    learning_rate=int(space['learning_rate']), n_estimators =int(space['n_estimators']), max_depth = int(space['max_depth']), gamma = space['gamma'],
                    reg_alpha = int(space['reg_alpha']),min_child_weight=int(space['min_child_weight']),
                    colsample_bytree=int(space['colsample_bytree']))
    
    evaluation = [( X_train, y_train), ( X_test, y_test)]
    
    clf.fit(X_train, y_train,
            eval_set=evaluation, eval_metric="auc",
            early_stopping_rounds=10,verbose=False)
    pred = clf.predict(X_test)
    precision = precision_score(y_test, pred>0.5)
    print ("SCORE:", precision)
    return {'loss': -precision, 'status': STATUS_OK }

We run 10 trials aiming for the best AUC, but even for larger trials like 100, the optimum values are identical as can be easily checked.

In [97]:
trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = space,
                        algo = tpe.suggest,
                        max_evals = 10,
                        trials = trials)

  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 10%|█         | 1/10 [00:05<00:52,  5.88s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 20%|██        | 2/10 [00:11<00:46,  5.83s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 30%|███       | 3/10 [00:17<00:40,  5.83s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 40%|████      | 4/10 [00:23<00:34,  5.81s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 50%|█████     | 5/10 [00:28<00:28,  5.78s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 60%|██████    | 6/10 [00:34<00:23,  5.80s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 70%|███████   | 7/10 [00:40<00:17,  5.77s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 80%|████████  | 8/10 [00:46<00:11,  5.80s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
 90%|█████████ | 9/10 [00:52<00:05,  5.82s/it, best loss: -0.0]

  _warn_prf(average, modifier, msg_start, len(result))

  y = column_or_1d(y, warn=True)

  y = column_or_1d(y, warn=True)



SCORE:
0.0
100%|██████████| 10/10 [00:58<00:00,  5.81s/it, best loss: -0.0]


  _warn_prf(average, modifier, msg_start, len(result))



We list the optimum values of our chosen parameters. The best learning rate comes to 0.35 and the optimal number of estimators comes to 38.

In [98]:
print("The best hyperparameters are : ","\n")
print(best_hyperparams)

The best hyperparameters are :  

{'colsample_bytree': 0.5076333004368242, 'gamma': 6.165452026737914, 'learning_rate': 0.35000000000000003, 'max_depth': 6.0, 'min_child_weight': 9.0, 'n_estimators': 38.0, 'reg_alpha': 137.0, 'reg_lambda': 0.004250572694391885}


We now fit XGBoost with the optimised parameters and the other parameters held at theire default values.

In [87]:
from xgboost import XGBClassifier

In [99]:
xgbc = XGBClassifier(colsample_bytree=0.5076333004368242, gamma=6.165452026737914, learning_rate=0.35, max_depth=6, min_child_weight=9, n_estimators=38, reg_alpha=137, reg_lambda=0.004250572694391885)

In [100]:
xgbc.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(colsample_bytree=0.5076333004368242, gamma=6.165452026737914,
              learning_rate=0.35, max_depth=6, min_child_weight=9,
              n_estimators=38, reg_alpha=137, reg_lambda=0.004250572694391885)

We now predict on the test set and note down the metric values.

In [101]:
ypred = xgbc.predict(X_test)

In [102]:
print(accuracy_score(y_test, ypred))
print(f1_score(y_test, ypred))
print(precision_score(y_test, ypred))
print(recall_score(y_test, ypred))
print(roc_auc_score(y_test, ypred))

0.9354859203023999
0.9475758871943125
0.900374550206896
1.0
0.9226353079459747


We see that the metric values are same as before, so tuning hyperparameters didn't improve performance.

Let us now find the optimal learning rate and estimators using Random Search and Grid Search.

In [103]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from pprint import pprint

In [104]:
pprint(xgbc.get_params())

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 0.5076333004368242,
 'gamma': 6.165452026737914,
 'learning_rate': 0.35,
 'max_delta_step': 0,
 'max_depth': 6,
 'min_child_weight': 9,
 'missing': None,
 'n_estimators': 38,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 137,
 'reg_lambda': 0.004250572694391885,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}


We search for the best learning rate between 0 and 2.2 at intervals of 0.2, and the ideal number of estimators between 0 and 50. The metric of interest is precision.

In [112]:
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 180, num = 180)]
learning_rate = [x for x in np.arange(0.1, 2.2, 0.2)]

In [113]:
search_grid = {'n_estimators': n_estimators,
               'learning_rate': learning_rate,
               }
pprint(search_grid)

{'learning_rate': [0.1,
                   0.30000000000000004,
                   0.5000000000000001,
                   0.7000000000000001,
                   0.9000000000000001,
                   1.1000000000000003,
                   1.3000000000000003,
                   1.5000000000000004,
                   1.7000000000000004,
                   1.9000000000000004,
                   2.1000000000000005],
 'n_estimators': [1,
                  2,
                  3,
                  4,
                  5,
                  6,
                  7,
                  8,
                  9,
                  10,
                  11,
                  12,
                  13,
                  14,
                  15,
                  16,
                  17,
                  18,
                  19,
                  20,
                  21,
                  22,
                  23,
                  24,
                  25,
                  26,
                  27,

In [114]:
xgbcrandom = RandomizedSearchCV(estimator=xgbc, param_distributions=search_grid, scoring='precision', n_jobs=-1)

In [115]:
xgbcrandom.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


RandomizedSearchCV(estimator=XGBClassifier(colsample_bytree=0.5076333004368242,
                                           gamma=6.165452026737914,
                                           learning_rate=0.35, max_depth=6,
                                           min_child_weight=9, n_estimators=38,
                                           reg_alpha=137,
                                           reg_lambda=0.004250572694391885),
                   n_jobs=-1,
                   param_distributions={'learning_rate': [0.1,
                                                          0.30000000000000004,
                                                          0.5000000000000001,
                                                          0.7000000000000001,
                                                          0.9000000000000001,
                                                          1.1000000000000003,
                                                          1.3000000000000003,

In [116]:
xgbcrandom.best_params_

{'learning_rate': 0.9000000000000001, 'n_estimators': 174}

With other parameters at default values, 174 is the optimal number of estimators and 0.9 is the optimal learning rate. So we fit XGBoost using these two values, while keeping other parameters at their default value.

In [117]:
xgbcrandomtuned = XGBClassifier(learning_rate=0.9, n_estimators=174)

In [118]:
xgbcrandomtuned.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


XGBClassifier(learning_rate=0.9, n_estimators=174)

We now predict on the test set using the fitted model and write down the different metric values.

In [119]:
ypredrandomtuned = xgbcrandomtuned.predict(X_test)

In [120]:
print(accuracy_score(y_test, ypredrandomtuned))
print(f1_score(y_test, ypredrandomtuned))
print(precision_score(y_test, ypredrandomtuned))
print(recall_score(y_test, ypredrandomtuned))
print(roc_auc_score(y_test, ypredrandomtuned))

0.9354859203023999
0.9475758871943125
0.900374550206896
1.0
0.9226353079459747


Again we obtain the same values, so it seems tuning is not that useful for this problem.

We haven't used cross validation here, but we repeat the processes (Randomized and Grid Search) with cross validation in another notebook.