# Regression Model for the Prediction of Patient Pay

This model is based on the HistGradientBoostingRegressor in sklearn
HistGradientBoostingRegressor has native support for categorical variables allowing for minimal preprocessing of the features in the data set. The categorical variables for insurance (bin, pcn, and group) and drug are transformed into ordinal values using the label encoder.

Cross validation of model fitted only on the insurance information and drug name show a mean average precentage error of ~1.5%.


### Relevant libraries are imported.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from sklearn.ensemble import HistGradientBoostingRegressor

from sklearn.model_selection import KFold, cross_val_score, cross_validate

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error

### Data set is imported and split into a training and test set.

Note that rejected claims are also removed from the data set as rejected claims do not offer any information on patient pay.

In [2]:
cmm_data = pd.read_csv('../pharmacy_tx.csv')

#cmm_data_no_dups = cmm_data.drop_duplicates()

cmm_train, cmm_test = train_test_split(cmm_data, shuffle=True, 
                             random_state= 614, test_size=0.001)

### Rejected claims are removed from the data set

Rejected claims do not offer any information on patient pay.

In [3]:
cmm_paid = cmm_test[cmm_test.rejected == 0].copy()

### Minimal feature engineering is perfomed

The month information is extracted from the 'tx_date' column and all categorical variables are encoded using the OrdinalEncoder.

In [4]:
cmm_paid['tx_date'] = pd.to_datetime(cmm_paid.tx_date)
cmm_paid['month'] = cmm_paid.tx_date.dt.month
#cmm_paid['day_of_year'] = cmm_paid.tx_date.dt.dayofyear

In [5]:
cmm_paid.to_csv("cmm_paid.csv", index= False)

In [6]:
cmm_paid = pd.read_csv("cmm_paid.csv")

In [7]:
cmm_paid

Unnamed: 0,tx_date,pharmacy,diagnosis,drug,bin,pcn,group,rejected,patient_pay,month
0,1/2/22,Pharmacy #47,H60.83,branded colifunene,664344,KBOSN,,False,22.74,1
1,1/2/22,Pharmacy #13,V97.67,generic cupitelol,757349,MSCXSG,DGLGRYP,False,17.56,1
2,1/2/22,Pharmacy #16,D13.25,branded plazamiglutic,664344,BIZF,QK6BI1N61,False,120.17,1
3,1/2/22,Pharmacy #45,Q85.91,generic pucomalol,664344,T17LNK,Z01MLD4I,False,16.89,1
4,1/3/22,Pharmacy #54,U41.19,branded foxivelule,725700,,DYGBI610ZY,False,16.48,1
...,...,...,...,...,...,...,...,...,...,...
12815,12/29/22,Pharmacy #29,H36.57,branded antimab,322463,T52GV,EVD4X5,False,6.23,12
12816,12/29/22,Pharmacy #52,H36.57,branded antimab,756120,7THOQ5,,False,22.60,12
12817,12/29/22,Pharmacy #56,G99.93,generic rulfalol,160389,RB7UU,RS5RB3YA,False,5.32,12
12818,12/29/22,Pharmacy #28,O91.95,branded hozirol,664344,REGLCC,BH2Q8B3GY2GAV,False,8.81,12


In [8]:
import json
cmm_paid_json1= cmm_paid.to_json(orient="records")

In [9]:
parsed = json.loads(cmm_paid_json1)
json.dumps(parsed, indent=4) 

'[\n    {\n        "tx_date": "1/2/22",\n        "pharmacy": "Pharmacy #47",\n        "diagnosis": "H60.83",\n        "drug": "branded colifunene",\n        "bin": 664344,\n        "pcn": "KBOSN",\n        "group": null,\n        "rejected": false,\n        "patient_pay": 22.74,\n        "month": 1\n    },\n    {\n        "tx_date": "1/2/22",\n        "pharmacy": "Pharmacy #13",\n        "diagnosis": "V97.67",\n        "drug": "generic cupitelol",\n        "bin": 757349,\n        "pcn": "MSCXSG",\n        "group": "DGLGRYP",\n        "rejected": false,\n        "patient_pay": 17.56,\n        "month": 1\n    },\n    {\n        "tx_date": "1/2/22",\n        "pharmacy": "Pharmacy #16",\n        "diagnosis": "D13.25",\n        "drug": "branded plazamiglutic",\n        "bin": 664344,\n        "pcn": "BIZF",\n        "group": "QK6BI1N61",\n        "rejected": false,\n        "patient_pay": 120.17,\n        "month": 1\n    },\n    {\n        "tx_date": "1/2/22",\n        "pharmacy": "Pharmacy

In [10]:
# Writing to sample.json
with open("cmm_paid.json", "w") as outfile:
    outfile.write(cmm_paid_json1)

In [11]:
enc = OrdinalEncoder()

In [6]:
encoded = enc.fit_transform(cmm_paid[['diagnosis', 'drug', 'bin', 'pcn', 'group']])
cmm_paid['diagnosis_encoded'] = encoded[:,0]
cmm_paid['drug_encoded'] = encoded[:,1]
cmm_paid['bin_encoded'] = encoded[:,2]
cmm_paid['pcn_encoded'] = encoded[:,3]
cmm_paid['group_encoded'] = encoded[:,4]

### Model Perfomance

Five models are trained below. The second model shows the importance of defining categorical features in the HistGradientBoostingRegressor.

In [19]:
## Model 1
## Baseline model using only insurance info and the drug name.

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412)
base_scores = cross_validate(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring=['r2', 'neg_root_mean_squared_error', 
                                        'neg_mean_absolute_percentage_error'])

In [29]:
## Model 2
## Improved model using only insurance info with defined categorical features.

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3])
cat_scores = cross_validate(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring=['r2', 'neg_root_mean_squared_error', 
                                        'neg_mean_absolute_percentage_error'])

In [33]:
## Model 3
## This model is the same as the model 2 with no bound on the number of leaf nodes.
## Training this model is slower (order of minutes) but shows greatly increased metrics.)

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3], max_leaf_nodes=None)
scores = cross_validate(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring=['r2', 'neg_root_mean_squared_error', 
                                        'neg_mean_absolute_percentage_error'])

In [36]:
## Model 4
## This model includes day of year as a feature with no bound on the number of leaf nodes.
## Training this model is slower (order of minutes) but shows greatly increased metrics.)

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded', 'day_of_year'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3], max_leaf_nodes=None)
scores_day = cross_validate(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring=['r2', 'neg_root_mean_squared_error', 
                                        'neg_mean_absolute_percentage_error'])

In [None]:
## Model 5
## This model includes month as a feature with no bound on the number of leaf nodes.
## Training this model is slower (order of minutes) but shows greatly increased metrics.)

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded', 'month'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3, 4], max_leaf_nodes=None)
scores_month = cross_validate(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring=['r2', 'neg_root_mean_squared_error', 
                                        'neg_mean_absolute_percentage_error'])

### Comparrison of Models

In [97]:
column_names = ['Model', 'Mean Training Time (s)', 'Mean Scoring Time (s)', 'Mean R2', 
                'Mean RMSE', 'Mean MAPE']
model_metrics = [base_scores, cat_scores, scores, scores_day, scores_month]
summary = np.zeros((5, 6))
i,j = 0,1
for metric in model_metrics:
    for value in metric.values():
        summary[i, j] = value.mean()
        j+=1
    j=1
    i+=1
    
summary_metrics = pd.DataFrame(summary, columns=column_names)
summary_metrics['Model'] = ['Model 1', 'Model 2', 'Model 3', 'Model 4', 'Model 5']
summary_metrics['Mean RMSE'] = summary_metrics['Mean RMSE'] * -1
summary_metrics['Mean MAPE'] = summary_metrics['Mean MAPE'] * -100
summary_metrics

Unnamed: 0,Model,Mean Training Time (s),Mean Scoring Time (s),Mean R2,Mean RMSE,Mean MAPE
0,Model 1,15.820838,1.729416,0.941824,9.771663,32.683561
1,Model 2,43.561767,3.04391,0.99259,3.487719,9.543922
2,Model 3,123.590078,10.088337,0.996171,2.507217,1.588028
3,Model 4,494.743763,8.572789,0.997202,2.143141,1.505368
4,Model 5,349.339357,11.606112,0.997166,2.156936,1.410123


Model 4 and Model 5 show very similar statistics. Model 5 outperforms Model 4 in two key areas training time and mean MAPE.

### Build Pipeline

Pipeline allows for the preprocessing and fitting/predicting on the data with a single object. This may be useful for building an application with the model or just predicting on the test set.

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin

class MonthTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(Self, X, y = None):
        X = pd.to_datetime(X)
        X = X.dt.month.values.reshape(-1,1)
        return X

In [6]:
ct1 = ColumnTransformer(transformers=[('mt', MonthTransformer(), 0)], remainder='passthrough')
ct2 = ColumnTransformer(transformers=[('enc', OrdinalEncoder(), 
                                       [1, 2, 3, 4])], remainder='passthrough')

In [7]:
hgbr_pipe = Pipeline([('months_column', ct1), ('remainder_columns', ct2), 
                      ('hgbr', HistGradientBoostingRegressor(categorical_features=[0,1,2,3,4], 
                                                             max_leaf_nodes=None))])

In [8]:
X = cmm_paid[['tx_date', 'bin', 'pcn', 'group', 'drug']].copy()

In [9]:
y = cmm_paid['patient_pay'].copy()

In [10]:
hgbr_pipe.fit(X, y)

In [11]:
predicted = hgbr_pipe.predict(X)

In [12]:
print('MAPE:', mean_absolute_percentage_error(y, predicted)*100)
print('MSE:', np.sqrt(mean_squared_error(y, predicted)))

MAPE: 1.393584226893579
MSE: 2.129510464623409


### Check Model on the Test Data

In [60]:
cmm_test_paid = cmm_test[cmm_test.rejected == 0].copy()
X_test = cmm_test_paid[['tx_date', 'bin', 'pcn', 'group', 'drug']].copy()
y_test = cmm_test_paid['patient_pay'].copy()

In [70]:
predicted_test = cls.predict(X_test)
print('MAPE:', mean_absolute_percentage_error(y_test, predicted_test)*100)
print('MSE:', np.sqrt(mean_squared_error(y_test, predicted_test)))

MAPE: 1.4016888684336466
MSE: 2.146045445630724


The HGBR model does quite well with a minimal number of inputs.

### Save Model and Model Parameters

In [13]:
import pickle as pkl

In [42]:
with open('hgbr_model', 'wb') as f:
    pkl.dump(hgbr_pipe, f)

#with open('hgbr_model', 'rb') as f:
#    cls = pkl.load(f)

In [15]:
#with open('hgbr_params', 'wb') as f:
#    pkl.dump(hgbr_pipe.get_params(deep=True), f)
    
#with open('hgbr_params', 'rb') as f:
#    parameters = pkl.load(f)

### New Pipeline for the web site

This pipeline does not transform date column to month.

In [47]:
hgbr_pipe_for_app = Pipeline([('remainder_columns', ct2), 
                              ('hgbr', HistGradientBoostingRegressor(categorical_features=[0,1,2,3,4], 
                                                                     max_leaf_nodes=None))])

In [45]:
X_web = cmm_paid[['tx_date', 'bin', 'pcn', 'group', 'drug']].copy()
X_web = ct1.fit_transform(X_web)

In [48]:
hgbr_pipe_for_app.fit(X_web, y)

In [51]:
hgbr_pipe_for_app.predict(X_web)

array([ 8.2108047 , 11.48065761, 17.1105145 , ..., 12.74060093,
       10.7206918 ,  6.57087847])

In [111]:
example = X_web[2,:]
example

array([2, 664344, nan, '77MAJF66DGD', 'branded tanoclolol'], dtype=object)

In [79]:
hgbr_pipe_for_app.predict([example])[0]

6.500881620073781

In [112]:
with open('hgbr_web', 'wb') as f:
    pkl.dump(hgbr_pipe_for_app, f)