# Regression Model for the Prediction of Patient Pay

This model is based on the HistGradientBoostingRegressor in sklearn
HistGradientBoostingRegressor has native support for categorical variables allowing for minimal preprocessing of the features in the data set. The categorical variables for insurance (bin, pcn, and group) and drug are transformed into ordinal values using the label encoder.

Cross validation of model fitted only on the insurance information and drug name show a mean average precentage error of ~1.5%.


### Relevant libraries are imported.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.ensemble import HistGradientBoostingRegressor

from sklearn.model_selection import KFold, cross_val_score, cross_validate

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error

### Data set is imported and split into a training and test set.

In [2]:
cmm_data = pd.read_csv('/home/ruggiec/Downloads/pharmacy_tx.csv')

#cmm_data_no_dups = cmm_data.drop_duplicates()

cmm_train, cmm_test = train_test_split(cmm_data, shuffle=True, 
                             random_state= 614, test_size=0.2)

### Minimal feature engineering is perfomed

The date information is extracted from the 'tx_date' column and all categorical variables are encoded using the LabelEncoder.

In [11]:
## For the models below, this step is not necessary.

cmm_train['tx_date'] = pd.to_datetime(cmm_train.tx_date)
cmm_train['day_of_week'] = cmm_train.tx_date.dt.day_of_week
cmm_train['day_of_year'] = cmm_train.tx_date.dt.day_of_year

In [4]:
## The models below do not use all of the encoded columns and rely only on bin, pcn, group, and drug.

le_diagnosis = LabelEncoder()
cmm_train['diagnosis_encoded'] = le_diagnosis.fit_transform(cmm_train['diagnosis'])

le_bin = LabelEncoder()
cmm_train['bin_encoded'] = le_bin.fit_transform(cmm_train['bin'])

le_pcn = LabelEncoder()
cmm_train['pcn_encoded'] = le_pcn.fit_transform(cmm_train['pcn'])

le_group= LabelEncoder()
cmm_train['group_encoded'] = le_group.fit_transform(cmm_train['group'])

le_drug = LabelEncoder()
cmm_train['pharmacy_encoded'] = le_drug.fit_transform(cmm_train.pharmacy)

le_drug = LabelEncoder()
cmm_train['day_of_week_encoded'] = le_drug.fit_transform(cmm_train.day_of_week)

le_drug = LabelEncoder()
cmm_train['drug_encoded'] = le_drug.fit_transform(cmm_train.drug)

### Rejected claims are removed from the data set

Rejected claims do not offer any information on patient pay.

In [5]:
cmm_paid = cmm_train[cmm_train.rejected == 0]

### Model Perfomance

Three models are trained below. The second model shows the importance of defining categorical features in the HistGradientBoostingRegressor.

In [12]:
## Model 1
## Baseline model using only insurance info and the drug name.

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412)
base_scores = cross_val_score(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring='neg_mean_absolute_percentage_error')

In [16]:
print('Default paramter with no categrocical features defined--MAPE:', -base_scores.mean()*100)

Default paramter with no categrocical features defined--MAPE: 32.39207362582866


In [21]:
## Model 2
## Improved model using only insurance info with defined categorical features.

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3])
cat_scores = cross_val_score(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring='neg_mean_absolute_percentage_error')

In [22]:
print('Default paramter with categrocical features defined--MAPE:', -cat_scores.mean()*100)

Default paramter with categrocical features defined--MAPE: 9.543921744956608


In [6]:
## Model 3
## This model is the same as the model 2 with no bound on the number of leaf nodes.
## Training this model is slower (order of minutes) but shows greatly increased metrics.)

features = ['bin_encoded', 'pcn_encoded', 'group_encoded', 'drug_encoded'] # features
hgbr = HistGradientBoostingRegressor(random_state=412, categorical_features=[0, 1, 2, 3], max_leaf_nodes=None)
scores = cross_val_score(hgbr, cmm_paid[features], cmm_paid.patient_pay, 
                         cv=5, scoring='neg_mean_absolute_percentage_error')

In [20]:
print('No upper bound on Leaf Nodes in model and categorical features--MAPE', -scores.mean()*100)

No upper bound on Leaf Nodes in model and categorical features--MAPE 1.588027916051193


My feeling is that max_leaf_nodes can be tuned so that the model can be trained faster.

### Further work includes adding day of year and diagnosis information into the model

Initial tests show very minimal gains ~0.1%

Additionally, this model was only tested on the training set.