# Predict Patient No-Show

### Problem
A binary classification problem to predict the label of the No-Show class. Labels = {Yes, No}

### Dataset
110527 (88208 Show / 22319 No-show) patients with appointments ranging in date from 4/29/2016 - 6/07/2016

### Features
- APP_DELTA : Days from when appointment scheduled to appointment day
- HOLIDAY_DELTA : Days from appointment day to nearest US Federal Holiday
- NAT_HOLIDAY : 1 if appointment day is a US Federal Holiday, 0 otherwise
- DAY_OF_MONTH : Day of month from 1-31
- DAY_OF_WEEK : Day of week from 0 (Monday) - 6 (Sunday)
- WEEKEND : 1 if Day of week == 5 or 6 (Saturday or Sunday), 0 otherwise
- GENDER : 1 - Female, 0 - Male
- AGE : Bucketized 1 = 'xx-24', 2='25-34',3='35-49',4='50-xx'

### Algorithm
- Logisitc Regression

### Literature Review
- [A probabilistic model for predicting the probability of no-show in hospital appointments](https://link.springer.com/article/10.1007/s10729-011-9148-9)
- [Machine-Learning-Based No Show Prediction in
Outpatient Visits](http://www.ijimai.org/journal/sites/default/files/files/2017/03/ijimai_4_7_4_pdf_11885.pdf)
- [Patient No-Show Predictive Model Development using Multiple Data Sources for an Effective Overbooking Approach](https://www.thieme-connect.com/products/ejournals/abstract/10.4338/ACI-2014-04-RA-0026)
- [Predicting appointment misses in hospitals using data analytics](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5427184/)


In [2]:
import pandas as pd
import numpy as np
import time
from datetime import datetime, date
import matplotlib.pyplot as plt
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE#import LinearRegression
from sklearn import model_selection
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

In [5]:
#Read in data
data = pd.read_csv("../input/KaggleV2-May-2016.csv", header=0)
data.head()

# Features

### DAY_OF_WEEK

In [6]:
data['DAY_OF_WEEK'] = pd.to_datetime(data['AppointmentDay']).dt.weekday
data.head()

In [7]:
data.DAY_OF_WEEK.hist()
plt.title('Days of Week Histogram')
plt.xlabel('Day of Week')
plt.ylabel('Freq')
plt.show()

### WEEKEND

In [8]:
data['WEEKEND'] = data['DAY_OF_WEEK'].apply(lambda x: 1 if (x== 5 or x==6) else 0)

### DAY_OF_MONTH

In [9]:
data['DAY_OF_MONTH'] = pd.to_datetime(data['AppointmentDay']).dt.day

In [10]:
data.DAY_OF_MONTH.hist()
plt.title('Days of Month Histogram')
plt.xlabel('Day of Month')
plt.ylabel('Freq')
plt.show()

### APP_DELTA

In [11]:
data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay']).dt.date
data['ScheduledDay'] = pd.to_datetime(data['ScheduledDay']).dt.date

In [12]:
#Calculate distance from when appointment scheduled to appointment day
data['APP_DELTA'] = abs(data['ScheduledDay'] - data['AppointmentDay']).astype('int64')

In [13]:
#Sanity check
data[['ScheduledDay','AppointmentDay','APP_DELTA']].head()
data['APP_DELTA'].mean()

In [14]:
data.APP_DELTA.hist()
plt.title('APP_DELTA Histogram')
plt.xlabel('APP_DELTA')
plt.ylabel('Freq')
plt.show()

### HOLIDAY_DELTA

In [15]:
dr = pd.date_range(start='2016-04-28', end='2016-06-08')
df = pd.DataFrame()
df['date'] = dr
cal = calendar()
holidays = cal.holidays(start=dr.min(), end=dr.max())
holidays=pd.to_datetime(holidays).date

In [16]:
#Sanity check : https://www.timeanddate.com/holidays/us/2016
holidays

In [17]:
def time_delta(holidays, pivot):
    nearest=min(holidays, key=lambda x: abs(x-pivot))
    timedelta=abs(nearest-pivot).days
    return timedelta

In [18]:
data['AppointmentDay'] = pd.to_datetime(data['AppointmentDay'])
data['HOLIDAY_DELTA'] = data['AppointmentDay'].dt.date.apply(lambda x: time_delta(holidays,x))

In [19]:
data.HOLIDAY_DELTA.hist()
plt.title('HOLIDAY_DELTA Histogram')
plt.xlabel('HOLIDAY_DELTA')
plt.ylabel('Freq')
plt.show()

### NAT_HOLIDAY

In [20]:
data['NAT_HOLIDAY'] = data['AppointmentDay'].dt.date.astype('datetime64[ns]').isin(holidays)*1

### Gender
1 - F, 0 - M

In [21]:
data['Gender'] = data['Gender'].apply(lambda x: 1 if x=='F' else 0)

### Age
1 = 'xx-24', 2='25-34',3='35-49',4='50-xx'

In [22]:
data['AGE_B'] = data.Age.apply(lambda x: 4 if x>49 else( 3 if x >34 else(2 if x >23 else(1))))

In [23]:
#Sanity check
data.dropna(how='any', inplace=True)
data[['Age','AGE_B']].head()

In [24]:
data.Age.hist()
plt.title('Age Histogram')
plt.xlabel('Age')
plt.ylabel('Freq')
plt.show()

In [25]:
data.AGE_B.hist()
plt.title('Age Histogram (Bucketized)')
plt.xlabel('Age')
plt.ylabel('Freq')
plt.show()

## Target

In [26]:
data['No-show'] = data['No-show'].apply(lambda x: 1 if x=='Yes' else 0)
#Sanity check
data['No-show'].sum()

In [27]:
#Sanity check
data.groupby('No-show').count()

# Feature Selection
"the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features."
- [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)


In [28]:
data_in = data.drop(['ScheduledDay','AppointmentDay','PatientId', 'AppointmentID','Neighbourhood', 'Age'], axis=1)

In [29]:
#from sklearn import datasets
from sklearn.feature_selection import RFE
#data_in = data_in.dropna(how='any')
data_in = data_in.drop(['No-show'], axis=1)
names = data_in.columns.values

In [30]:
logreg = LogisticRegression()
rfe=RFE(logreg)
rfe=rfe.fit(data_in, data['No-show'])
print("Features sorted by rank:")
print(sorted(zip(map(lambda x: round(x,4), rfe.ranking_),names)))

# Train/Test Data Split

In [31]:
X = data_in[['AGE_B','APP_DELTA','DAY_OF_MONTH','DAY_OF_WEEK','Gender','HOLIDAY_DELTA','Hipertension','SMS_received','Scholarship','Diabetes','Alcoholism','Handcap','NAT_HOLIDAY','WEEKEND']]
y = data['No-show']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

# Fit Model

In [32]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predictions

In [33]:
y_predict = logreg.predict_proba(X_test)
my_list = map(lambda x: x[0], y_predict)
y_pred = pd.Series(my_list).round()

print('Accuracy: {:.2f}'.format(logreg.score(X_test, y_test)))

# 10-fold cross validation

In [34]:
kfold = model_selection.KFold(n_splits=10)
scoring = 'accuracy'
results = model_selection.cross_val_score(logreg, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation\n   average accuracy: %.3f" % (results.mean()))
print("   max accuracy: %.3f" %(results.max()))


# Confusion Matrix

In [35]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

# Classification Report

In [36]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

# ROC Curve

In [37]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="upper left")
plt.show()

# AUC

In [38]:
print(logit_roc_auc)