# Predicting medical appointments using Python

In this notebook,
 I'll use Python and its libraries to predict whether someone would show up for a medical appointment or not. I'll develop an Artificial Neural Network and train my model on the data.

In [17]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('/content/dataset.csv')

df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df['WaitingDays'] = (df['AppointmentDay'] - df['ScheduledDay']).dt.days
df = df[df['WaitingDays'] >= 0]

df['No-show'] = df['No-show'].map({'Yes': 1, 'No': 0})
df['Gender'] = LabelEncoder().fit_transform(df['Gender'])
df['AppointmentWeekday'] = df['AppointmentDay'].dt.day_name()
df['AgeGroup'] = pd.cut(df['Age'], bins=[0,18,35,50,65,100], labels=['0-18','19-35','36-50','51-65','65+'])

df.to_csv('dataset_modified.csv', index=False)
print("Cleaned dataset saved as 'dataset_modified.csv'")

Cleaned dataset saved as 'dataset_modified.csv'


## Import libraries

I'll import the necessary deep learning libraries, Keras and Tensorflow along with some metrics.

In [1]:
import pandas as pd

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

import os
import tensorflow
os.environ['KERAS_BACKEND'] = 'tensorflow'

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout

## Import dataset

In [4]:
dataset = pd.read_csv('/content/dataset_modified.csv')
dataset.head(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,Showed_up,Date.diff
0,29872500000000.0,5642903,F,2016-04-29,2016-04-29,62,JARDIM DA PENHA,False,True,False,False,False,False,True,0
1,558997800000000.0,5642503,M,2016-04-29,2016-04-29,56,JARDIM DA PENHA,False,False,False,False,False,False,True,0
2,4262962000000.0,5642549,F,2016-04-29,2016-04-29,62,MATA DA PRAIA,False,False,False,False,False,False,True,0
3,867951200000.0,5642828,F,2016-04-29,2016-04-29,8,PONTAL DE CAMBURI,False,False,False,False,False,False,True,0
4,8841186000000.0,5642494,F,2016-04-29,2016-04-29,56,JARDIM DA PENHA,False,True,True,False,False,False,True,0


## Data engineering

If a person has missed an appointment before, there are chances he/she might miss again. Let's see if that is correlated.
I found this idea from a kernel on [Kaggle](https://www.kaggle.com/belagoesr/predicting-no-show-downsampling-approach-with-rf).

In [5]:
missed_appointment = dataset.groupby('PatientId')['Showed_up'].sum()
missed_appointment = missed_appointment.to_dict()
dataset['missed_appointment_before'] = dataset.PatientId.map(lambda x: 1 if missed_appointment[x]>0 else 0)
dataset['missed_appointment_before'].corr(dataset['Showed_up'])

np.float64(0.6102086525921399)

Surprisingly the correlation is really high and we should keep this column.

As we don't need all the columns, I'll start ommiting them.

In [6]:
dataset = dataset.drop(['PatientId', 'AppointmentID', 'ScheduledDay', 'AppointmentDay'], axis = 1)
print("Columns: {}".format(dataset.columns))

Columns: Index(['Gender', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'Showed_up',
       'Date.diff', 'missed_appointment_before'],
      dtype='object')


Let's great dummy columns to accomodate all neighborhoods.

In [7]:
dataset = pd.concat([dataset.drop('Neighbourhood', axis = 1),
           pd.get_dummies(dataset['Neighbourhood'])], axis=1)

Now, let's map the Gender column to random values, here 'M' as 0 and 'F' as 1.

In [8]:
gender_map = {'M': 0, 'F': 1}
dataset['Gender'] = dataset['Gender'].map(gender_map)

Next, I'll split the dataset into train and test data.

In [9]:
y = dataset.loc[:, 'Showed_up']
X = dataset.drop(['Showed_up'], axis = 1)

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [11]:
print("Final shape: {}".format(X_train.shape))

Final shape: (71681, 91)


Let's now scale the data to make it ready for the Neural Network.

In [12]:
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

## Model generation

I'll develop an Artificial Neural Network to map the data to find patterns and eventually learn from it.

In [13]:
classifier = Sequential()
classifier.add(Dense(units = 64, activation = 'relu', input_dim = 91))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dropout(rate = 0.5))
classifier.add(Dense(units = 1, activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [14]:
history = classifier.fit(X_train, y_train, epochs = 5, validation_split = 0.1)

Epoch 1/5
[1m2016/2016[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.8336 - loss: 0.4544 - val_accuracy: 0.8827 - val_loss: 0.3422
Epoch 2/5
[1m2016/2016[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.8845 - loss: 0.3436 - val_accuracy: 0.8827 - val_loss: 0.3471
Epoch 3/5
[1m2016/2016[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.8849 - loss: 0.3383 - val_accuracy: 0.8828 - val_loss: 0.3345
Epoch 4/5
[1m2016/2016[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.8833 - loss: 0.3373 - val_accuracy: 0.8827 - val_loss: 0.3383
Epoch 5/5
[1m2016/2016[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.8846 - loss: 0.3330 - val_accuracy: 0.8827 - val_loss: 0.3328


## Model prediction

As the model is now ready and trained, let's test on the test data.
For a baseline, I'll also write the difference between the two classes in the test data.

In [15]:
y_pred = classifier.predict(X_test)
y_pred = y_pred > 0.5

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))
print("-"*50)
print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred)*100))

[1m1104/1104[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 746us/step
Confusion matrix:
[[ 3033  4126]
 [    0 28147]]
--------------------------------------------------
Accuracy: 88.31%


## Conclusion

Using `ANN` with data engineering, I was able to achieve an accuracy of over 88% in predicting whether someone would show up or not for their appointment.