## No Show Dataset

The data has been collected and published by Joni Hoppen who is a data scientist at Aquarela Advanced Analytics in Braziel from November 2015 to Jun 2016. The dataset has 110.527 medical appointments with 14 associated characteristics or columns.


In [60]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from datetime import datetime as dt
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, fbeta_score
from sklearn.model_selection  import train_test_split
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

In [90]:
noShow = pd.read_csv('KaggleV2-May-2016.csv')
noShow.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


### Looking at the first few raws some questions comes to mind:

1. #### Who most likely to miss his appointment and who wouldn't?

2. #### Do remainder messages help patients remember their appointments?

3. #### Can we predict who is going to show up at the appointment?

 But befor answering any Questions, we should clean and prepare the data:

###  Missing values
No missing values

In [62]:
noShow.isna().sum().sum()

0

###  Columns type and info

We can see that although ['ScheduledDay'] and ['AppointmentDay'] conatin a date and time values, they have object data type so we need to change that.

In [63]:
noShow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


### Clean ['ScheduledDay'] and ['AppointmentDay'] columns

In [64]:
def clean_time(column):
    ''''
    column = takes a column and remove all the unnecessary characters and
    return a list of date-time types. 
    '''
    Date_time_object = []
    for i in range(len(noShow)):
        t = column[i]
        t = t.replace('T',' ')
        t = t.replace('Z','')
        t = str(''.join(t))
        m = dt.strptime(t,'%Y-%m-%d %H:%M:%S')
        Date_time_object.append(m)
    return Date_time_object
    

Chick the result:

In [91]:
AppointmentDay_object = clean_time(noShow['AppointmentDay'])
AppointmentDay_object[0:3]


[datetime.datetime(2016, 4, 29, 0, 0),
 datetime.datetime(2016, 4, 29, 0, 0),
 datetime.datetime(2016, 4, 29, 0, 0)]

In [92]:
ScheduledDay_object = clean_time(noShow['ScheduledDay'])
ScheduledDay_object[0:3]

[datetime.datetime(2016, 4, 29, 18, 38, 8),
 datetime.datetime(2016, 4, 29, 16, 8, 27),
 datetime.datetime(2016, 4, 29, 16, 19, 4)]

In [93]:
day = [] # to separate day of the appointment from the rest of the AppointmentDay_object
month = [] # to separate month of the appointment from the rest of the AppointmentDay_object 

for dm in range(len(AppointmentDay_object)):
    t = AppointmentDay_object[dm]
    day.append(t.day)
    month.append(t.month)


I will use ScheduledDay_object and AppointmentDay_object to find the diffrence between the actual day of the appointment and the day the appointment scheduled. 

In [94]:
time_between_A_S =[] # calculate the time diffrence between the appointment and the scheduling day 
                     # in seconds.

for t in range(len(noShow)):
    time_between_A_S.append(AppointmentDay_object[t] - ScheduledDay_object[t])
    time_between_A_S[t] = time_between_A_S[t].days*24*60*60 + time_between_A_S[t].seconds

time_between_A_S[0:10]

[-67088,
 -58107,
 -58744,
 -62971,
 -58043,
 141789,
 118488,
 116402,
 -28936,
 126695]

Some values have a negative sign which means some patients scheduled their appointment after the appointment itself. I am considering these values as an outlier and I will consider them as zeros ( I will assume these patients scheduled their appointment on the same day they saw their doctors) in the next step. 

In [95]:
day_difference = [] # convert seconds in time_between_A_S to days.

for i in range(len(noShow)):
    t = time_between_A_S[i]
    if t >= 0:
        day_difference.append(time.strftime('%d', time.gmtime(t)))
    else:
        t = 0
        day_difference.append(t)
        
day_difference[0:10]  

[0, 0, 0, 0, 0, '02', '02', '02', 0, '02']

 Useing 'day_difference', 'day', and 'month' I made 3 new colmuns:

In [96]:
noShow['Time_difference'] = pd.Series(day_difference, index=noShow.index)
noShow['Time_difference'] = noShow['Time_difference'].astype(str).astype(int)
noShow['Appoint_day'] = pd.Series(day, index=noShow.index)
noShow['Appoint_month'] = pd.Series(month, index=noShow.index)

### Remove outliers from ['Age']

In [97]:
print('Age:',sorted(noShow.Age.unique()))

Age: [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]


We can see that Age has unreasonable values like '-1' or unlikely values like 100, 102 and 115. I removed all the outliers to improve the prediction process.

In [98]:
noShow = noShow[noShow.Age >= 0]
noShow = noShow[noShow.Age < 100]

### Examin ['Handcap'], ['Alcoholism'], ['Hipertension'], ['Diabetes']


In [99]:
print('Diabetes:',noShow.Diabetes.unique())
print('Alcoholism:',noShow.Alcoholism.unique())
print('Hipertension:',noShow.Hipertension.unique())
print('Handcap:',noShow.Handcap.unique())

Diabetes: [0 1]
Alcoholism: [0 1]
Hipertension: [1 0]
Handcap: [0 1 2 3 4]


### Compain all types of Handcap to be in one type

I did this step because there is no information to distinguish the 4 different handicap conditions. So, I considered '0' if the patient is not a handicap and '1' if he is.

In [100]:
noShow.Handcap.replace([0,1,2,3,4],[0,1,1,1,1],inplace=True)
noShow.Handcap.unique()

array([0, 1])

### Examin ['Neighbourhood']

In [101]:
print('Neighbourhood:',noShow.Neighbourhood.unique())

Neighbourhood: ['JARDIM DA PENHA' 'MATA DA PRAIA' 'PONTAL DE CAMBURI' 'REPÚBLICA'
 'GOIABEIRAS' 'ANDORINHAS' 'CONQUISTA' 'NOVA PALESTINA' 'DA PENHA'
 'TABUAZEIRO' 'BENTO FERREIRA' 'SÃO PEDRO' 'SANTA MARTHA' 'SÃO CRISTÓVÃO'
 'MARUÍPE' 'GRANDE VITÓRIA' 'SÃO BENEDITO' 'ILHA DAS CAIEIRAS'
 'SANTO ANDRÉ' 'SOLON BORGES' 'BONFIM' 'JARDIM CAMBURI' 'MARIA ORTIZ'
 'JABOUR' 'ANTÔNIO HONÓRIO' 'RESISTÊNCIA' 'ILHA DE SANTA MARIA'
 'JUCUTUQUARA' 'MONTE BELO' 'MÁRIO CYPRESTE' 'SANTO ANTÔNIO' 'BELA VISTA'
 'PRAIA DO SUÁ' 'SANTA HELENA' 'ITARARÉ' 'INHANGUETÁ' 'UNIVERSITÁRIO'
 'SÃO JOSÉ' 'REDENÇÃO' 'SANTA CLARA' 'CENTRO' 'PARQUE MOSCOSO'
 'DO MOSCOSO' 'SANTOS DUMONT' 'CARATOÍRA' 'ARIOVALDO FAVALESSA'
 'ILHA DO FRADE' 'GURIGICA' 'JOANA D´ARC' 'CONSOLAÇÃO' 'PRAIA DO CANTO'
 'BOA VISTA' 'MORADA DE CAMBURI' 'SANTA LUÍZA' 'SANTA LÚCIA'
 'BARRO VERMELHO' 'ESTRELINHA' 'FORTE SÃO JOÃO' 'FONTE GRANDE'
 'ENSEADA DO SUÁ' 'SANTOS REIS' 'PIEDADE' 'JESUS DE NAZARETH'
 'SANTA TEREZA' 'CRUZAMENTO' 'ILHA DO PRÍNCIPE' 'RO

### One-hot encode the 'Neighbourhood' data using pandas.get_dummies()

In [102]:
#One-hot encode the 'Neighbourhood' data using pandas.get_dummies()
dummy_list = noShow['Neighbourhood']
Neighbour_features = pd.get_dummies(noShow['Neighbourhood'])

In [103]:
Neighbour_features.head()

Unnamed: 0,AEROPORTO,ANDORINHAS,ANTÔNIO HONÓRIO,ARIOVALDO FAVALESSA,BARRO VERMELHO,BELA VISTA,BENTO FERREIRA,BOA VISTA,BONFIM,CARATOÍRA,...,SANTOS REIS,SEGURANÇA DO LAR,SOLON BORGES,SÃO BENEDITO,SÃO CRISTÓVÃO,SÃO JOSÉ,SÃO PEDRO,TABUAZEIRO,UNIVERSITÁRIO,VILA RUBIM
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [104]:
result = pd.concat([noShow, Neighbour_features], axis=1, join='inner')

In [105]:
noShow = result
noShow.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,...,SANTOS REIS,SEGURANÇA DO LAR,SOLON BORGES,SÃO BENEDITO,SÃO CRISTÓVÃO,SÃO JOSÉ,SÃO PEDRO,TABUAZEIRO,UNIVERSITÁRIO,VILA RUBIM
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,...,0,0,0,0,0,0,0,0,0,0


### Let us visualize to answer our questions:

In [106]:
def probStatus_Scatter(group_by):
    ''''
    draw a scatter graph to illustrate the relationship between 
    the probability of showing up to the appointment and the group_by column
    '''
    df = pd.crosstab(index = noShow[group_by], columns = noShow['No-show']).reset_index()
    df['probShowUp'] = df['No'] / (df['No'] + df['Yes'])
    fig = px.scatter(df, x=group_by, y='probShowUp')
    fig.show()

In [107]:
def probStatus_bar(group_by):
    ''''
    draw a bar graph to illustrat the relasionship between 
    the probailaty of showing up to the appointment and the group_by column
    '''
    df = pd.crosstab(index = noShow[group_by], columns = noShow['No-show']).reset_index()
    df['probShowUp'] = df['No'] / (df['No'] + df['Yes'])
    fig = px.bar(df, x=group_by, y='probShowUp')
    fig.show()

 ### Q1: Who most likely to miss his appointment and who wouldn't?

In [108]:
probStatus_Scatter('Time_difference')

We can see that people who made their appointment on the same day they saw their doctors have a higher probability to show up. In general, when the day of scheduling is close from the day of the appointment, patients have a higher probability to show up to their appointments.

In [109]:
probStatus_Scatter('Appoint_day')

Although there are no obvious patterns, here we can see that having the appointment at the first 3 days of the month or at the last 5 days of the month means patients have a slightly higher probability to show up to their appointments. 

In [110]:
probStatus_bar('Appoint_month')

Here all months have the same probabiaties 

In [111]:
probStatus_Scatter('Age')

people age have a significant effect on the attendance rate. As shown in the graph, infants, very young children, and adults between the ages of 45 and 90 have a higher probability to show up to their appointment than the other groups.

In [112]:
probStatus_bar('Handcap')

In [113]:
probStatus_bar('Hipertension')

In [114]:
probStatus_bar('Alcoholism')

In [115]:
probStatus_bar('Diabetes')

All mentioned Conditions does not affect patient probability to show up to their appointment.

In [116]:
probStatus_Scatter('Neighbourhood')

There is no relationship between the neighborhood and the probability of attending to the appointments.

In [117]:
probStatus_bar('Gender')

No Relation between gender and probability of showing up as well.

### Q2: Do remainder messages help patients remember their appointments?

In [118]:
probStatus_bar('SMS_received')

It looks like receiving a reminder does not improve patients' attendance that much as shown in the graph.

### Re-encode binary variables ['Gender'] and ['No-show']

We have two binary categorcal variables that need to be transform in to numbers to make the preduction process more eayser.

In [119]:
print('Gender:',noShow.Gender.unique())
print('No-show:',noShow['No-show'].unique())

Gender: ['F' 'M']
No-show: ['No' 'Yes']


In [120]:
noShow['Gender'].replace(['M','F'],[0,1],inplace= True)
noShow['No-show'].replace(['Yes','No'],[0,1],inplace= True)

### Split and drop 

Split the data into features and target label and drop the unwanted values:

In [121]:
# Split the data into features and target label

features = noShow.drop(['No-show','PatientId','AppointmentID','Neighbourhood','ScheduledDay','AppointmentDay'], axis = 1)
label = noShow['No-show']

### normalize skewed info


In [122]:
# Log-transform the skewed features
skewed_feat =['Age','Gender']
features_log_transformed = pd.DataFrame(data = features)
features_log_transformed[skewed_feat] = features[skewed_feat].apply(lambda x: np.log(x + 1))


In [123]:
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['Age','Gender']

features_transform = pd.DataFrame(data = features_log_transformed)
features_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])


# Predict

In [124]:
# Split the 'features' and 'label' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    label, 
                                                    test_size = 0.2, 
                                                    random_state = 47)

In [125]:
clf = RandomForestClassifier()

model = clf.fit(X_train,y_train)



The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.



In [132]:
predictions = clf.predict(X_test)
accuracy =accuracy_score(y_test, predictions)
fscore=fbeta_score(y_test, predictions,beta = 0.5)

In [131]:
accuracy

0.767995294756368

fare result!

In [133]:
fscore

0.8423770395209067

### Q3: Can we predict who is going to show up at the appointment?

In [50]:
importances = model.feature_importances_
X_features = features.columns
list_of_importances = list(zip(X_features, model.feature_importances_))

In [51]:
importances = pd.DataFrame(model.feature_importances_,
                                   index = features.columns,
                                  columns=['importance']).sort_values('importance', ascending=False)
importances[0:3]

Unnamed: 0,importance
Age,0.271615
Time_difference,0.230455
Appoint_day,0.151618


This Table shows that Age, Time_difference, and Appoint_day are the top 3 predictive features of my model. In other words, if I know the patient's age and how far they scheduled their appointment form the actual appointment, in addition to what day in the month the appointment was, I can predict if they are going to show up or not.