# Project: Patient Arrival Prediction-Model


#### In this Project I will analyze, wrangle and clean the dataset of patients of a hospital that shows data about the patient and whether the patient arrived at their scheduled appointment or not. After making the data ready for processing I will be testing various model candidates and will be choosing the best model that can predict the arrival. After getting the golden model I will export the model into a PKL file to be shown on a UI web page using Flask.

Project DATASET :
This data set is from Kaggle.com

● PatientId: Identification of a patient

● AppointmentID: Identification of each appointment

● Gender: Male or Female

● DataMarcacaoConsulta: The day of the actual appointment, when they have to visit the doctor

● DataAgendamento: The day someone called or registered the appointment

● Age: How old is the patient

● Neighbourhood: Where the appointment takes place

● Scholarship: True or False, indicates if the patient is in the Bolsa Familia program

● Hipertension: True or False

● Diabetes: True or False

● Alcoholism: True or False

● Handcap: handicap level of severeness (5 levels)

● SMS_received: 1 or more messages sent to the patient

● No-show "No" indicates if the patient showed up to their appointment and "Yes'' if they didn't show up

Project Summary:
Analyzing a dataset that shows the information of patients of a hospital and shows whether the patient arrived at their booked appointment or not. The goal of this project is to build a Machine learning model that can predict whether the patient with the given info will arrive or not.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Patient.csv')

In [3]:
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


### Let's take a look into our data.

In [4]:
df.shape

(110527, 14)

Looks like we have 100000 rows.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [7]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [8]:
df['Diabetes'].value_counts()


Diabetes
0    102584
1      7943
Name: count, dtype: int64

In [9]:
df['No-show'].value_counts()

No-show
No     88208
Yes    22319
Name: count, dtype: int64

In [10]:
df['Handcap'].value_counts()

Handcap
0    108286
1      2042
2       183
3        13
4         3
Name: count, dtype: int64

In [11]:
df['Neighbourhood'].value_counts()

Neighbourhood
JARDIM CAMBURI                 7717
MARIA ORTIZ                    5805
RESISTÊNCIA                    4431
JARDIM DA PENHA                3877
ITARARÉ                        3514
                               ... 
ILHA DO BOI                      35
ILHA DO FRADE                    10
AEROPORTO                         8
ILHAS OCEÂNICAS DE TRINDADE       2
PARQUE INDUSTRIAL                 1
Name: count, Length: 81, dtype: int64

As we can see [	Scholarship	Hipertension, Diabetes, Alcoholism, SMS_received, No-show] are all boolean types (1,0 or yes,no) values except the ['Handcap'] which has 5 different values, this may indicates how severe is the handicap.

## Data Cleaning and Preprocessing

There's some useless columns like _PatientId, AppointmentID and ScheduledDay_ so we will drop these columns

In [12]:
df.drop(['PatientId', 'AppointmentID', 'ScheduledDay'],axis=1, inplace=True)

In [13]:
df.head()

Unnamed: 0,Gender,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


## Checking for null values.

In [14]:
df.isna().sum()

Gender            0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

### Let's have a look at *Age* column and its different values.

In [17]:
df['Age'].value_counts()

Age
 0      3539
 1      2273
 52     1746
 49     1652
 53     1651
        ... 
 115       5
 100       4
 102       2
 99        1
-1         1
Name: count, Length: 104, dtype: int64

looks like we have an invalid data which is -1 , So we will remove that entry.

In [18]:
invalid_entry = df[df['Age'] < 0]
df.drop(invalid_entry.index, inplace=True)
df[df['Age'] < 0]

Unnamed: 0,Gender,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show


In [20]:
df['Age'].value_counts()

Age
0      3539
1      2273
52     1746
49     1652
53     1651
       ... 
98        6
115       5
100       4
102       2
99        1
Name: count, Length: 103, dtype: int64

In [21]:
# now lets drop duplicate entries... 
df.drop_duplicates(inplace=True)
df.shape

(95151, 11)

### We removed more than 10k duplicated entries. approximately 10% of our whole dataset.

Since we don't want the exact time of the appointment date we will cut it from the full time stamp.

In [22]:
df['AppointmentDay'] = df['AppointmentDay'].apply(lambda x: x.split("T")[0])


In [23]:
df['AppointmentDay'].head(15)

0     2016-04-29
1     2016-04-29
2     2016-04-29
3     2016-04-29
4     2016-04-29
5     2016-04-29
6     2016-04-29
7     2016-04-29
8     2016-04-29
9     2016-04-29
10    2016-04-29
11    2016-04-29
12    2016-04-29
13    2016-04-29
14    2016-04-29
Name: AppointmentDay, dtype: object

Now lets rename the columns into something better and informative. also correct some spelling errors.

In [24]:
df.head()

Unnamed: 0,Gender,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,F,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [25]:
df = df.rename(columns={'AppointmentDay': 'Appointment_Date', 'Hipertension': 'Hypertension', 'Handcap': 'Handicap' })


In [26]:
df.head()

Unnamed: 0,Gender,Appointment_Date,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show
0,F,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No


Let's convert our date into datetime so we can perform any other needed method.

In [27]:
df['Appointment_Date'] = pd.to_datetime(df['Appointment_Date'])
df['Appointment_Date'].head()

0   2016-04-29
1   2016-04-29
2   2016-04-29
3   2016-04-29
4   2016-04-29
Name: Appointment_Date, dtype: datetime64[ns]

Now we will extract the Week day from the converted Date easily.

In [29]:
df['Weekday'] = df['Appointment_Date'].dt.day_name()
df['Weekday'].head(15)

0     Friday
1     Friday
2     Friday
3     Friday
4     Friday
5     Friday
6     Friday
7     Friday
8     Friday
9     Friday
10    Friday
11    Friday
12    Friday
13    Friday
14    Friday
Name: Weekday, dtype: object

In [30]:
df.head()

Unnamed: 0,Gender,Appointment_Date,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show,Weekday
0,F,2016-04-29,62,JARDIM DA PENHA,0,1,0,0,0,0,No,Friday
1,M,2016-04-29,56,JARDIM DA PENHA,0,0,0,0,0,0,No,Friday
2,F,2016-04-29,62,MATA DA PRAIA,0,0,0,0,0,0,No,Friday
3,F,2016-04-29,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,Friday
4,F,2016-04-29,56,JARDIM DA PENHA,0,1,1,0,0,0,No,Friday


In [31]:
df.isna().sum()

Gender              0
Appointment_Date    0
Age                 0
Neighbourhood       0
Scholarship         0
Hypertension        0
Diabetes            0
Alcoholism          0
Handicap            0
SMS_received        0
No-show             0
Weekday             0
dtype: int64

### now let's check and convert the data types into ones that are compatible with machine learing models and libraries.

In [32]:
attrs = ['Neighbourhood', 'No-show','Gender' ]

df[attrs] = df [attrs].apply(lambda x: pd.factorize(x)[0] + 1)

In [33]:
df.head()

Unnamed: 0,Gender,Appointment_Date,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show,Weekday
0,1,2016-04-29,62,1,0,1,0,0,0,0,1,Friday
1,2,2016-04-29,56,1,0,0,0,0,0,0,1,Friday
2,1,2016-04-29,62,2,0,0,0,0,0,0,1,Friday
3,1,2016-04-29,8,3,0,0,0,0,0,0,1,Friday
4,1,2016-04-29,56,1,0,1,1,0,0,0,1,Friday


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95151 entries, 0 to 110525
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Gender            95151 non-null  int64         
 1   Appointment_Date  95151 non-null  datetime64[ns]
 2   Age               95151 non-null  int64         
 3   Neighbourhood     95151 non-null  int64         
 4   Scholarship       95151 non-null  int64         
 5   Hypertension      95151 non-null  int64         
 6   Diabetes          95151 non-null  int64         
 7   Alcoholism        95151 non-null  int64         
 8   Handicap          95151 non-null  int64         
 9   SMS_received      95151 non-null  int64         
 10  No-show           95151 non-null  int64         
 11  Weekday           95151 non-null  object        
dtypes: datetime64[ns](1), int64(10), object(1)
memory usage: 9.4+ MB


In [35]:
df= pd.get_dummies(df)

In [36]:
df.head()

Unnamed: 0,Gender,Appointment_Date,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show,Weekday_Friday,Weekday_Monday,Weekday_Saturday,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday
0,1,2016-04-29,62,1,0,1,0,0,0,0,1,True,False,False,False,False,False
1,2,2016-04-29,56,1,0,0,0,0,0,0,1,True,False,False,False,False,False
2,1,2016-04-29,62,2,0,0,0,0,0,0,1,True,False,False,False,False,False
3,1,2016-04-29,8,3,0,0,0,0,0,0,1,True,False,False,False,False,False
4,1,2016-04-29,56,1,0,1,1,0,0,0,1,True,False,False,False,False,False


In [37]:
df.info()   

<class 'pandas.core.frame.DataFrame'>
Index: 95151 entries, 0 to 110525
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Gender             95151 non-null  int64         
 1   Appointment_Date   95151 non-null  datetime64[ns]
 2   Age                95151 non-null  int64         
 3   Neighbourhood      95151 non-null  int64         
 4   Scholarship        95151 non-null  int64         
 5   Hypertension       95151 non-null  int64         
 6   Diabetes           95151 non-null  int64         
 7   Alcoholism         95151 non-null  int64         
 8   Handicap           95151 non-null  int64         
 9   SMS_received       95151 non-null  int64         
 10  No-show            95151 non-null  int64         
 11  Weekday_Friday     95151 non-null  bool          
 12  Weekday_Monday     95151 non-null  bool          
 13  Weekday_Saturday   95151 non-null  bool          
 14  Weekday_Th

In [38]:
df['No-show'].value_counts()

No-show
1    74458
2    20693
Name: count, dtype: int64

### No is 1

### Yes is 2

Now that the data is prepared we can work on the machine learning models and experiment different models.

## Model Building and experimenting

In [39]:
# Import some classification models
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier

# import needed functions
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.metrics import f1_score

In [40]:
x = df.drop(['No-show', 'Appointment_Date'] , axis=1)
y = df['No-show']


In [41]:
x.head()

Unnamed: 0,Gender,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,Weekday_Friday,Weekday_Monday,Weekday_Saturday,Weekday_Thursday,Weekday_Tuesday,Weekday_Wednesday
0,1,62,1,0,1,0,0,0,0,True,False,False,False,False,False
1,2,56,1,0,0,0,0,0,0,True,False,False,False,False,False
2,1,62,2,0,0,0,0,0,0,True,False,False,False,False,False
3,1,8,3,0,0,0,0,0,0,True,False,False,False,False,False
4,1,56,1,0,1,1,0,0,0,True,False,False,False,False,False


In [42]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: No-show, dtype: int64

In [43]:
y.shape

(95151,)

In [44]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)
# random_state is recommended to used the parameter (random_state=42) to produce the same results across a different runs.

In [45]:
RF = RandomForestClassifier()
results = cross_validate(RF, x, y, cv=5, scoring='accuracy', return_train_score=True)
print('Randdom forest')
print("Accuracy:" , 'train: ', results['train_score'].mean(), '/// test: ', results['test_score'].mean())


Randdom forest
Accuracy: train:  0.9047566507476347 /// test:  0.7125936825755406


In [46]:
DT = DecisionTreeClassifier()
results = cross_validate(DT, x, y, cv=5, scoring='accuracy', return_train_score=True)
print('DecisionTree: ')
print("Accuracy: train: " , results['train_score'].mean(), '/// test: ', results['test_score'].mean())



DecisionTree: 
Accuracy: train:  0.9047986891158567 /// test:  0.6856680273822674


# ---------------


In [51]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print('Random Forest Accuracy is =', accuracy_score(y_test, y_preds))


Random Forest Accuracy is = 0.7182639764606977


In [52]:
cm_RF = metrics.confusion_matrix(y_test, y_preds)
print('Confusion Matrix for Random Forest : \n')
print(cm_RF)

Confusion Matrix for Random Forest : 

[[6546  976]
 [1705  289]]


In [53]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6932641266200209

In [54]:
 model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print('Decision Tree Accuracy is =', accuracy_score(y_test, y_preds))



Decision Tree Accuracy is = 0.6952501050861707


In [55]:
cm_DT = metrics.confusion_matrix(y_test, y_preds)
print('Confusion Matrix for Decision Tree : \n')
print(cm_DT)

Confusion Matrix for Decision Tree : 

[[6253 1269]
 [1631  363]]


In [56]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6836182298155963

Looks like the support vector classifier has bested the RandomForest and the decision tree. It also has a reasonable accuracy.

In [57]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6836182298155963

In [58]:
 model = GradientBoostingClassifier()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_preds))

Accuracy: 0.7904581757040774


In [59]:
cm_GB = metrics.confusion_matrix(y_test, y_preds)
print('Confusion Matrix for Gradient Boosting  : \n')
print(cm_GB)

Confusion Matrix for Gradient Boosting  : 

[[7521    1]
 [1993    1]]


In [60]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6981479682599265

In [61]:
 model = LogisticRegression()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, y_preds))

Accuracy: 0.7904581757040774


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [62]:
cm_LR = metrics.confusion_matrix(y_test, y_preds)
print('Confusion Matrix for Logistic regression  : \n')
print(cm_LR)

Confusion Matrix for Logistic regression  : 

[[7522    0]
 [1994    0]]


In [63]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6979488669616234

We finally got a reasonable decision making model. and not one that always goes to The positive predictions (TP and FN). i will be picking this model as my final model.

In [64]:
print(' the F1 score is : ')
f1_score(y_test, y_preds, average='weighted')


 the F1 score is : 


0.6979488669616234