
# Project: Investigate a Dataset - No-show appointments

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

This dataset represents the data collected from over 100 thousand medical appointments and it includes several parameters:
- Unique identifiers: "Patient ID" and "Appointment ID"
- Dates and times of scheduling and the actual appointment: "Scheduled Day" and "Appointment Day"
- Demographic data: "Age", "Gender", and "Neighbourhood"
- Medical condition parameters: "Hypertension", "Diabetes", "Alcoholism", and "Handicap"
- Additional data: "Scholarship" which shows the patients who are subscribed to medical insurance, and "SMS Received" to show if the patient received a confirmation message for the medical appointment
- The key value: "No Show" which determines if the patient showed up for the medical appointment or not

We'll examine some of these parameters with relation to the key value to determine if any of these factors can be the reason for a missed medical appointment. This analysis would assist us in predicting if the patient will actually show up in the future for the medical appointment.



### Question(s) for Analysis

- Are the patient's demographics related to him/her not showing up for the medical appointment?
- Is it possible for the patient's medical condition to prevent him/her from showing up?
- Can having a scholarship or sending a confirmation SMS be important factors to ensure that the patient shows up?
- Will we be able to predict the probability of the patient not showing up based on the available data in this dataset?


In [None]:
# importing the packages that will be used in this analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline


In [None]:
# Upgrade pandas to use dataframe.explode() function. 
# no need to use the explode() function for this dataset as there are no list-like entries in the cells
#!pip install --upgrade pandas==0.25.0

# Upgrading seaborn to use histplot attribute
!pip install -U seaborn


<a id='wrangling'></a>
## Data Wrangling


### General Properties


In [None]:
# Reading the dataset and loading it in a dataframe object
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
# displaying the first 5 rows to validate that the dataset was imported correctly
df.head()

In [None]:
# Displaying columns and datatypes, and number of entries for each column
df.info()

In [None]:
# finding duplicated entries
print(df.duplicated().sum())
# no duplicated rows

In [None]:
# Displaying statistical data
df.describe()

In [None]:
# displaying the number of unique values for each column
df.nunique()

In [None]:
# listing the values of the Handcap column and the counts for each value
df['Handcap'].value_counts()

## The initial investigation for the dataset revealed the following:
- The dataset shape is 14 columns X 110527 rows
- There are no missing data (All columns have the same number of row entries 110527)
- ScheduledDay and AppointmentDay datatypes is a string object and should be converted to datetime format
- Scholarship, Hipertension, Diabetes, Alcoholism, and SMS_received are all entered as 0 for No and 1 for Yes and thus can be converted to Boolean datatype
- No-show datatype is string with entries are only Yes and No, so it should be converted to Boolean True/False datatype too
- Handcap have 5 unique values, possibly the degree of the handicap (ranging from 0 as non-handicap to 4 as highest degree of handicap/disability)
- There are no duplicated entries in the dataset
- Checking the Age parameter reveals some erratic data that should be dropped (min is -1 and max is 115 which is outside the reasonable ranges)

#### Further analysis that can be performed:
- PatientId should be investigated to check for repetitive patients and the likelihood of them not showing up for their appointment
- ScheduledDay column (unlike the AppointmentDay column) has a timestamp. Since there are 110527 unique Appointment IDs and only 103549 unique entries in the ScheduledDay, the 6978 difference can be investigated to check for incorrect data entry.


### Data Cleaning


In [None]:
# Converting the datatypes of the ScheduledDay and AppointmentDay columns to datetime format
# and verifying the successful conversion

df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])
df.info()

In [None]:
# Displaying the first 5 rows to emphasize the successful conversion
df.head()

In [None]:
# printing the value counts of each of the binary fields before converting to boolean datatype
print(df['Scholarship'].value_counts())
print(df['Hipertension'].value_counts())
print(df['Diabetes'].value_counts())
print(df['Alcoholism'].value_counts())
print(df['SMS_received'].value_counts())

In [None]:
# Converting the Scholarship, Hipertension, Diabetes, Alcoholism, and SMS_received binary integar values to boolean
# then printing the value counts after conversion to verify it is the same value counts as listed above
df['Scholarship'] = df['Scholarship'].astype('bool')
df['Hipertension'] = df['Hipertension'].astype('bool')
df['Diabetes'] = df['Diabetes'].astype('bool')
df['Alcoholism'] = df['Alcoholism'].astype('bool')
df['SMS_received'] = df['SMS_received'].astype('bool')
print(df['Scholarship'].value_counts())
print(df['Hipertension'].value_counts())
print(df['Diabetes'].value_counts())
print(df['Alcoholism'].value_counts())
print(df['SMS_received'].value_counts())

# applying the astype('bool') to the No-show column messed the column entries,
# so after reloading the kernel, I'll find another way to change the datatype to bool keeping the original values
# df['No-show'] = df['No-show'].astype('bool')


In [None]:
# Renaming the No-show column
df.rename(columns = {'No-show':'noshow'}, inplace = True)

In [None]:
# displaying the No-show value counts before conversion
df['noshow'].value_counts()

In [None]:
# https://stackoverflow.com/questions/17702272/convert-pandas-series-containing-string-to-boolean
# converting the No-show Yes/No string values to True/False boolean values then verifying the successful conversion
# by checking the after conversion value counts against the before conversion value counts above 
df['noshow'].replace({'No': 0, 'Yes': 1}, inplace = True)
df['noshow'] = df['noshow'].astype('bool')
df['noshow'].value_counts()

In [None]:
# verifying datatypes after conversion
df.info()

In [None]:
# checking the data for patients over 99 years old
df.loc[df['Age'] > 99]

After checking the above data for patients above 99 years old, it was discovered that they actually showed up for their medical appointment, except for a 115 years old female patient who scheduled 4 appointments and showed up once.
No data to drop here, as these entries should be included in the analysis.

In [None]:
# checking data for patients with negative age
df.loc[df['Age'] < 0]

In [None]:
# Dropping the wrong age row
df.drop([99832], inplace=True)

Now that the dataset is cleaned, we proceed to the next section of the data analysis process.

<a id='eda'></a>
## Exploratory Data Analysis



### Are the patient's demographics related to him/her not showing up for the medical appointment?

Evaluating the Age, Gender, and Neighbourhood parameters and it's correlation to No Show

In [None]:
df.groupby('Neighbourhood')['noshow'].count()

In [None]:
df.groupby('Neighbourhood').sum()['noshow'].describe()

As we now know the 81 neighbourhoods in the dataset and the number of patients in each neighbourhood, we'll now examine how many of no show appointments were made from each neighbourhood.
For that purpose, we'll subset the original dataset to select the no show patients (No Show = True) and then display it proportionally to the original dataset.

In [None]:
# Filtering the original dataframe to show only the patients who missed their appointments
# and assigning it to another dataframe labelled 'noshows'
# previously used subsetting the dataframe to filter the no shows, but it returned a SettingWithCopyWarning later
# the warning suggested I use the .loc instead, but it didn't solve the problem
noshows = df.loc[df['noshow'] == True]
noshows.groupby('Neighbourhood')['noshow'].count().plot(kind='bar', title='Percentage of No Shows per neighbourhood',figsize=(20,20), alpha=.7);

In [None]:
# https://seaborn.pydata.org/generated/seaborn.histplot.html
sns.set(rc={'figure.figsize':(20,20)})
sns.histplot(data=df, y='Neighbourhood', bins=10, hue='noshow')
plt.title("Percentage of No Show for each Neighbourhood")
plt.show()

In [None]:
# Calculating the number of patients from each neighbourhood not showing up
# in proportion to the total number of patients in that neighbourhood
totals = df.groupby('Neighbourhood')['noshow'].count()
counts = noshows.groupby('Neighbourhood')['noshow'].count()
proportions = counts / totals
proportions[proportions > 0.25]

Patients from the above 5 neighbourhoods are the most likely to miss their appointments.

Now for the age analysis, we'll cut the dataset into 4 categories to check if a certain age group is more likely to miss their medical appointment.

In [None]:
# Statistical Age data in the noshows dataframe
noshows['Age'].describe()

In [None]:
# Creating age groups based on the above age statistical data
# to examine the correlation between the age group and no show
# This code returns a warning, I read the provided link and applied the suggested method as mentioned above
# The warning still pops up
bin_edges = [0.0, 16.0, 33.0, 51.0, 115.0]
bin_names = ['Child', 'Youth', 'Adult', 'Senior']
noshows['Age_Group'] = pd.cut(noshows['Age'], bin_edges, labels=bin_names)

In [None]:
# Value counts for the age groups in the noshows dataframe
noshows['Age_Group'].value_counts()

From the above numbers, we see that Youth (ages between 16 and 33) are the most likely to miss their appointment.
On the other hand, Children (below 16 years) are more likely to show up for their medical appointment by a small difference.

In [None]:
# plotting the ratio of no shows for each gender
sns.set(rc={'figure.figsize':(5,5)})
sns.histplot(data=df, x='Gender', bins=10, hue='noshow')
plt.title("Percentage of No Show for each Gender")
plt.show()

In [None]:
# calculating the percentage of no shows for each gender in proportion to the total numbers for the gender
gender_percent = noshows['Gender'].value_counts() / df['Gender'].value_counts()
print(gender_percent)

As seen from the above plotting and calculation, the patient gender has negligible effect on the no show status.

### Is it possible for the patient's medical condition to prevent him/her from showing up?

In [None]:
# Applying the same proportion calculation for each of the medical parameters
htn_percent = noshows['Hipertension'].value_counts() / df['Hipertension'].value_counts()
dbt_percent = noshows['Diabetes'].value_counts() / df['Diabetes'].value_counts()
alc_percent = noshows['Alcoholism'].value_counts() / df['Alcoholism'].value_counts()
hcp_percent = noshows['Handcap'].value_counts() / df['Handcap'].value_counts()
print("Hipertension compared to No Show\n{}".format(htn_percent))
print("\nDiabetes  compared to No Show\n{}".format(dbt_percent))
print("\nAlcoholism  compared to No Show\n{}".format(alc_percent))
print("\nDegree of disability compared to No Show\n{}".format(hcp_percent))

From the above figures, we conclude that the patient's medical condition in terms of Hipertension, Diabetes, and Alcolism has negligible effect on showing up for the medical appointment.
However, it appears that the higher the patient's disability degree, the most likely he/she misses the medical appointment.

### Can having a scholarship or sending a confirmation SMS be important factors to ensure that the patient shows up?

In [None]:
# Applying the same proportion calculation for the Scholarship program and SMS received
scholar_percent = noshows['Scholarship'].value_counts() / df['Scholarship'].value_counts()
sms_percent = noshows['SMS_received'].value_counts() / df['SMS_received'].value_counts()
print("Scholarship compared to No Show\n{}".format(scholar_percent))
print("\nSMS received compared to No Show\n{}".format(sms_percent))

Examining the above figures reveals unexpected results. It seems there are negative correlation between each of the Scholarship and SMS received with reference to the no show status. In other words, the patients who have scholarship programs or received the appointment confirmation SMS are more likely to miss their medical appointments.

<a id='conclusions'></a>
## Conclusions

After performing the exploration process on the patient's dmographic data, medical condition, and additional parameters, we concluded the following:
- Patients from 5 nieghbourhoods (Ilhas Oceânicas De Trindade, Itararé, Santa Cecília, Santa Clara, Santos Dumont) are more likely to miss their medical appointments in comparison to other neighbourhoods
- Patients with high degree of disability are more likely to miss their medical appointment compared to healthy patients or patients with low degree of disability
- We cannot use the patient's gender, age, and medical condition as indicators for predicting whether he/she shows up for the medical appointment or not
- Investigating the dataset revealed that patients who receive a confirmation SMS for their medical appointment are more likely to miss it, also patients who have scholarship program are more likely to miss their medical appointment compared to patients who are not subscribed to the scholarship program
- Further analysis can be performed on Patient ID to identify patients with several instances of missing their medical appointment, also we can further explore the difference between the scheduling date and appointment date to determine if the date gap between them would be a factor in the patient forgetting the appointment date and not showing up

### Limitations

Most of the collected parameters or patient characteristics in this dataset cannot be a viable indicator that would effectively enable us to predict the future probability of a patient showing up for the medical appointment or not.
In my opinion, we need to collect more data such as medical appointment urgency based on the patient's condition, the appointment medical specialty, or the seasonal diseases information. 


## Submitting your Project 


In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])