# Project: Investigate a Dataset - No Show Appointments

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 
This Dataset is provided by Kaggle.com under the title "Medical Appointment No Shows: Why do 30% of patients miss their scheduled appointments?". This dataset collects information from 100k medical appointments in Brazil and is focused on the  question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. Columns in this Dataset include:
<ul>
    <li>The patient and appointment ID</li>
    <li>The gendre of the patient</li>
    <li>The age of the patient</li>
    <li>The time for scheduling the appointment as well as the appointment time itself</li> 
    <li>The hospital neighbourhood</li>
    <li>If they are part of a Brasilian welfare program called Bolsa Família</li>
    <li>the different ailments the patient might suffer from which include:
        <ul>
        <li>Diabates</li>
        <li>Hypertension</li>
        <li>Alcohlism</li>
        <li>Handicap</li>
        </ul>
    </li>
    <li>If they got an SMS to remind them of the appointment</li>
    <li>And of course, finally, whether they showed up at the appointment or not</li>
</ul>
Some interesting questions can be posed about this dataset and those will be discussed in the follwing section.
    

### Question(s) for Analysis
Before questions can be asked, one must identify the independant variables and the dependant ones. Of course, the dependant variable for this set would be the "show/no-show" data because that is what is being questioned and that is the variable that may change depending on other variables. The independant variables would be every other variable in the data set.  
With that done, it is now possible to ask questions about why a patient might not show up to their appointment.  
  
Some questions that can be asked are:
<ul>
    <li>Does gender affect whether or not the patient shows up</li>
    <li>Does age affect if they apply for welfare</li>
    <li>Does a person with a handicap have a higher chance of missing their appointment</li>
    <li>Will a person reminded by SMS still miss their appointment</li>
</ul>

Now that the questions are defined, importing the data and cleaning it depending on the needed variables, can be carried out.

In [4]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html


AttributeError: module 'matplotlib' has no attribute 'rcParams'

In [None]:
# Upgrade pandas to use dataframe.explode() function. 
#!pip install --upgrade pandas

<a id='wrangling'></a>
## Data Wrangling


In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
# pd.options.display.max_rows = 1000000
df = pd.read_csv('Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv')

In [None]:
df.shape

In [None]:
df.describe()

<a></a>
> After further internet searching, I found out that the above error "TypeError: \_\_init\_\_() got an unexpected keyword argument 'max_rows'" and "ImportError: cannot import name 'is_url'" a there because of the rows being a very large number, 110527 according to the shape function in fact, and that caused an error with pandas. I have tried to fix it with the commented out code (taken from <a href= https://stackoverflow.com/questions/71369078/init-got-an-unexpected-keyword-argument-max-rows>this</a> stack overflow page) above the df definition but it didn't make much of a difference so I commented it out again. the error seems to be a warning more than it is an error, meaning that it doesn't stop the code from running and functioning as it should. Thus it can be ignored.

The above mentioned error was resolved after a kernel restart and a pandas upgrade further than the given 0.23.0 version. I have left the above block just in case the error returns when you review the notebook and would like to have an explanation of what was happening

In [None]:
df.info()

In [None]:
sum(df.duplicated())

In [None]:
df.drop(['PatientId','AppointmentID','Neighbourhood','ScheduledDay','AppointmentDay'],axis=1,inplace=True)


### Data Cleaning
We did a quick analysis of the dataset to see if we had any missing values, duplicated rows or wrong data types. Fortunately, the data set had no missing values nor duplicated rows and all the variables we will work with have the correct data type. Then we had to drop the unnecessary columns from our dataframe. These were the data we did not need in our questions. This included the patient and appointment IDs as they were only specific to the user themselves rather than helping in answering our questions. The appointment and scheduled dates were not needed either because our scope does not depend on the dates/durations between scheduled and appointment dates. Finally, we dropped the hospital neighbourhood column as well because we do not care where the hospital is situated for this analysis. Next, we will rename the columns to have a lower case name and underscores, where appropriate, to make the naming used consistent and easy.
 

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.
df.rename(columns=lambda x: x.lower().strip().replace('-','_'),inplace=True)

In [None]:
#The handicap column had a typo and needed to be rename d from "handcap" to "handicap"
df.rename({'handcap':'handicap'},axis=1,inplace=True)

In [None]:
df.head()

<a id='eda'></a>
## Exploratory Data Analysis

In this section, the actual Exploratory Data Analysis (EDA) will be carried out. Some questions are posed below and they will be assessed using some statistics, mainly using the describe and groupby methods to reach some conclusions. Some graphs for visualising the data will also be used, which might include histograms, bar charts and scatterplots.



In [None]:
# Firstly, I will look at a general histogram of the whole dataframe to see how the results are distrubuted as well as
# look at the distribution numerically using .describe()
df.describe()

In [None]:
df.hist(figsize=(15,15));

The minimum age was noticed to be -1 and the max age to be 115. Although they both may seem like erronuous data points, only the -1 will be dropped because most data will be affected by that negative value. While the entries above 100 are also numerous so we can assume that these numbers, while rather unusual, are not erronuous outliers.

Looking at the rest of the histograms, we can see that most people were not on welfare and did not have handicaps or other ailments (infact more than 75% as shown by the describe method)

We can also see that almost 50% of the patients received an SMS reminder to show up.

The age of patients seems skewed to the right, i.e. the mean (centre) of the data is more on the left side.

Next, the questions posed will be answered

In [None]:
df.drop(index=df.query('age == -1').index[0],inplace=True)

### Does Gender affect whether or not the patient shows up?

In [None]:
# to find out what genders we have:
df.gender.unique()

In [None]:
#split in 2 dfs for ease of access
male = df.query('gender =="M"')
female = df.query('gender == "F"')
male,female

In [None]:
male_tot = male.shape[0]
female_tot = female.shape[0]
plt.bar([1,2],[male_tot,female_tot],tick_label=['Males','Females'])
plt.title('Gender split in patients'.title())
plt.xlabel('Gender')
plt.ylabel('Number');

In [None]:
split = df.groupby(['gender','no_show']).count()['age']    # Any arbitrary column chosen to show counts
split

In [None]:
#creat proportions function to avoid redoing code multiple times
def get_prop(var,counts,tots):
    prop = counts[var] / tots
    return prop

In [None]:
male_prop = get_prop('M',split,male_tot)
female_prop = get_prop('F',split,female_tot)
male_prop,female_prop

In [None]:
def draw_bar(prop1,prop2,label1,label2,var_name):
    bars1 = plt.bar([1,3], prop1, 0.5, color='b', alpha=.7, label='{}'.format(label1))
    bars2 = plt.bar([1.5,3.5], prop2, 0.5, color='r', alpha=.7, label='{}'.format(label2))

    # title and labels
    plt.ylabel('Proportion of {}'.format(var_name))
    plt.xlabel('Did They Show?')
    plt.title('Proportion by {} and No Show'.format(var_name))
    locations = [1.25 , 3.25]  # xtick locations
    labels = ['Yes', 'No']  # xtick labels (switched Yes and No for clarity: now 'Yes' means they did show up and vice versa)
    plt.xticks(locations, labels)

    # legend
    plt.legend();

In [None]:
# plot bars
draw_bar(male_prop,female_prop,'Males','Females','Gender')

From the above plot we can clearly see that the percentage of males showing up to their appointments was higher than females and the percentage for missing the appointments was higher for females than males.

This rules that Gender indeed affect if the patient shows up or not, showing that females were more likely to miss their appointment

### Does age affect if they apply for welfare?

In [None]:
df.describe()

In [None]:
yes_schol = df.groupby('age')['scholarship'].sum()
yes_schol

In [None]:
yes_schol.plot(kind='bar',width=1, color='b', figsize=(20,20))
plt.title('Amount of people signed up for welfare based on age'.title())
plt.ylabel('Amount of People');

In [None]:
# Alternative method to create histogram, but less clean so didn't use it:

#df.query('scholarship == 1').age.plot.hist(by="scholarship", bins=100, figsize=(10, 8)); 

In the histogram (made out of a bar chart) above, it shows that most welfare patients are middle-aged adults where the peak is shown at 38. There are other local peaks at 16 and 5. This might be children enrolled in the welfare program along with their parents seeing that <a href='https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia#:~:text=In%20order%20to%20be%20eligible%2C%20families%20had%20to%20ensure%20that%20children%20attend%20school'>Bolsa Familia</a> was only awarded to poor families with children enrolled in schools (highlight of text can be seen in link provided). These values then make sense that the amount of adults enrolled (parents) is more than children and seniors, but children more than citizens due to them being the main factor to obtaining the family welfare.

### Does a person with a handicap have a higher chance of missing their appointment?

In [None]:
df.handicap.unique()

That is unexpected. Should only be 0 and 1 (True or False). further exploration required

In [None]:
handi_error = df.query('handicap > 1')
handi_error

In [None]:
# Will turn non zero values larger than 1 into a 1 to signify True
df['handicap'] = df['handicap'].replace([2,3,4], 1)

In [None]:
df.handicap.unique()

In [None]:
handi = df.query('handicap == 1')
non = df.query('handicap == 0')
handi,non

In [None]:
handi_tot = handi.shape[0]
non_tot = non.shape[0]
plt.bar([1,2],[handi_tot,non_tot],tick_label=['Handicapped','Non-Handicapped'])
plt.title('Handicapped vs Able Patients'.title())
plt.xlabel('Handicapped/Abled')
plt.ylabel('Number');

Because the value difference is that huge, a proportion comparison would be better

In [None]:
counts = df.groupby(['handicap','no_show']).count()['age']    # Any arbitrary column chosen to show counts
counts

In [None]:
#get proportions handicapped and able
handi_prop = get_prop(1,counts,handi_tot)
non_prop = get_prop(0,counts,non_tot)
handi_prop,non_prop

In [None]:
#plot bars
draw_bar(handi_prop,non_prop,'Handicapped','Abled','Handicapped')

From the comparative bar chart above, it is clear that the answer to the question: "Does a person with a handicap have a higher chance of missing their appointment?" is No. From the graphs, it is evident that the percentage of patients missing their appointment is higher for non-handicapped patients.

### Will a person reminded by SMS still miss their appointment

In [None]:
# to find out what genders we have:
df.sms_received.unique()

In [None]:
rec = df.query('sms_received == 1')
no_rec = df.query('sms_received == 0')
rec,no_rec

In [None]:
rec_tot = rec.shape[0]
no_rec_tot = no_rec.shape[0]
plt.bar([1,2],[rec_tot,no_rec_tot],tick_label=['Received SMS','Didn\'t Receive SMS'])
plt.title('SMS ' + 'reception in patients'.title())
plt.xlabel('SMS Received?')
plt.ylabel('Number');

In [None]:
SMS_count = df.groupby(['sms_received','no_show']).count()['age']    # Any arbitrary column chosen to show counts
SMS_count

In [None]:
#get proportions received and not
rec_prop = get_prop(1,SMS_count,rec_tot)
no_rec_prop = get_prop(0,SMS_count,no_rec_tot)
rec_prop,no_rec_prop

In [None]:
# plot bars
draw_bar(rec_prop,no_rec_prop,'Received','Didn\'t Receive','SMS Reception')

Surprisingly, the percentage of people who received a reminder through SMS and showed up, were less than those who didn't receive and SMS altogether. It can be shown in the bar graphs that about 70% of those who received an SMS, showed up. While more than 80% of those who didn't receive an SMS, actually showed up!

<a id='conclusions'></a>
## Conclusions

The Questions asked at the beginning of the report were:
<ul>
    <li>Does gender affect whether or not the patient shows up</li>
    <li>Does age affect if they apply for welfare</li>
    <li>Does a person with a handicap have a higher chance of missing their appointment</li>
    <li>Will a person reminded by SMS still miss their appointment</li>
</ul>

They have been analysed above and answered but to summarise and show a clearer answer, the final conclusion for every question will be answered here:

#### Does gender affect whether or not the patient shows up?
From the above plot we can clearly see that the percentage of males showing up to their appointments was higher than females and the percentage for missing the appointments was higher for females than males.

This rules that Gender indeed affects if the patient shows up or not, showing that females were more likely to miss their appointment

#### Does age affect if they apply for welfare?
In the histogram above, it shows that most welfare patients are middle-aged adults where the peak is shown at 38. There are other local peaks at 16 and 5. This might be children enrolled in the welfare program along with their parents seeing that <a href='https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia#:~:text=In%20order%20to%20be%20eligible%2C%20families%20had%20to%20ensure%20that%20children%20attend%20school'>Bolsa Familia</a> was only awarded to poor families with children enrolled in schools. These values then make sense that the amount of adults enrolled (parents) is more than children and seniors, but children more than citizens due to them being the main factor to obtaining the family welfare.

#### Does a person with a handicap have a higher chance of missing their appointment?
From the comparative bar chart above, it is clear that the answer to the question "Does a person with a handicap have a higher chance of missing their appointment?" is No. From the graphs, it is evident that the percentage of patients missing their appointment is higher for non-handicapped patients.

#### Will a person reminded by SMS still miss their appointment?
Surprisingly, the percentage of people who received a reminder through SMS and showed up, were less than those who didn't receive and SMS altogether. It can be shown in the bar graphs that about 70% of those who received an SMS, showed up. While more than 80% of those who didn't receive an SMS, actually showed up! Thus, to answer the question, yes a person reminded by SMS will still be likely to miss their appointment.


### General Limitations and Improvements
A limitation that is faced in this analysis is the very wide differences between variable values. Most of the analysis had to be done using proportions and that doesn't give as much of an accurate result as comparing values right away. comparing 2 percentages together, especially when one signifies a very small amount and the other a very large one, could create unfair comparisons seeing that the smaller variables can fluctuate their percentages much greater than the bigger ones at the slihtest value changes.

Further improvements/research can also be done on the neighbourhood of the hospitals to see if perhaps, the neighbourhood is bad/unsafe, could that drive patients to skip out on their appointments? Another research could be looking at jobs/incomes of patients. Is a lower income patient more prone to go to their appointment because it was relatively expensive?

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])