# Project: Predicting Medical Appointment No-Shows in Brazil

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The No-Show dataset contains information on 100,000 medical appointments in Brazil. The set lists a variety of variables, including patient age, gender, neighborhood, and certain diagnoses. It also includes whether or not the patient showed up for their scheduled appointment. 

This project endeavors to explore this dataset and answer a few key questions. 

<ul>
    <li> Are age and opting to get an SMS message correlated? Does receiving an SMS message before the appointment help prevent no-shows?
    <li> Are those on welfare more likely to follow through on their appointments?
        <li> Are younger men without a diagnosis more likely to be no-shows? This question is based on a <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1121551/"> 2001 report in the United Kingdom </a>that showed that men tended to go to the doctor less often than women, especially for general wellness visits. 
</ul>


First, I will import the required packages for this project. 

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#%matplotlib

<a id='wrangling'></a>
## Data Wrangling

First, data from Kaggle will be loaded. Basic characteristics about the dataset will also be explored

### General Properties

In [44]:
df = pd.read_csv("noshowdata.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [45]:
df.shape

(110527, 14)

In [46]:
df.columns[df.isnull().any()]

Index([], dtype='object')

Based on this result, we see that there is no missing data and all columns are fully filled out. 

Next, we can determine if there is any obvious incorrect data using describe.

In [47]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [91]:
noshows = df[df["No-show"] == "Yes"]
len(noshows)

22319

In [93]:
percentnoshows = len(noshows) / len(df)
percentnoshows

0.20193255946510807

From this description, we can determine a range of information about the dataset:

<ul>
    <li> The average age is 37.
    <li> The value for average scholarship is very close to 0 and all measured percentiles are exactly 0. This means that there are not many people enrolled in the Bolsa Familia welfare program. 
    <li> Based on the mean value, the most common diagnosis of the four provided is hipertension. 
    <li> Approximately 20% of all appointments ended with a no-show. 
        
</ul>

### Data Cleaning
However, from this data we also see that there is an error in the Age column. The minimum age is -1, which is not possible.

Ultimately, we want to determine what factors impact whether a specific patient will be a no-show for their appointment. Therefore, we only want to consider patients who have autonomy. In other words, we will only consider adult patients (18 or older). Whether or not a minor patient goes to a medical appointment is strongly affected by their parents, which would affect our ability to draw conclusions from this dataset. 


In [48]:
df_adult = df[df.Age >= 18]
df_adult.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0,1,0,0,0,0,No


In [49]:
df_adult.shape

(83147, 14)

In [50]:
df_adult.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,83147.0,83147.0,83147.0,83147.0,83147.0,83147.0,83147.0,83147.0,83147.0
mean,146990500000000.0,5674828.0,46.887428,0.093798,0.261741,0.094916,0.040122,0.026134,0.33027
std,255654500000000.0,72617.47,17.672053,0.291549,0.439585,0.293101,0.196246,0.172503,0.470313
min,39217.84,5030230.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172457000000.0,5640372.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31683630000000.0,5680670.0,46.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94225830000000.0,5725467.0,60.0,0.0,1.0,0.0,0.0,0.0,1.0
max,999968600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


We see that df_adult now only contains entries for patients who are adults (Age 18 or older) and have validated this by determining the minimum and average ages. 

We can also notice that the mean values for all diagnoses went up. This is to be expected as children are less likely to be diagnosed with illnesses such as hipertension, diabetes, and alcoholism. 

<a id='eda'></a>
## Exploratory Data Analysis


### Research Question 1: Are age and opting to get an SMS message correlated? Does receiving an SMS message before the appointment help prevent no-shows?

To answer this question, we will build two datasets, one for those who choose to recieve an SMS message and one for those who declined. 

First, age character for both of these datasets will be visualized with a histogram. 

Then, no show behavior will be explored

In [51]:
adultsms = df_adult[df_adult.SMS_received == 1]
adultnosms = df_adult[df_adult.SMS_received == 0]

In [52]:
adultsms.shape

(27461, 14)

In [53]:
adultnosms.shape

(55686, 14)

Because these datasets have different sizes, they will be normalized when constructing a histogram. 

In [104]:
plt.figure(figsize=(8,6))
plt.hist(adultnosms.Age, bins=15, alpha = 0.5, label = "Declined SMS", normed=True)
plt.hist(adultsms.Age, bins=15, alpha = 0.5, label = "Received SMS", normed=True)
plt.legend(loc='upper right')
plt.xlabel("Age")
plt.ylabel("Normalized Count")
plt.title("Histogram of Ages for Adults Who Recieve SMS Messages and Adults Who Don't")
plt.show()

The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  This is separate from the ipykernel package so we can avoid doing imports until


Therefore, from the histogram, we see that there is no significant difference in age distribution between those patients who recieve SMS reminders about their appointments and those who do not. 

Now we will determine whether receiving the SMS has any impact on no show status. 

In [80]:
sms_show = adultsms[adultsms['No-show'] == "No"]
sms_no = adultsms[adultsms['No-show'] == "Yes"]

nosms_show = adultnosms[adultnosms['No-show'] == "No"]
nosms_no = adultnosms[adultnosms['No-show'] == "Yes"]

In [81]:
print(sms_show.shape)
print(sms_no.shape)

(20102, 14)
(7359, 14)


In [82]:
print(nosms_show.shape)
print(nosms_no.shape)

(46723, 14)
(8963, 14)


In [83]:
sms_noshow_percent = len(sms_no) / len(adultsms)
sms_noshow_percent

0.2679800444266414

In [84]:
nosms_noshow_percent = len(nosms_no) / len(adultnosms)
nosms_noshow_percent

0.16095607513558166

In [102]:
labels = ['All Patients', 'Adults with SMS Message', 'Adults without SMS Message']
percentdata = [20.19, 26.79, 16.10]
plt.figure(figsize = (10,6))
plt.bar(labels, percentdata)
plt.title("Percentage Appointment No-Shows")
plt.ylabel("Percentage")
plt.show()

Interestingly, it appears that adults who opt to recieve SMS messages are more likely to be no-shows than adults who do not. 

### Research Question 2: Are those on welfare more likely to follow through on their appointments?

Next, we will perform a similar analysis to determine whether those who are on welfare are more likely to show up to their appointments. 

Children will be re-included in this analysis because if a child is on welfare, they are likey under their parent's plan. 

In [106]:
welfare = df[df.Scholarship == 1]

In [108]:
welfare_noshow = welfare[welfare["No-show"] == 'Yes']

In [109]:
welfare_noshow_percent = len(welfare_noshow) / len(welfare)
welfare_noshow_percent

0.2373630420771568

In [111]:
labels = ['All Patients', 'Patients on Welfare']
percentdata = [20.19, 23.74]
plt.bar(labels, percentdata)
plt.title("Percentage Appointment No-Shows")
plt.ylabel("Percentage")
plt.show()

Interestingly, patients on welfare are also more likely to be no-shows. 

### Research Question 3: Are younger men without a diagnosis more likely to be no-shows? 

To answer this question, we will classify men under 50 as "young" and men above 50 as "older."

First, we will determine no-show rates for both young and old men with and without diagnoses. 

In [119]:
men = df_adult[df_adult.Gender == "M"]

In [116]:
young_men = men[men.Age < 50]
young_men.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,13258.0,13258.0,13258.0,13258.0,13258.0,13258.0,13258.0,13258.0,13258.0
mean,150515600000000.0,5678916.0,34.255996,0.030397,0.095339,0.02836,0.060567,0.02919,0.285639
std,257502300000000.0,66372.74,9.263914,0.171683,0.293693,0.166006,0.238544,0.178353,0.451735
min,43741.76,5122866.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4522082000000.0,5644137.0,26.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,35218330000000.0,5682391.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,95645670000000.0,5725706.0,42.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999479200000000.0,5790466.0,49.0,1.0,1.0,1.0,1.0,4.0,1.0


In [118]:
older_men = men[men.Age >= 50]
older_men.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,11948.0,11948.0,11948.0,11948.0,11948.0,11948.0,11948.0,11948.0,11948.0
mean,151639600000000.0,5675104.0,63.539672,0.022012,0.433545,0.16304,0.110228,0.042434,0.303984
std,259072800000000.0,76922.46,10.036571,0.146729,0.495585,0.369417,0.313186,0.22438,0.459995
min,61433780.0,5134227.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4413892000000.0,5642030.0,55.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31525110000000.0,5683170.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,95458720000000.0,5728932.0,70.0,0.0,1.0,0.0,0.0,0.0,1.0
max,999946500000000.0,5790461.0,100.0,1.0,1.0,1.0,1.0,3.0,1.0


From the above describe functions, we see that we properly divided the men into older and younger groups. 

Next, we will isolate men who do not have a diagnosis from those that do. 

In [125]:
young_men['Diagscore'] = young_men.apply(lambda young_men: young_men.Hipertension + young_men.Diabetes + young_men.Alcoholism + young_men.Handcap, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [128]:
older_men['Diagscore'] = older_men.apply(lambda older_men: older_men.Hipertension + older_men.Diabetes + older_men.Alcoholism + older_men.Handcap, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


The new column Diagscore (Diagnostic Score) is a sum of all diagnostics. If Diagscore is zero, the patient does not have any of the listed diagnoses. This can be used for further analysis. 

In [136]:
young_no = young_men[young_men.Diagscore == 0]
older_no = older_men[older_men.Diagscore == 0]

In [137]:
young_diag = young_men[young_men.Diagscore > 0]
older_diag = older_men[older_men.Diagscore > 0]

In [140]:
young_no_noshow = young_no[young_no['No-show'] == "Yes"]
older_no_noshow = older_no[older_no['No-show'] == "Yes"]

In [143]:
# Percent of young males with no diagnoses that are no-shows
percent_young_no_noshow = len(young_no_noshow) / len(young_no)
print(percent_young_no_noshow)

# Percent of older males with no diagnoses that are no-shows
percent_older_no_noshow = len(older_no_noshow) / len(older_no)
print(percent_older_no_noshow)

0.22476016445865693
0.15136563876651982


In [147]:
young_diag_noshow = young_diag[young_diag["No-show"] == "Yes"]
older_diag_noshow = older_diag[older_diag["No-show"] == "Yes"]

In [148]:
# Percent of young males with diagnoses that are no-shows
percent_young_diag_noshow = len(young_diag_noshow) / len(young_diag)
print(percent_young_diag_noshow)

# Percent of older males with diagnoses that are no-shows
percent_older_diag_noshow = len(older_diag_noshow) / len(older_diag)
print(percent_older_diag_noshow)

0.19152615650670127
0.16387693288697594


In [153]:
labels = ["Young Men w/o Diagnosis", "Young Men with Diagnosis", "Older Men w/o Diagnosis", "Older Men with Diagnosis"]
percentdata = [22.5, 19.2, 15.1, 16.4]
plt.figure(figsize=(11,4))
plt.bar(labels, percentdata)
plt.title("Percentage Appointment No-Shows among Men")
plt.ylabel("Percentage")
plt.show()

From the data, we see that young men with no diagnosis are most likely to be no shows. Interestingly, a diagnosis will decrease no shows for young men but increase no shows for older men. 

<a id='conclusions'></a>
## Conclusions



In this project, we answered three key research questions. 

First, we determined that **receiving an SMS message before the appointment did not help prevent no-shows**. Overwhelmingly, it was determined that adults who received SMS messages were more likely to be no-shows than those who did not receive SMS messages. This was true even when considering if age played a factor in choosing to receive SMS messages. 

Second, we determined that **patients on welfare were more likely to be no-shows**. This may be because they are not paying for the appointment and therefore feel less of an obligation to go, or because they may have to work during the time of the appointment. 

Finally, we saw that **young men with no diagnosis are more likely to be no shows**. If a young man had a diagnosis, their likelihood of beign a no show decreased. Interestingly, we found that a diagnosis will decrease no shows for young men but increase no shows for older men. 

**Recommendations**

<ul>
    <li> Implementing SMS messages does not have an impact on reducing no shows, so budgeting for this option should not be increased. 
    <li> Those on welfare are more likely to be no shows, so it may be beneficial to determine what factors are preventing these people from following through on their appointments (i.e. having to work extra hours). 
    <li> Diagnoses do decrease the rate of no shows among men, but only for younger patients. Other options should be considered for older patients. 
</ul>

In [154]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

0