# Project: Investigate a Dataset - Medical Appointments No Shows

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description  

This dataset collects information from over 110,000 medical appointments in Brazil in 14 associated variables and is focused on the question of whether or not patients show up for their appointments.

The data dictionary is given below:
1. **PatientId**: Identification of a patient
2. **AppointmentID**: Identification of each appointment
3. **Gender**: Male or Female
4. **ScheduledDay**: Date appointment was booked
5. **AppointmentDay**: Date for the appointment visit
6. **Age**: Age of the patient
7. **Neighbourhood**: Where the appointment takes place (city, town, provience, etc)
8. **Scholarship**: True or False if patient receives Government health benefits/subsidies
9. **Hipertension**: True or False if patient is hypertensive
10. **Diabetes**: True or False if patient is diabetic
11. **Alcoholism**: True or False if patient has an alcoholism condition
12. **Handcap**: True or False if patient is handicaped
13. **SMS_received**: 1 or more messages sent to the patient
14. **No-show**: True or False if patient shows up for the appointment

### Question(s) for Analysis
1. What factors are important for us to know in order to predict if a patient will show up for their scheduled appointment?
2. How does the presence of a health condition affect the showing up or not of a patient?
3. Is there a relationship between the number of days to the appointment and failure to show up for the appointment?

In [1]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

ERROR: Could not find a version that satisfies the requirement pandas==0.25.0 (from versions: none)
ERROR: No matching distribution found for pandas==0.25.0


In [2]:
# import statements for necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

In [3]:
# read in the data and read a few lines of the data
df = pd.read_csv("Database_No_show_appointments/noshowappointments-kagglev2-may-2016.csv")
df.tail()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
110522,2572134000000.0,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3596266000000.0,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,15576630000000.0,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,92134930000000.0,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No
110526,377511500000000.0,5629448,F,2016-04-27T13:30:56Z,2016-06-07T00:00:00Z,54,MARIA ORTIZ,0,0,0,0,0,1,No


In [4]:
# inspect the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [5]:
# check for duplicate records
df.duplicated().sum()

0

In [6]:
# check for null records
df.isna().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

In [7]:
# inspect the ScheduledDay and AppointmentDay columns
df[['ScheduledDay', 'AppointmentDay']][:5]

Unnamed: 0,ScheduledDay,AppointmentDay
0,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z
1,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z
2,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z
3,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z
4,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z


### Data Cleaning

From the visual and programmatic analysis performed; the dataset has few quality issues and no major tidiness concerns. These are identified and wranggled in the cells below.