# Project: Investigate Data Set - No Show Appointment


## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **1.0**: I have selected the No-show appointments (original source on
Kaggle) dataset. This dataset collects information from 100k medical appointments in
Brazil and is focused on the question of whether or not patients show up for their appointment. 
A number of characteristics about the patient are included in each row.
● ‘ScheduledDay’ tells us on what day the patient set up their appointment.
● ‘Neighborhood’ indicates the location of the hospital.
● ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare
program Bolsa Família.
● Be careful about the encoding of the last column: it says ‘No’ if the patient showed up to their
appointment, and ‘Yes’ if they did not show up.


Name (Value(s))          Description
PatienID (number)	    Identification of a patient
AppointmentID (number)   Identification of each appointment
Gender (F or M)	        It says 'F' if female or 'M' if man
ScheduledDay (date)	    Tells us on what day the patient set up their appointment
AppointmentDay (date)	The day of the actuall appointment, when they have to visit the doctor
Age	(number)	            How old is the patient
Neighbourhood (string)	Indicates the location of the hospital
Scholarship (0 or 1)	    Indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família
Hipertension (0 or 1)	Indicates if the patient has hipertension
Diabetes (0 or 1)	    Indicates if the patient has diabetes
Alcoholism (0 or 1)	    Indicates if the patient is an alcoholic
Handcap (0 or 1)	        Indicates if the patient is handicaped
SMS_receive (0 or 1)	    1 or more messages sent to the patient
No-show	(Yes or No)	    It says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up


## Metadata
110527 rows/records (Medical Appointments) and 14 columns/variables associated with each records.

PatientId: Identification of a patient.

AppointmentID: Identification of each appointment.

Gender: Male (M) or Female (F).

ScheduledDay: The day someone called or registered the appointment, this is before appointment of course.

AppointmentDay: The day of the actuall appointment, when they have to visit the doctor.

Age: How old is the patient.

Neighbourhood: Where the appointment takes place.

Scholarship: True (1) of False (0).

Hypertension: True (1) or False (0).

Diabetes: True (1) or False (0).

Alcoholism: True (1) or False (0).

Handicap: The handcap refers to the number of disabilites a person has. 
For example, if the person is blind and can't walk the total is 2.

SMS_received: 1 or more messages sent to the patient.

No-show: Yes (If the patient did not show up for the appointment. No, if the patient shows up for the apponitment)




The most important variable in this research is "No-show".
This variable describes if the patient shows up or not for appointment.

# Dependent Variable: No-show

# Independent Variables: age, gender, scholarship, and waiting_period.

# Research Questions
1. What factors are important for us to know in order to predict if a patient will 
show up for their scheduled appointment?
2. Is there any correlation between age and those who shows up for appointments?
3. Is there any correlation between gender showing up for appointments?
4. Does receiving scholarship have a significant impact on showing up for appointments?
5. Does receiving messages have a significant impact on showing up for appointments?

In [1]:
#import the liberaries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('C:/Users/USER/Desktop/Udacity Nanodegree projects/noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558998000000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262960000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951000000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841190000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
df.shape

(110527, 14)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [5]:
df.columns

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')

In [6]:
df.isna().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39200.0,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172615000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731800000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391700000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999982000000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


Observing the age variable closely, I discovereed there is an issue with the variable, -1 can not be someone's age. This will be fixed later. The average age in this dataset is 37, and the heighest age is 115.

In [8]:
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

In [9]:
type(df['No-show'][0])

str

In [10]:
type(df['ScheduledDay'][0])

str

In [11]:
type(df['AppointmentDay'][0])

str

In [12]:
#ScheduledDay and AppointmentDay  shouldn't be a string. Hence, this will be converted to datetype
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

In [13]:
df.dtypes

PatientId                     float64
AppointmentID                   int64
Gender                         object
ScheduledDay      datetime64[ns, UTC]
AppointmentDay    datetime64[ns, UTC]
Age                             int64
Neighbourhood                  object
Scholarship                     int64
Hipertension                    int64
Diabetes                        int64
Alcoholism                      int64
Handcap                         int64
SMS_received                    int64
No-show                        object
dtype: object

In [14]:
df.head(2)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558998000000000.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No
