# Project: Hospital Data Analysis

## Table of Contents
<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#wrangling">Data Wrangling</a></li>
  <li><a href="">Exploratory Data Analysis</a></li>
  <li><a href="">Conclusion</a></li>
</ul>

<a id="introduction"></a>
## Introduction
> This analysis focuses on examining the **No-Show appointments** data from Kaggle. In the dataset, about 100,000 medical appointment records from Brazil have been featured. The information captured in the dataset include the patient's **id, Appointment Id, Gender, ScheduledDay, AppointmentDay, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received, No-show**
> ### Definition of important variables.
>> _Gender_: Describes whether the patient is **male** or **female**. <br>
>> _ScheduledDay_: Tells us on what day the patient set up their appointment. <br>
>> _Age_: Indicates the patients age. <br>
>> _Neighbourhood_: Indicates the location of the hospital. <br>
>> _Scholarship_: Indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família. <br>
>> _Hipertension_: Indicates whether a patient has Hipertension or not <br>
>> _Diabetes_: Indicates whether a patient is diabetic or not. <br>
>> _Alcoholism_: Indicates whether a patient alcoholic or not. <br>
>> _Handcap_: Indicates whether a patient is Handcapped or not. <br>
>> _SMS_received_: Indicates whether a patient received sms notifications about the appointment or not. <br>
>> _No-show_: Indicates whether a patient showed up for their appointment or not. <br>
>>> #### Important Points to Note:
>>>> 1. For the _Scholarship_, _Hipertension_, _Diabetes_, _Alcoholism_, _Handcap_, and _SMS_received_ fields, **1 = Yes** and **0 = No**
>>>> 2. For the _No-show_ field, **No = The patient showed up for the appointment** and **yes = The patient did not show up for the appointment**
> ### Questions to be answered.
>> Q1: Are male patients more likely to show up for an appointment as compared to female patients? <br>
>> Q2: Are older patients more likely to show up for an appointment as compared to the younger patients? <br>
>> Q3: How do patients from different neighbourhoods differ in terms of showing up for doctor's appointments? <br>
>> Q4: Are patients who are enrolled in Brasilian welfare program more likely to show up for scheduled appointments? <br>
>> Q5: How do positive diagnosis of health conditions such as hipertension, Diabetes, Alcoholism, and Handcap affect appointment attendance? <br>

In [1]:
# Import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

<a id="wrangling"></a>
# Data Wrangling

## General Properties

In [14]:
# Loading the data and set the 'AppointmentID' as the index
df = pd.read_csv('Dataset/appointments_data.csv', index_col='AppointmentID')

# Verify that the data was loaded successfully
df.head()

Unnamed: 0_level_0,PatientId,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
AppointmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
5642903,29872500000000.0,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
5642503,558997800000000.0,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
5642549,4262962000000.0,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
5642828,867951200000.0,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
5642494,8841186000000.0,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
# Examine the datatypes of the data in every column in the dataset
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

> All the numerical columns in the dataset are set as integers while the columns with text data are set to string as required.

In [4]:
# Examine the number of columns and rows in the data
df.shape

(110527, 14)

> The dataset has 14 columns and 110527 entries

In [5]:
# Check whether the data has any duplicated records
df.duplicated().sum()

0

> The dataset has no duplicated records

In [10]:
# check for records with null values.
df.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

> There are no null values in the dataset.

In [6]:
# Check the number of unique appointments record by appointment id
df['AppointmentID'].nunique()

110527

> There are 110527 unique appoint records in the dataset

In [7]:
# How many unique patients are captured in the dataset?
df['PatientId'].nunique()

62299

> 62,299 unique patients have been captured in the dataset. <br>
> - This shows that some patients set more than one appointments.

In [8]:
# How many appointments were made by patients with hipertension in the dataset?
df['Hipertension'].value_counts()

0    88726
1    21801
Name: Hipertension, dtype: int64

> Of all the appointments that were recorded, 21,810 appointments were made by patients who had hipertension.

In [9]:
# How many appointments were made by patients with Scholarship?
df['Scholarship'].value_counts()

0    99666
1    10861
Name: Scholarship, dtype: int64

> Of all the appointments that were recorded, 10861 appointments were made by patients who had enrolled in Brasilian welfare program Bolsa Família.

### _**Remarks:**_ The dataset :-
> 1. Does not have any duplicated records. <br>
> 2. Has no null values. <br>
> 3. Has all the columns set to the correct datatypes <br>
**All the only data cleaning step that is needed is to drop the unnecessary columns**

## Data cleaning

In [11]:
# Make a copy of the original dataset
df_new = df.copy()

# Verify that the data was copied successfully
df_new.head(3)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No


In [12]:
# Display all the columns for ease of reference
df_new.columns

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')

> We need to drop the `PatientId` and `AppointmentDay` columns.

In [13]:
# Drop the columns
df_new.drop(columns=['PatientId', 'AppointmentDay'])

# Verify that the columns were dropped successfully
df_new.columns

Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')