## Introduction
---
The dataset provided by [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments) comprises a collection of observational data from 100k medical appointments in Brazil and is borne on the subject of whether patients show up for their appointment. Distinct characteristics and attributes about the patient are incorporated, for instance, Gender and Age just as every individual's essential health designation(s), e.g., Hypertension, Alcoholism, and Diabetes. A half-year of arrangement information is surveyed in this notebook.

Inquiries to be explored: 

- What is the patient age distribution of no-shows versus shows? 

- What is the patient health designation breakdown of no-indicates versus appears? 

- Which neighbourhoods have the most elevated number of no-appears? 

- What is the breakdown of no-shows versus shows by patients that got SMS updates? 

- Do no-shows have a bigger time span between schedule date and appointment date?

- Do arrangement no-shows occur on explicit days of the week? 

- Are there any patterns in arrangement no-shows versus shows over a timespan?

### Section I: Data Wrangling
---

__Importing data analysis packages__ which is used in this analysis:

- Pandas.
- Numpy.
- Matplotlib.

In [1]:
# Data analysis packages:
import pandas as pd
import numpy as np
#from datetime import datetime as dt

# Visualization packages:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

  'Matplotlib is building the font cache using fc-list. '


__Loading data__ into the Jupyter notebook

In [2]:
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


__Information about the Dataset__

In [3]:
## Reading dataset general information:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


The dataset includes 14 columns and 110527 data rows.
We're gonna explore the no-show appointments dataset dataset using this variables:

- PatientId: Identification of a patient 
- AppointmentID: Identification of every appointment
- Gender: Male or Female 
- ScheduledDay: The day of the genuine appointment, when they need to visit the specialist 
- AppointmentDay: The day somebody called or enrolled the appointment 
- Age: How old is the patient 
- Neighborhood: Where the appointment happens 
- Scholarship: True or False, shows if the patient is in the Bolsa Familia program 
- Hipertension: True or False 
- Diabetes: True or False 
- Alcoholism: True or False 
- Handcap: True or False 
- SMS_received: at least 1 messages sent to the patient 
- No-show "No" demonstrates if the patient appeared at their appointment and "Yes" in the event that they did not show up

And the above information shows that the dataset does not have any missing value but we will check below, since all the attributes have 110527 instances. However there is no guarantee these values are all consistent.

__Observations:__

- Changing PatientID information type from buoy to int will make the dataset simpler to work
- Taking time from the ScheduleDay column will make the dataset easier to work
- Taking time from the AppointmentDay column will make the dataset simpler to work

__Checking__ for __null values__

In [4]:
# Check if there is any missing values
df.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

__Checking__ for __duplicate values__

In [5]:
# Check for duplicate rows 
df.duplicated().sum()

0

In [6]:
# Check for duplicate appointmentID
sum(df.AppointmentID.duplicated())

0

__Observations:__
- no missing values exist in the provided dataset.
- no duplicate data exists in the dataset.

__Checking__ whether the dataset has valid values

In [7]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


__Observations:__

- Minimum Age appears as -1. Any such patient will be assumed to be a newborn.
- 

### Data Cleaning

__Converting__ the 