
# Project: Investigate a Dataset (No show appointments )

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
# Introduction

> **I explore Medical Appointment No Show dataset this dataset collects information from 100k medical appointments in Brazil  I will use it to predict patient will show up or not by answering these questions:**
>* Q1 Is Age having any impact on patient to show up ? 
>* Q2 Is received SMS having any impact on patient to show up?



In [8]:
# importing libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline





<a id='wrangling'></a>
## Data Wrangling

> Here in this section  I load the data and I will clean the data 

### General Properties

In [9]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments.csv')
df.head(5)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [10]:
# see the general information 
# checking for data types and missing data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


 **No Missing value**

In [11]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


> **Here as we can see there's unacceptable mistake where's the minimum age is -1 which clearly wrong**

In [12]:
df.shape

(110527, 14)

In [13]:
#check for any duplicated data 
sum(df.duplicated())

0

### Data Cleaning 
cleaning from missing value and spelling mistakes 

In [14]:
# droping the columns i don't need 
df.drop(['PatientId','AppointmentID','AppointmentDay', 'Hipertension','Neighbourhood', ], axis = 1, inplace = True)

In [15]:
# renaming mis splet column name
df.rename(columns={'No-show':'Noshowing', }, inplace =True)
df.head(3)

Unnamed: 0,Gender,ScheduledDay,Age,Scholarship,Diabetes,Alcoholism,Handcap,SMS_received,Noshowing
0,F,2016-04-29T18:38:08Z,62,0,0,0,0,0,No
1,M,2016-04-29T16:08:27Z,56,0,0,0,0,0,No
2,F,2016-04-29T16:19:04Z,62,0,0,0,0,0,No


In [None]:
#correct the age mistake
def replaceAge(i):
    if i < 1:
        return df['Age'].median()
    else:
        return i
    
# here will pass the whole column to the function:
df['Age'] = df['Age'].apply(lambda a: replaceAge(a))

df.describe()

* I correct the age mistake now the data is ready

<a id='eda'></a>
## Exploratory Data Analysis

> here i will visualize my questions 



In [None]:
# to see the whole dataset
df.hist(figsize=(10, 8));

### Research Question 1 (Does the age effect in showing up ? )

* To answer my question i will visualize the age to see the higher age who showing up and the lowest and the average

In [None]:
# sperate data for show and not show 
show = df.Noshowing== 'No'
noshow = df.Noshowing == 'Yes'

In [None]:
df.Age[show].mean()

In [None]:
df.Age[noshow].mean()

In [None]:
# here's i will visulaize the data to see the difference between variables and what's higher than the other
plt.figure(figsize=(20,20))
df.Age[show].hist(label= 'show', alpha=0.5, color= 'c')
df.Age[noshow].hist(label= 'noshow', alpha = 0.5, color= 'r')
plt.legend();
plt.title('showing up of Age ')
plt.xlabel('Age')
plt.ylabel('counts');

* Patient between 37 to 44 are the more patients whos showing up the oldest    patients is the less showing up to their appointments 
* Average Age of patient that shows up is 37.77709002941544
* Average Age of patient that doesn't show up is 34.279109435317352 



### Research Question 2  (Is received SMS having any impact on patient to show up?)
* I need to see the percentage it's will make a clear look to the data so I will use the pie chart 

In [None]:
#use groupby to extract every case in all two variables 
rn = df.groupby(['Noshowing','SMS_received'])['Noshowing'].count()
rn

In [None]:
#creat pie 
labels = ['received sms and show up','did not received sms and show up' , 'received sms and did not show up' , 'did not received sms and did not show up']
colors = ['#ff87b0' , '#fffe87' , '#46d7f1' , '#fda945']   
#styling 
plt.pie(rn ,
        labels = labels , 
        colors = colors ,  
        shadow = True, 
        wedgeprops={'edgecolor' : 'grey'} , 
        explode = [0.02,0.02,0.02,0.02],
        radius = 1,
        autopct = '%1.1f%%')
plt.title("showing up in the appotinment" )

* As we can see people who have received sms they have show up which is receiving the sms will make people show up to their appotinment 


<a id='conclusions'></a>

### limitations
 * I didn't know about what the good of charts to visualize my questions and answers them 
 * I wasn't know how to drow the pie Chart but I learnd 
 * The data set doesn’t have alot of numerical values only age data which was so difficult find correlations since most of the variables are objects


## Conclusions

* in this dataset the women are more the man 
* Patient between 37 to 44 are more showing to their appointments
* few Patients who received SMS but they still showing 






In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])

END