

# Project: Investigate a Dataset 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

The Movie Database (TMDb) collected  information about 100K movies , including user ratings and revenue. 

Here we will try to solve many questions such as:
* information about dataset.
* How many values in each category.
* missing or duplicate value in dataset.
* negetive value in age and how to drop it from the dataset.
* How to rename the columns.
* creating dummy values of gender and merge it with original df.
* creating binning function for cuttingn ages column.

For the analysis included in the notebook I have chosen to look at the following 3 questions (some sub-questions are looked at in detail as part of the analysis):
* are there more no_shows with alcoholism or vice versa.
* how many patients in each group.
* dose certain gender influence whether one shows up to appoinments or not.
### Dataset Description 
PatientId: the information of patient.	
AppointmentID: information of appointment.
Gender: Female or Male.
ScheduledDay: the date and time of the appointment.	
AppointmentDay: the day that patinet is called.
Age	Neighbourhood: how old is the patient.	
Scholarship	Hipertension	
Diabetes: if the patient is have Diabetes or not	
Alcoholism: if the patient is under the alcohol or not.	
Handcap	SMS_received: if the patinet recevied the appointment SMS.	
No-show: "No" if the patinet is not come , "Yes" if the patinet is come to the appointment.


### Question(s) for Analysis
* information about dataset.
* How many values in each category.
* missing or duplicate value in dataset.
* negetive value in age and how to drop it from the dataset.
* How to rename the columns.
* creating dummy values of gender and merge it with original df.
* creating binning function for cuttingn ages column.

In [None]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

<a id='wrangling'></a>
## Data Wrangling

here:
- we Load the data and print out a few lines. 
- types and look for instances of missing or possibly errant data.
- information of dataset.
- dataset shape.
- how many values are in each cateogry.
- missing value in dataset.
- if there is a duplicate value or not.
- is no negetive value in age.
 Tip:
 in this section, we will load the data, check the missing values of data and clean the dataset for analysis. Make sure that you doocuments steps carefullyand justify it.

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

In [None]:
#information of dataset
df.info()

no missing data.

In [None]:
#dataset shape
df.shape

In [None]:
df.describe()

the mean of Ages is 37 years.
the maximum age is 115 years.
there maybe some mistak in one value of Age (-1).

In [None]:
#how many values are in each cateogry
df.nunique()

In [None]:
#missing value in dataset
df.isnull().sum()

In [None]:
#if it a duplicate value or not
sum(df.duplicated())

In [None]:
#is no negetive value in age
df.Age.max(), df.Age.min()


### Data Cleaning

here we will do:
+ drop negative age from dataset.
+ rename the columns.
+ creating dummy values of gender.
+ merge dummy values to original df.
+ drop gender column.
+ creating binning function for cuttingn ages column.

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.
df.query('Age == "-1"')

In [None]:
#drop negative age from dataset
df.drop(df.index[99832], inplace=True)

In [None]:
df.info()

In [None]:
df.head(1)

In [None]:
#rename the columns
df.rename(columns = {'PatientId':'Patient_Id', 'AppointmentID':'Appointment_ID', 'ScheduledDay':'Scheduled_Day', 'AppointmentDay':'Appointment_Day', 'Hipertension':'Hypertension', 'No-show':'No_show'}, inplace=True)
df.head()

In [None]:
#creating dummy values of gender
dummy = pd.get_dummies(df['Gender'])
dummy.head()

In [None]:
#merge dummy values to original df
df = pd.concat([df, dummy], axis = 1)
df.head()

In [None]:
#drop gender column 
df.drop(['Gender'], axis = 1, inplace = True)
df.head(3)

In [None]:
#creating binning function for cuttingn ages column

def binning(dataframe, cut_points, labels = None):
    #Define min  and max values
    minimum = dataframe.min()
    maximum = dataframe.max()
    
    #create list by adding min and max to cut_points
    break_points = [minimum] + cut_points + [maximum]
    
    #Binning using cut function of pandas
    dataframeBin = pd.cut(dataframe, bins = break_points, labels = labels, include_lowest = True)
    return dataframeBin

#Binning age:
cut_points = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
labels = ["child", "teen", "young_adult", "40s", "50s", "retirement", "70s", "80s", "90s", "100s", "centenarian"]
df["Age_Bin"] = binning(df["Age"],cut_points, labels)
print(pd.value_counts(df["Age_Bin"], sort = False))

In [None]:
### General look on the data

In [None]:
# histogram for all dataset
df.hist(figsize=(15, 12))

- 18% or more suffered from hypertension.
- No. of patinet who recived SMS is half the No. of those who not recived.
- Most of the patinets didn't suffer from diseases.

<a id='eda'></a>
## Exploratory Data Analysis

there are three Questions:
* are there more no_shows with alcoholism or vice versa.
* how many patients in each group.
* dose certain gender influence whether one shows up to appoinments or not.

### Research Question 1: are there more no_shows with alcoholism or vice versa?

In [None]:
df.shape

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.
df.groupby(["No_show", "SMS_received"]).size()

### Research Question 2  : how many patients dose not show?

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.

# return the Age_Bin column variable to make sure that the groupby function is able to run
df["Age_Bin"] = binning(df["Age"],cut_points, labels)
# run the groupy function
df.groupby(['No_show', 'Age_Bin']).size()

In [None]:
#plot the histogram and sit the xtic properties
ax = df['Age'].plot(kind = 'hist')
ax.set_xticks(cut_points)
ax.set_xticklabels(labels, rotation = 60, )
#set histogram labels and titles
plt.xlabel('Age Group', fontsize = 16)
plt.ylabel('Number of Patients', fontsize = 16)
plt.title('Number of "No_Show" Patients by Age Group', fontsize = 22)
plt.xlim([0, 120])
plt.show()


here:
- the childs are the most hwo don't come to the apoitment.
- The younger age group is not regular in appointments, unlike the older age group, so that it is clear that they are committed. 

### Research Question 3  : dose certain gender influence whether one shows up to appoinments or not?

In [None]:
# create pie chart to detrmine which catogry is influences
# use column F to represents 1 female and 0 male
df.groupby(['No_show', 'F']).size()

In [None]:
#set pie chart proprities
labels = 'Female, No show', 'Female, Showed', 'male, No show', 'male, Showed'
cut_points = [ 57245, 14594, 30962, 7725]
colors = ['red', 'green', 'blue', 'pink']

#set pie chart title and axis
fig1, ax = plt.subplots()
ax.pie(cut_points, labels = labels, colors = colors, autopct = '%1.1f%%')
plt.title('Proportion of Female and Male Patients', fontsize = 18)

#equal aspect ratio ensures that pie is drawn as circle
ax.axis('equal')

#plot pie chart
plt.show()



here:
- The percentage of women who do not show appointments is greater than those who showed, as the percentage of women who showed(13.2%) which do not show(51.8%).
- The percentage of men who do not show appointments is greater than those who showed, as the percentage of men who showed(7.0%) which do not show(28.0%).



In [None]:
# define show and noshow
show = df.No_show == 'No'
noshow = df.No_show == 'Yes'

In [None]:
# pi shart for comparison between those who showed to those who did not according to the gender
plt.figure(figsize=[14.70, 8.27])
df.Age[show].hist(alpha = 0.5, label = 'show')
df.Age[noshow].hist(alpha = 0.5, label = 'noshow')
plt.legend()
plt.title('comparison between those who showed to those who did not according to the gender')
plt.xlabel('Age')
plt.ylabel('Patient Number')



here:
- from age (0_10) they are the higgest patient number.
- from (80_100) lowest patient number.



<a id='conclusions'></a>
## Conclusions





here:
- At the end, Age plays an important role, as the age group 0-10 is the most attended, followed by the age group 35-70 years old, and the elderly are the age group that rarely attends the appointment. 
- Although it sounds a bit strange, but more patients showed up at the clinic without receiving the SMS.
- limitation could not detect corrolation between patinets showing/not showing and many characteristics such as Gender, Chronic diseases, disabilities.

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])