
# Project: No-Show Appointment Dataset

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> The **No-show Appointment dataset** is a collection from over 100k medical appointments in Brazil. It aims to understand   why patients show up or do not show up for their scheduled appointments. It takes into consideration the Neighbourhood of the patients, whether or not they are enrolled in the Brasilian welfare program, Illnesses they have.
> Some other columns included are:
<ul>
    <li><b>PatientId</b></li>
    <li><b>AppointmentDay</b></li>
    <li><b>Age</b> - in years</li>
    <li><b>Gender</b></li>
    <li><b>No-show</b> (No - showed up/Yes - No-show)</li> 
</ul>

### Questions:
<ol>
    <li><a href='#q1'>What Age group(s) barely show up?</a></li>
    <li><a href='#q2'>What Day(s) of the week are people hardly showing up?</a></li>
    <li><a href='#q3'>What Illness(es)/Condition(s) have a higher rate of not showing up?</a></li>
</ol>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [None]:
df = pd.read_csv('noshowappointments.csv')

In [None]:
df.head()

getting a first glance on the representation of each data column

In [None]:
df.info()

Checking the datatypes for each column and null values if any

In [None]:
df.duplicated().sum()
#checking for duplicated records

In [None]:
df.describe()

Checking the statistical properties for the data

<b>The data is relatively clean but to work better with, the AppointmentDay column needs to be changed to datetime, the negative age row needs to be dropped and for better analysis the No-show column has to be made an integer datatype

### Data Cleaning:

<ul>
    <li>Dropping of the row where age corresponded to -1</li>
    <li>Converting the No-show column to binary</li>
    <li>Converting the AppointmentDay column to datetime object</li>
</ul>

In [None]:
df.query('Age == -1')

In [None]:
df.drop(labels=99832, axis=0, inplace=True)
#rdropped row with age column as -1

In [None]:
df['No-show'].replace({'No':0, 'Yes':1}, inplace=True)
#converting No-show column to binary integer column

In [None]:
df['AppointmentDay'] = df['AppointmentDay'].apply(pd.to_datetime)
#converting column to datetime datatype

In [None]:
df.head()

<a id='eda'></a>
## Exploratory Data Analysis


## Functions

#### plot function

In [None]:
def viz_plot(data, color, label_x, label_y, title):
    height = list(data.keys())
    values = list(data.values())
    
    fig = plt.figure(figsize=(12,8))
    plt.bar(height, values, color=color)
    plt.xlabel(label_x, fontsize=14)
    plt.ylabel(label_y, fontsize=14)
    plt.title(title, fontsize=18)
    plt.tick_params(left=False, bottom=False)

<a id='q1'></a>
### The Age Effect

In [None]:
child_df = df.query('Age >= 1 and Age <=12')
teens_df = df.query('Age >=13 and Age <=19')
youths_df = df.query('Age >=20 and Age<=30')
Adults_df = df.query('Age >=31 and Age <=50')
Aged_df = df.query('Age >=51')

In [None]:
child_mean = child_df['No-show'].mean()
teens_mean = teens_df['No-show'].mean()
youths_mean = youths_df['No-show'].mean()
adults_mean = Adults_df['No-show'].mean()
aged_mean = Aged_df['No-show'].mean()

In [None]:
mean_no_show_age = {'Children':child_mean, 'Teens':teens_mean, 'Youths':youths_mean, 'Adults':adults_mean, 'Aged':aged_mean}

viz_plot(data=mean_no_show_age, color='maroon', label_x='Age Category', label_y='mean_No_show', title='Distribution of mean No-show Age Category')

<b>This shows the mean distribution of No-show across age categories</b>
<p>The Aged, Adults and children attend more of their scheduled appointments as compared to the teenagers and youths<p>

In [None]:
fig = plt.figure(figsize=(10,8))
location = [1,2,3,4,5]
height = [child_mean, teens_mean, youths_mean, adults_mean,aged_mean]
labels = ['Children', 'Teens', 'Youths', 'Adults', 'Aged']
plt.plot(labels, height)
plt.xticks(fontsize=12)
plt.xlabel('Age Category', fontsize=14)
plt.ylabel('mean_No-show', fontsize=14)
plt.title('Line distribution of Mean_No-Show vs Age Category', fontsize=18)

This means <b>Youths</b> have a higher mean of not showing up for appointment

#### Age group:
<ul>
    <li>Children - 1-12 </li>
    <li>Teens - 13-19 </li>
    <li>Youths - 20-30</li>
    <li>Adults - 31-50</li>
    <li>Aged - 51 and above</li>
</ul>

<a id='q2'></a>
### Day of the week Effect

In [None]:
df['day'] = df['AppointmentDay'].dt.day_name()
# making a column to for day of the week for the appointment

In [None]:
df.head()

In [None]:
mon_df = df.query('day == "Monday"')
tue_df = df.query('day == "Tuesday"')
wed_df = df.query('day == "Wednesday"')
thur_df = df.query('day == "Thursday"')
fri_df = df.query('day == "Friday"')
sat_df = df.query('day == "Saturday"')

> Making dataframes that contain respective days of the week only

In [None]:
mon_mean = mon_df['No-show'].mean()
tue_mean = tue_df['No-show'].mean()
wed_mean = wed_df['No-show'].mean()
thur_mean = thur_df['No-show'].mean()
fri_mean = fri_df['No-show'].mean()
sat_mean = sat_df['No-show'].mean()

> finding the mean no-show appointments for each day

In [None]:
mean_no_show_day = {'Mon':mon_mean,'Tue':tue_mean, 'Wed':wed_mean, 'Thurs':thur_mean, 'Fri':fri_mean, 'Sat':sat_mean}
viz_plot(data=mean_no_show_day, color='green', label_x='Day of week', label_y='No-show_mean', title='mean No-show vs Day of week')

> From the graph, it shows people are more unfaithful to their appointments slated for the weekend (Friday & Saturday) than any other day. They give the heighest honour to thurday for showing up to their appiontments

In [None]:
show_days = df[df['No-show']==0]['day'].value_counts()

# data series trend for days of the week people show up for appointment

In [None]:
fig = plt.figure(figsize=(12,8))
data = [show_days[0], show_days[1],show_days[2],show_days[3],show_days[4],show_days[5]]
labels = ['Wed', 'Tue', 'Mon', 'Fri', 'Thur', 'Sat']
explode = [0.1, 0, 0, 0, 0.1, 0]
plt.pie(data, labels=labels, autopct='%1.2f%%', explode=explode, shadow=True, startangle=60,textprops={'fontsize': 14})
plt.title('Days of the week with attended Appointments\n', fontsize=18)
plt.show()

Saturday poses to be the day people hardly attended appointments, unlike Wednesday where are more than likely

<a id='q3'></a>
### Illness Effect

In [None]:
hipertension = df[df['Hipertension'] ==1]
diabetes = df[df['Diabetes']==1]
alcoholism = df[df['Alcoholism']==1]
handicap = df[df['Handcap'] >0]

In [None]:
hiper_mean = hipertension['No-show'].mean()
diab_mean = diabetes['No-show'].mean()
alco_mean = alcoholism['No-show'].mean()
hand_mean = handicap['No-show'].mean()

In [None]:
illness_data = {'Hipertension':hiper_mean, 'Diabetes':diab_mean, 'Alcoholism':alco_mean, 'Handcap':hand_mean}
viz_plot(data=illness_data, color='#00008b', label_x='Illness', label_y='No-show_mean', title='mean No-show vs Illness type')

This means people who have an alcoholic condition tend to miss their appointment more than other illnesses

<a id='conclusions'></a>
## Conclusions

With the performed analysis, the tested factors of influence are;
<ul>
    <li>Age groups between 13-30 tend not to show up for appointments</li>
    <li>Appointments set for the weekend tend to be futile</li>
    <li>Patients who suffer from alcoholism tend to wait out on their appointments</li>
 </ul>
 
 > Other factors may also give rise to patients unavailablity to their appointments and combined facotrs too, such as age and Illness or a combination of two or more facors, but for this analysis, the above listed three factors were considered as the data is rich enough to consider alot of analysis.
 
 ### Limitations
 > The data tends to be inbalanced, as there are more enteries for patients who made their appointments than those who didn't.
 
 > The Handcap column had values of more than 1 (i.e 2,3,4) as compared to the other Illness columns

## References

1. [Matplotlib bars](https://www.geeksforgeeks.org/bar-plot-in-matplotlib/)
2. [Day of week extracton](https://datascientyst.com/convert-datetime-day-of-week-name-number-in-pandas/#:~:text=Step%203%3A%20Extract%20Day%20number,weekday%20%2D%20alias)