

# Project: No-Show Appointments Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

> In this project i am going to work with the No-Show Appointments Dataset. The dataset collects medical appointments information of 100k Brazilians patients. The focus is on whether the patients show up for their appointments or not?
The dataset consists of 14 columns and 110527 rows. The columns show:
   <ol>
    <li> PatientId:         this column is used to identify the patient.</li>
    <li>AppointmentID:     this column is used to uniquely recognize the appointments per patient </li>
    <li>AppointmentID:     this column is used to uniquely recognize the appointments per patient</li>
    <li>Gender:            this is the sex of each patient.</li>
    <li>ScheduledDay:      tells us on what day the patient set up their appointment.</li>
    <li>AppointmentDay:    tells us on what day the appointment was made.</li>
    <li>Age:               defines the age of each patient.</li>
    <li>Neighbourhood:    indicates the location of the hospital.</li>
    <li>Scholarship:       indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.</li>
    <li>Hipertension:      True or False </li>
    <li>Diabetes:          True or False  </li>
    <li>Alcoholism:       True or False </li>
    <li>Handcap :          True or False </li>
    <li>SMS_received:      True or False </li>
    <li>No-show :          this column tells us whether the patient showed up or not?</li>
</ol>
 


### Project Questions
><p>What is the relationship between those who attended and not attended their appointments?</p>
 <p>What gender highly shows up for the appointments?</p>
 <p>Which age highly shows up for appointments?</p>
 <p>Which disease has the highest incident rate?</p>


In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
% matplotlib inline



UsageError: Line magic function `%` not found.


In [None]:
# loading datasets
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv',)
df.head(1)

<a id='wrangling'></a>
## Data Wrangling




### General Properties

In [None]:
# loading of the datasets
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv',)
df.head()

In [None]:
#the last 20 rows of the dataset
df.tail(20)

In [None]:
# showing the dataframe size
df.shape

In [None]:
# checking for duplicates in the dataset
df.duplicated().sum()

In [None]:
# checking for missing values
df.isnull().sum().any()

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
# unique values in each columns
df.nunique()

In [None]:
df.Gender.value_counts()

In [None]:
# age should not be less than 1 and so we highlight the row
df.query('Age==-1')


### Data Cleaning


In [None]:
# drop colums that won't be necessary for the study
# drop columns PatientId, AppointmentID
df.drop(['PatientId', 'AppointmentID'], axis=1, inplace=True)


In [None]:
# confirm changes
df.head(1)

In [None]:
# change column names to lowercase and - to _ for consistency in the code
df.rename(columns=lambda x: x.strip().lower().replace("-", "_"), inplace=True)


In [None]:
df.head()

In [None]:
# introduce _day to scheduledday appointmentday
df.rename(columns=lambda x: x.strip().lower().replace("day", "_day"), inplace=True)

In [None]:
# correct misspelt column name handap to handicap and hipertension to hypertension
df.rename(columns={'handcap' : 'handicap', 'hipertension':'hypertension'}, inplace=True)

In [None]:
# confirm changes
df.head(1)

In [None]:
# drop ow with age less than 1
df.drop(labels=99832, axis=0, inplace=True)

In [None]:
# confirm changes
df.query('age==-1')

In [None]:
df.hist(figsize=(10,8))

<a id='eda'></a>
## Exploratory Data Analysis



### 1. What is the relationship between those who attended and not attended their appointments?

In [None]:
# function for getting percentages in integer daya type
def percentage_calculation(n):
    Total= 110527
    percent =(n/Total)*100
    return int(percent)


In [None]:
# making masks for the attended and not_attended
attended = df.no_show=='No'
not_attended = df.no_show=='Yes'

In [None]:
# the relationship between the attended and non- attended
attended = (df['no_show']=='No').sum()
not_attended = (df['no_show']=='Yes').sum()
value = [attended, not_attended]
label = ('show up patients','no show patients ')

plt.pie(value, labels = label, autopct= '%.1f%%')
plt.title('Percentage of no-show patients')
plt.figure(figsize=(10,10))
plt.show()

><p>We can see that many patients showed up for their appointments opposed to those who missed</p>

## 2. What gender highly shows up for the appointments?

In [None]:
df.gender.value_counts()

In [None]:
sns.countplot(x ='no_show',hue='gender', data = df)
plt.xlabel('Distribution of No show appointment') 
plt.ylabel('No. of Patients')
plt.title("Gender Distribution for No Show")# title
plt.legend()
plt.show()

><p>Many females visited the neighbourhoods unlike men. The number of females who showed up was also higher as compared to their male counterparts</p>

In [None]:
# the count for the gender distribution
df.groupby(['no_show'])['gender'].value_counts()

In [None]:
# percentage of women who had appointments
percentage_calculation(57245+14594)

In [None]:
# perentage of men
percentage_calculation(30962+7725)

In [None]:
# percentage of women that showed up
percentage_calculation(57245)

In [None]:
# percentage of men that showed up
percentage_calculation(30962)

### 3. Which age highly shows up for appointments?

In [None]:
# box distribution for age
df['age'].plot(kind = 'box') 
plt.ylabel('Age Range')
plt.title("Distribution of Age");

><p>The age ranged from 0 to 100 with many patients being of ages between 20-60</p>

In [None]:
# grouping the Ages into clusters for better distribution
levels = [0,18,36,60,100]

# labelling the nage levels
level_name= ['Children', 'Teenagers', 'Adult', 'Aged']

In [None]:
# adding the range to the column
df['age_range'] = pd.cut(df['age'], levels, labels=level_name)

In [None]:
# confirm changes
df.head(1)

In [None]:
# count plot for age_range
sns.countplot(x ='no_show', hue = "age_range", data = df)
plt.xlabel('Distribution for No show appointment') 
plt.ylabel('No. of Patients')
plt.title("Age Range Distribution for No Show")

><p>The number of children and teenagers who showed up was almost similar. Most of the adults and the aged purposed to attend their appointments as opposed to the young</p>

## 4. Which disease has the highest incident rate?

In [None]:
hypertension_proportion = df.hypertension.mean()*100
hypertension_proportion

In [None]:
alcoholism_proportion = df.alcoholism.mean()*100
alcoholism_proportion

In [None]:
diabetes_proportion = df.diabetes.mean()*100
diabetes_proportion

In [None]:
handicap_proportion = df.handicap.mean()*100
handicap_proportion

In [None]:
colors = ['brown', 'red', 'pink', 'green']
locations = [1, 2, 3, 4]
heights = [hypertension_proportion, diabetes_proportion, alcoholism_proportion, handicap_proportion]
labels = ['Hypertension', 'Diabetes', 'Alcoholism', 'Handicap']

plt.bar(locations,heights,color=colors, alpha=0.7);
plt.title('Rate of Disease Incidents')
plt.xlabel('Disease', fontsize=18)
plt.ylabel('Rate', fontsize=18)
plt.xticks(locations, labels);

><p>Many patients had Hypertension, followed by diabetes. Alcoholism followed closely and lastly minimal patients were handicapped</p>

<a id='conclusions'></a>
## Conclusions
<h3>Findings</h3>
    <ol>
        <li>For question 1, 79.8% of the patients showed up for their scheduled appointments against the 20.2% who did not.</li>
        <li>In question 2, there are many females visting the hospital than males. The number of females and males who showed up for their appointments is almost similar.</li>
    <li>51% of the female showed up for their appointments compared to 28% of their male counterparts</li>
        <li>The adults and the aged highly showed up for their appointments unlike the children and teenagers.</li>
        <li>Hypertension was most prevalent in the patients, followed by diabetes, then alcoholism and handicap the least</li>
      </ol>
  
## Limitations
><ol><li>The dataset captured some ages of 0 which is mathematically impractible.</li>
    <li>There were more females than males showing an underepresenation of males in the dataset</li>
</ol>

## References
<p>https://www.kaggle.com/code/mostafarefaatm/medical-appointment-no-show</p>
<p>https://www.kaggle.com/code/lucasfrancisco/starter-medical-appointment-no-shows-f1e029d1-c

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])