## My frist Data Exploration and Visualization

#### This projects covers the following:
#### 1. Dataset Descripton
#### 2. Data Wranging
#### 3. Data Exploratory Analysis
#### 4. Conclusion

### Dataset Desciption

In this project, We will be analysing a data set on patients missing their medical Appointment and
in particularly looking for trends that influences patients not showing up for their appointments.
The data sets contains 110,527 medical appointments with 14 associated variables (characteristics).
The most important one is if the patient show-up or no-show to the appointment.

### Column List and Description

1. PatientId - Patient’s unique Identification number.
2. AppointmentID - Patient’s Appointment Identification number.
3. Gender - Patient’s Gender (Male or Female).
4. Scheduled Day - The day a patient is registered for an appointment.
5. Appointment Day - The day of a patient’s appointment with a doctor.
6. Age - How old is the patient.
7. Neighbourhood - Place of the appointment.
8. Scholarship - Pateints on Free Health care (0 or 1).
9. Hipertension - Patient’s Diagnosis (0 or 1).
10. Diabetes - Patient’s Diagnosis (0 or 1).
11. Alcoholism - Patient’s Diagnosis (0 or 1).
12. Handcap - Patient’s Diagnosis (0 or 1).
13. SMS_received - Reminder messages for appointment (0 or 1).
14. No-show - If patients showed up (Yes or No). NB ’Yes’ means patient did not       show up and ’No’ means patient showed up. N.B 1 means Present and 0 means         absent in the diseases catergory



### Questions Answered During the Analysis

This projects will be addressing some questions like 
1. What percentage of People showed for their Appointment compared to those that      didn’t 
2. The relationship between scheduled date and Appointment date ? 
3. Does SMS/reminders have any impact on pateints showing up for their                Appointment? 
4. Does Medical Scholarship or medical care have any impact on patients showing
   up for their medical appointment ?


In [None]:
#Packages used for the projects
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Data Wrangling
### General Properties

In [None]:
# Loading the dataset
df = pd.read_csv('/kaggle/input/noshowappointments/KaggleV2-May-2016.csv')

# Checking the shape of the data
df.shape

In [None]:
#checking information on the characteristics of the data set
df.info()

In [None]:
#Checking for missing values in the dataset
df.isnull().sum()

In [None]:
# Description of the data set
df.describe()


In [None]:
# Checking for unique Characters
df.nunique()


### Data Cleaning

In [None]:
#Changing all columns names to lowercase
df.rename(columns=lambda x:x.lower(), inplace=True)
df.head()

In [None]:
# Dropping extraneous columns from the dataset
df.drop(['patientid', 'appointmentid', 'neighbourhood'], axis=1, inplace=True)
df.head()

In [None]:
# Changing the datatype of scheduledday and appointmentday columns
for column in ['scheduledday', 'appointmentday']: df[column] = pd.to_datetime(df[column])
df.info()


In [None]:
#Changing columns names for ease in identification
df = df.rename(columns= {'no-show':'no_show',
                         'scheduledday' : 'scheduled_day',
                         'appointmentday' : 'appointment_day',
                         'hipertension' : 'hypertension',
                         'handcap': 'handicap'})
df.columns

In [None]:
# Creating a new column to now the difference in time between appointment and scheduled days
df['days_difference'] = df['appointment_day'] - df['scheduled_day']
df.head()

In [None]:
# Changing the datatype in the days_difference column
df.days_difference = df.days_difference.dt.days.astype(np.int64)
df.info()
df.head()

In [None]:
# Checking days with negative values
df[df.days_difference < 0].shape


In [None]:
# Dropping days with negative values
df.drop(index =df[df.days_difference < 0].index , inplace=True)

# Checking that negative days were dropped
df[df.days_difference < 0].shape

In [None]:
# Checking the shape of the remaining entries
df.shape

In [None]:
# Overview of the data set
df.head(20)


# Exploratory Data Analysis

In [None]:
# Creating a function to plot histogram
def histogram():
    df.hist(figsize=(15,15))
# Plotting Histogram to show the distribution of each variable
histogram();
plt.suptitle('The Distribution of each variable', fontsize=20);

From the Chart above, we can deduct the following information: 
1. The age chart shows that children and those around 50 years are scheduled more for medical Appointment. 
2. The alcoholism and handicap charts shows that alcoholics and handicaps patients are very low. 
3. The difference in days shows most appointment are fixed same day or closer to the date of scheduling.
4. Most of the patients are not diabetics. 5. A greater percentage of the patients are on medical scholarship

####  Research Question 1: What Percentage of People Showed up for their appointment in the data set

In [None]:
#Checking the total count of patients in the data set
df.count()


In [None]:
# Masking the data set to seperate to patients that showed up for their appointments
df1 = df.query('no_show == "No"')
df1

In [None]:
# Masking the data set to seperate to patients that did not show up for their appointments
df2 = df.query('no_show == "Yes"')
df2


In [None]:
#Using scheduled day to get the actual amount of patients that showed up
show_up = df1['scheduled_day'].count()
show_up

In [None]:
# Using scheduled day to get the actual amount of patients that did not show up
No_show = df2['scheduled_day'].count()
No_show

In [None]:
# Visual Representation of the data of patients that showed and those that did not
locations = ['Showed Up', 'No Show Up']
Heights = [show_up, No_show]
plt.barh(locations, Heights)
plt.xlabel('Number of Patients')
plt.ylabel('Patients that Show Up or Not')
plt.title('Distribution between patients that showed up and those that did not')

In [None]:
# Deriving the total count by adding both counts
total_count = show_up + No_show
total_count


In [None]:
# Percentage of those patients that showed Up
percent_show_up = (show_up / total_count) * 100
percent_show_up

In [None]:
# Percentage of those patients that did not show Up
percent_no_show = (No_show / total_count) * 100
percent_no_show

In [None]:
#visual Representation of the percentage ratio of show up and No show
locations = ['% Showed Up', '% No Show Up']
Heights = [percent_show_up, percent_no_show]
plt.bar(locations, Heights);
plt.xlabel('Percentage Count')
plt.ylabel('Patients that Show Up or Not')
plt.title('Percentage Distribution between patients that showed up and those that did not show up')

The above charts shows that 71.5% of the total patients showed up for their medical appointment and 28.5% did not show up despite been scheduled for an appointment.

#### Research Question 2: The relationship between scheduled date and Appointment date ?

In [None]:
# Checking for unique values in Days Difference column
df['days_difference'].unique()


In [None]:
# Description of days difference
df['days_difference'].describe()

In [None]:
# Grouping the Days difference
bin_edges = [0, 10, 20, 30, 40, 50, 100]

# Labels for grouping
bin_names = ['0 days', '10 days', '20 days', '30 days', '40 days', '50 days and above']

In [None]:
# Grouping Days Difference
df['days_difference_group'] = pd.cut(df['days_difference'], bin_edges, labels = bin_names)

df.head()


In [None]:
#Visual Representation of the difference in days between Schedule Day and Appointment Day
df['days_difference_group'].value_counts().plot(kind='pie', figsize=(8,6));

In [None]:
# A sneak peak into the distribution of days difference between Scheduled Day and Appointment Day
df['days_difference'].hist(bins=65);
plt.xlabel('Number of Days')
plt.ylabel("Frequency of patient's appointment")
plt.title('Distribution of days difference between Scheduling and Appointmnet')

The Distribution shows that over 50% of the patients was scheduled for an appointment the same day they visited the medical centre.

#### Research Question 3: Does SMS/reminders have any impact on pateints showing up for their Appointment ? 

In [None]:
# Checking for SMS value Count
df['sms_received'].value_counts()


In [None]:
#Distribution of SMS Count 
x_axis = ['No SMS sent', 'SMS sent']
df['sms_received'].value_counts().plot(y='SMS count', kind='bar', title='SMS distribution');
plt.xlabel('SMS Received')
plt.ylabel('SMS Counts')

In [None]:
#Grouping SMS recived according to patients that showed up and those that didn't
sms = df.groupby(['sms_received','no_show'])['no_show'].count()
sms

In [None]:
#Percentage Show up with SMS and NO_SMS
sms_no_show = sms[0]['No'] / sms[0].sum()
sms_show = sms[1]['No'] / sms[1].sum()

In [None]:
 # Representation in Percentage
locations = ['SMS and Showed Up', 'No SMS and showed Up']
heights = [sms_show, sms_no_show]
colors = ['Red', 'Blue']
plt.bar(locations, heights, color=colors);
plt.ylabel('Percentage Count')
plt.xlabel('SMS')
plt.title('Percentage of Reminder/Show Up')


From the above chart, we can deduce that patients who did not receive reminders in SMS
message form from the medical centred showed up more compared to those that got a reminder. We can conclude that SMS messages have no effects on patients showing up for their medical appointment.


#### Research Question 4: Does Medical Scholarship or medical care have any impact on patients showing up for their medical appointment ?


In [None]:
#Scholarship value counts
df['scholarship'].value_counts()


In [None]:
#Grouping Scholarship
df_s = df.groupby('scholarship')['no_show'].value_counts()
df_s

In [None]:
#Percentage
scholarship_show = df_s[1]['No'] / df_s[1].sum()
no_scholarship_show = df_s[0]['No'] / df_s[0].sum()

In [None]:
# Visual Representation
locations = ['Scholarship and Showed Up', 'No scholarship and showed Up']
heights = [scholarship_show, no_scholarship_show]
colors = ['Red', 'Blue']
plt.bar(locations, heights, color=colors);
plt.xlabel('Scholarship')
plt.ylabel('Percentage Count')
plt.title('Percentage Count on patients with and without Scholarship that Showed')

From the above percentage count chart we can say that scholarship or free medical care have little to no effects on a patient’s willingness to show up for a scheduled medical appointment. The chart shows that patients without Medical scholarship showed up more when compared to those on scholarship.

## CONCLUSIONS

1. About 28.5 Percent of the patients in the data set did not show up for their
medical Appointment. 
2. Most of the patients have the same day of scheduling and appointment.
3. SMS messages/Reminder have no effects on patients showing up for their medical Appointment. 
4. Scholarship or free medical care have little to no effects on a patient’s willingness to show up for a scheduled medical appointment.
