#Medical-appointment-dataset-analysis

Dataset Description

A person makes a doctor appointment, receives all the instructions and no-show. Who to blame? This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

Dataset Description

A person makes a doctor appointment, receives all the instructions and no-show. Who to blame? This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

Columns Description

PatientId: Identification of a patient.

AppointmentID: Identification of each appointment.

Gender: Male or Female.

AppointmentDay: The day of the actuall appointment, when they have to visit the doctor.

ScheduledDay: The day someone called or registered the appointment, this is before appointment of course.

Age: How old is the patient.

Neighbourhood: Where the appointment takes place.

Scholarship: True of False, indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.

Hipertension: True or False.

Diabetes: True or False.

Alcoholism: True or False.

Handcap: True or False.

SMS_received: 1 or more messages sent to the patient.

No-show: True (if the patient did not show up), or False (if the patient did show up).

EDA Questions

Q1: How often do men go to hospitals compared to women? Which of them is more likely to show up?

Q2: Does recieving an SMS as a reminder affect whether or not a patient may show up? is it correlated with number of days before the appointment?

Q3: Does having a scholarship affects showing up on a hospital appointment? What are the age groups affected by this?

Q4: Does having certain deseases affect whather or not a patient may show up to their appointment? is it affected by gender?

#Environment set-up

In [None]:
# importing lib.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# getting the csv file directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

#Data Wrangling

in this section, we'd load our data from a CSV file to a pandas dataframe, and then take a quick dive into exploring our dataset in details.

In [None]:
# loading dataset from csv file and showing its first 5 rows
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

We'll move next into exploring our dataset by going through its data types, NaNs or duplicated rows, and any columns that may need to be dropped or parsed.

In [None]:
# viewing main info about df
df.info()

We can notice there are no NaNs at all in our data

PatientId and AppointmentId columns wouldn't be helpful during analysis.

ScheduledDay and AppointmentDay needs to be casted to date data type.

we may append a new column for days until appointment.

Gender needs to be converted into a categoy type

Scholarship Hipertension Diabetes Alcoholism Handcap better be boolean data type.

No-show needs to be parsed and casted to boolean too.

In [None]:
# checking for duplicates
df.duplicated().sum()

our dataset has no duplicated rows either.

In [None]:
# exploring the unique values of each column
df.nunique()

Handcap and Age columns has inconsistant unique values.
SMS_received would be casted to boolean data type.

In [None]:
# exploring handcap values
df['Handcap'].value_counts()

we'd be only intrested in rows with 0 or 1 values.

In [None]:
# exploring age column distribution
df['Age'].describe()

Age column would need to be handled.

#Exploration Summery

our dataset consists of 110527 rows with 14 columns, and has no NaNs nor duplicated values.

PatientId and AppointmentId columns wouldn't be helpful during analysis.

ScheduledDay and AppointmentDay needs to be casted to date data type.

we may append a new column for days until appointment.

Gender needs to be casted into a categoy type

Scholarship, Hipertension, Diabetes, Alcoholism and SMS_recieved better be boolean data type.

No-show column needs to be parsed and asted to boolean type.

Handcap colume needs to be cleaned to have only 0 and 1 values.

Age columns has inconsistant unique values that needs to be handled.

#Data Cleaning

in this section, we'd perform some operations on our dataset based on the previous findings to make our analysis more accurate and clear.


Dropping PatientId and AppointmentId columns

In [None]:
columns_to_drop = ['PatientId', 'AppointmentID']
df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=True)
df.columns



Handling date data type

In [None]:
df.AppointmentDay.unique

it looks like all hours are set to 00:00:00, so we would want to extract onl the year, month and day data

In [None]:
# extracting only day, month and year values
df['ScheduledDay'] = df['ScheduledDay'].str[:10]
df['AppointmentDay'] = df['AppointmentDay'].str[:10]

# changing data type
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

# confirming changes
print(df[['AppointmentDay', 'ScheduledDay']].dtypes)
df.head()

Now, we'd move into appending a new column that holds number of days to the appointment.

In [None]:
# making new due days column
df['due-days'] = df['AppointmentDay'] - df['ScheduledDay']

# converting data type
df['due-days'] = df['due-days'].dt.days

# drop sch and appoint col
df.drop(['AppointmentDay', 'ScheduledDay'], axis = 1, inplace = True)

We'll move into exploring this new column.

In [None]:
# viewing summery statistics
df['due-days'].describe()

We seem to have some negative values here, we'll drop them.

In [None]:
# viewing negative days values
df[df['due-days'] < 0 ]

In [None]:
# dropping these values and confirming changes
df.drop(df[df['due-days'] < 0].index, inplace = True)
df['due-days'].describe()


Converting Gender and No-show to categorical variables

In [None]:
# converting column and confirming changes
df['Gender'] = df['Gender'].astype('category')

df['Gender'].dtypes


Converting Scholarship, Hipertension, Diabetes, Alcoholism, Handcap and SMS_recieved to boolean data type

In [None]:
# converting columns to bool and confirming changes
cols = ['Scholarship', 'Hipertension', 'Diabetes', 'Alcoholism', 'SMS_received']
df[cols] = df[cols].astype('bool')
df[cols].dtypes

Parsing and casting No-show column

In [None]:
# mapping alues to be more familiar
df.loc[df['No-show'] == 'Yes', 'No-show'] = 0
df.loc[df['No-show'] == 'No', 'No-show'] = 1

# casting dt type and confirming changes
df['No-show'] = df['No-show'].astype(bool)
df['No-show'].dtypes

#Cleaning Handcap column

In [None]:
# viewing rows with values of handcap > 1
df[df['Handcap'] > 1]

We have 199 rows with inconsistant values, we'd replace them with 1 to treat them as beeing handcaped

In [None]:
# filling the bigger values with 1
df.loc[df['Handcap'].isin([2, 3, 4]), 'Handcap'] = 1

# casting type and confirming changes
df['Handcap'] = df['Handcap'].astype('bool')
df['Handcap'].unique()

#Cleaning Age column

In [None]:
#exploring values below 0
df[df['Age'] < 0]

we have one value with negative age, so we will drop it

In [None]:
# dropping row with negative age and confirming changes
df.drop(df[df['Age'] < 0].index, inplace = True)
df[df['Age'] < 0]

In [None]:
df.head()

In [None]:
df.info()


We endded up with a datafram of 110521 rows and 11 columns, and everything looks tidy and clean. We'd proceed in visualizing it to extract meaningful insights from it.

#Data Visualization and EDA

Now that our data is clean, we'd perform some EDA on it in order to extract useful insights from it.

In [None]:
# setting seaborn configurations
sns.set_style("whitegrid")

How often do men go to hospitals compared to women? Which of them is more likely to show up?

In [None]:
# viewing count plot of gender distribution in our dataset
sns.countplot(x = 'Gender', data = df)
plt.title("Gender Distribution")
plt.show()

In [None]:
# viewing count plot of gender distribution against age in our dataset
sns.boxplot(x = 'Gender', y = 'Age', data = df)
plt.title("Gender Distribution against Age")
plt.show()

we can notice that nearly half of our dataset conists of women with wider age destribution and some outliers, all of which achiees a rate higher than men.

In [None]:
df['No-show'].value_counts()

it is obvious that 79.8% of our patients did show up on their appointments and only 20.1% of them did not.
lets dive deeper to see if this is related to gender.

In [None]:
# showing the gender destribution with respect to the no-show colunmn
sns.countplot(x = 'Gender', data = df, hue = 'No-show')
plt.title('gender destribution with respect to no-show colunmn')
plt.show()

from the above chart, we can come up with a conclusion that women do show up on their appointments more often than men do, but this may be affected by the percentage of women on this dataset.


Does recieving an SMS as a reminder affect whether or not a patient may show up? is it correlated with number of days before the appointment?

In [None]:
# viewing count plot of recieving SMS distribution in our dataset
sns.countplot(x = 'SMS_received', data = df)
plt.title("SMS received destribution")
plt.show()

In [None]:
df['SMS_received'].value_counts()

we can see that 67.8% of our patients did not reciee any SMS reminder of their appointments, cound this be affecting their showin up?

In [None]:
# showing the sms destribution with respect to the no-show colunmn
sns.countplot(x = 'SMS_received', data = df, hue = 'No-show')
plt.title('SMS destribution with respect to no-show colunmn')
plt.show()

we can see that our previous deduction was not quiet correct, as the vast majority of our patients did not recieve any SMS reminder and yet they showed up on their appointments.

In [None]:
# viewing the correlation between no-show and due-days without outliers
sns.boxplot(x = 'No-show', y = 'due-days', data = df, showfliers = False)
plt.title('no-show against due-days')
plt.show()

it is clear that there is a correlation between number od due days and whether a patient shows up or not.
patient with appointments from 0 to 30 days tend to show up more regularly, while patients with higher number of days tend to not show up.

In [None]:
# viewing the correlation between no-show and due-days without outliers with respect to gender
sns.boxplot(x = 'No-show', y = 'due-days', data = df, hue = 'Gender', showfliers = False)
plt.title('no-show against due-days')
plt.show()

Gender does not affect number of due days and showing up at an appointment that much.


Does having a scholarship affects showing up on a hospital appointment? What are the age groups affected by this?

In [None]:
# plotting having a scholarship against age
sns.boxplot(x = 'Scholarship', y = 'Age', data = df)
plt.title('shcolarship V.S. age')
plt.show()

In [None]:
# plotting having a scholarship against no show with respect to gender
sns.barplot(x = 'Scholarship', y = 'No-show', hue = 'Gender', data = df)
plt.title('shcolarship V.S. no show with respect to gender')
plt.show()

In [None]:
# ploting age destribution
df['Age'].hist()

we can see that having a scolarship does not affect showing up to a doctor appointment that much and that huge age group is enrolled to that scholarship and also enrol their babies on.


Does having certain deseas affects whather or not a patient may show up to their appointment? is it affected by gender?

In [None]:
# plotting deseases against no show
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.countplot(x = 'Hipertension', data = df, hue= 'No-show')
plt.subplot(2,2,2)
sns.countplot(x = 'Diabetes', data = df, hue= 'No-show')
plt.subplot(2,2,3)
sns.countplot(x = 'Alcoholism', data = df, hue= 'No-show')
plt.subplot(2,2,4)
sns.countplot(x = 'Handcap', data = df, hue= 'No-show')
plt.show()

In [None]:
# plotting deseases against no show with respect to age
plt.figure(figsize=(15,10))
plt.subplot(2,2,1)
sns.barplot(x = 'Hipertension', y = 'Age', data = df, hue= 'No-show')
plt.subplot(2,2,2)
sns.barplot(x = 'Diabetes', y = 'Age', data = df, hue= 'No-show')
plt.subplot(2,2,3)
sns.barplot(x = 'Alcoholism', y = 'Age', data = df, hue= 'No-show')
plt.subplot(2,2,4)
sns.barplot(x = 'Handcap', y = 'Age', data = df, hue= 'No-show')
plt.show()

from the previous set of plots, we can conclude that the vast majority of our dataset does not have chronic deseases, yet, they are existed in so many young people.
having a chronic deseas may affect your showing up at a hospital's appointment.

#Conclusion

Q1: How often do men go to hospitals compared to women? Which of them is more likely to show up?
Nearly half of our dataset conists of women with wider age destribution and some outliers, all of which achiees a rate higher than men.

It is obvious that 79.8% of our patients did show up on their appointments and only 20.1% of them did not.

Women do show up on their appointments more often than men do, but this may b affected by the percentage of women on this dataset.

Q2: Does recieving an SMS as a reminder affect whether or not a patient may show up? is it correlated with number of days before the appointment?
67.8% of our patients did not reciee any SMS reminder of their appointments, yet they showed up on their appointments.
It is clear that there is a positive correlation between number of due days and whether a patient shows up or not.
Patient with appointments from 0 to 30 days tend to show up more regularly, while patients with higher number of days tend to not show up.
gender does not affect number of due days and showing up at an appointment that much.

Q3: Does having a scholarship affects showing up on a hospital appointment? What are the age groups affected by this?
Having a scholarship does not affect showing up to a doctor appointment that much.
Huge age group is enrolled to that scholarship and also enrol their babies on.
Q4: Does having certain deseases affect whather or not a patient may show up to their appointment? is it affected by gender?
We can conclude that the vast majority of our dataset does not have chronic deseases, yet, they are existed in so many young people.
Having a chronic deseas may affect your showing up at a hospital's appointment.