# Project: Investigate a Dataset - No show appointments

## Introduction
### Dataset Description
This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

‘ScheduledDay’ tells us on what day the patient set up their appointment. ‘Neighborhood’ indicates the location of the hospital. ‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família. The encoding of the last column: it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up. The other attributes are denoted with either a 1 or 0.1 if the attribute is true and 0 its its false eg No Scholarship 1 means the person is on scholarship and 0 means they are not on scholarship

### Question(s) for Analysis
#### What are the main factors influencing whether or not a person will show up for an appointment?
Firstly we will analyse the relationship between sickness and probability of person to show up? Then analyse How else does the attributes independent of a person's condition affect thier probability to show up From this we will be able to see the most important factors that influence whether or not a person is likely to show up and any recommendation to increase patient turn up rate

#### Dependent variable : No show
Independent variables : neighbourhood , message recieved , Gender , ScheduledDay ,AppointmentDay Age Neighbourhood Scholarship Hipertension, Diabetes ,Alcoholism, Handcap ,SMS_received
### Goal : Trying to find the impact of each of these factors on influencing a person's turn up for an appointment

# Importing the relevant libraries

In [None]:
import pandas as pd
%matplotlib inline
from matplotlib import pyplot as plt

## Data Wrangling

In [None]:
#Importing our data setm and having a general overview of its structure
df = pd.read_csv("data.csv")
df.head(15)

### Lets look at the shape of our data

In [None]:
df.shape

In [None]:
#Whats the general structure of our attributes
df.info();
#From below we see that our data has no missing values and every attribute has the correct data type

In [None]:
#Lets also look at the summary statistics of any info that might be relevant
df.describe()
#We see that its not very useful so we might need to exclude it from our analysis

## Data Wrangling
### Now that we have a general inituition of what our data looks like , lets try to see if there is need for cleaning

In [None]:
#Lets check for duplicates and drop them if they exists
df.drop_duplicates(inplace=True)

In [None]:
#Get an overview of all the columns and to see if we might need to drop irrelevant ones
print(df.columns.tolist())

##### Since we are looking for factors influencing  a person's likelihood of showing up,we wont need patient id,appointment id.We might need to look furthur at 'ScheduledDay', 'AppointmentDay' and Neighbourhood


In [None]:
df.drop(['PatientId','AppointmentID'],axis=1,inplace=True)
df.head()

#### From the above we also see that appointment day and schedule might not be  relevant since they differ by minutes or hours and might be cumbersome so we drop it

In [None]:
##Lets also check neighbourhoods
df['Neighbourhood'].value_counts()
#From the below we have 81 neighborhoods, it would be very cumbersome to analyse all 81 variable since we dont have
#any other demographic info to describe the neighborhoods

In [None]:
#Lets drop the irrelevant columns
df.drop(['ScheduledDay', 'AppointmentDay','Neighbourhood'],axis=1,inplace=True)

In [None]:
df.head(20)

##### By looking at the above, we see that the No-show column is not homogenous to the other columns.Its labelling is also very confusing so we can rename it to Showedup,since its the trait we are most interested in.We can rename our No to 1(True) since it means the person showed up and Yes to  0(False) ,since means the person didnt show up

In [None]:
df['No-show'].replace({'No':1,'Yes':0},inplace=True)

In [None]:
df.rename(columns={"No-show":"Showedup"},inplace=True)
df.head()

##### We also want to analyse two aspects of our data ie sicknesses and other factors.We want to see the impact of sickness on people who show up and also the influence of other factors so lets make two datasets, one for analysing sicknesses and the other

In [None]:
sickDf = df[['Hipertension','Diabetes','Alcoholism','Handcap']]
sickDf.head()

In [None]:
otherDf = df[['Gender','Scholarship','Age','SMS_received']]
otherDf.head()

In [None]:
#We then create masks for rows with showed up and those who missed so that we can easily pick out our rows
showedup = df.Showedup==1
missed = df.Showedup ==0

# Exploratory Data Analysis

## Research question 1 : What is the relationship between sickness and probability of person to show up?The sicknesses are Hipertension,Alcoholism,Diabetes and Handicap.The dependent variable is ShowedUp

In [None]:
#Lets pull in the sickness data set
sickDf.head(10)

In [None]:
#We might need to see the percentage of people who show up or not based on specified attribute so lets  create a funvtion
#to help us do that
#b is the total number of people with the attribute , a is the proportion who showed up
def percentage(a,b):
    percentage = a/b * 100
    return percentage

In [None]:
#Before we analyse  the individual traits ,lets look attributes of combined traits since some people come with more than one sickness
sickDf[showedup].value_counts().plot(kind="bar",figsize=(15,8));
plt.title("Dependence Among Sicknesses")
plt.ylabel("Total People");

From the above we see that people with hipertension alone have the highest turn up rate.
There are very few people who come with a combination of all sicknesses
We also see that there seems to be many types(1,2,3,4) of Handicap, but what we are most interested in is whether or not a person is handcap , so 0 means they are not any number above shows that they are handicap and its different levels.So lets convert that

In [None]:
#Change any type of handicap to just 1 , to show that the person is handcap
sickDf['Handcap'] = sickDf['Handcap'].mask(df['Handcap']>1,1)
sickDf[showedup].value_counts().plot(kind="bar",figsize=(15,8));
plt.title("Dependence Among Sicknesses")
plt.ylabel("Total People");

##### From the above we see that we have more people who werent sick but still showed up.Thus validates our assumptions that there might be other factors other than sickness that influence whether a person will show up
##### We also see that people with Hipertension alone have the highest turn up rate , followd by people with both Hipertension and Diabetes
##### Handcap people have the lowest turn up rates ,this is shown by the last bars.Thier influence on turn up is almost insignificant.
##### Now lets take a closer look at Hipertension and Diabetes

In [None]:
sickDf.Hipertension.value_counts() , sickDf.Hipertension[showedup].value_counts()

In [None]:
#Lets calculate % of people with Hipertension who showed up
percentage(18029,21801)

In [None]:
#Lets look at the proportion of people with Diabetes and Hipertension who showed up or didnt
sickDf.Diabetes.value_counts(),  sickDf.Diabetes[showedup].value_counts()

In [None]:
# percentage of people with diabetes who showed up
percentage(6513,7943)

##### Thus we see that these people have very high appointment turn up rates ie approx  82%, we have a 0.82 probability that every person with these sicknesses will show up

### Research Question 2  : How else does attributes independent of a person's condition affect thier probability to show up? The attributes are Age, Gender and SMS Recieved

In [None]:
#Lets take a quick scan of our dataset again
otherDf.head()

In [None]:
otherDf.hist();
#A general overview shows that we have more people who werent encouraged to come ie by having welfare or getting
# an sms.Now lets look at how this affected the turn up rates

In [None]:
otherDf.Scholarship[showedup].value_counts().plot(kind='bar', label="showedUp");
plt.title("Turn Up Vs Scholarship")
plt.xlabel("Scholarship Status");
plt.ylabel("Total People");
#From this we see that very few people with scholarship showed up ,so the incentive does major  impact turn up rate

In [None]:
otherDf.SMS_received[showedup].value_counts().plot(kind='bar', label="showedUp");
plt.title("Turn Up Vs SMS_received")
plt.xlabel("SMS Recieved ");
plt.ylabel("Total People");

##### Although few people recived SMS , there seems to be a  signficantly higher turn out to those who recieved.Thus this incentive proves to have been a little more effective than scholarship

In [None]:
df.Age[showedup].hist( alpha=0.5, bins=20 ,color="yellow", label="Showedup")
df.Age[missed].hist(alpha=0.5, bins=20 ,color="red", label="Missed");
plt.title("Turn Up Across Ages")
plt.xlabel("Age")
plt.ylabel("Total People")
plt.legend();


##### From the above  we can see that children under one year have the highest appointment turnup rates and people in thier mid twenties have the lowest turn up artes.People in thier seventies have a high turn out rate.

In [None]:
otherDf.Gender[showedup].value_counts().plot(kind="bar");
plt.title("Turn Up By Gender")
plt.xlabel("Gender")
plt.ylabel("Total People");

###### Because we might have more of one gender, our previous methodoy will not yield correct results if applied here So we first look at the structure of the data in regard to turn up and not.Then we look at the % of each gender  that turns up rather than the number.This will give us a more accurate picture of which gender is likely to turn up

In [None]:
df.groupby("Showedup")['Gender'].value_counts().plot(kind="bar",color="green",ylabel="Total");


In [None]:
#How  many of each gender do we have?
otherDf.Gender.value_counts()

In [None]:
#How many of each gender showed up
otherDf.Gender[showedup].value_counts()

In [None]:
#What percenatge of males showed up?
maleR = percentage(30962,38687)
maleR

In [None]:
#What percentage of females showed up
femaleR = percentage(57246,71840)
femaleR

##### From the above we se that gender doesnt have any impact on turnout rate since the % in each gender are relatively the same ie 80% and 80%  for both to 1 dp

In [None]:
#On a final note,lets also look at the percentage of people who showed up overal
df.Showedup.value_counts()

In [None]:
percentage(88208,110257)

###### From with, 80% of the people showed up, thats a very good sign.This can also be derived from our very first histogram

# Conclusions

From our analysis we looked at the factors that influence a person turnup probability.We looked at the influence of objective factors like sickness  then other incentives that might complement and influence a persons turn up.
We found that people with high risk illnesses are more likely to turn up than the rest.These sicknesses are Hipertension and Dibetes.This is very true because failure to come for check up or to get treatment might be fatal so these people have to show up.
Age is also another great contributor, and we see that vulnerable people(very young babies or very old people) are more likely to show up than the rest of the population.
We however noticed that irregardless of sicknesss there were other factors that contributed.And from our analysis, sending people reminder SMS  was more effective and resuted in more people turning up for thier appoinitments.
Although,we might have expected that people with Scholarship might turn up more, they actaully turn up less.This might be because they dont bear the responsibility of paying and might not take appointments as serious because they can always book anytime without having to worry about getting money for the appointment
## So the main factors influencing whether or not a person will show up are : Type of Sickness, Age and whether they recieved SMS

###### Recommendations : The hospital should send more messages to remind people of thier appointments.More analysis can be directed towards why some people didnt recieve messages and what can be done about that


#### Limitations

###### Our data didnt give us any background of the demographics of the different neighborhoods.If we had this, there was room to analyse the impact of neighborhood on turn up  and therefore would have been an important insight  because it highlights economic status which is also an important factor to consider.Usually same neighbourhoods would share similar attitudes towards our dependant variable.