# Project 1 - Medical Appointment No Shows

---

- Data: https://www.kaggle.com/datasets/joniarroba/noshowappointments?select=KaggleV2-May-2016.csv
- What if that possible to predict someone to no-show an appointment?
  
---
 

## Clean Data 

In [1]:
# Dependencies and Setup
import pandas as pd
from datetime import datetime

# Read in data
df = pd.read_csv("KaggleV2-May-2016.csv")

In [2]:
# Column names
df.columns


Index(['PatientId', 'AppointmentID', 'Gender', 'ScheduledDay',
       'AppointmentDay', 'Age', 'Neighbourhood', 'Scholarship', 'Hipertension',
       'Diabetes', 'Alcoholism', 'Handcap', 'SMS_received', 'No-show'],
      dtype='object')

In [3]:
# fixed the name of columns 
df.rename(columns={'AppointmentID':'Appointment_ID',
                   'PatientId':'Patient_ID',
                   'ScheduledDay':'Scheduled_Day',
                   'AppointmentDay':'Appointment_Day',
                   'Hipertension':'Hypertension',
                   'SMS_received':'SMS_Received',
                   'No-show':'No_Show'
                   },
                   inplace=True)
df.columns

Index(['Patient_ID', 'Appointment_ID', 'Gender', 'Scheduled_Day',
       'Appointment_Day', 'Age', 'Neighbourhood', 'Scholarship',
       'Hypertension', 'Diabetes', 'Alcoholism', 'Handcap', 'SMS_Received',
       'No_Show'],
      dtype='object')

In [4]:
# Show the first 5 rows of the dataset
df.head()

Unnamed: 0,Patient_ID,Appointment_ID,Gender,Scheduled_Day,Appointment_Day,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_Received,No_Show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [5]:
# Info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Patient_ID       110527 non-null  float64
 1   Appointment_ID   110527 non-null  int64  
 2   Gender           110527 non-null  object 
 3   Scheduled_Day    110527 non-null  object 
 4   Appointment_Day  110527 non-null  object 
 5   Age              110527 non-null  int64  
 6   Neighbourhood    110527 non-null  object 
 7   Scholarship      110527 non-null  int64  
 8   Hypertension     110527 non-null  int64  
 9   Diabetes         110527 non-null  int64  
 10  Alcoholism       110527 non-null  int64  
 11  Handcap          110527 non-null  int64  
 12  SMS_Received     110527 non-null  int64  
 13  No_Show          110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


The data has no null values.

In [6]:
#check the duplicated in data
df.duplicated().sum()

0

The data has no duplicated values.

In [7]:
# Describe the data 
df.describe().T.style.background_gradient()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Patient_ID,110527.0,147496265710394.06,256094920291738.84,39217.84439,4172614444192.0,31731838713978.0,94391720898175.02,999981631772427.0
Appointment_ID,110527.0,5675305.123427,71295.75154,5030230.0,5640285.5,5680573.0,5725523.5,5790484.0
Age,110527.0,37.088874,23.110205,-1.0,18.0,37.0,55.0,115.0
Scholarship,110527.0,0.098266,0.297675,0.0,0.0,0.0,0.0,1.0
Hypertension,110527.0,0.197246,0.397921,0.0,0.0,0.0,0.0,1.0
Diabetes,110527.0,0.071865,0.258265,0.0,0.0,0.0,0.0,1.0
Alcoholism,110527.0,0.0304,0.171686,0.0,0.0,0.0,0.0,1.0
Handcap,110527.0,0.022248,0.161543,0.0,0.0,0.0,0.0,4.0
SMS_Received,110527.0,0.321026,0.466873,0.0,0.0,0.0,1.0,1.0


Age : the average patient age is 37 years; the min value is -1 which may be an error
Scholarship : 9.8 % of patients are enrolled in welfare program
Hypertension : 19.7 % of patents have hypertension
Diabetes : 7.2% of patents have diabetes
Alcoholism : 3.0% of patents have alcoholism
Handcap : 2.2% of patents have a Handicap
SMS_received : 32.1% of the patients received one or more SMS messages.

In [8]:
# Drop the row when the age value is -1
df=df[df['Age']!=-1]
df = df.reset_index(drop=True)

In [9]:
# Describe each "object" type 
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Gender,110526,2,F,71839
Scheduled_Day,110526,103548,2016-05-06T07:09:54Z,24
Appointment_Day,110526,27,2016-06-06T00:00:00Z,4691
Neighbourhood,110526,81,JARDIM CAMBURI,7717
No_Show,110526,2,No,88207


Gender : There are 2 unique values in this column. The most frequent gender is Female (F), appearing 71,839 times.
Neighbourhood : There are 81 unique values in this column. The most frequent neighbourhood is JARDIM CAMBURI, appearing 7,717 times.
No_show : There are 2 unique values in this column. The most frequent value is No, appearing 88,207 times. This means that most appointments were kept.

## Add columns

In [10]:

# Convert the Scheduled_Day and Appointment_Day to datetime
df['Scheduled_Day'] = pd.to_datetime(df['Scheduled_Day'])
df['Appointment_Day'] = pd.to_datetime(df['Appointment_Day'])
# Split Scheduled_Day column into date and time 
df['Scheduled_Day_Date'] = df['Scheduled_Day'].dt.strftime("%Y-%m-%d")
df['Scheduled_Hour'] = df['Scheduled_Day'].dt.hour
# Get appointment day of week
df['Appointment_Day_of_Week'] = df['Appointment_Day'].dt.day_name()
# Remove timestamp from Appointment Day (since it is set at midnight for all rows)
df['Appointment_Day_Date'] = df['Appointment_Day'].dt.strftime("%Y-%m-%d")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110526 entries, 0 to 110525
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype              
---  ------                   --------------   -----              
 0   Patient_ID               110526 non-null  float64            
 1   Appointment_ID           110526 non-null  int64              
 2   Gender                   110526 non-null  object             
 3   Scheduled_Day            110526 non-null  datetime64[ns, UTC]
 4   Appointment_Day          110526 non-null  datetime64[ns, UTC]
 5   Age                      110526 non-null  int64              
 6   Neighbourhood            110526 non-null  object             
 7   Scholarship              110526 non-null  int64              
 8   Hypertension             110526 non-null  int64              
 9   Diabetes                 110526 non-null  int64              
 10  Alcoholism               110526 non-null  int64              
 11  Handcap      

Unnamed: 0,Patient_ID,Appointment_ID,Gender,Scheduled_Day,Appointment_Day,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_Received,No_Show,Scheduled_Day_Date,Scheduled_Hour,Appointment_Day_of_Week,Appointment_Day_Date
0,29872500000000.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No,2016-04-29,18,Friday,2016-04-29
1,558997800000000.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No,2016-04-29,16,Friday,2016-04-29
2,4262962000000.0,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No,2016-04-29,16,Friday,2016-04-29
3,867951200000.0,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,2016-04-29,17,Friday,2016-04-29
4,8841186000000.0,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No,2016-04-29,16,Friday,2016-04-29


In [11]:
df["Days_Between"] = ""

In [12]:
def days_between(d1, d2):
    d1 = datetime.strptime(d1, "%Y-%m-%d")
    d2 = datetime.strptime(d2, "%Y-%m-%d")
    return abs((d2 - d1).days)

for i in range(df.shape[0]):
    df.loc[i,"Days_Between"] = days_between(df["Scheduled_Day_Date"][i],df["Appointment_Day_Date"][i])


In [13]:
# Check min and max days between
[df["Days_Between"].min(),df["Days_Between"].max()]

[0, 179]

In [14]:
df.to_csv('data_cleaned.csv', index=False)  

## Load Cleaned Data for Analysis....(do this in separate notebook)

In [15]:
# Dependencies and Setup
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import scipy.stats as st # for linear regression model
import numpy as np

# Read in data
df = pd.read_csv("data_cleaned.csv")
df.head()

Unnamed: 0,Patient_ID,Appointment_ID,Gender,Scheduled_Day,Appointment_Day,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_Received,No_Show,Scheduled_Day_Date,Scheduled_Hour,Appointment_Day_of_Week,Appointment_Day_Date,Days_Between
0,29872500000000.0,5642903,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No,2016-04-29,18,Friday,2016-04-29,0
1,558997800000000.0,5642503,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No,2016-04-29,16,Friday,2016-04-29,0
2,4262962000000.0,5642549,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No,2016-04-29,16,Friday,2016-04-29,0
3,867951200000.0,5642828,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,2016-04-29,17,Friday,2016-04-29,0
4,8841186000000.0,5642494,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No,2016-04-29,16,Friday,2016-04-29,0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110526 entries, 0 to 110525
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Patient_ID               110526 non-null  float64
 1   Appointment_ID           110526 non-null  int64  
 2   Gender                   110526 non-null  object 
 3   Scheduled_Day            110526 non-null  object 
 4   Appointment_Day          110526 non-null  object 
 5   Age                      110526 non-null  int64  
 6   Neighbourhood            110526 non-null  object 
 7   Scholarship              110526 non-null  int64  
 8   Hypertension             110526 non-null  int64  
 9   Diabetes                 110526 non-null  int64  
 10  Alcoholism               110526 non-null  int64  
 11  Handcap                  110526 non-null  int64  
 12  SMS_Received             110526 non-null  int64  
 13  No_Show                  110526 non-null  object 
 14  Sche