# Load
#### Appointments table was created via the command line & data was copied from the missing_appointment_csv.csv

In [14]:
import os
import glob
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

In [15]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


#### Connect to Postgres Database

In [16]:
%sql postgresql://postgres:trublue6@localhost/dataengineering

# Data Exploration: What is the reason for missed appointments?

### 1. PatientId Analysis: Discover if the missed appointments are by the same patients.

In [43]:
%%sql 
-- how many of the repeat patients appointments represent the total number of appointments
SELECT count(distinct(patientid)) unique_patients, ROUND(AVG(sum_patients),2) avg_repeat_app, SUM(sum_patients) sum_patients
FROM
	(SELECT patientid, COUNT(*) sum_patients
	FROM appointments
	WHERE noshow = 'Yes'
	GROUP BY 1
	HAVING COUNT(*) > 2
	ORDER BY 2 desc) patients;

 * postgresql://postgres:***@localhost/dataengineering
1 rows affected.


unique_patients,avg_repeat_app,sum_patients
808,3.77,3046


In [40]:
%%sql 
SELECT SUM(patient_count) all_missedapp_patients, SUM(app_count) total_missedapp FROM
(SELECT patientid, COUNT(DISTINCT patientid) patient_count, COUNT(appointmentid) app_count
FROM appointments
WHERE noshow = 'Yes'
GROUP BY 1) total;

 * postgresql://postgres:***@localhost/dataengineering
1 rows affected.


all_missedapp_patients,total_missedapp
17663,22319


#### Patient Analysis Result:

> I wanted to analyze the Patientid's to quantify the representation of repeat patients to
> see if they made up a significant portion of the missed appointments. I found that the repeat patients made up 
> about 13% (22,319 / 3046) of the total missed appointmentss & their population made up about 4%. There is no significant correlation.

### 2. Neighborhood Analysis: Discover the correlation between missed appointments & neighborhoods.

In [26]:
%%sql

SELECT neighborhood, count(*) 
FROM appointments
WHERE noshow = 'Yes'
GROUP BY 1
ORDER BY 2 desc;

 * postgresql://postgres:***@localhost/dataengineering
80 rows affected.


neighborhood,count
JARDIM CAMBURI,1465
MARIA ORTIZ,1219
ITARARÉ,923
RESISTÊNCIA,906
CENTRO,703
JESUS DE NAZARETH,696
JARDIM DA PENHA,631
CARATOÍRA,591
TABUAZEIRO,573
BONFIM,550


#### Neighborhood Analysis Result:

> I wanted to analyze the Neighborhoods to identify what percentage of the missed appointments are due to a specific
> neighborhoods. While I did not discover an obvious significant correlation as the top ten neighboorhoods made up under 10% 
> of the total missed appointments, there may be opportunities for a shuttle or another location for those areas.

### 3. Day of Week Analysis: Discover the correlation between missed appointments & weekday of appointment.

In [29]:
%%sql

SELECT scheduleddayofweek, COUNT(*)
FROM appointments
WHERE noshow = 'Yes'
GROUP BY 1
ORDER BY 2 desc;

 * postgresql://postgres:***@localhost/dataengineering
6 rows affected.


scheduleddayofweek,count
Tuesday,5291
Wednesday,4879
Monday,4561
Friday,3887
Thursday,3700
Saturday,1


#### Day of Week Analysis Result:

> After analyzing the correlation between day of week and missed appointment, I discovered the missed appointments are
> consistent across the entire week. There is no specific day of week with a substantial increased in missed
> appointments.

### 4. SMS Analysis: Discover the correlation between patients who recieved an SMS text and missed appointments

In [34]:
%%sql

-- find percentage of no sms to total missed appointments
SELECT round((sum(sms_count) / sum(total_missed.missed_app)) / sum(total_missed.missed_app), 2) as sms_total_percentage FROM

-- total number of missed appointments
(SELECT appointmentid, count(*) missed_app
from appointments
where noshow = 'Yes'
group by 1) total_missed,

-- break out of total sms
(SELECT sms, COUNT(*) sms_count
FROM appointments
WHERE noshow = 'Yes' and sms = 'No'
GROUP BY 1) total_sms
;


 * postgresql://postgres:***@localhost/dataengineering
1 rows affected.


sms_total_percentage
0.56


#### SMS Analysis

> I compared the number of patients who recieved a SMS verus those who did not. I discovered that 56% of patients who
> did not recieve a SMS missed thier appointment compared to 44% who did not missed their appointment. While both
> percentages are significant, the facility can decrease the number of missed appointments but ensuring SMS are 
> consistently sent to patients. 

### 5. Underlying Condition Anaysis: Discover the correlation between 1 or more underlying condition to missed appointments.

In [37]:
%%sql
SELECT
	SUM(CASE WHEN Hypertension = 'Yes' THEN 1
	WHEN Diabetes = 'Yes' THEN 1
	WHEN Alcoholism = 'Yes' THEN 1
	WHEN Handicap = 'Yes' THEN 1
	ELSE 0
	END) AS underlying_condition
FROM appointments
WHERE noshow ='Yes';

 * postgresql://postgres:***@localhost/dataengineering
1 rows affected.


underlying_condition
4700


#### Underlying Condition Analysis

> I analyzed all patients who missed an appointment that had at least one or more underlying conditions I 
> found that about 20% of missed appointments were from patients and an underlying condition.