##Data Dictionary
Age: age of patient
Gender: M or F
Register Time: date and time appointment was made
Apointment: date of appointment
Day: day of the week of appointment

Diabetes: 0 or 1 for condition (1 means patient was scheduled to treat condition)
Drinks: 0 or 1 for alcoholism
Hypertension: 0 or 1 for condition
Handicap: 0 or 1 for condition

Smoker: 0 or 1 for smoker / non-smoker
Scholarship: 0 or 1 indicating whether the family of the patient takes part in the Bolsa Familia Program, an initiative that provides families with small cash transfers in exchange for keeping children in school and completing health care visits

Tuberculosis: 1 or 0 for condition (1 means patient was scheduled to treat condition)
Sms_Reminder: 0 ,1 ,2 for number of text message reminders sent to patient about appointment
WaitingTime: integer number of days between when the appointment waade and when the appointment took place.

Show Up: Yes or No

** Information on scholarship
https://www.worldbank.org/en/news/opinion/2013/11/04/bolsa-familia-Brazil-quiet-revolution

DATA CLEANING

In [18]:
import pandas as pd
import sqlite3
from werkzeug.security import generate_password_hash
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 20)
pd.set_option('display.width', 300)
pd.set_option('max_colwidth', 20)

In [29]:
df = pd.read_csv('https://raw.githubusercontent.com/ongcp97/BC3407-Team-7/main/appointmentData.csv')
df = df.rename({'index': 'appointment_id',
                'Show Up':'Show_Up',
                'Waiting Time':'Waiting_Time',
                'Register Time':'Register_Time',
                }, axis=1)

df['Show_Up'] = df['Show_Up'].replace('Yes', 1).replace('No', 0)
df['Gender'] = df['Gender'].replace('M', 1).replace('F', 0)
df['Day'] = df['Day'].replace(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
                              [1, 2, 3, 4, 5, 6, 7])
df['Waiting_Time'] = df['Waiting_Time'] * -1  # inverse sign of Waiting_Time
df = df[~df['Age'] < 0]  # remove negative ages
df['Handicap'] = df['Handicap'].replace([2, 3, 4], 1)  # Replace handicap >1 with handicap==1
df['Register_Time'] = pd.to_datetime(df['Register_Time'])
df = df.rename({'Apointment': 'Appointment'}, axis=1)
df['Appointment'] = pd.to_datetime(df['Appointment'])

df['Appointment_Month'] = df['Appointment'].dt.month
df['Appointment_Week_Number'] = df['Appointment'].dt.week

cols = ['Gender', 'Day', 'Diabetes', 'Drinks', 'HyperTension', 'Handicap', 'Smoker', 'Scholarship', 'Tuberculosis',
        'Sms_Reminder', 'Show_Up', 'Appointment_Week_Number', 'Appointment_Month']
df.loc[:, cols] = df.loc[:, cols].astype('category')
# print(df.info())

  df['Appointment_Week_Number'] = df['Appointment'].dt.week


NORMALIZING DATA for PATIENTS, APPOINTMENTS + Generate USERS tables in sqlite DB

In [20]:
df1 = df.copy()

# Each row is an Appointment ID, sorted by registered time
df1 = df1.sort_values(by=['Register_Time'], ascending=True).reset_index(drop=True)
df1 = df1.reset_index(drop=False)
df1 = df1.rename({'index': 'appointment_id',
                }, axis=1)

# Creating Patient IDs for Appointments
comparison = ['Age', 'Gender', 'Diabetes', 'Drinks', 'HyperTension', 'Handicap', 'Smoker', 'Scholarship',
              'Tuberculosis']
df1['patient_id'] = df1.groupby(comparison).ngroup()
first_appts = df1.groupby(['patient_id'])['Appointment'].min().reset_index(drop=False)
first_appts.columns = ['patient_id','first_appt']
df1 = df1.merge(first_appts,how='left',on='patient_id')
df1 = df1.sort_values(by=['patient_id', 'appointment_id'], ascending=True)

#normalizing tables into (1) patients table (2) appointments table
df1_patients = df1[['patient_id','first_appt'] + comparison]
df1_patients = df1_patients.drop_duplicates(subset=['patient_id'] + comparison, keep='first')

appt_cols = ['appointment_id', 'patient_id', 'Register_Time', 'Appointment', 'Day', 'Sms_Reminder', 'Waiting_Time',
             'Show_Up', 'Appointment_Month', 'Appointment_Week_Number']
df1_appointments = df1[appt_cols]

# Transferring to sqlite
conn = sqlite3.connect('dash_app/assets/hospital_database.db')
c = conn.cursor()
df1_patients.to_sql('patients', conn, if_exists='replace', index=False)
df1_appointments.to_sql('appointments', conn, if_exists='replace', index=False)

# Initializing Data Portal Login User Accounts
login_df1 = pd.DataFrame([['admin', 0, generate_password_hash('admin1234')],
                         ['nurse', 1, generate_password_hash('nurse1234')],
                         ],
                        columns=['user_id', 'access_level', 'password'])
login_df1.to_sql('users', conn, if_exists='replace', index=False)
conn.commit()

CART MODEL

In [21]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import numpy
from sklearn import preprocessing,tree

In [22]:
df_cart = df.copy()
df_cart = df_cart[[c for c in df.columns if c not in ['Show_Up','Appointment','Register_Time']]+['Show_Up']]
df_cart = df_cart.reset_index(drop=True)
Y_position = len(df_cart.columns)-1 # Last column = Show Up

# fix random seed for reproducibility
numpy.random.seed(7)

# Train Test Split
X = df_cart.iloc[:, 0:Y_position]
    # Day Sms_Reminder  Waiting_Time Appointment_Month Appointment_Week_Number  Age Gender Diabetes Drinks HyperTension Handicap Smoker Scholarship Tuberculosis
Y = df_cart.iloc[:, Y_position]
    # Show_Up
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=2020)

# Normalizing X-variables
scaler = preprocessing.StandardScaler().fit(X_train)
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

# create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(scaled_X_train, y_train)
y_pred_train2 = clf.predict(scaled_X_train)
cm2_train = confusion_matrix(y_train,y_pred_train2)
print("Decision Tree")
print("================================")
print(cm2_train)
acc_train2 = (cm2_train[0,0] + cm2_train[1,1]) / sum(sum(cm2_train))
print("Decsion Tree TrainSet: Accurarcy %.2f%%" % (acc_train2*100))
print("================================") # 97.38%
y_pred2 = clf.predict(scaled_X_test)
cm2 = confusion_matrix(y_test,y_pred2)
acc2 = (cm2[0,0] + cm2[1,1]) / sum(sum(cm2))
print(cm2)
print("Decision Tree Testset: Accurarcy %.2f%%" % (acc2*100))
print("================================") # 58.40%

#save the model to a file
import joblib
joblib.dump(clf, "dash_app/assets/Cart Model")


Decision Tree
[[ 62619    782]
 [  4773 141821]]
Decsion Tree TrainSet: Accurarcy 97.35%
[[ 9969 17359]
 [20047 42624]]
Decision Tree Testset: Accurarcy 58.44%


['dash_app/assets/Cart Model']

In [32]:
# Test Run of trained CART Model
import joblib
loaded_model = joblib.load('dash_app/assets/Cart Model')
trial_prediction=df.copy()
trial_prediction = trial_prediction[[c for c in df.columns if c not in ['Show_Up','Appointment','Register_Time']]+['Show_Up']]
trial_prediction = trial_prediction.reset_index(drop=True)
Y_position = len(trial_prediction.columns)-1 # Last column = Show Up
X = trial_prediction.iloc[:, 0:Y_position]
scaler = preprocessing.StandardScaler().fit(X)
scaled_X_train = scaler.transform(X)
y_pred = loaded_model.predict(scaled_X_train)

new_df = df.copy()
new_df['pred'] = y_pred
difference = abs(new_df['pred'].astype(int)-new_df['Show_Up'].astype(int)).sum()
accuracy = (len(df.index)-difference)/(len(df.index))
print(accuracy)
# df.to_csv('dash_app/assets/trial_prediction.csv',index=False)

0.8567338013426935
