# 5. Machine Lerning


After the EDA & Visualization phase, we can start predict.
To do this, we will first import the required directories:

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import sklearn
from sklearn import linear_model, metrics, preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import r2_score, f1_score
from sklearn.model_selection import train_test_split
import warnings
%matplotlib inline
warnings.simplefilter(action="ignore", category=FutureWarning)

Load the dataset:

In [2]:
df = pd.read_csv('data/data_after_encoding.csv')

> # 5.1 The question

### What features doctors who require a long wait, and is it possible to have an appointment with a shorter wait?

First, let's define "long wait".  
There are 2 types of treatment that we should seperate them:  
1. Family / Pediatric / First Medicine  
2. Specialist medicine  

We can easily seperate them thanks to the visit cost rates.  
As mentioned earliar, first medicine cost 0 NIS per visit.  
All other treatments cost 30NIS per visit.  
Lets use that to define boolean column and append this to our dataframe.  
Later wer'e gonna use that as the prediction column.

In [3]:
longWait = ((df["visitCost"] == 0) & (df["daysToAppointment"] > 7)) | ((df["visitCost"] > 0) & (df["daysToAppointment"] > 21))
df["longWait"] = longWait.astype(int)

> # 5.2 The Method

Since our problem type is classfication, i choose the **LogisticRegression** algorithem using **f1 scoring** to evaluate the performance.

> #  5.3 Pre-Implementation

## Important functions:

In [4]:
def split_to_X_and_y(df, target_column):
    dfcpy = df.copy()
    y = pd.Series(dfcpy[target_column])
    del dfcpy[target_column]
    X = dfcpy
    return X, y
    
def split_to_train_and_test(X, y, test_ratio, rand_state):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=rand_state)
    return X_train, X_test, y_train, y_test

def train_model(X_train, y_train):
    model=linear_model.LogisticRegression(class_weight = 'balanced', max_iter=10000).fit(X_train, y_train)
    return model

def predict(trained_model, X_test):
    return trained_model.predict(X_test)

def evaluate_performance(y_test,y_predicted):
    metrics.confusion_matrix(y_test, y_predicted)
    return metrics.f1_score(y_test, y_predicted, average="macro")

> #  5.4 Implementation

Let's split to X and y.

In [5]:
X, y = split_to_X_and_y(df, "longWait")

Great, Now lets split the dataframe int train and test using the function we created earlier.

In [6]:
X_train, X_test, y_train, y_test = split_to_train_and_test(X, y, 0.20, 42)

Now were gonna train the model and predict:

In [7]:
trained_model = train_model(X_train, y_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
trained_model.fit(X_scaled, y_train)
pred_vals = predict(trained_model, X_test)
y_pred= pd.Series(pred_vals,index=X_test.index)

Let's check our work!  

We are going to evaluate the model using f1 score

In [8]:
evaluate_results = evaluate_performance(y_test, y_pred)
print("f1 score:", evaluate_results)

f1 score: 0.7532550156832574


Thats a nice score, but i think i can improve that :)  
Lest try to improve by adding new features.

In [9]:
# number of days in week doctors work

workDaysInWeek = X[["receptionOnSunday", "receptionOnMonday", "receptionOnTuesday", "receptionOnWednesday", "receptionOnThursday", "receptionOnFriday", "receptionOnSaturday"]].copy()
X["workDaysInWeek"] = workDaysInWeek.astype(bool).sum(axis=1)

# More then 15 years after graduation

yrsAfterGraduation15 = (X.graduationYear < 2007)
X["yrsAfterGraduation15"] = yrsAfterGraduation15.astype(int)

#Clinic located at top 5 biggest cities

primeCities = X["clinicCity"].value_counts().nlargest(n=5).tolist()
primeLocation = ((X.clinicCity ==  primeCities[0]) | (X.clinicCity ==  primeCities[1]) | (X.clinicCity == primeCities[2]) | (X.clinicCity == primeCities[3]) | (X.clinicCity == primeCities[4]))
X["primeLocation"] = primeLocation.astype(int)

Let's check if our score improved:

In [10]:
X_train, X_test, y_train, y_test = split_to_train_and_test(X, y, 0.20, 42)
trained_model = train_model(X_train, y_train)
scaler = preprocessing.StandardScaler().fit(X_train)
X_scaled = scaler.transform(X_train)
trained_model.fit(X_scaled, y_train)
pred_vals = predict(trained_model, X_test)
y_pred= pd.Series(pred_vals,index=X_test.index)
evaluate_results = evaluate_performance(y_test, y_pred)
print("f1 score:", evaluate_results)

f1 score: 0.9305567441639329


# 6. Report & Conclusions

In conclusion,  
After obtaining the data, performing optimization and analyzing in an advanced manner,  
we ran the logistic regression algorithm to answer the research question.  
Our algorithm knows to predict whether an appointment with a particular doctor  
will force the patient to wait a long or short time based on the various features.

We can deduce from the improvement of the algorithm features such as:  
**The doctor's experience**, the **number of working days a week**  
and a **central location** have a very strong effect on the waiting time for the doctor.  
And compromising on one or more of them can lead to a shorter wait.  

The algorithm predicted the waiting times with an accuracy of **93%** and I am very pleased with the above result.