# **Predicting Outcome for Diabetes**

## Objectives

* To build a Support Vector Machine Learning model in order to predict whether a patient is Diabetic or Non-Diabetic.
* Training the Machine Learning Model.
* Evaluate the accuracy score of the Machine Learning model.
* We will be answering Business Requirements 2 & 3:
    * 2 - The client requires a machine learning tool that their healthcare practitioners can use to identify whether a patient has diabetes.
    * 3 - The client expects an accuracy score of 75% or higher in predicting the outcome of diabetes.

## Inputs

* outputs/datasets/collection/diabetes.csv


## Outputs

* x_train dataset
* y_train dataset
* Support Vector Machine Pipeline

## Additional Comments

* This Notebook falls under the CRISP-DM of Modeling and Evaluation. There is also a small part Data preparation involved from previous notebook.
* A Machine Learning Model will be created using a SVM model which we will then evaluate the accuracy score


---

# Change working directory

* As the notebooks are stored in the subfolder 'jupyter_notebooks' we therefore, when running the notebook in the editor, need to change the working directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction'

# Importing the Libraries

* Here we import the libraries/dependencies that will be used for creation of the Machine Learning Model

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn import svm

%matplotlib inline

# Loading the Datasets

* We will load the Diabetes Dataset along with the cleaned data from previous for use

#### Diabetes Source Dataset

In [5]:
df = pd.read_csv(f"outputs/datasets/collection/diabetes.csv")
df.head(15)
df.shape

(768, 9)

---

# Dataset Collection

* We will run through the steps we took in the Feature Engineering notebook to get the dataset ready for the train test split.

In [6]:
df = pd.read_csv(f"inputs/datasets/raw/diabetes.csv")

print(df.head())
print(df.shape)

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
(768, 9)


In [7]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [8]:
df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

---

# Data Processing

* Going through the same processes we took in the Feature Engineering notebook to pre-process the data ready for train test split. Refer back to 03-FeatureEngineering.ipynb to see the processes we took in detail.

* We will begin by dropping the Outcome column to separate the feature variables from the target variables

In [9]:
x = df.drop(columns = 'Outcome', axis=1)
y = df['Outcome']
print(x)
print(y)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  
0                       0.627   50  
1                       0.351   31  


* Next we re-visit the process of standardising the data. Using the functions StandardScaler(), fit() and transform().

In [10]:
df_scaler = StandardScaler()

In [11]:
df_scaler.fit(x)

StandardScaler()

In [12]:
stnd_data = df_scaler.transform(x)
print(stnd_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [13]:
x = stnd_data
y = df['Outcome']
print(x)
print(y)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]
0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64


* Having processed the data we can now move onto the Train Test Split

---

# Train Test Split

* Like in the previous notebook we will then need to carry out a Train Test Split in order to train our model.


In [14]:
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=0.2, random_state=2)
print(f"Total Records: {x.shape} \nTrain set: {x_train.shape} \nTest set: {x_test.shape}")

Total Records: (768, 8) 
Train set: (614, 8) 
Test set: (154, 8)


---

# Creating the Model

* We will now need to create the pipeline for the Support Vector Machine model which we will be training and using to predict the Outcome of Diabetes.

In [15]:
# Support Vector Machine pipeline for predicting output of Diabetes
class SVM_pipeline():

    # Initiates the hyperparameters
    def __init__(self, tuning_parameter, iteration_no, lambda_parameter):
        self.tuning_parameter = tuning_parameter
        self.iteration_no = iteration_no
        self.lambda_parameter = lambda_parameter

    # Fits the diabetes dataset to the SVM classifier model
    def fit(self, x, y):
        # M refers to the No. of data points (rows) and Y refers to No. of input features (columns)
        self.m, self.n = x.shape

        # Initiate weight and bias values
        self.w = np.zeros(self.n)
        self.b = 0
        self.x = x
        self.y = y

        # Optimisation Algorithm
        for i in range(self.iteration_no):
            self.update_weight_value()
    
    # Encoding the label
    def update_weight_value(self):
        label_y = np.where(self.y <= 0, -1, 1)

        # Conditions for the gradients (dw, db)
        for index, x_i in enumerate(self.x):
            constraint = label_y[index] * (np.dot(x_i, self.w) - self.b) >= 1

            if (constraint == True):
                dw = 2 * self.lambda_parameter * self.w
                db = 0
            
            else:
                dw = 2 * self.lambda_parameter * self.w - np.dot(x_i, label_y[index])
                db = label_y[index]
                
            # Formula used for updating weight and bias values
            self.w = self.w - self.tuning_parameter * dw
            self.b = self.b - self.tuning_parameter * db
    
    # Predicts the Outcome label when using an input value
    def diabetes_prediction(self, x):
        result = np.dot(x, self.w) - self.b
        label_prediction = np.sign(result)
        predicted_outcome = np.where(label_prediction <= -1, 0, 1)

        return predicted_outcome

---

# Training the Model

* Now we will begin training the SVM classifier model which we just created using the Train Test Split in the previous step.

In [16]:
classifier_SVM = SVM_pipeline(tuning_parameter=0.001, iteration_no=1000, lambda_parameter=0.01)

classifier_SVM.fit(x_train, y_train)

---

# Performance Evaluation

### Evaluating the Model

In [17]:
x_train_predict = classifier_SVM.diabetes_prediction(x_train)
x_test_predict = classifier_SVM.diabetes_prediction(x_test)

#### Train Set Accuracy Score

* Now we will gather an accuracy score for the training data.

In [18]:
train_accuracy = accuracy_score(x_train_predict, y_train)
print('Train dataset Accuracy Score: ', train_accuracy)

Train dataset Accuracy Score:  0.7866449511400652


* As we can see the Accuracy score is showing at 0.786 which is above the Business Requirement 3 of needing a score of at least 0.75. If the score was below this then it would be deemed a fail. However, as we are above the 0.75 minimum requirement this can be considered a success.

#### Test Set Accuracy Score

In [19]:
test_accuracy = accuracy_score(x_test_predict, y_test)
print('Test dataset Accuracy Score: ', test_accuracy)

Test dataset Accuracy Score:  0.7727272727272727


* As we can see the Accuracy score is showing at 0.772 which is just above the Business Requirement 3 of needing a score of at least 0.75. If the score was below this then it would be deemed a fail. However, as we have met the 0.75 minimum requirement this can be considered a success.

* As the training and test data has output similar Accuracy Scores it is a good indication that the model is not overtrained. If the accuracy score was high in the training data and the test data was low then this would be a signal that the model is overfitted.

* Unfortunately one of the limitations we have is due to the low size of the dataset, it is difficult to get a high accuracy as there isn't a lot of training data for the model to then use with the test data.

#### Confusion Matrix

* Next we will calculate the confusion matrix to evaluate the performance of the SVM model on the training dataset
* The confusion matrix is a table which shows the number of correct and incorrect predictions made by the Support Vector Machine model.

##### Train dataset

In [20]:
confusion_matrix_train = confusion_matrix(y_train, x_train_predict)

# Custom template string for the confusion matrix output for clearer readability
train_template = (
    "Confusion Matrix Train Output:\n"
    "True Positive Predictions: {}\n"
    "False Negative Predictions: {}\n"
    "False Positive Predictions: {}\n"
    "True Negative Predictions: {}"
    )

# Inserts the values from the confusion matrix into the custom template
train_output_string = train_template.format(
    confusion_matrix_train[0][0],
    confusion_matrix_train[0][1],
    confusion_matrix_train[1][0],
    confusion_matrix_train[1][1]
    )

print(train_output_string)

Confusion Matrix Train Output:
True Positive Predictions: 359
False Negative Predictions: 41
False Positive Predictions: 90
True Negative Predictions: 124


##### Test dataset

In [22]:
confusion_matrix_train = confusion_matrix(y_test, x_test_predict)

# Custom template string for the confusion matrix output for clearer readability
test_template = (
    "Confusion Matrix Test Output:\n"
    "True Positive Predictions: {}\n"
    "False Negative Predictions: {}\n"
    "False Positive Predictions: {}\n"
    "True Negative Predictions: {}"
    )

# Inserts the values from the confusion matrix into the custom template
test_output_string = test_template.format(
    confusion_matrix_train[0][0],
    confusion_matrix_train[0][1],
    confusion_matrix_train[1][0],
    confusion_matrix_train[1][1]
    )

print(test_output_string)

Confusion Matrix Test Output:
True Positive Predictions: 91
False Negative Predictions: 9
False Positive Predictions: 26
True Negative Predictions: 28


* The true positive value represents the number of times the model correctly predicted the Diabetic outcome (1).

* The false negative value represents the number of times the model incorrectly predicted a Non-Diabetic outcome(0) for the dataset values that was actually Diabetic (1).

* The false positive value represents the number of times the model incorrectly predicted the Diabetic outcome (1) for the dataset values that was actually Non-Diabetic (0). 

* The true negative value represents the number of times the model correctly predicted the Non-Diabetic outcome (0).

* As we can see above, the confusion matrix shows us that the model is predicting a higher number of true positive and true negative predictions compared to that of the false positive and false negative predictions. This indicates to us that the model is performing well on both the train and test datasets.

### Predictive Power Score

#### Classification Report

* Next we will use the classification_report function using the sklearn library to generate a report that includes various evaluation metrics such as precision, recall and the F1 score for the performance of the Support Vector Machine model on the training and test sets.This will allow us to assess the overall performance and **Predictive Power Score** to identify areas for improvement.

##### Train dataset

In [23]:
train_report = classification_report(y_train, x_train_predict)
print(train_report)

              precision    recall  f1-score   support

           0       0.80      0.90      0.85       400
           1       0.75      0.58      0.65       214

    accuracy                           0.79       614
   macro avg       0.78      0.74      0.75       614
weighted avg       0.78      0.79      0.78       614



##### Test dataset

In [24]:
test_report = classification_report(y_test, x_test_predict)
print(test_report)

              precision    recall  f1-score   support

           0       0.78      0.91      0.84       100
           1       0.76      0.52      0.62        54

    accuracy                           0.77       154
   macro avg       0.77      0.71      0.73       154
weighted avg       0.77      0.77      0.76       154



* The classification report provides a detailed breakdown of the evaluation metrics for the output of "Diabetic" (1) and "Non-Diabetic" (0) in both the training and test datasets.

* Let's now asses what each score means:
    * Precision: Precision is the number of true Diabetic predictions made by the model, divided by the total number of Diabetic predictions made by the model. It measures the proportion of Diabetic predictions that are actually correct.
    * Recall: Recall is the number of true Diabetic predictions made by the model, divided by the total number of actual Diabetic cases in the data. It measures the proportion of actual Diabetic cases that were correctly predicted by the model.
    * f1-score: The f1-score is the harmonic mean of precision and recall. It is a balance between precision and recall and reaches its best value at 1.
    * Support: Support is the number of samples of the true response that lies in the outcome of Diabetic(1) and Non-Diabetic(0).


* As we can see the report also returns the overall accuracy score  which is the proportion of correct predictions and confirms what we saw further above. From this, we can see that the model is performing well.

---

# Testing the Predictive Outcome

* Now we will perform a manual test of a dataset to give us a predictive Outcome ready for use.

* We will take a randomised row of data from the original data set to predict the outcome.

In [25]:
# Random row of data from source dataset
manual_input = (1,97,66,15,140,23.2,0.487,22)


* Then we need to convert the data to a numpy array and reshape it for only one record to be used rather than the whole dataset (768 records).

In [26]:
manual_input_nparray = np.asarray(manual_input)
manual_input_shaped = manual_input_nparray.reshape(1, -1)

* Next we need to standardise the manually input data. The reason we do this is because the model was trained on a standardised set of data so if we use the raw data then we will get an inaccurate prediction.

In [27]:
stnd_manual_input = df_scaler.transform(manual_input_shaped)
print(stnd_manual_input)

[[-0.84488505 -0.74783062 -0.16054575 -0.3472913   0.52271486 -1.11594738
   0.04567536 -0.95646168]]


* Last we will create a variable to make a prediction using our trained model for the target Outcome of the input data.

In [28]:
predict = classifier_SVM.diabetes_prediction(stnd_manual_input)
print(predict)

[0]


* Having an output of '0' tells us that the model predicted correctly and from the data the person is non-diabetic. However, having an output of '0' isn't descriptive or clear for the end user so we will create an if statement to print whether the person is shown to likely have diabetes or not.

In [29]:
if (predict[0] == 0):
    print('Based on the data entered. This person does not show signs of being diabetic.')
else: 
    print('Based on the data entered. This person shows signs of being diabetic.')

Based on the data entered. This person does not show signs of being diabetic.


* As we can see we have a user friendly output message. Although we have a high confidence in the model, it isn't 100% accurate so we cannot guarantee that the person is either diabetic or non-diabetic so further review by the medical worker may be required.

---

# Push files to Repo

* We will be be pushing the following files to the repository
    * Train set data
    * Test set data

In [53]:
import joblib

version = 'v1.0'
file_path = f"outputs/svm_pipeline/predict_diabetes/{version}"
try:
  os.makedirs(name=file_path)
except Exception as e:
  print(e)


##### Train data set

In [54]:
pd.DataFrame(x_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.141852,-0.059293,-3.572597,-1.288212,-0.692891,0.05171,-0.999286,-0.786286
1,0.639947,-0.497453,0.046245,0.719086,-0.102454,-0.151361,-1.056668,0.319855
2,-0.844885,2.131507,-0.470732,0.154533,6.652839,-0.240205,-0.223115,2.191785
3,-0.547919,-0.497453,0.563223,1.534551,0.965543,0.216705,0.722182,-0.360847
4,-1.141852,1.849832,-0.160546,1.158182,-0.692891,1.270134,4.291962,-0.701198


In [55]:

pd.DataFrame(x_train).to_csv(f"{file_path}/x_train.csv", index=False)


In [56]:
pd.DataFrame(y_train).head()

Unnamed: 0,Outcome
619,1
329,0
13,1
476,1
45,1


In [57]:
pd.DataFrame(y_train).to_csv(f"{file_path}/y_train.csv", index=False)

##### Test data set

In [58]:
pd.DataFrame(x_test).head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,-0.250952,-0.466156,0.149641,-1.288212,-0.692891,-0.785957,-0.799958,-0.531023
1,-0.250952,-0.247076,-1.297896,-0.472747,-0.692891,-1.217483,-1.002306,-0.956462
2,0.342981,0.817027,0.459827,-1.288212,-0.692891,0.216705,-0.766737,2.702312
3,-0.250952,1.536861,-0.263941,1.032726,1.260761,0.31824,-0.34996,-0.27576
4,-0.250952,-1.154694,0.149641,0.719086,-0.692891,0.660922,-0.618751,-0.445935


In [59]:
pd.DataFrame(x_test).to_csv(f"{file_path}/x_test.csv", index=False)

In [60]:
pd.DataFrame(y_test).head()

Unnamed: 0,Outcome
615,0
80,0
148,0
132,1
501,0


In [61]:
pd.DataFrame(y_test).to_csv(f"{file_path}/y_test.csv", index=False)

* We run the code below to save the trained model in our repository for use in the Streamlit dashboard

In [62]:
joblib.dump(value=classifier_SVM, filename=f"{file_path}/classifier_SVM.pkl")

['outputs/svm_pipeline/predict_diabetes/v1.0/classifier_SVM.pkl']

#### Saving the trained model

In [64]:
import pickle

filename = 'outputs/svm_pipeline/predict_diabetes/v1.0/trained_svm.sav'
pickle.dump(classifier_SVM, open(filename, 'wb'))

In [65]:
load_svm_model = pickle.load(open('outputs/svm_pipeline/predict_diabetes/v1.0/trained_svm.sav', 'rb'))

In [66]:
manual_input = (1,97,66,15,140,23.2,0.487,22)

manual_input_nparray = np.asarray(manual_input)
manual_input_shaped = manual_input_nparray.reshape(1, -1)

stnd_manual_input = df_scaler.transform(manual_input_shaped)
print(stnd_manual_input)

predict = load_svm_model.diabetes_prediction(stnd_manual_input)
print(predict)

if (predict[0] == 0):
    print('Based on the data entered. This person does not show signs of being diabetic.')
else: 
    print('Based on the data entered. This person shows signs of being diabetic.')

[[-0.84488505 -0.74783062 -0.16054575 -0.3472913   0.52271486 -1.11594738
   0.04567536 -0.95646168]]
[0]
Based on the data entered. This person does not show signs of being diabetic.
