# **Predicting Outcome for Diabetes**

## Objectives

* To build a Support Vector Machine Learning model in order to predict whether a patient is Diabetic or Non-Diabetic.
* Training the Machine Learning Model.
* Evaluate the accuracy score of the Machine Learning model.
* We will be answering Business Requirements 2 & 3:
    * 2 - The client requires a machine learning tool that their healthcare practitioners can use to identify whether a patient has diabetes.
    * 3 - The client expects an accuracy score of 75% or higher in predicting the outcome of diabetes.

## Inputs

* outputs/datasets/collection/diabetes.csv

* outputs/datasets/cleaned/x_train_cleaned.csv
* outputs/datasets/cleaned/x_test_cleaned.csv
* outputs/datasets/cleaned/y_train_cleaned.csv
* outputs/datasets/cleaned/y_test_cleaned.csv

## Outputs

* x_train dataset
* y_train dataset
* Support Vector Machine Pipeline

## Additional Comments

* This Notebook falls under the CRISP-DM of Modeling and Evaluation. There is also a small part Data preparation involved from previous notebook.
* A Machine Learning Model will be created using a SVM model which we will then evaluate the accuracy score


---

# Change working directory

* As the notebooks are stored in the subfolder 'jupyter_notebooks' we therefore, when running the notebook in the editor, need to change the working directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction'

# Importing the Libraries

* Here we import the libraries/dependencies that will be used for creation of the Machine Learning Model

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import svm

%matplotlib inline

# Loading the Datasets

* We will load the Diabetes Dataset along with the cleaned data from previous for use

#### Diabetes Source Dataset

In [5]:
import pandas as pd

df = pd.read_csv(f"outputs/datasets/collection/diabetes.csv")
df.head(15)
df.shape

(768, 9)

#### Cleaned Train Datasets

In [6]:
x_train_path = "outputs/datasets/cleaned/x_train_cleaned.csv"
x_train = pd.read_csv(x_train_path)

x_train.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,-1.141852,-0.087934,0.134013,-0.046266,-0.849452,-0.004933,-0.999286,-0.786286
1,0.639947,-0.54779,-0.19666,0.49048,-0.220117,-0.23762,-1.056668,0.319855
2,-0.844885,2.211346,-1.023342,-0.475663,6.980205,-0.339421,-0.223115,2.191785
3,-0.547919,-0.54779,0.630022,1.88602,0.918237,0.184126,0.722182,-0.360847
4,-1.141852,1.915724,-0.527333,1.241924,-0.849452,1.391191,4.291962,-0.701198
5,-0.547919,0.207688,0.464686,1.027226,0.834943,1.580249,2.271503,-0.190672
6,-0.547919,0.010607,-0.19666,-0.046266,-0.48851,0.634957,-0.398282,-0.531023
7,1.23388,-0.055087,0.134013,-0.046266,-0.849452,-0.353964,-0.872441,0.404942
8,1.23388,-1.007646,0.795359,-0.690362,-0.48851,0.460442,0.347687,2.957575
9,-0.547919,-0.777718,-1.023342,-1.119759,0.261138,0.329555,-0.827139,-0.956462


In [7]:
y_train_path = "outputs/datasets/cleaned/y_train_cleaned.csv"
y_train = pd.read_csv(y_train_path)

y_train.head(5)

Unnamed: 0,Outcome
0,1
1,0
2,1
3,1
4,1


#### Cleaned Test Datasets

In [8]:
x_test_path = "outputs/datasets/cleaned/x_test_cleaned.csv"
x_test = pd.read_csv(x_test_path)

x_test.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,-0.250952,-0.514943,-0.031324,-0.690362,-0.48851,-0.964768,-0.799958,-0.531023
1,-0.250952,-0.285015,-2.346033,-1.549156,-0.48851,-1.459228,-1.002306,-0.956462
2,0.342981,0.831778,0.464686,-0.690362,-0.48851,0.184126,-0.766737,2.702312
3,-0.250952,1.587256,-0.692669,1.027226,1.232904,0.300469,-0.34996,-0.27576
4,-0.250952,-1.237573,-0.031324,0.49048,-0.48851,0.693129,-0.618751,-0.445935
5,-0.844885,-1.139033,-1.023342,1.027226,-0.155333,0.693129,0.112118,-0.956462
6,-1.141852,0.569003,-0.858006,-1.119759,1.09408,-1.502857,-0.799958,-1.041549
7,2.718712,0.240535,0.134013,0.275781,-0.849452,1.085789,0.293325,0.915469
8,0.639947,-0.777718,-1.188678,0.597829,0.908982,0.227754,-0.126471,0.830381
9,0.639947,0.404769,-0.19666,-0.475663,0.353687,0.431356,0.211782,-0.360847


In [9]:
y_test_path = "outputs/datasets/cleaned/y_test_cleaned.csv"
y_test = pd.read_csv(y_test_path)

y_test.head(5)

Unnamed: 0,Outcome
0,0
1,0
2,0
3,1
4,0


Dataset shape

In [10]:
print(df.shape, x_train.shape, x_test.shape)

(768, 9) (614, 8) (154, 8)


Double checking the mean of variables for diabetic and non-diabetic

In [11]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


---

# Train Test Split

* We carried out this process in the previous notebook and pushed the datasets to our repo which have been loaded in above as x_train, x_test, y_train, y_test.

* We are now ready to begin training the model in the next steps.

---

# Creating the Model

* We will now need to create the pipeline for the Support Vector Machine model which we will be training and using to predict the Outcome of Diabetes.

In [None]:
# Support Vector Machine pipeline for predicting output of Diabetes
class SVM_pipeline():

    # Initiates the hyperparameters
    def __init__(self, tuning_parameter, iteration_no, lambda_parameter):
        self.tuning_parameter = tuning_parameter
        self.iteration_no = iteration_no
        self.lambda_parameter = lambda_parameter

# Training the Model

* Now we will begin creating and training the model using the cleaned datasets.
* We create an SVC classifier which is used to classify the dataset into classes.

In [12]:
classifier = svm.SVC(kernel='linear')

classifier.fit(x_train, y_train.values.ravel())

SVC(kernel='linear')

## Evaluating the Model

In [13]:
x_train_predict = classifier.predict(x_train)
x_test_predict = classifier.predict(x_test)

#### Train Set Accuracy Score

* Now we will gather an accuracy score for the training data.

In [14]:
train_accuracy = accuracy_score(x_train_predict, y_train)
print('Train dataset Accuracy Score: ', train_accuracy)

Train dataset Accuracy Score:  0.7915309446254072


* As we can see the Accuracy score is showing at 0.7915 which is above the Business Requirement 3 of needing a score of at least 0.75. If the score was below this then it would be deemed a fail. However, as we are above the 0.75 minimum requirement this can be considered a success.

#### Test Set Accuracy Score

In [15]:
test_accuracy = accuracy_score(x_test_predict, y_test)
print('Test dataset Accuracy Score: ', test_accuracy)

Test dataset Accuracy Score:  0.7532467532467533


* As we can see the Accuracy score is showing at 0.7532 which is just above the Business Requirement 3 of needing a score of at least 0.75. If the score was below this then it would be deemed a fail. However, as we have met the 0.75 minimum requirement this can be considered a success.

* As the training and test data has output similar Accuracy Scores it is a good indication that the model is not overtrained. If the accuracy score was high in the training data and the test data was low then this would be a signal that the model is overfitted.

* Unfortunately one of the limitations we have is due to the low size of the dataset, it is difficult to get a high accuracy as there isn't a lot of training data for the model to then use with the test data.

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [16]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block (2852421808.py, line 5)