# **Predicting Outcome for Diabetes**

## Objectives

* To build a Support Vector Machine Learning model in order to predict whether a patient is Diabetic or Non-Diabetic.
* Evaluate the accuracy score of the Machine Learning model.

## Inputs

* outputs/datasets/collection/diabetes.csv

* outputs/datasets/cleaned/x_train_cleaned.csv
* outputs/datasets/cleaned/x_test_cleaned.csv
* outputs/datasets/cleaned/y_train_cleaned.csv
* outputs/datasets/cleaned/y_test_cleaned.csv

## Outputs

* x_train dataset
* y_train dataset
* Support Vector Machine Pipeline

## Additional Comments

* This Notebook falls under the CRISP-DM of Modeling and Evaluation. There is also a small part Data preparation involved from previous notebook.
* A Machine Learning Model will be created using a SVM model which we will then evaluate the accuracy score


---

# Change working directory

* As the notebooks are stored in the subfolder 'jupyter_notebooks' we therefore, when running the notebook in the editor, need to change the working directory.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/pp5-diabetes-prediction'

# Importing the Libraries

* Here we import the libraries/dependencies that will be used for creation of the Machine Learning Model

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import svm

%matplotlib inline

# Loading the Datasets

* We will load the Diabetes Dataset along with the cleaned data from previous for use

#### Diabetes Source Dataset

In [16]:
import pandas as pd

df = pd.read_csv(f"outputs/datasets/collection/diabetes.csv")
df.head(15)
df.shape

(768, 9)

#### Cleaned Train Datasets

In [7]:
x_train_path = "outputs/datasets/cleaned/x_train_cleaned.csv"
x_train = pd.read_csv(x_train_path)

x_train.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.342981,0.569003,-0.692669,0.812528,0.446236,-0.557565,-0.183854,-0.616111
1,-0.547919,-0.482096,0.134013,0.275781,0.07604,0.169583,-0.204994,-0.871374
2,-0.250952,-0.449249,-0.858006,-0.368314,-0.48851,-0.935682,-0.751636,-0.701198
3,-0.844885,-0.219321,-0.19666,0.275781,0.03902,0.315012,0.17252,-0.105584
4,-0.844885,-1.139033,0.464686,-0.046266,-0.553294,0.315012,-1.12009,-0.956462
5,0.639947,-0.941952,-1.850024,0.275781,-0.257137,-0.543022,-0.34996,-0.871374
6,-0.547919,-0.712024,-0.527333,-0.797711,-0.016509,0.067782,1.19332,-0.445935
7,0.046014,-0.810564,-1.023342,-0.475663,-0.48851,-0.615737,-0.08721,-0.956462
8,-1.141852,-0.64633,-1.684688,-0.690362,-0.48851,-1.066568,-1.189553,-1.041549
9,-1.141852,0.043454,1.291368,1.027226,-0.48851,0.40227,-0.830159,-0.360847


In [8]:
y_train_path = "outputs/datasets/cleaned/y_train_cleaned.csv"
y_train = pd.read_csv(y_train_path)

y_train.head(5)

Unnamed: 0,Outcome
0,0
1,0
2,0
3,1
4,0


#### Cleaned Test Datasets

In [9]:
x_test_path = "outputs/datasets/cleaned/x_test_cleaned.csv"
x_test = pd.read_csv(x_test_path)

x_test.head(15)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.342981,0.831778,0.464686,-0.690362,-0.48851,0.184126,-0.766737,2.702312
1,-0.844885,-0.876258,0.795359,-0.260965,0.816433,0.373184,-0.721435,0.830381
2,-0.250952,-0.350709,-0.858006,-0.690362,-0.48851,-1.430142,-0.996266,-1.041549
3,0.639947,-0.416402,-1.023342,-0.046266,-0.48851,-1.081111,-0.802978,-0.531023
4,-0.250952,-0.219321,-0.527333,1.241924,0.446236,0.824015,-0.972105,-0.445935
5,-0.844885,-1.040492,-0.858006,-1.01241,-0.303412,-1.066568,2.404388,-0.701198
6,-0.844885,0.798931,-1.354015,-0.690362,-0.48851,-0.397593,0.278225,-0.360847
7,1.23388,-0.055087,0.464686,-0.690362,-0.48851,-1.081111,-0.189894,2.617224
8,-1.141852,0.50331,0.960695,-0.046266,-0.48851,-0.746623,-0.727475,2.191785
9,0.046014,0.240535,-1.023342,-1.656505,1.288433,-0.717538,0.16648,-0.190672


In [11]:
y_test_path = "outputs/datasets/cleaned/y_test_cleaned.csv"
y_test = pd.read_csv(y_test_path)

y_test.head(5)

Unnamed: 0,Outcome
0,0
1,1
2,0
3,0
4,0


Dataset shape

In [14]:
print(df.shape, x_train.shape, x_test.shape)

(768, 9) (691, 8) (77, 8)


Double checking the mean of variables for diabetic and non-diabetic

In [20]:
df.groupby('Outcome').mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,70.844,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,75.242537,22.164179,100.335821,35.142537,0.5505,37.067164


---

# Train Test Split

* We carried out this process in the previous notebook and pushed the datasets to our repo which have been loaded in above as x_train, x_test, y_train, y_test.

* We are now ready to begin training the model in the next steps.

---

# Training the Model

* Now we will begin creating and training the model using the cleaned datasets.
* we create an SVC classifier which is used to classify the dataset into classes.

In [24]:
classifier = svm.SVC(kernel='linear')

classifier.fit(x_train, y_train.values.ravel())

SVC(kernel='linear')

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
