# Predicting Type-II Diabetes Occurrence Using Supervised Machine Learning

## Introduction

In this project, we will predict whether an individual is likely to be diabetic based on different variables, 
given below:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- Age: Age (years)
- Outcome: Class variable (0 or 1)

The dataset used was obtained from Kaggle: https://www.kaggle.com/datasets/mathchi/diabetes-data-set

Project by: Syed Muhammad Farzan Hussain (https://www.linkedin.com/in/farzanhussain/)

In [13]:
# Importing the libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split 
from sklearn import svm
from sklearn.metrics import accuracy_score

## Exploratory Data Analysis

In [15]:
# Loading the CSV file into Python
diabetes = pd.read_csv("diabetes.csv")

# Let's see what variables we are working with
print(diabetes.head())

# Determining the dimensions of the dataframe
print(diabetes.shape)

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


(768, 9)

In [19]:
# Getting the dataset's statistical information
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [23]:
# How many people in the dataset have diabetes?
diabetes["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64


Based on the above code, we can tell that 500 people were not diagnosed with diabetes while 268 were.

In [35]:
diabetes.groupby("Outcome").mean()

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164



The table above shows that diabetic people have a higher blood glucose level. 

People who are older also seem to have a higher likelihood of being diabetic. 


## Building a Predictive Model

In [42]:
# Separating data and variables
X = diabetes.drop(columns = "Outcome", axis = 1)
y = diabetes["Outcome"]

# Since the ranges of the values spans across large values, we need to standardize the data
scaler = StandardScaler()

# Fitting data into the scaler
scaler.fit(X)
standardized_data = scaler.transform(X)

# Reassign X to represent the standardized data
X = standardized_data

In [47]:
# Splitting the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

# Let's see what X_train and X_test looks like
print(X_train.shape, X_test.shape)

(614, 8) (154, 8)


In [50]:
# Training the model
classifier = svm.SVC(kernel = "linear")

classifier.fit(X_train, y_train)

SVC(kernel='linear')

## Model Evaluation

In [57]:
# Is our model any good?
# Calculating model accuracy on training data
X_train_pred = classifier.predict(X_train)
X_train_acc = accuracy_score(X_train_pred, y_train)
print("The accuracy of the model for training data is " + str(X_train_acc*100) + " percent.")

The accuracy of the model for training data is 79.15309446254072 percent.


In [82]:
# Calculating model accuracy for test data
X_test_pred = classifier.predict(X_test)
X_test_acc = accuracy_score(X_test_pred, y_test)
print("The accuracy of the model for test data is " + str(X_test_acc*100) + " percent.")

The accuracy of the model for test data is 72.07792207792207 percent.


In [90]:
# Predictive model 
# We are selecting a random chain of input variables from the CSV file
input = [11,143,94,33,146,36.6,0.254,51]

# Changing the input into a numpy array
input_np = np.asarray(input)

# Reshaping the data
reshaped_input_np = input_np.reshape(1, -1)

# Standardize data
sd_data = scaler.transform(reshaped_input_np)

# Prediction
prediction = classifier.predict(sd_data)

if prediction == 1:
    print("The person is diabetic")
else:
    print("The person is not diabetic")
        

The person is diabetic


