<a href="https://colab.research.google.com/github/Joshua-Dias-Barreto/Diabetes-Prediction/blob/main/Diabetes_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Diabetes Prediction**

The objective of this project is to accurately predict whether a patient has diabetes or not based on certain factors such as their insulin level, skin thickness, age, blood pressure, pregnancies, glucose and BMI.


#Importing the necessary packages:

In [None]:
import pandas as pd
import matplotlib as plt
from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn.model_selection import train_test_split


#Viewing the first five rows of the dataset

In [None]:
df=pd.read_csv('/content/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Checking if there are any null values present in the dataset

Since there are no null values in this dataset, there is no need to delete or impute any values

In [None]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

# Obtaining statistical data of each feature (column) in the dataset 

In [None]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
df.shape

(768, 9)

# Checking if there is sufficient data for both, people with diabetes and people who didn't have diabetes

In this dataset, 268 people had diabetes, while 500 people didn't.

In [None]:
df["Outcome"].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

df.drop(columns="Outcome",axis=1) Assigns all the feature columns of the dataset to X1 except the Outcome column .

The Outcome column is assigned to y .

In [None]:
X1=df.drop(columns="Outcome",axis=1)
y=df["Outcome"]

# **Standardizing the Data**

Standardization makes all variables contribute equally. All SVM kernel methods are based on distance, hence it is necessary to scale variables prior to running the final Support Vector Machine (SVM) model.

The standardScaler() ,standardizes features by removing the mean and scaling to unit variance.

In [None]:
X=StandardScaler().fit_transform(X1)
print(X)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


# **Splitting the dataset into a training and testing dataset**
**test_size = 0.3 :**

- 30% of the dataset is kept for testing while the model is trained on the rest of the 70%.

**stratify = y :**

- This is done so that there are sufficient positive and negative diabetic outcomes in both, the training dataset and the testing dataset.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=3,stratify=y)

# It can be seen that there is a uniform distribution of 0's and 1's in the testing and training dataset.

In [None]:
y_test.value_counts()

0    150
1     81
Name: Outcome, dtype: int64

In [None]:
y_train.value_counts()

0    350
1    187
Name: Outcome, dtype: int64

# **Using Logistic Regression :**

Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set.

Logistic regression essentially uses the sigmoid function defined below to model a binary output variable.

sigmoid(z) = 1/ (1 + e<sup>-z</sup>)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)

In [None]:
from sklearn import metrics
cnf=metrics.confusion_matrix(y_test,y_pred)
cnf

array([[131,  19],
       [ 40,  41]])

In [None]:
tn=cnf[0][0]
tp=cnf[1][1]
fn=cnf[1][0]
fp=cnf[0][1]
accuracy=(tn+tp)/(tn+tp+fp+fn)
precision=tp/(tp+fp)
recall=tp/(tp+fn)

In [None]:
print("accuracy",accuracy,"\n")
print("precision",precision,"\n")
print("recall",recall,"\n")

accuracy 0.7445887445887446 

precision 0.6833333333333333 

recall 0.5061728395061729 



In [None]:
print("accuracy",metrics.accuracy_score(y_test,y_pred),"\n")
print("precision",metrics.precision_score(y_test,y_pred),"\n")
print("recall",metrics.recall_score(y_test,y_pred),"\n")
print("F1 score",2*(precision*recall)/(precision+recall))
print("misclassifications",fp+fn)

accuracy 0.7445887445887446 

precision 0.6833333333333333 

recall 0.5061728395061729 

F1 score 0.5815602836879433
misclassifications 59


Using SVM

In [None]:
from sklearn import svm
sv=svm.SVC(kernel='linear')
sv.fit(X_train,y_train)
y_pred=sv.predict(X_test)

In [None]:
cnf=metrics.confusion_matrix(y_test,y_pred)
cnf

array([[133,  17],
       [ 38,  43]])

In [None]:
print("accuracy",metrics.accuracy_score(y_test,y_pred))
print("precision",metrics.precision_score(y_test,y_pred))
print("recall",metrics.recall_score(y_test,y_pred))

accuracy 0.7619047619047619
precision 0.7166666666666667
recall 0.5308641975308642
