# Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.style.use('ggplot')

# Importing Diabetes Dataset

In [2]:
df = pd.read_csv("../datasets/diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Naive Bayes Algorithm

Naive Bayes Algorithm is a machine learning technique for classification purpose which uses the well-known Bayes Theorem.<br>
The Bayes theorem is as:

$$P(A/B) = \displaystyle \Bigg[\frac{P(A) * P(B/A)}{P(B)}\Bigg]$$

where<br>
P(A/B) = Posterior probability i.e. probability of A given condition B(there can be multiple conditions in B)<br>
P(A) = Prior Probability i.e. the initial probability of event<br>
P(B/A) = Likelihood of event/s of B given that the outcome is event A<br>
P(B) = Probability of B in the data<br>

The Naive Bayer algorithm will classify the outcome based on the highest probability of any class in A.

For diabetes dataset<br>
$$P\Bigg[\frac{(Outcome=0)}{(Pregnancies\cap Glucose\cap ... \cap Age)}\Bigg] = \Bigg[\frac{P(Outcome=0) * P(Pregnancies\cap Glucose\cap ... \cap Age / Outcome=0)}{P(Pregnancies\cap Glucose\cap ... \cap Age)}\Bigg]$$

<br>
Similarly, Probability(Outcome=1/conditions) is calculated and the class for which the probability highest is selected as the classified class.

# Implementation

In [3]:
# Split the dataset into features and target variable

X = df.drop(['Outcome'], axis=1)
y = df['Outcome']

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Building Naive Bayes Model with all features

In [5]:
from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

#Train the model using the training sets
model.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = model.predict(X_test)

# Evaluation model

In [6]:
from sklearn.metrics import classification_report, confusion_matrix

class_names = [0,1]

print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.88      0.84       130
           1       0.67      0.53      0.59        62

    accuracy                           0.77       192
   macro avg       0.74      0.70      0.71       192
weighted avg       0.76      0.77      0.76       192

Confusion Matrix:
[[114  16]
 [ 29  33]]


# Using Cross Validation

In [7]:
from sklearn.model_selection import cross_val_score

gnb = GaussianNB()

cv_results = cross_val_score(gnb, X, y, cv=5)

print(cv_results)

print(np.mean(cv_results))

[0.75324675 0.72727273 0.74675325 0.78431373 0.74509804]
0.7513368983957219
