In [1]:
import pandas as pd

Naive Bayes Classifier
This is a classification technique based on an assumption of independence between predictors or what’s known as Bayes’ theorem. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a Naive Bayes Classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple.
To build a Bayesian model is simple and particularly functional in case of enormous data sets. Along with simplicity, Naive Bayes is known to outperform sophisticated classification methods as well.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). The expression for Posterior Probability is as follows.
<img 
Here,
P(c|x) is the posterior probability of class (target) given predictor (attribute). 
P(c) is the prior probability of class. 
P(x|c) is the likelihood which is the probability of predictor given class. 
P(x) is the prior probability of predictor.

In [3]:
df = pd.read_csv("diabetes.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.drop(['Pregnancies', 'SkinThickness','DiabetesPedigreeFunction'],axis='columns',inplace=True)

In [5]:
df.head()

Unnamed: 0,Glucose,BloodPressure,Insulin,BMI,Age,Outcome
0,148,72,0,33.6,50,1
1,85,66,0,26.6,31,0
2,183,64,0,23.3,32,1
3,89,66,94,28.1,21,0
4,137,40,168,43.1,33,1


In [6]:
inputs = df.drop('Outcome',axis='columns')
target = df.Outcome

In [7]:
df.shape

(768, 6)

In [8]:
inputs.Age[:10]

0    50
1    31
2    32
3    21
4    33
5    30
6    26
7    29
8    53
9    54
Name: Age, dtype: int64

In [9]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())

In [10]:
inputs.head()

Unnamed: 0,Glucose,BloodPressure,Insulin,BMI,Age
0,148,72,0,33.6,50
1,85,66,0,26.6,31
2,183,64,0,23.3,32
3,89,66,94,28.1,21
4,137,40,168,43.1,33


In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.3)

In [16]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [17]:
model.fit(X_train,y_train)

GaussianNB()

In [18]:
model.score(X_test,y_test)

0.7402597402597403

In [19]:
X_test[0:10]

Unnamed: 0,Glucose,BloodPressure,Insulin,BMI,Age
55,73,50,0,23.0,21
10,110,92,0,37.6,30
205,111,72,0,23.9,27
434,90,68,0,24.5,36
229,117,80,53,45.2,24
435,141,0,0,42.4,29
443,108,70,0,30.5,33
76,62,78,0,32.6,41
609,111,62,182,24.0,23
319,194,78,0,23.5,59


In [23]:
X_test[10:100]

Unnamed: 0,Glucose,BloodPressure,Insulin,BMI,Age
155,152,88,0,50.0,36
671,99,58,0,25.4,21
751,121,78,74,39.0,28
705,80,80,0,39.8,28
12,139,80,0,27.1,57
...,...,...,...,...,...
160,151,90,0,29.7,36
639,100,74,46,19.5,28
389,100,68,81,31.6,28
761,170,74,0,44.0,43


In [20]:
y_test[0:10]

55     0
10     0
205    0
434    0
229    0
435    1
443    1
76     0
609    0
319    1
Name: Outcome, dtype: int64

In [21]:
model.predict(X_test[0:10])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1], dtype=int64)

In [24]:
model.predict(X_test[10:100])

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 0], dtype=int64)

In [22]:
model.predict_proba(X_test[:10])

array([[0.9855911 , 0.0144089 ],
       [0.76822339, 0.23177661],
       [0.94469423, 0.05530577],
       [0.94647178, 0.05352822],
       [0.71291092, 0.28708908],
       [0.57018098, 0.42981902],
       [0.87986133, 0.12013867],
       [0.88994119, 0.11005881],
       [0.93127221, 0.06872779],
       [0.06564683, 0.93435317]])

Calculate the score using cross validation

In [27]:
from sklearn.model_selection import cross_val_score
cross_val_score(GaussianNB(),X_train, y_train, cv=5)

array([0.7962963 , 0.74074074, 0.73831776, 0.74766355, 0.77570093])

In [28]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(X_train, y_train)))
print('Accuracy of GNB classifier on test set: {:.2f}'
     .format(gnb.score(X_test, y_test)))

Accuracy of GNB classifier on training set: 0.77
Accuracy of GNB classifier on test set: 0.74
