## Naive Bayes Classifier using SciKit API

## Exploratory Data Analysis

In [37]:
#----------------------------------Importing all the required libraries------------------------------------------#
from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import numpy as np
import seaborn as sns

In [38]:
#---------------------------Reading in the file and looking at the head of the data---------------------------#
email_data = pd.read_csv('data.csv')
email_data

Unnamed: 0,x1,x2,x3,x4,x5,y
0,0,0,1,1,0,-1
1,1,1,0,1,0,-1
2,0,1,1,1,1,-1
3,1,1,1,1,0,-1
4,0,1,0,0,0,-1
5,1,0,1,1,1,1
6,0,0,1,0,0,1
7,1,0,0,0,0,1
8,1,0,1,1,0,1
9,1,1,1,1,1,-1


In [39]:
#------------------------We see that there are 150 observations and 5 rows in the dataset------------------------------#
email_data.shape

(10, 6)

In [40]:
#------------------------------------Getting descriptive statistics----------------------------------------------------#
email_data.describe()

Unnamed: 0,x1,x2,x3,x4,x5,y
count,10.0,10.0,10.0,10.0,10.0,10.0
mean,0.6,0.5,0.7,0.7,0.3,-0.2
std,0.516398,0.527046,0.483046,0.483046,0.483046,1.032796
min,0.0,0.0,0.0,0.0,0.0,-1.0
25%,0.0,0.0,0.25,0.25,0.0,-1.0
50%,1.0,0.5,1.0,1.0,0.0,-1.0
75%,1.0,1.0,1.0,1.0,0.75,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0


## Defining X's and Y for the model

In [41]:
#-----------------------------------------Dropping the Y column-------------------------------------------------------#
X = email_data.drop(columns = ['y'])
X.head()

Unnamed: 0,x1,x2,x3,x4,x5
0,0,0,1,1,0
1,1,1,0,1,0
2,0,1,1,1,1
3,1,1,1,1,0
4,0,1,0,0,0


In [42]:
#------------------------------------------Assigning actual Y variable-------------------------------------------#
Y = email_data['y']
Y.head()

0   -1
1   -1
2   -1
3   -1
4   -1
Name: y, dtype: int64

## Question 1: Frequency table

In [43]:
#--------------------------- Frequency table for variable x1--------------------------------------------#
count_x1 = email_data.groupby(['x1', 'y']).size().unstack()
print(count_x1) 

y   -1   1
x1        
0    3   1
1    3   3


In [14]:
#--------------------------- Creating a frequency table using pivot table for x1---------------------------------#
pd.pivot_table(
    email_data, 
    values='y', 
    index=email_data['x1'], 
    columns=email_data['y'], 
    aggfunc=np.size, 
    fill_value=0
)

y,-1,1
x1,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,1
1,3,3


In [15]:
#--------------------------- Creating a frequency table using pivot table for x2---------------------------------#
pd.pivot_table(
    email_data, 
    values='y', 
    index=email_data['x2'], 
    columns=email_data['y'], 
    aggfunc=np.size, 
    fill_value=0
)

y,-1,1
x2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,4
1,5,0


In [16]:
#--------------------------- Creating a frequency table using pivot table for x3---------------------------------#
pd.pivot_table(
    email_data, 
    values='y', 
    index=email_data['x3'], 
    columns=email_data['y'], 
    aggfunc=np.size, 
    fill_value=0
)

y,-1,1
x3,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2,1
1,4,3


In [17]:
#--------------------------- Creating a frequency table using pivot table for x4---------------------------------#
pd.pivot_table(
    email_data, 
    values='y', 
    index=email_data['x4'], 
    columns=email_data['y'], 
    aggfunc=np.size, 
    fill_value=0
)

y,-1,1
x4,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1,2
1,5,2


In [18]:
#--------------------------- Creating a frequency table using pivot table for x5---------------------------------#
pd.pivot_table(
    email_data, 
    values='y', 
    index=email_data['x5'], 
    columns=email_data['y'], 
    aggfunc=np.size, 
    fill_value=0
)

y,-1,1
x5,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4,3
1,2,1


## Frequency Tables

In [1]:
Now we Compute all the probabilities necessary for a Naïve Bayes classifier

feature probability P(xi | y) xi = Known author
P(xi¦y) = P(Known Author = False ¦ Read = Yes) = ¼ = 0.25
P(xi¦y) = P(Known Author = True ¦ Read = Yes) = ¾ = 0.75
P(xi¦y) = P(Known Author = False ¦ Read = No) = 3/6 = 0.5
P(xi¦y) = P(Known Author = True ¦ Read = No) = 3/6 = 0. 5

Class Probability
P(Read = Yes) = 4/10 = 0.4
P(Read = No) = 6/10 = 0.6

feature probability P(xi | y) xi = Is long?
P(xi¦y) = P(Is Long = False ¦ Read = Yes) = 4/4 = 1
P(xi¦y) = P(Is Long = True ¦ Read = Yes) = 0/4 = 0 As we cannot have zero here we use Laplace with c = 2 we get = (0+1) / (4+ 2) = 1/6 = 0.1667
P(xi¦y) = P(Is Long = False ¦ Read = No) = 1/6 = 0.166
P(xi¦y) = P(Is Long = True ¦ Read = No) = 5/6 = 0. 833

Class Probability
P(Read = Yes) = 4/10 = 0.4
P(Read = No) = 6/10 = 0.6

feature probability P(xi | y) xi = has ‘research’
P(xi¦y) = P(has ‘research’= False ¦ Read = Yes) = 1/4 = 0.25
P(xi¦y) = P(has ‘research’= True ¦ Read = Yes) = 3/4 = 0.75
P(xi¦y) = P(has ‘research’= False ¦ Read = No) = 2/6 = 0.333
P(xi¦y) = P(has ‘research’= True ¦ Read = No) = 4/6 = 0. 667

Class Probability
P(Read = Yes) = 4/10 = 0.4
P(Read = No) = 6/10 = 0.6


feature probability P(xi | y) xi = has ‘grade’
P(xi¦y) = P(has ‘grade’= False ¦ Read = Yes) = 2/4 = 0.5
P(xi¦y) = P(has ‘grade’= True ¦ Read = Yes) = 2/4 = 0.5
P(xi¦y) = P(has ‘grade’= False ¦ Read = No) = 1/6 = 0.167
P(xi¦y) = P(has ‘grade’= True ¦ Read = No) = 5/6 = 0. 833

Class Probability
P(Read = Yes) = 4/10 = 0.4
P(Read = No) = 6/10 = 0.6


feature probability P(xi | y) xi = has ‘lottery’
P(xi¦y) = P(has ‘lottery’= False ¦ Read = Yes) = 3/4 = 0.75
P(xi¦y) = P(has ‘lottery’= True ¦ Read = Yes) = 1/4 = 0.25
P(xi¦y) = P(has ‘lottery’= False ¦ Read = No) = 4/6 = 0.667
P(xi¦y) = P(has ‘lottery’= True ¦ Read = No) = 5/6 = 0.333

Class Probability
P(Read = Yes) = 4/10 = 0.4
P(Read = No) = 6/10 = 0.6

SyntaxError: invalid syntax (<ipython-input-1-9f568f6003c5>, line 1)

## Question 2

In [44]:
#----------------------------Import Gaussian Naive Bayes model--------------------------------------------------#
from sklearn.naive_bayes import GaussianNB

#------------------------------Create a Gaussian Classifier-----------------------------------------------------#
model = GaussianNB()

In [57]:
#-----------------------------Training the model--------------------------------------------------------------#
model.fit(X,Y)

GaussianNB(priors=None, var_smoothing=1e-09)

### Method predict:
Returns an array of binary predictions (one for each element of the test set)

In [46]:
#--------------------------------Predict Output for x1 = (0 0 0 0 0)-------------------------------------------#
predicted= model.predict([[0,0,0,0,0]]) 
print ("Predicted Class:", predicted)

Predicted Class: [1]


In [47]:
#------------------------------- Predict Output for x2 = (1 1 0 1 0)-------------------------------------------#
predicted= model.predict([[1,1,0,1,0]]) # x1 = 1, x2 = 1, x3 = 0, x4 = 1, x5 = 0 
print ("Predicted Class:", predicted)

Predicted Class: [-1]


### Method predict_proba:
- Returns a n-by-2 matrix of probabilities of belonging to each class. 
- (i,0) is the probability that element i belongs to class 0 i.e. (in our case discard or -1)
- (i,1) is the probability that element i belongs to class 1 i.e. (in our case read or +1)

In [58]:
#---------------------------------Predicted probability for x1 = (0 0 0 0 0)--------------------------------------#
predicted_prob= model.predict_proba([[0,0,0,0,0]]) 
print ("Predicted Probability:", predicted_prob)

Predicted Probability: [[2.85780302e-06 9.99997142e-01]]


- In the above case since the predicted probability for read is more than that of discard, hence the predicted class is +1

In [59]:
#------------------------------- Predicted Probability for x = (1 1 0 1 0)-------------------------------------------#
predicted_proba= model.predict_proba([[1,1,0,1,0]],) # x1 = 1, x2 = 1, x3 = 0, x4 = 1, x5 = 0 
print ("Predicted Probability:", predicted_proba)

Predicted Probability: [[1. 0.]]


- In the above case since the predicted probability for discard is more than that of read, hence the predicted class is -1

In [18]:
email_data[['x1','y']]

Unnamed: 0,x1,y
0,0,-1
1,1,-1
2,0,-1
3,1,-1
4,0,-1
5,1,1
6,0,1
7,1,1
8,1,1
9,1,-1


### trial and error - calculate posterior probability

In [19]:
# calculate P(A|B) given P(A), P(B|A), P(B|not A)
def bayes_theorem(p_a, p_b_given_a, p_b_given_not_a):
	# calculate P(not A)
	not_a = 1 - p_a
	# calculate P(B)
	p_b = p_b_given_a * p_a + p_b_given_not_a * not_a
	# calculate P(A|B)
	p_a_given_b = (p_b_given_a * p_a) / p_b
	return p_a_given_b
 
# P(A)
p_a = 0.0002
# P(B|A)
p_b_given_a = 0.85
# P(B|not A)
p_b_given_not_a = 0.05
# calculate P(A|B)
result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

P(A|B) = 0.339%


In [20]:
# calculate P(y|xi) given P(y), P(xi|y), P(xi|not y)
def bayes_theorem(p_y, p_xi_given_y, p_xi_given_not_y):
	# calculate P(not y=1)
	not_yr = 1 - p_yr
	# calculate P(B)
	p_xi = p_xi_given_y * p_a + p_b_given_not_a * not_a
	# calculate P(A|B)
	p_a_given_b = (p_b_given_a * p_a) / p_b
	return p_a_given_b
 
# P(A)
p_a = 0.0002
# P(B|A)
p_b_given_a = 0.85
# P(B|not A)
p_b_given_not_a = 0.05
# calculate P(A|B)
result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

NameError: name 'p_yr' is not defined

In [None]:
# calculate P(y|xi) given P(y), P(xi|y), P(xi|not y)
def bayes_theorem(p_y, p_xi_given_y, p_xi):
	# calculate P(xi)
	not_yr = 1 - p_yr
	# calculate P(B)
	p_xi = p_xi_given_y * p_a + p_b_given_not_a * not_a
	# calculate P(A|B)
	p_y_given_xi = (p_xi_given_y * p_y) / p_xi
	return p_a_given_b
 
# P(A)
p_a = 0.0002
# P(B|A)
p_b_given_a = 0.85
# P(B|not A)
p_b_given_not_a = 0.05
# calculate P(A|B)
result = bayes_theorem(p_a, p_b_given_a, p_b_given_not_a)
# summarize
print('P(A|B) = %.3f%%' % (result * 100))

## Question 4

## Question 5

### Verifying the actual and predicted values for Y using all the features

In [66]:
model.fit(X,Y)

GaussianNB(priors=None, var_smoothing=1e-09)

In [67]:
Y_pred= model.predict(X) 
print ("Predicted Class:", Y_pred)

Predicted Class: [ 1 -1 -1 -1 -1  1  1  1  1 -1]


In [68]:
#---------------------------------- Actual vs Predicted values for Y---------------------------------------#
df1 = pd.DataFrame({'Actual': Y, 'Predicted': Y_pred})
df1

Unnamed: 0,Actual,Predicted
0,-1,1
1,-1,-1
2,-1,-1
3,-1,-1
4,-1,-1
5,1,1
6,1,1
7,1,1
8,1,1
9,-1,-1


In [69]:
#-------------------------------- Calculating the accuracy score----------------------------------------------#
from sklearn.metrics import accuracy_score
acc_score = accuracy_score(Y,Y_pred)
print('The accuracy score is ' + str(acc_score))

The accuracy score is 0.9


In [35]:
predicted_no_x1= model.predict_proba([[1,0,1,0]],) # x1 = 1, x2 = 1, x3 = 0, x4 = 1, x5 = 0 
print ("Predicted Class:", predicted)

Predicted Class: [-1]


### Verifying the actual and predicted values for Y using features x2, x3, x4 and x5

In [50]:
#-----------------------------------------Dropping the x1 and Y column--------------------------------------------------#
X_new = email_data.drop(columns = ['x1','y'])
X_new.head()

Unnamed: 0,x2,x3,x4,x5
0,0,1,1,0
1,1,0,1,0
2,1,1,1,1
3,1,1,1,0
4,1,0,0,0


In [51]:
#------------------------------------------Assigning actual Y variable-------------------------------------------#
Y_new = email_data['y']
Y_new.head()

0   -1
1   -1
2   -1
3   -1
4   -1
Name: y, dtype: int64

In [71]:
#----------------------------Import Gaussian Naive Bayes model--------------------------------------------------#
from sklearn.naive_bayes import GaussianNB

#------------------------------Create a Gaussian Classifier-----------------------------------------------------#
model = GaussianNB()

In [73]:
#-----------------------------Training the model--------------------------------------------------------------#
model.fit(X_new,Y_new)

GaussianNB(priors=None, var_smoothing=1e-09)

In [75]:
model.fit(X_new,Y_new)

GaussianNB(priors=None, var_smoothing=1e-09)

In [76]:
#---------------------------Predicting Y without using feature x1--------------------------------#
Y_pred_no_x1= model.predict(X_new) 
print ("Predicted Class:", Y_pred_no_x1)

Predicted Class: [ 1 -1 -1 -1 -1  1  1  1  1 -1]


In [33]:
#---------------------------------- Actual vs Predicted values for Y for no x1 model---------------------------------------#
df2 = pd.DataFrame({'Actual': Y_new, 'Predicted': Y_pred_no_x1})
df2

Unnamed: 0,Actual,Predicted
0,-1,1
1,-1,-1
2,-1,-1
3,-1,-1
4,-1,-1
5,1,1
6,1,1
7,1,1
8,1,1
9,-1,-1


In [77]:
#-------------------------------- Calculating the accuracy score----------------------------------------------#
from sklearn.metrics import accuracy_score
acc_score_no_x1 = accuracy_score(Y_new,Y_pred_no_x1)
print('The accuracy score is ' + str(acc_score_no_x1))

The accuracy score is 0.9


### Verifying the predicted values for x1 and x2 from question 2

In [74]:
#--------------------------------Predict Output for x = (0 0 0 0)-------------------------------------------#
predicted= model.predict([[0,0,0,0]]) 
print ("Predicted Class:", predicted)

Predicted Class: [1]


- The predicted class is still the same as predicted in Q2, even without x1

In [63]:
#--------------------------------Predict Output for x = (0 0 0 0)-------------------------------------------#
predicted_prob= model.predict_proba([[0,0,0,0]]) 
print ("Predicted Probability:", predicted_prob)

Predicted Probability: [[1.21396982e-06 9.99998786e-01]]


- In the above case since the predicted probability for read is more than that of discard, hence the predicted class is +1

In [64]:
#------------------------------- Predict Output for x = (1 1 0 1 0)-------------------------------------------#
predicted= model.predict([[1,0,1,0]]) # x2 = 1, x3 = 0, x4 = 1, x5 = 0 
print ("Predicted Class:", predicted)

Predicted Class: [-1]


- The predicted class is still the same as predicted in Q2, even without x1

In [65]:
#--------------------------------Predict Output for x = (0 0 0 0)-------------------------------------------#
predicted_proba= model.predict_proba([[1,0,1,0]]) 
print ("Predicted Probability:", predicted_proba)

Predicted Probability: [[1. 0.]]


- In the above case since the predicted probability for discard is more than that of read, hence the predicted class is -1