# Naive Bayes

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes is a classification algorithm for binary (two-class) and multi-class classification problems. The technique is easiest to understand when described using binary or categorical input values.

It is called naive Bayes because the calculation of the probabilities for each hypothesis are simplified to make their calculation tractable. Rather than attempting to calculate the values of each attribute value P(d1, d2, d3|h), they are assumed to be conditionally independent given the target value and calculated as P(d1|h) * P(d2|H) and so on.

## Bayes' Theorem
Bayes’ Theorem is stated as:

P(h|d) = (P(d|h) * P(h)) / P(d)

Where

###### P(h|d) is the probability of hypothesis h given the data d. This is called the posterior probability.
###### P(d|h) is the probability of data d given that the hypothesis h was true.
###### P(h) is the probability of hypothesis h being true (regardless of the data). This is called the prior probability of h.
###### P(d) is the probability of the data (regardless of the hypothesis).

### Useful Libraries

#### Load Dataset. Use "bank-data.csv"

In [164]:
# import dataset
import pandas as pd
df  =pd.read_csv('bank-data.csv',index_col=0)
df.head()

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,FEMALE,INNER_CITY,17546.0,NO,1,NO,NO,NO,NO,YES
ID12102,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO
ID12103,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO
ID12104,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO
ID12105,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO


#### Preprocess the data

In [165]:
# import library for preprocessing
from sklearn import preprocessing

In [166]:
# Tranform data using "fit_transform(attribute)" function  
labels = preprocessing.LabelEncoder()
df.mortgage = labels.fit_transform(df.mortgage)
df.current_act = labels.fit_transform(df.current_act)
df.save_act = labels.fit_transform(df.save_act)
df.car = labels.fit_transform(df.car)
df.married = labels.fit_transform(df.married)
df.region = labels.fit_transform(df.region)
df.sex = labels.fit_transform(df.sex)
df.head()

Unnamed: 0_level_0,age,sex,region,income,married,children,car,save_act,current_act,mortgage,pep
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ID12101,48,0,0,17546.0,0,1,0,0,0,0,YES
ID12102,40,1,3,30085.1,1,3,1,0,1,1,NO
ID12103,51,0,0,16575.4,1,0,1,1,1,0,NO
ID12104,23,0,3,20375.4,1,3,0,0,1,0,NO
ID12105,57,0,1,50576.3,1,0,0,1,0,0,NO


#### Select independent variables and target column

In [167]:
# Select the independent variables and the target attribute
from sklearn.model_selection import train_test_split
x = df.iloc[:, :-1].values #independent
y = df.iloc[:, -1].values #target
x_train, x_test , y_train, y_test = train_test_split(x,y,test_size = 0.30,random_state=42)

#### Import Naive Bayes Classifier library 

In [168]:
# import Classifier library
from sklearn.naive_bayes import GaussianNB

In [169]:
# Call the Classifier
model = GaussianNB()

#### Predict the target column and find the perfromance of the model

In [170]:
y_pred = model.fit(x_train, y_train).predict(x_test)

In [171]:
# Print Number of mislabeled points
print("mislabled points : %d" % ( (y_test != y_pred).sum()))

mislabled points : 56


### Prediction and Evaluation

In [172]:
# import required libraries
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [173]:
# Calculate and print confusion matrix and other performance measures (Refer previous labsheet)
#accuracy score
print(accuracy_score(y_test, y_pred))
#confusion matrix
print(confusion_matrix(y_test, y_pred))

0.6111111111111112
[[58 17]
 [39 30]]


#### Q1: Consider "current_act" as an irrelevant attribute. Remove it and find the accuracy of Naive Bayes classifier

In [174]:
# display dataframe first 5 columns
print(df.iloc[:, : 5])
#deleting column 
df1 =df.drop(['current_act'], axis=1)

         age  sex  region   income  married
id                                         
ID12101   48    0       0  17546.0        0
ID12102   40    1       3  30085.1        1
ID12103   51    0       0  16575.4        1
ID12104   23    0       3  20375.4        1
ID12105   57    0       1  50576.3        1
...      ...  ...     ...      ...      ...
ID12575   31    0       3  22678.1        0
ID12576   33    0       3  12178.5        1
ID12577   43    1       1  26106.7        0
ID12578   40    1       0  27417.6        1
ID12579   47    1       3  23337.2        1

[479 rows x 5 columns]


In [175]:
from sklearn.model_selection import train_test_split
# Selecting the independent variables
x1 = df1.iloc[:, :-1].values #independent
# selecting only the target lableled column
y1 = df1.iloc[:, -1].values #target
x_train1, x_test1 , y_train1, y_test1 = train_test_split(x1,y1,test_size = 0.30,random_state=42)

In [176]:
# Apply the classifier and Print Number of mislabeled points
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
y_pred1 = model.fit(x_train1, y_train1).predict(x_test1)
print("mislabled points : %d" % ( (y_test1 != y_pred1).sum()))

mislabled points : 56


In [177]:
# accuracy of the new classifier
# Calculate and print confusion matrix and other performance measures
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
#accuracy score
print(accuracy_score(y_test1, y_pred1))
#confusion matrix
print(confusion_matrix(y_test1, y_pred1))

0.6111111111111112
[[58 17]
 [39 30]]


#### Q2: Write your observation

since it is irrelent so not being used for classification since naive bayes is robust so it deals with irrelevent attributes by itself in either case

### Load "car.csv" dataset. 

#### Q3: Apply Naive Bayes classifier on this dataset

In [206]:
# Load the data
df  =pd.read_csv('car.csv')
df.columns = ["price","maintenance_cost","doors","person_capacity","luggage_boot_size","saftey","class"]
# shuffle the DataFrame rows 
df = df.sample(frac = 1) 
df

Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,saftey,class
200,vhigh,high,6,4,med,low,unacc
1197,med,low,2,4,small,med,acc
962,med,vhigh,6,6,small,low,unacc
790,high,low,3,2,big,high,unacc
1328,low,vhigh,3,2,big,low,unacc
...,...,...,...,...,...,...,...
218,vhigh,med,2,2,med,low,unacc
1495,low,high,6,4,small,high,acc
1405,low,high,2,2,small,high,unacc
1654,low,low,3,2,big,high,unacc


In [207]:
# Preprocess and Tranform data using "fit_transform(attribute)" function  
labels = preprocessing.LabelEncoder()
df.price = labels.fit_transform(df.price)
df.maintenance_cost = labels.fit_transform(df.maintenance_cost)
df.doors= labels.fit_transform(df.doors)
df.person_capacity = labels.fit_transform(df.person_capacity)
df.luggage_boot_size = labels.fit_transform(df.luggage_boot_size)
df.saftey = labels.fit_transform(df.saftey)
df.head()

Unnamed: 0,price,maintenance_cost,doors,person_capacity,luggage_boot_size,saftey,class
200,3,0,3,1,1,1,unacc
1197,2,1,0,1,2,2,acc
962,2,3,3,2,2,1,unacc
790,0,1,1,0,0,0,unacc
1328,1,3,1,0,0,1,unacc


In [208]:
# Select the independent variables and the target attribute
# Selecting the independent variables
x = df.iloc[:, :-1].values #independent
# selecting only the target lableled column
y = df.iloc[:, -1].values #target

In [209]:
# Apply the classifier
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [210]:
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
x_train, x_test , y_train, y_test = train_test_split(x,y,test_size = 0.30,random_state=42)

# predictions for testing partition
y_pred = model.fit(x_train, y_train).predict(x_test)

In [211]:
# Print Number of mislabeled points
print("mislabled points : %d" % ( (y_test != y_pred).sum()))

mislabled points : 199


In [212]:
# Calculate and print confusion matrix and other performance measures

#accuracy score
print(accuracy_score(y_test, y_pred))
#confusion matrix
print(confusion_matrix(y_test, y_pred))

0.6165703275529865
[[  9   0  48  54]
 [  4   0   5  11]
 [ 13   0 293  64]
 [  0   0   0  18]]


#### Q4: Find the correlation between the attributes of the dataset.

In [213]:
# Find the pairwise correlation of attributes and arrange in ascending order
c = df.corr().abs()
s = c.unstack()
so = s.sort_values(kind='quicksort')
print(so)

luggage_boot_size  saftey               1.928308e-18
saftey             luggage_boot_size    1.928308e-18
doors              saftey               2.535047e-18
saftey             doors                2.535047e-18
price              saftey               3.520899e-18
saftey             price                3.520899e-18
                   person_capacity      8.484556e-18
person_capacity    saftey               8.484556e-18
maintenance_cost   saftey               1.323858e-17
saftey             maintenance_cost     1.323858e-17
luggage_boot_size  person_capacity      8.693132e-04
person_capacity    luggage_boot_size    8.693132e-04
maintenance_cost   person_capacity      9.523677e-04
person_capacity    maintenance_cost     9.523677e-04
price              luggage_boot_size    9.523677e-04
luggage_boot_size  price                9.523677e-04
maintenance_cost   luggage_boot_size    9.523677e-04
luggage_boot_size  maintenance_cost     9.523677e-04
                   doors                9.5236

#### Q5: Remove one of the highly correlated attributes and apply Naive Bayes classifier

In [214]:
# Drop highly correlated attribute
#deleting column 
df =df.drop(['doors'], axis=1)
df

Unnamed: 0,price,maintenance_cost,person_capacity,luggage_boot_size,saftey,class
200,3,0,1,1,1,unacc
1197,2,1,1,2,2,acc
962,2,3,2,2,1,unacc
790,0,1,0,0,0,unacc
1328,1,3,0,0,1,unacc
...,...,...,...,...,...,...
218,3,2,0,1,1,unacc
1495,1,0,1,2,0,acc
1405,1,0,0,2,0,unacc
1654,1,1,0,0,0,unacc


In [215]:
# Apply the classifier
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
# Divide the dataset into training and testing partition
from sklearn.model_selection import train_test_split
x_train, x_test , y_train, y_test = train_test_split(x,y,test_size = 0.30,random_state=42)
# predictions for testing partition
y_pred = model.fit(x_train, y_train).predict(x_test)
# Print Number of mislabeled points
print("mislabled points : %d" % ( (y_test != y_pred).sum()))


mislabled points : 199


In [216]:
# Calculate and print confusion matrix and other performance measures
#accuracy score
print(accuracy_score(y_test, y_pred))
#confusion matrix
print(confusion_matrix(y_test, y_pred))

0.6165703275529865
[[  9   0  48  54]
 [  4   0   5  11]
 [ 13   0 293  64]
 [  0   0   0  18]]


#### Q6: Write your observation below in the performance of model in Q4 and Q6

reduced slightly after removing doors attribute