## Assignment Questions

__Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?__

__Ans)__ To find the probability that an employee is a smoker given that they use the health insurance plan, we can use Bayes' theorem.

Let's define the events:
A: Employee is a smoker.
B: Employee uses the health insurance plan.

We are given:
P(B) = 0.7 (probability that an employee uses the health insurance plan)
P(A|B) = 0.4 (probability that an employee is a smoker given that they use the health insurance plan)

We want to find:
P(A|B) (probability that an employee is a smoker given that they use the health insurance plan)

According to Bayes' theorem:

P(A|B) = (P(B|A) * P(A)) / P(B)

We don't have the values for P(B|A) and P(A), so we need to calculate them.

P(B|A) represents the probability that an employee uses the health insurance plan given that they are a smoker. This information is not given directly, so we cannot determine its value from the given information.

P(A) represents the overall probability of an employee being a smoker, regardless of whether they use the health insurance plan or not. This information is also not given directly, so we cannot determine its value from the given information.

Without the values for P(B|A) and P(A), we cannot calculate the probability that an employee is a smoker given that they use the health insurance plan.

__Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?__

__Ans)__ The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of data they are suitable for and the assumptions they make.

1. Bernoulli Naive Bayes:

- Suitable for binary or boolean features (e.g., presence or absence of a feature).
- Assumes that each feature is conditionally independent given the class label.
- The feature vectors are binary, where each feature can take only two values (0 or 1).
- Often used in text classification tasks, where the presence or absence of certain words is considered.
- Works well with sparse feature vectors.

2. Multinomial Naive Bayes:

- Suitable for discrete or count-based features (e.g., word frequencies in a document).
- Assumes that each feature's value follows a multinomial distribution given the class label.
- The feature vectors are typically non-negative integers, representing the counts or frequencies of each feature.
- Commonly used in text classification, document categorization, and spam filtering tasks.
- Can handle multiple occurrences of features and is robust to variations in feature frequencies.

__In summary, Bernoulli Naive Bayes is appropriate when dealing with binary features, while Multinomial Naive Bayes is suitable for discrete or count-based features. Choosing between the two depends on the nature of the data and the specific problem at hand.__

__Q3. How does Bernoulli Naive Bayes handle missing values?__

__Ans)__  Bernoulli Naive Bayes does not handle missing values explicitly. It assumes that the feature vectors are binary, where each feature can take only two values (0 or 1). When missing values are present in the data, they are typically treated as a separate category or ignored.

__Q4. Can Gaussian Naive Bayes be used for multi-class classification?__

__Ans)__ 
Yes, Gaussian Naive Bayes can be used for multi-class classification. While it is commonly used for binary classification problems, it can be extended to handle multi-class classification by applying the "one-vs-all" (or "one-vs-rest") approach.

Q5. Assignment:

Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.


Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:
Summarise your findings and provide some suggestions for future work.


Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for visualization
import plotly.express as px # for visualization
import matplotlib.pyplot as plt # for visualization
%matplotlib inline

# To display all the columns of dataframe
pd.set_option('display.max_columns', 500)
import warnings
warnings.filterwarnings("ignore")

In [2]:
df=pd.read_csv('emails.csv')

In [3]:
df.head()

Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,in,on,is,this,enron,i,be,that,will,have,with,your,at,we,s,are,it,by,com,as,from,gas,or,not,me,deal,if,meter,hpl,please,re,e,any,our,corp,can,d,all,has,was,know,need,an,forwarded,new,t,may,up,j,mmbtu,should,do,am,get,out,see,no,there,price,daren,but,been,company,l,these,let,so,would,m,into,xls,farmer,attached,us,information,they,message,day,time,my,one,what,only,http,th,volume,mail,contract,which,month,more,robert,sitara,about,texas,nom,energy,pec,questions,www,deals,volumes,pm,ena,now,their,file,some,email,just,also,call,change,other,here,like,b,flow,net,following,p,production,when,over,back,want,original,them,below,o,ticket,c,he,could,make,inc,report,march,contact,were,days,list,nomination,system,who,april,number,sale,don,its,first,thanks,business,help,per,through,july,forward,font,free,daily,use,order,today,r,had,fw,set,plant,statements,go,gary,oil,line,sales,w,effective,well,tenaska,take,june,x,within,nbsp,she,how,north,america,being,under,next,week,than,january,last,two,service,purchase,name,less,height,off,agreement,k,work,tap,group,year,based,transport,after,think,made,each,available,changes,due,f,h,services,smith,send,management,stock,sent,ll,co,office,needs,cotten,did,actuals,u,money,before,looking,then,pills,online,request,...,square,danny,gepl,hydrocarbon,alpine,christmas,muscle,souza,relating,begins,ecf,forth,answers,audit,approve,lunch,types,starts,difficult,le,lasts,series,till,edge,growing,covered,shipper,sometime,republic,filter,sooner,increasing,nelson,percentage,returned,pop,interface,kin,experienced,prime,merger,obtain,ryan,servers,attachments,achieve,effects,gov,examples,procedure,explore,caribbean,rally,amounts,comfort,attempt,greatly,amelia,engel,delay,fare,der,cove,filing,fletcher,leth,undervalued,cents,esther,hlavaty,reid,lls,troy,palmer,metals,las,carter,luis,migration,brief,hess,therein,ur,pond,joanne,community,tglo,eogi,ml,wysak,felipe,errors,affect,convenient,minimal,boost,incremental,decide,reserve,superior,kerr,willing,quite,wild,unlimited,sans,mother,computers,unfortunately,ordered,satisfaction,priority,traded,testing,portal,ward,lets,aren,knows,refer,shot,fda,tue,saying,cancel,forecast,cousino,bass,permanent,phones,technical,whose,objective,cards,distributed,learning,fire,drill,towards,forget,explosion,gloria,formula,redelivery,audio,visual,encoding,approach,doubt,staffing,excite,corel,tm,enronavailso,contacting,alland,heavy,economic,nigeria,milwaukee,phillip,curve,returns,padre,kathy,buttons,sir,vary,sounds,disclose,authority,flw,straight,worldnet,beemer,ooo,defs,thorough,officers,flight,prefer,awesome,macintosh,feet,constitutes,formosa,porn,armstrong,driscoll,watches,newsietter,twenty,tommy,fields,method,setup,allocating,initially,missed,clarification,especially,dorcheus,del,millions,insurance,pooling,trial,tennessee,ellis,direction,bold,catch,performing,accepted,matters,batch,continuing,winning,symbol,offsystem,decisions,produced,ended,greatest,degree,solmonson,imbalances,fall,fear,hate,fight,reallocated,debt,reform,australia,plain,prompt,remains,ifhsc,enhancements,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,0,0,1,0,0,2,0,0,0,0,0,0,0,0,3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,4,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,18,21,13,0,1,61,4,2,0,0,2,0,12,9,95,4,3,3,3,12,3,1,21,1,12,0,1,0,0,2,15,141,0,21,1,1,39,1,0,0,0,0,14,3,0,91,0,2,8,0,0,0,7,2,1,1,5,0,0,3,0,0,0,35,0,0,7,1,46,0,0,3,1,2,0,0,0,2,2,1,3,0,0,0,15,0,1,0,0,0,0,0,0,1,0,2,0,2,0,0,0,0,0,0,1,0,0,2,0,0,0,1,0,0,0,0,19,0,0,1,17,3,0,0,0,0,0,0,0,131,0,68,13,0,0,5,0,0,1,0,0,0,2,0,0,0,2,0,0,0,0,1,0,0,0,0,0,3,0,0,0,0,0,0,122,0,0,1,0,0,1,2,0,0,0,23,1,2,0,0,0,6,0,0,1,1,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,1,0,0,0,0,1,0,0,0,25,57,0,0,0,0,0,0,5,8,0,0,0,0,0,53,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,4,2,0,0,0,8,0,0,0,0,0,0,2,0,2,0,0,0,0,2,0,2,0,0,0,0,0,0,0,0,0,3,0,0,0,0,3,0,0,0,0,0,0,0,0,3,0,0,1,0,0,1,0,0,0,0,2,0,0,0,0,0,0,4,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,2,0,0,0,0,0,0,0,0,5,0,4,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,1,5,9,2,0,16,2,0,0,1,1,0,2,1,36,3,1,2,0,2,3,0,10,2,5,2,0,1,0,0,10,79,0,0,0,1,21,0,2,0,0,0,3,2,0,49,0,0,5,0,0,1,9,0,0,1,3,0,0,2,0,1,0,27,0,0,1,0,24,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,4,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,13,0,0,0,8,0,0,0,0,0,0,0,1,48,0,50,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,46,0,0,0,0,0,0,0,0,0,0,11,0,0,0,1,0,1,0,0,3,5,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,7,1,0,0,0,0,0,0,0,0,0,0,0,0,11,29,0,0,0,0,0,0,6,1,0,0,0,1,0,28,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,3,12,2,2,0,30,8,0,0,2,0,0,7,0,19,2,4,2,0,4,1,2,6,0,6,0,0,3,0,1,10,71,0,0,0,1,11,8,0,1,0,0,9,2,0,63,0,0,3,0,1,0,1,1,0,0,9,3,0,1,0,1,0,34,1,0,0,0,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12,3,0,4,0,0,0,0,0,0,0,1,0,0,0,0,0,1,4,0,0,1,0,0,0,0,0,0,0,0,3,0,14,0,0,0,9,0,0,0,0,0,0,0,0,58,0,37,7,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,0,0,0,0,41,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,10,28,0,2,0,0,0,0,8,4,0,0,0,0,0,26,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0


In [4]:
df.columns

Index(['Email No.', 'the', 'to', 'ect', 'and', 'for', 'of', 'a', 'you', 'hou',
       ...
       'connevey', 'jay', 'valued', 'lay', 'infrastructure', 'military',
       'allowing', 'ff', 'dry', 'Prediction'],
      dtype='object', length=3002)

In [5]:
print('Checking for Null values in the dataframe:','\n',df.isnull().sum(),'\n')

Checking for Null values in the dataframe: 
 Email No.     0
the           0
to            0
ect           0
and           0
             ..
military      0
allowing      0
ff            0
dry           0
Prediction    0
Length: 3002, dtype: int64 



In [6]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = df.isnull().sum().sum()

# percent of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

0.0


In [7]:
## Decision Tree Model Training With Hyperparameter Tuning
import warnings
warnings.filterwarnings('ignore')

In [8]:
df['Email No.']

0          Email 1
1          Email 2
2          Email 3
3          Email 4
4          Email 5
           ...    
5167    Email 5168
5168    Email 5169
5169    Email 5170
5170    Email 5171
5171    Email 5172
Name: Email No., Length: 5172, dtype: object

In [9]:
from sklearn.preprocessing import LabelEncoder

In [11]:
encoder = LabelEncoder()
df['Email_Encoded'] = encoder.fit_transform(df['Email No.'])

In [12]:
#segregate the dependent and independent variable
X = df.drop(columns = ['Prediction','Email No.'])
y = df['Prediction']

In [13]:
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Bernoulli Naive Bayes classifier
bernoulli_nb = BernoulliNB()
bernoulli_scores = cross_validate(bernoulli_nb, X, y, cv=10, scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'])
print("Bernoulli Naive Bayes:")
print("Accuracy:", bernoulli_scores['test_accuracy'].mean())
print("Precision:", bernoulli_scores['test_precision_macro'].mean())
print("Recall:", bernoulli_scores['test_recall_macro'].mean())
print("F1 Score:", bernoulli_scores['test_f1_macro'].mean())
print()

Bernoulli Naive Bayes:
Accuracy: 0.8696832781939164
Precision: 0.8518062894474481
Recall: 0.8336952375310982
F1 Score: 0.8395844017744422



In [18]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Bernoulli Naive Bayes classifier
bernoulli_nb = BernoulliNB()
bernoulli_nb.fit(X_train, y_train)  # Fit the classifier with the training data

# Confusion matrix for Bernoulli Naive Bayes
bernoulli_pred = bernoulli_nb.predict(X_test)
bernoulli_cm = confusion_matrix(y_test, bernoulli_pred)
print("Confusion Matrix - Bernoulli Naive Bayes:")
print(bernoulli_cm)
print()

Confusion Matrix - Bernoulli Naive Bayes:
[[695  44]
 [ 63 233]]



In [23]:
# Multinomial Naive Bayes classifier
multinomial_nb = MultinomialNB()
multinomial_scores = cross_validate(multinomial_nb, X, y, cv=10, scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'])
print("Multinomial Naive Bayes:")
print("Accuracy:", multinomial_scores['test_accuracy'].mean())
print("Precision:", multinomial_scores['test_precision_macro'].mean())
print("Recall:", multinomial_scores['test_recall_macro'].mean())
print("F1 Score:", multinomial_scores['test_f1_macro'].mean())
print()

Multinomial Naive Bayes:
Accuracy: 0.7447865245737587
Precision: 0.707219074228888
Recall: 0.6615416814753386
F1 Score: 0.6598958496231238



In [24]:
# Bernoulli Naive Bayes classifier
Multinomial_nb = MultinomialNB()
Multinomial_nb.fit(X_train, y_train)  # Fit the classifier with the training data

# Confusion matrix for Bernoulli Naive Bayes
Multinomial_pred = Multinomial_nb.predict(X_test)
Multinomial_cm = confusion_matrix(y_test, Multinomial_pred)
print("Confusion Matrix - Bernoulli Naive Bayes:")
print(Multinomial_cm)
print()

Confusion Matrix - Bernoulli Naive Bayes:
[[660  79]
 [140 156]]



In [15]:
# Gaussian Naive Bayes classifier
gaussian_nb = GaussianNB()
gaussian_scores = cross_validate(gaussian_nb, X, y, cv=10, scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'])
print("Gaussian Naive Bayes:")
print("Accuracy:", gaussian_scores['test_accuracy'].mean())
print("Precision:", gaussian_scores['test_precision_macro'].mean())
print("Recall:", gaussian_scores['test_recall_macro'].mean())
print("F1 Score:", gaussian_scores['test_f1_macro'].mean())

Gaussian Naive Bayes:
Accuracy: 0.9601693016586633
Precision: 0.9464521384830574
Recall: 0.9595299283260277
F1 Score: 0.9523692492047605


In [25]:
gaussian_nb = GaussianNB()
gaussian_nb.fit(X_train, y_train)  # Fit the classifier with the training data

gaussian_pred = gaussian_nb.predict(X_test)
gaussian_cm = confusion_matrix(y_test, gaussian_pred)
print("Confusion Matrix - Gaussian Naive Bayes:")
print(gaussian_cm)

Confusion Matrix - Gaussian Naive Bayes:
[[711  28]
 [  4 292]]


Based on the performance metrics obtained, __the Gaussian Naive Bayes variant performed the best among the three classifiers. It achieved the highest accuracy, precision, recall, and F1 score.__

The superior performance of Gaussian Naive Bayes can be attributed to the assumption of a Gaussian distribution for the features. If the features in the dataset follow a continuous distribution and exhibit a normal distribution, Gaussian Naive Bayes can effectively model the data and make accurate predictions.

On the other hand, Bernoulli Naive Bayes achieved good accuracy, precision, recall, and F1 score, but slightly lower than Gaussian Naive Bayes. It assumes binary features and works well when dealing with binary or categorical data.

Multinomial Naive Bayes, which is suitable for discrete features, had the lowest performance among the three classifiers. It achieved lower accuracy, precision, recall, and F1 score. This could be due to the mismatch between the nature of the features in the dataset and the assumption of multinomial distribution made by the classifier.

Limitations of Naive Bayes that can be observed from the results include its assumption of feature independence, which may not hold true in all cases. If the features are dependent on each other, the classifier may not capture those dependencies and could lead to suboptimal performance. Additionally, Naive Bayes tends to struggle with datasets that have imbalanced class distributions.

__In conclusion, the Gaussian Naive Bayes variant performed the best in terms of accuracy, precision, recall, and F1 score. However, it is important to consider the characteristics of the dataset and the assumptions made by each variant of Naive Bayes when selecting the most suitable classifier. For future work, exploring feature engineering techniques, addressing feature dependencies, and handling imbalanced datasets could potentially enhance the performance of Naive Bayes classifiers. Additionally, comparing Naive Bayes with other classification algorithms could provide further insights into the model selection process.__