Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

In this problem, we want to find the probability of an employee being a smoker given that he/she uses the health insurance plan. Let's define the events:

A = employee is a smoker
B = employee uses the health insurance plan

Using the information given in the problem, we have:

P(B) = 0.7 (70% of employees use the health insurance plan)
P(A|B) = ? (what we want to find)
P(B|A) = 0.4 (40% of employees who use the plan are smokers)
P(A) = ? (we don't know this yet)

To find P(A), we need more information. Let's assume that the percentage of smokers among all employees is 20%. Then, we have:

P(A) = 0.2 (20% of employees are smokers)

Now, we can apply Bayes' theorem:

P(A|B) = P(B|A) * P(A) / P(B)
P(A|B) = 0.4 * 0.2 / 0.7
P(A|B) = 0.114

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.114 or approximately 11.4%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


Bernoulli Naive Bayes and Multinomial Naive Bayes are two commonly used variants of Naive Bayes classification algorithm.

The main difference between them lies in the type of data they are best suited for.

Bernoulli Naive Bayes is typically used for binary data, where each feature can take on only one of two possible values, typically 0 or 1. It is particularly useful for text classification, where the presence or absence of a particular word in a document can be represented as a binary feature.

On the other hand, Multinomial Naive Bayes is typically used for discrete count data, where each feature represents the count of a particular word or feature in a given document. This makes it a good choice for text classification tasks where the frequency of occurrence of words matters more than just their presence or absence.

Q3. How does Bernoulli Naive Bayes handle missing values?


Bernoulli Naive Bayes assumes that the input data is binary and that each feature takes on one of two possible values, typically 0 or 1. In the case of missing values, we need to decide how to represent them.

One approach is to assign a default value to the missing values, such as 0 or 1. This assumes that the missing values do not carry any particular meaning and are treated the same as the present values. This approach can work well if the missing values are randomly distributed across the data.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

 Gaussian Naive Bayes can be used for multi-class classification by modeling the likelihood of the features as a multivariate Gaussian distribution for each class and computing the posterior probabilities using Bayes' theorem.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [32]:
#importing the necessary dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.model_selection import GridSearchCV
##load the dataset
df=pd.read_csv('spambase.data.csv')
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
##training the dataset
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0)
gnb1=GaussianNB()
gnb.fit(X_train,y_train)
parameters1={'priors':[None], 'var_smoothing':[1e-09]}
grid1=GridSearchCV(gnb1,param_grid=parameters1,cv=10)
grid1.fit(X_train,y_train)
y_pred1=grid1.predict(X_test)
print(accuracy_score(y_pred1,y_test))
print(classification_report(y_pred1,y_test))
gnb2=MultinomialNB()
parameters2={'alpha':[1.0],'force_alpha':['warn'], 'fit_prior':[True], 'class_prior':[None]}
grid2=GridSearchCV(gnb2,param_grid=parameters2,cv=10)
grid2.fit(X_train,y_train)
y_pred2=grid2.predict(X_test)
print(accuracy_score(y_pred2,y_test))
print(classification_report(y_pred2,y_test))
gnb3=BernoulliNB()
gnb3.fit(X_train,y_train)
parameter3={'alpha':[1.0],'force_alpha':['warn'],'binarize':[0.0],'fit_prior':[True],'class_prior':[None]}
grid3=GridSearchCV(gnb3,param_grid=parameter3,cv=10)
grid3.fit(X_train,y_train)
y_pred3=grid3.predict(X_test)
print(accuracy_score(y_pred3,y_test))
print(classification_report(y_pred3,y_test))

0.8282608695652174
              precision    recall  f1-score   support

           0       0.73      0.98      0.84       615
           1       0.97      0.71      0.82       765

    accuracy                           0.83      1380
   macro avg       0.85      0.84      0.83      1380
weighted avg       0.86      0.83      0.83      1380

0.8050724637681159
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       815
           1       0.77      0.76      0.76       565

    accuracy                           0.81      1380
   macro avg       0.80      0.80      0.80      1380
weighted avg       0.80      0.81      0.80      1380

0.8797101449275362
              precision    recall  f1-score   support

           0       0.94      0.87      0.90       884
           1       0.80      0.90      0.84       496

    accuracy                           0.88      1380
   macro avg       0.87      0.88      0.87      1380
weighted avg     