## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?


**40%**


## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes is used when the independent features follow a Bernoulli distribution, meaning the features are binary—they can take only two possible values such as yes or no, pass or fail, male or female, etc. In this model, each feature is either present (1) or absent (0). Therefore, to use the Bernoulli Naive Bayes algorithm, the independent features must be binary in nature. The probability of a point belonging to class 0 is p, and for class 1, it is (1 - p), as per the Bernoulli distribution.

Multinomial Naive Bayes, on the other hand, is used when the independent features represent discrete counts, such as the frequency of words in text data. This model is commonly used in Natural Language Processing (NLP) tasks. A typical use case is spam email classification, where the input features are the number of times specific words occur in an email (i.e., term frequency). Unlike Bernoulli, the Multinomial model considers how often a word occurs rather than just whether it is present or not.

## Q3. How does Bernoulli Naive Bayes handle missing values?


In [1]:
import numpy as np
import pandas as pd

In [6]:
data={
    'A':[0 , 0 ,1 , 1 , np.NaN],
    'B':[1,np.NaN , 0, 1 , 0] ,
    'target':[1,0,1,0,1]
}

In [7]:
df=pd.DataFrame(data)
df

Unnamed: 0,A,B,target
0,0.0,1.0,1
1,0.0,,0
2,1.0,0.0,1
3,1.0,1.0,0
4,,0.0,1


In [8]:
X=df.iloc[: , :-1]
y=df.iloc[: , -1]

In [15]:
from sklearn.naive_bayes import BernoulliNB

In [12]:
from sklearn.model_selection import train_test_split

X_train , X_test , y_train , y_test=train_test_split(X , y , test_size=0.2)

In [13]:
X_train

Unnamed: 0,A,B
2,1.0,0.0
1,0.0,
0,0.0,1.0
3,1.0,1.0


In [16]:
bnb=BernoulliNB()
bnb.fit(X_train , y_train)

ValueError: Input X contains NaN.
BernoulliNB does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Bernoulli Naive Bayes does not inherently handle missing values. It expects binary input features (i.e., 0 or 1) for all training and prediction instances. If a feature value is missing (e.g., NaN), the algorithm cannot compute the likelihood properly, and it may raise an error or produce incorrect results.

🔧 Ways to Handle Missing Values Before Using Bernoulli Naive Bayes:

**Imputation:**

Replace missing values with a default (usually 0, assuming absence of the feature).

You could also use more sophisticated techniques like mean/mode imputation, or predictive imputation if justified.

**Remove rows with missing values:**

This is simple but only viable if the missing rate is very low.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes (GNB) can be used for multi-class classification problems.

In fact, Naive Bayes models—including Gaussian, Multinomial, and Bernoulli variants—are naturally designed to handle multiple classes. During training, the model learns the distribution of features for each class, and during prediction, it calculates the posterior probability for each class and selects the class with the highest probability.

In [18]:
df=pd.read_csv("spambase.data")

In [19]:
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [24]:
X=df.iloc[: , :-1]
y=df.iloc[: , -1]

In [25]:
from sklearn.model_selection import train_test_split

X_train , X_test , y_train , y_test = train_test_split(X , y , test_size=0.3 , random_state=7)

In [26]:
X_train.shape

(3220, 57)

In [29]:
from sklearn.naive_bayes import BernoulliNB , GaussianNB , MultinomialNB
from sklearn.model_selection import cross_val_score
bnb=BernoulliNB()
gnb=GaussianNB()
mnnb=MultinomialNB()

bnb_cv=cross_val_score(bnb , X_train , y_train , cv=10 , scoring='accuracy')
gnb_cv=cross_val_score(gnb , X_train , y_train , cv=10 , scoring='accuracy')
mnnb_cv=cross_val_score(mnnb , X_train , y_train , cv=10 , scoring='accuracy')

print(f"BNB : {bnb_cv}")
print(f"GNB : {gnb_cv}")
print(f"MNNB : {mnnb_cv}")

BNB : [0.90993789 0.89751553 0.90062112 0.86335404 0.84782609 0.87888199
 0.89440994 0.91304348 0.84782609 0.89440994]
GNB : [0.80434783 0.84161491 0.83850932 0.80434783 0.83540373 0.82608696
 0.81055901 0.82608696 0.80434783 0.85093168]
MNNB : [0.74223602 0.80124224 0.75776398 0.81055901 0.80745342 0.77950311
 0.78571429 0.81055901 0.74534161 0.80124224]


In [30]:
print(f"BNB : {bnb_cv.mean()}")
print(f"GNB : {gnb_cv.mean()}")
print(f"MNNB : {mnnb_cv.mean()}")

BNB : 0.8847826086956522
GNB : 0.8242236024844722
MNNB : 0.7841614906832297


Highest avg accuracy while cross validation is observed for Bernouli naive Bayes followed by Gaussian and Multinola . This is with respect to training data

In [32]:
bnb.fit(X_train , y_train)
gnb.fit(X_train , y_train)
mnnb.fit(X_train ,y_train)

In [33]:
y_pred1=bnb.predict(X_test)
y_pred2=gnb.predict(X_test)
y_pred3=mnnb.predict(X_test)

In [35]:
from sklearn.metrics import accuracy_score , recall_score , precision_score , f1_score

## Bernoulli

In [36]:
print(accuracy_score(y_test , y_pred1))
print(recall_score(y_test , y_pred1))
print(precision_score(y_test , y_pred1))
print(f1_score(y_test , y_pred1))

0.8934782608695652
0.8333333333333334
0.8986615678776291
0.8647654093836246


## Gaussian

In [37]:
print(accuracy_score(y_test , y_pred2))
print(recall_score(y_test , y_pred2))
print(precision_score(y_test , y_pred2))
print(f1_score(y_test , y_pred2))

0.8311594202898551
0.9414893617021277
0.7264021887824897
0.8200772200772201


## Multinomial

In [38]:
print(accuracy_score(y_test , y_pred3))
print(recall_score(y_test , y_pred3))
print(precision_score(y_test , y_pred3))
print(f1_score(y_test , y_pred3))

0.8007246376811594
0.7109929078014184
0.7816764132553606
0.744661095636026


### From the above it can be clearly observed that Bernoulli Naive Bayes has highest accuracy of 89% on testting data followed by Gaussian and multinomial

In [41]:
df.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [44]:
col_subset=['0.64' , '0.64.1' ,	'0.1' ,	'0.32', '0.2'  , '0.3' , '0.4' , '0.5' , '0.6']

In [47]:
for i in col_subset:
    print(df[i].value_counts())
    print()

0.00     3703
14.28      35
0.08       27
0.10       24
0.17       24
         ... 
2.46        1
4.16        1
2.59        1
0.91        1
2.01        1
Name: 0.64, Length: 171, dtype: int64

0.00    2713
0.32      49
0.29      41
0.55      39
0.36      29
        ... 
2.25       1
2.91       1
1.79       1
1.43       1
2.35       1
Name: 0.64.1, Length: 214, dtype: int64

0.00     4553
0.58        2
0.42        2
0.17        2
0.21        2
35.46       2
0.57        1
0.44        1
7.07        1
1.33        1
1.29        1
19.73       1
0.04        1
0.60        1
1.35        1
0.11        1
0.14        1
0.15        1
0.87        1
0.13        1
0.55        1
42.73       1
19.16       1
0.06        1
0.52        1
0.16        1
0.19        1
0.95        1
5.03        1
7.18        1
13.63       1
0.81        1
1.16        1
1.26        1
0.10        1
0.49        1
1.91        1
40.13       1
0.91        1
9.16        1
4.31        1
42.81       1
0.31        1
Name: 0.1, dtype: int

Based on the above data it can be concluded that the data which has dominant number of zeros (more than a 50%) , for such data Bernoulli Naive BAYES performs well than the other 2 variants 