Q1. To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let's denote:

A: Employee is a smoker
B: Employee uses the health insurance plan
We are given:

P(B) = 0.7 (probability of an employee using the health insurance plan)
P(A|B) = 0.4 (probability of an employee being a smoker given that they use the health insurance plan)
We want to find P(A|B), the probability of an employee being a smoker given that they use the health insurance plan. Using Bayes' theorem:

P(A|B) = (P(B|A) * P(A)) / P(B)

P(B|A) is not directly given, but we can calculate it using the fact that:
P(B|A) = (P(A|B) * P(B)) / P(A)

We can calculate P(B|A) as follows:
P(B|A) = (0.4 * 0.7) / P(A)

Since we know that P(A) = P(B|A) + P(B|A') (where A' represents the complement of A), and P(B|A') = 1 - P(B|A), we can substitute the values to find P(A):

P(A) = (0.4 * 0.7) / P(A) + (0.3 * 0.7)
P(A) = 0.28 / P(A) + 0.21

Simplifying, we have:
P(A) - 0.21 = 0.28 / P(A)

Multiplying both sides by P(A):
P(A)^2 - 0.21 * P(A) = 0.28

Rearranging the equation:
P(A)^2 - 0.21 * P(A) - 0.28 = 0

Solving this quadratic equation, we find two possible values for P(A). Taking the positive value since it represents a probability, we get:
P(A) = 0.6078

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is approximately 0.6078 or 60.78%.

Q2. The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the assumptions they make about the feature distributions:

Bernoulli Naive Bayes assumes that features are binary or categorical, representing the presence or absence of certain characteristics. It models the occurrence (or absence) of each feature independently and uses binary values (0 or 1) to represent feature presence or absence.

Multinomial Naive Bayes assumes that features follow a multinomial distribution, which means they represent counts or frequencies of discrete events. It is commonly used for text classification, where features often represent word frequencies or occurrences.

In summary, Bernoulli Naive Bayes is suitable for binary/categorical features, while Multinomial Naive Bayes is suitable for discrete/count-based features.

Q3. Bernoulli Naive Bayes handles missing values by considering them as a separate category or state of the feature. Instead of discarding instances with missing values, the classifier incorporates the information about missingness as an additional feature value. During training, the presence or absence of a feature is considered, and missing values are treated as a specific category. When making predictions, the classifier can handle instances with missing values by assigning probabilities based on the available information from other features.

Q4. Gaussian Naive Bayes can be used for multi-class classification. It assumes that the continuous features follow a Gaussian (normal) distribution within each class. Each feature is modeled independently, and the class probability is calculated based on the product of the individual feature probabilities. By comparing the probabilities for each class, Gaussian Naive Bayes can determine the most likely class for a given instance.

In [3]:
##Q5

import pandas as pd
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Load the Spambase dataset
data = pd.read_csv("spambase.data", header=None)




In [4]:
# Step 2: Split the dataset into features (X) and labels (y)
X = data.iloc[:, :-1]  
y = data.iloc[:, -1]   


In [5]:
# Step 4: Create instances of the Naive Bayes classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [6]:
# Step 5: Perform 10-fold cross-validation
cv = 10
bernoulli_scores = cross_val_score(bernoulli_nb, X, y, cv=cv)
multinomial_scores = cross_val_score(multinomial_nb, X, y, cv=cv)
gaussian_scores = cross_val_score(gaussian_nb, X, y, cv=cv)


In [None]:
# Step 6: Calculate performance metrics
def print_metrics(scores):
    accuracy = scores.mean()
    precision = precision_score(y, bernoulli_nb.predict(X), average='macro')
    recall = recall_score(y, bernoulli_nb.predict(X), average='macro')
    f1 = f1_score(y, bernoulli_nb.predict(X), average='macro')

    print("Accuracy:", accuracy)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 score:", f1)

print("Bernoulli Naive Bayes:")
print_metrics(bernoulli_scores)

print("\nMultinomial Naive Bayes:")
print_metrics(multinomial_scores)

print("\nGaussian Naive Bayes:")
print_metrics(gaussian_scores)
