Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Ans:
P(Smoker∣Uses Plan)=
P(Uses Plan)
P(Smoker and Uses Plan)
​
 =
0.70
0.28
​
 =0.40

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Ans:The key difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in how they handle feature data: Bernoulli Naive Bayes is designed for binary features (like "yes/no" or "present/absent") where each feature can only take on one of two values, while Multinomial Naive Bayes is used for discrete count data, where features can take on multiple values representing the frequency of occurrence within a category, like word counts in a text document

Q3. How does Bernoulli Naive Bayes handle missing values?

Ans:
Bernoulli Naive Bayes, like other Naive Bayes variants, generally handles missing values by simply ignoring them during the probability calculation, essentially treating missing data as a separate category with its own probability, effectively assuming "missing at random" behavior; this means that rows with missing values are usually skipped when calculating probabilities, not causing the model to fail but potentially reducing the amount of data used for training

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Ans:
Yes, Gaussian Naive Bayes can be used for multi-class classification; it is a probabilistic classification algorithm capable of predicting the probability of a data point belonging to one of multiple classes, making it suitable for problems with more than two possible outcomes.


Q5 Ans:


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
columns = ['feature_' + str(i) for i in range(57)] + ['label']
data = pd.read_csv(url, header=None, names=columns)

# Split the dataset into features (X) and target variable (y)
X = data.drop('label', axis=1)
y = data['label']

data.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_48,feature_49,feature_50,feature_51,feature_52,feature_53,feature_54,feature_55,feature_56,label
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [5]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [6]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import numpy as np
from sklearn.preprocessing import StandardScaler

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Split the data (X) and target (y)
X = data.drop('label', axis=1)
y = data['label']

# Scaling for Gaussian Naive Bayes only
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define a function to calculate performance metrics
def evaluate_classifier(clf, X, y):
    # Perform 10-fold cross-validation
    cv_scores = cross_val_score(clf, X, y, cv=10, scoring='accuracy')

    # Calculate other metrics manually from cross-validation predictions
    precision = cross_val_score(clf, X, y, cv=10, scoring='precision')
    recall = cross_val_score(clf, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(clf, X, y, cv=10, scoring='f1')

    return cv_scores.mean(), precision.mean(), recall.mean(), f1.mean()

# For Bernoulli and Multinomial Naive Bayes, use raw features
X_bernoulli = X  # No scaling for Bernoulli Naive Bayes
X_multinomial = X  # No scaling for Multinomial Naive Bayes

# Evaluate each classifier
results = {}
for clf, name, X_data in zip([bernoulli_nb, multinomial_nb, gaussian_nb],
                              ['Bernoulli Naive Bayes', 'Multinomial Naive Bayes', 'Gaussian Naive Bayes'],
                              [X_bernoulli, X_multinomial, X_scaled]):
    accuracy, precision, recall, f1 = evaluate_classifier(clf, X_data, y)
    results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1
    }

# Print results
for name, metrics in results.items():
    print(f"{name}:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")


Bernoulli Naive Bayes:
  Accuracy: 0.8839
  Precision: 0.8870
  Recall: 0.8152
  F1 Score: 0.8481
Multinomial Naive Bayes:
  Accuracy: 0.7863
  Precision: 0.7393
  Recall: 0.7215
  F1 Score: 0.7283
Gaussian Naive Bayes:
  Accuracy: 0.8187
  Precision: 0.7063
  Recall: 0.9575
  F1 Score: 0.8106
