# Module 69 Naive Bayes Assignment2

Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

A1. We are given,

P(Insurance) = 0.70

p(Smoker | Insurance) = 0.40

We are asked to find P(Smoker | Insurance), which is already provided as 0.40 .

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

A2. There are several Aspects on which the Bernoulli and Multinomial Naive Bayes -

1.) **Data Type:**

Bernoulli NB - Designed for binary features (0 or 1).

Multinomial NB - Works with count data or frequency data (e.g., word counts).

2.) **Feature Handling:**

Bernoulli NB - Considers the presence or absence of a feature.

Multinomial NB - Considers the frequency of a feature in the data.

3.) **Use case:**

Bernoulli NB - Best for text classification where binary presence matters.

Multinomial NB - Best for text classification where word frequencies matter.

Q3. How does Bernoulli Naive Bayes handle missing values?

A3. Bernoulli Naive Bayes does not inherently handle missing values. You need to preprocess the data before applying the model:

1.) **Imputation:** Replace missing values with 0, 1, or another representative value.

2.) **Feature Engineering:** Drop rows or columns with too many missing values or use domain knowledge for imputation.


Q4. Can Gaussian Naive Bayes be used for multi-class classification?

A4. Yes, Gaussian Naive Bayes can handle multi-class classification by applying the "one-vs-rest" (OvR) strategy.

It computes the posterior probability for each class and assigns the label with the highest probability.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

A5.

# Steps to Implement the Naive Bayes Classifiers

**Step 1: Data Preparation:**
Download the dataset from Spambase Data Set and save it locally. Assume the file is named spambase.csv.

**Step2: Implementation in Python:**

In [3]:
from google.colab import files

uploaded = files.upload()

Saving spambase.csv to spambase.csv


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Load the dataset
data = pd.read_csv('spambase.csv')
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Classifiers
models = {
    'BernoulliNB': BernoulliNB(),
    'MultinomialNB': MultinomialNB(),
    'GaussianNB': GaussianNB()
}

# Evaluation
results = {}
for model_name, model in models.items():
    # Cross-validation scores
    accuracy = cross_val_score(model, X, y, cv=10, scoring='accuracy').mean()
    precision = cross_val_score(model, X, y, cv=10, scoring='precision').mean()
    recall = cross_val_score(model, X, y, cv=10, scoring='recall').mean()
    f1 = cross_val_score(model, X, y, cv=10, scoring='f1').mean()
    results[model_name] = {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}

# Display Results
print(pd.DataFrame(results).T)


               Accuracy  Precision    Recall  F1 Score
BernoulliNB    0.883913   0.886914  0.815124  0.848071
MultinomialNB  0.786087   0.739029  0.720797  0.727751
GaussianNB     0.821739   0.710275  0.956939  0.813000


**Step 3: Results**

The results will include:

1.) **Accuracy:** Measures overall correctness.

2.) **Precision:** Measures the proportion of true positives out of predicted positives.

3.) **Recall:** Measures the proportion of true positives out of actual positives.

4.) **F1 Score:** Harmonic mean of precision and recall.


Step 4: Discussion

1.) **Best Performing Classifier:** Analyze which classifier performed the best based on metrics.

2.) **Reasoning:**

BernoulliNB: Likely to perform well with binary features.

MultinomialNB: Suited for frequency-based features.

GaussianNB: Handles continuous features but may struggle with skewed distributions.

**Limitations Observed:**

Naive Bayes assumes feature independence, which might not hold true for correlated features.

Sensitive to imbalanced data.


**Step 5: Conclusion**

1.) Summarize the findings (e.g., "Multinomial Naive Bayes performed the best for this text-based dataset due to its ability to handle frequency data effectively").

2.) Suggest improvements:

Feature engineering.

Using ensemble methods (e.g., combining Naive Bayes with other models).

Handling data imbalance if applicable.