Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use conditional probability.

Let's denote:

A: Event that an employee uses the health insurance plan.
B: Event that an employee is a smoker.
We want to find P(B|A), the probability of an employee being a smoker given that he/she uses the health insurance plan.

We are given:

P(A) = 0.70 (Probability that an employee uses the health insurance plan)
P(B|A) = 0.40 (Probability that an employee is a smoker given that he/she uses the health insurance plan)


We can rearrange the formula to solve for 

P(A∩B)=P(B∣A)×P(A)

Now we can substitute the given values:

P(A∩B)=0.40×0.70=0.28

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.28, or 28%.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes:

In Bernoulli Naive Bayes, features are binary variables (0 or 1), representing whether a particular feature is present or absent in a document or sample.
It assumes that the features are generated from independent Bernoulli distributions.
It's commonly used in text classification tasks where the presence or absence of a word in a document matters more than its frequency.
Multinomial Naive Bayes:

In Multinomial Naive Bayes, features represent the frequencies with which certain events have been generated by a multinomial distribution.
It's typically used when the features (e.g., word counts) represent the occurrence counts of words or other items within the document.
It's widely used in text classification tasks, especially when the frequency of words matters.

Q3. How does Bernoulli Naive Bayes handle missing values?

 Bernoulli Naive Bayes, missing values can be handled in different ways, depending on the implementation and the specific requirements of the problem. Here are some common approaches:

Imputation: Missing values can be replaced with a specific value, such as 0 or 1, depending on whether the feature is binary or categorical. This approach assumes that missing values are indicative of the feature being absent.

Mean or Mode Imputation: For binary features, missing values can be imputed with the mode of the feature (i.e., the most common value) across the dataset. This approach assumes that missing values are more likely to be similar to the majority class.

Ignoring Missing Values: In some implementations, missing values may simply be ignored during training and prediction. This approach works well if missing values are relatively rare and do not significantly affect the overall performance of the classifier.

Explicit Handling: Some implementations of Bernoulli Naive Bayes may include explicit handling of missing values, treating them as a separate category or introducing a separate parameter to model the probability of missing values.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Gaussian Naive Bayes is an extension of the Naive Bayes algorithm that assumes that the features follow a Gaussian (normal) distribution. It's particularly useful when dealing with continuous features.

For multi-class classification, Gaussian Naive Bayes can be adapted straightforwardly by applying the Bayes' theorem to estimate the probability of each class given the input features and then selecting the class with the highest probability as the predicted class.

Here's how it works:

Model Training: During training, Gaussian Naive Bayes estimates the parameters (mean and variance) of the Gaussian distribution for each class based on the training data. For each feature in each class, it computes the mean and variance of the feature values.

Prediction: To predict the class label for a new instance, Gaussian Naive Bayes calculates the likelihood of the observed feature values under each class's Gaussian distribution. It then combines these likelihoods with the prior probabilities of the classes to compute the posterior probability of each class given the input features using Bayes' theorem. Finally, it selects the class with the highest posterior probability as the predicted class

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.



In [14]:
# Step 1: Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 2: Load the dataset
url = "spambase.data"
names = [f"attr_{i}" for i in range(57)] + ['is_spam']
data = pd.read_csv(url, names=names)

# Step 3: Split features and target
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Step 4: Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()



from sklearn.model_selection import cross_validate

# Step 5: Evaluate performance using 10-fold cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']

bernoulli_scores = cross_validate(bernoulli_nb, X, y, cv=10, scoring=scoring)
multinomial_scores = cross_validate(multinomial_nb, X, y, cv=10, scoring=scoring)
gaussian_scores = cross_validate(gaussian_nb, X, y, cv=10, scoring=scoring)

# Step 6: Report performance metrics
print("Bernoulli Naive Bayes Metrics:")
print("Accuracy:", np.mean(bernoulli_scores['test_accuracy']))
print("Precision:", np.mean(bernoulli_scores['test_precision']))
print("Recall:", np.mean(bernoulli_scores['test_recall']))
print("F1 score:", np.mean(bernoulli_scores['test_f1']))
print("\n")

print("Multinomial Naive Bayes Metrics:")
print("Accuracy:", np.mean(multinomial_scores['test_accuracy']))
print("Precision:", np.mean(multinomial_scores['test_precision']))
print("Recall:", np.mean(multinomial_scores['test_recall']))
print("F1 score:", np.mean(multinomial_scores['test_f1']))
print("\n")

print("Gaussian Naive Bayes Metrics:")
print("Accuracy:", np.mean(gaussian_scores['test_accuracy']))
print("Precision:", np.mean(gaussian_scores['test_precision']))
print("Recall:", np.mean(gaussian_scores['test_recall']))
print("F1 score:", np.mean(gaussian_scores['test_f1']))

# Step 7: Discussion and conclusion

Bernoulli Naive Bayes Metrics:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 score: 0.8481249015095276


Multinomial Naive Bayes Metrics:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 score: 0.7282909724016348


Gaussian Naive Bayes Metrics:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 score: 0.8130660909542995
