# Assignment | 10th April 2023

Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Ans.

To determine the probability that an employee is a smoker given that they use the health insurance plan, we can use Bayes' theorem. Let's denote the following probabilities:

P(S) = Probability of being a smoker

P(H) = Probability of using the health insurance plan

P(S|H) = Probability of being a smoker given that the employee uses the health insurance plan

From the given information, we have:

P(H) = 0.70 (70% of the employees use the health insurance plan)

P(S|H) = 0.40 (40% of the employees who use the plan are smokers)

We want to find P(S|H), which can be calculated using Bayes' theorem:

P(S|H) = (P(H|S) * P(S)) / P(H)

To find P(H|S), the probability of using the health insurance plan given that the employee is a smoker, we need to use the formula:

P(H|S) = (P(S|H) * P(H)) / P(S)

Now, we can calculate P(H|S):

P(H|S) = (0.40 * 0.70) / P(S)

Given that P(H) = 0.70 and P(H|S) = 0.40, we can solve for P(S) using the formula:

P(H) = P(H|S) * P(S) + P(H|~S) * P(~S)

Since P(H|~S) is not provided, we cannot directly calculate P(S). However, we can still determine the probability that an employee is a smoker given that they use the health insurance plan.

Without additional information, we cannot determine the exact value of P(S|H).

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Ans.

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm, which is a popular and simple probabilistic classification algorithm.

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the types of features they handle and the assumptions they make about the underlying data.

1. Bernoulli Naive Bayes:

- Bernoulli Naive Bayes is suitable for binary features, where each feature can take on only two values (usually 0 and 1).
- It assumes that each feature is conditionally independent of all other features, given the class variable.
- It works well with features that represent the presence or absence of certain characteristics.
For example, in text classification, each feature could represent the presence or absence of a specific word in a document.

2. Multinomial Naive Bayes:

- Multinomial Naive Bayes is suitable for features that represent discrete counts, such as word frequencies or occurrence counts in text data.
- It assumes that the features are generated from a multinomial distribution.
- It also assumes that the occurrence of one feature does not affect the occurrence of other features.
- It is commonly used in text classification tasks, where the features are often represented as word frequencies or occurrence counts.

In summary, the key difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the type of features they handle and the assumptions they make about the data. Bernoulli Naive Bayes is suitable for binary features, while Multinomial Naive Bayes is suitable for discrete count features.

Q3. How does Bernoulli Naive Bayes handle missing values?

Ans.

Bernoulli Naive Bayes, like other Naive Bayes variants, requires complete data without missing values in order to make accurate predictions. However, if the dataset contains missing values, there are a few approaches to handle them in the context of Bernoulli Naive Bayes:

1. Deleting instances with missing values: One straightforward approach is to remove instances (rows) that have missing values. This can be done if the missing values are relatively few and random, without significant bias. However, this approach can lead to information loss if the deleted instances contain valuable information for the classification task.

2. Imputation: Another approach is to impute missing values with suitable replacements. For Bernoulli Naive Bayes, which deals with binary features, the missing values can be imputed with either 0 or 1, depending on the specific imputation strategy chosen. Common imputation techniques include replacing missing values with the mean, mode, or using more sophisticated methods like regression-based imputation or K-nearest neighbors imputation.

It is important to note that the choice of handling missing values in Bernoulli Naive Bayes depends on the specific characteristics of the dataset, the amount of missing data, and the nature of the missingness. It is always recommended to carefully analyze the impact of missing values on the data and choose the most appropriate approach for imputation or handling missing data.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Ans.

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Naive Bayes algorithms, including Gaussian Naive Bayes, can be extended to handle multiple classes by utilizing a one-vs-all (or one-vs-rest) approach.

In the context of multi-class classification, Gaussian Naive Bayes assumes that the features follow a Gaussian (normal) distribution within each class. To train a Gaussian Naive Bayes classifier for multi-class problems, the following steps can be followed:

1. Training:

- For each class in the dataset, calculate the class prior probability, which is the proportion of instances belonging to that class in the training set.
- Estimate the mean and variance for each feature within each class. This involves calculating the mean and variance of the feature values for each class separately.

2. Prediction:

- Given a new instance with feature values, calculate the posterior probability for each class using Bayes' theorem, which involves multiplying the class prior probability with the likelihood of the features given the class (estimated using Gaussian distribution parameters).
- The class with the highest posterior probability is assigned as the predicted class for the instance.

By using the one-vs-all approach, a separate Gaussian Naive Bayes classifier is trained for each class, treating it as the positive class and the remaining classes as the negative class. During prediction, the instance is classified based on the highest posterior probability obtained from each classifier.

It's worth noting that Gaussian Naive Bayes assumes independence between features given the class, which might not always hold in real-world scenarios. Nevertheless, it is a simple and computationally efficient algorithm that can be used for multi-class classification tasks.

Q5. Assignment:

1.  Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

2. Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

3. Results:
Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score

4. Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

5. Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

In [1]:
# Import the necessary libraries:

import numpy as np
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
# Load the dataset:

data = np.loadtxt('spambase.data', delimiter=',')
X = data[:, :-1]
y = data[:, -1]

In [3]:
# Initialize the classifiers:

bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [4]:
# Perform 10-fold cross-validation and compute the performance metrics for each classifier:

def evaluate_classifier(classifier, X, y):
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy')
    precision = cross_val_score(classifier, X, y, cv=10, scoring='precision')
    recall = cross_val_score(classifier, X, y, cv=10, scoring='recall')
    f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1')

    return np.mean(accuracy), np.mean(precision), np.mean(recall), np.mean(f1)

# Evaluate Bernoulli Naive Bayes
accuracy_b, precision_b, recall_b, f1_b = evaluate_classifier(bernoulli_nb, X, y)

# Evaluate Multinomial Naive Bayes
accuracy_m, precision_m, recall_m, f1_m = evaluate_classifier(multinomial_nb, X, y)

# Evaluate Gaussian Naive Bayes
accuracy_g, precision_g, recall_g, f1_g = evaluate_classifier(gaussian_nb, X, y)


In [5]:
# Print the performance metrics:

print("Bernoulli Naive Bayes:")
print("Accuracy:", accuracy_b)
print("Precision:", precision_b)
print("Recall:", recall_b)
print("F1 Score:", f1_b)

print("\nMultinomial Naive Bayes:")
print("Accuracy:", accuracy_m)
print("Precision:", precision_m)
print("Recall:", recall_m)
print("F1 Score:", f1_m)

print("\nGaussian Naive Bayes:")
print("Accuracy:", accuracy_g)
print("Precision:", precision_g)
print("Recall:", recall_g)
print("F1 Score:", f1_g)


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995


- Discussion:

Based on the results obtained, we can analyze the performance of each variant of Naive Bayes. The Bernoulli Naive Bayes variant models the presence or absence of a feature, assuming a binary distribution. It performs well when dealing with binary data, such as whether a certain word is present in an email or not. Multinomial Naive Bayes, on the other hand, assumes a multinomial distribution and is suitable for discrete data, like word counts. Finally, Gaussian Naive Bayes assumes a Gaussian distribution and is suitable for continuous numerical data.

In this case, we can expect Bernoulli Naive Bayes to perform well since the dataset represents email messages where the presence or absence of certain words might be indicative of spam. However, it is essential to analyze the results to determine the best-performing variant.

The limitations of Naive Bayes include its assumption of feature independence, which might not hold in some cases. Additionally, Naive Bayes can struggle with data sparsity, as it calculates probabilities based on observed frequencies. In the case of the spambase dataset, the performance of Naive Bayes can be affected by these limitations.

- Conclusion:

In conclusion, we implemented Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers on the spambase dataset and evaluated their performance using 10-fold cross-validation. By comparing the accuracy, precision, recall, and F1 scores, we can determine which variant of Naive Bayes performed the best. Additionally, we discussed the limitations of Naive Bayes, which should be considered when applying this algorithm to real-world problems. Suggestions for future work could include exploring other classification algorithms and feature engineering techniques to improve the performance on the spambase dataset.
