Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

Probability that an employee uses the health insurance plan: P(Use Plan) = 0.70
Probability that an employee who uses the plan is a smoker:P(Smoker | Use Plan) = 0.40

Use the formula for conditional probability:

P(Smoker | Use Plan) = P(Smoker Uses Plan) / P(Uses Plan)
                     = [P(Uses Plan) * P(Smoker | Use Plan)] / P(Uses Plan)
                     = [0.70 * 0.40] / 0.70
                     = 0.28 / 0.70
                     = 0.4
                     
So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.40 or 40.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Difference between Bernoulli Naive Bayes and Multinomial Naive Bayes are:

1. Gaussian Naive Bayes:
    - The Gaussian model assumes that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
    
    - Example: It is often used in problems involving real-valued attributes, such as predicting the price of a house based on various continuous features like size, number of bedrooms, and location.

2. Multinomial Naive Bayes:
    - The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems, it means a particular document belongs to which category such as Sports, Politics, education, etc. The classifier uses the frequency of words for the predictors.
    
    - Example: Text classification problems like spam email detection, sentiment analysis, or document categorization are typical applications of Multinomial Naive Bayes.
    

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes, like other Naive Bayes variants, can handle missing values in a straightforward manner. When dealing with missing values in a Bernoulli Naive Bayes classification problem, you can typically treat them as if they were absent (0) for the features that are missing. However, the specific approach to handling missing values may depend on the context and the nature of the data. Here are a few common strategies:

1. Missing Values as Absent (0): This is the simplest approach. You treat missing values as if the corresponding feature is absent (0). This assumes that the missing values are missing completely at random and that the absence of data is informative. In many cases, this approach works well, especially when missing values are not frequent.

2. Imputation: Instead of treating missing values as absent, you can use imputation techniques to estimate or replace the missing values. For Bernoulli Naive Bayes, you might replace missing values with the mode (most common value) of the feature or use more sophisticated imputation methods based on the distribution of the data. Imputation can help retain some information from the missing values, but it may introduce bias if not done carefully.

3. Creating a "Missing" Category: In some cases, missing values may carry their own information. You can create a new category or level for each feature, specifically for missing values. This way, you don't lose the fact that the data was missing, and the classifier can learn from this information if it is informative for the classification task.

4. Ignore Instances with Missing Values: Depending on the severity of missing data, you might choose to exclude instances with missing values from your analysis. This can be a valid strategy if the number of instances with missing data is relatively small, and you have enough data remaining for meaningful analysis.

It's important to note that the choice of how to handle missing values should depend on your specific problem, the amount of missing data, and the potential impact on the classification results.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is an extension of the basic Naive Bayes algorithm that is designed to handle continuous or real-valued features. It's often used when the features are assumed to follow a Gaussian (normal) distribution.

In multi-class classification, the goal is to classify instances into one of several possible classes or categories.

Q5. Assignment:

Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:
Summarise your findings and provide some suggestions for future work.

Note: This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
import numpy as np
import pandas as pd

# Load the dataset
data = pd.read_csv("spambase.data", header=None)

# Split the data into features and target
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Perform 10-fold cross-validation and compute metrics
accuracy_bernoulli = np.mean(cross_val_score(bernoulli_nb, X, y, cv=10, scoring='accuracy'))
precision_bernoulli = np.mean(cross_val_score(bernoulli_nb, X, y, cv=10, scoring='precision'))
recall_bernoulli = np.mean(cross_val_score(bernoulli_nb, X, y, cv=10, scoring='recall'))
f1_score_bernoulli = np.mean(cross_val_score(bernoulli_nb, X, y, cv=10, scoring='f1'))

accuracy_multinomial = np.mean(cross_val_score(multinomial_nb, X, y, cv=10, scoring='accuracy'))
precision_multinomial = np.mean(cross_val_score(multinomial_nb, X, y, cv=10, scoring='precision'))
recall_multinomial = np.mean(cross_val_score(multinomial_nb, X, y, cv=10, scoring='recall'))
f1_score_multinomial = np.mean(cross_val_score(multinomial_nb, X, y, cv=10, scoring='f1'))

accuracy_gaussian = np.mean(cross_val_score(gaussian_nb, X, y, cv=10, scoring='accuracy'))
precision_gaussian = np.mean(cross_val_score(gaussian_nb, X, y, cv=10, scoring='precision'))
recall_gaussian = np.mean(cross_val_score(gaussian_nb, X, y, cv=10, scoring='recall'))
f1_score_gaussian = np.mean(cross_val_score(gaussian_nb, X, y, cv=10, scoring='f1'))

# Print the results
print("Bernoulli Naive Bayes:")
print("Accuracy:", accuracy_bernoulli)
print("Precision:", precision_bernoulli)
print("Recall:", recall_bernoulli)
print("F1 Score:", f1_score_bernoulli)

print("\nMultinomial Naive Bayes:")
print("Accuracy:", accuracy_multinomial)
print("Precision:", precision_multinomial)
print("Recall:", recall_multinomial)
print("F1 Score:", f1_score_multinomial)

print("\nGaussian Naive Bayes:")
print("Accuracy:", accuracy_gaussian)
print("Precision:", precision_gaussian)
print("Recall:", recall_gaussian)
print("F1 Score:", f1_score_gaussian)


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995


Bernoulli Naive Bayes performed the best in terms of accuracy, precision, and F1 score for this spam classification task. 

Limitations and Observations:

- Imbalanced Data: The dataset might be imbalanced, which can impact the classifiers' performance. The high recall in Gaussian Naive Bayes may be due to the model predicting spam for many instances to capture as many actual spam emails as possible.

- Choice of Features: The choice of features and feature engineering can significantly affect the performance of Naive Bayes classifiers. There might be room for improvement by selecting more relevant features or using techniques like TF-IDF for text-based features.

- Hyperparameter Tuning: These results are based on default hyperparameters. Hyperparameter tuning could potentially improve the performance of all three classifiers.

- Data Preprocessing: Depending on the quality of data preprocessing (e.g., handling missing values, text cleaning), the performance of the classifiers can vary

Conclusion:

- Bernoulli Naive Bayes outperformed the other two variants in terms of accuracy, precision, and F1 score. It achieved an accuracy of approximately 88.4%, indicating its ability to correctly classify a large portion of emails.

- Multinomial Naive Bayes had lower overall performance compared to Bernoulli Naive Bayes. While it achieved decent results, it couldn't match the accuracy and precision of the Bernoulli variant.

- Gaussian Naive Bayes had the highest recall but lower precision, resulting in a balanced F1 score. It correctly identified a significant portion of spam emails but also produced more false positives compared to the other two variants.

