#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

##### To find the probability that an employee is a smoker given that he/she uses the health insurance plan, you can use conditional probability. The notation for this probability is P(Smoker | Uses insurance). The formula for conditional probability is:

**P(Smoker∣Usesinsurance)= P(Usesinsurance)/P(Smoker∩Usesinsurance)**


**From the information provided:**

- P(Smoker ∩ Usesinsurance) is the probability that an employee is both a smoker and uses the health insurance plan, which is 40% of the employees who use the plan (0.4).

- P(Usesinsurance) is the probability that an employee uses the health insurance plan, which is 70% of all employees (0.7).


**Plugging in the values:**
- P(Smoker∣Usesinsurance)= 0.4/0.7

**So, the probability that an employee is a smoker given that he/she uses the health insurance plan is approximately 0.571 or 57.1%.**

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

#### Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm used for classification tasks, but they differ in terms of the type of data they are suitable for:

**Bernoulli Naive Bayes:**

- Suitable for binary or multivariate Bernoulli-distributed data.
- Assumes that features are binary variables (0 or 1), representing the presence or absence of a particular term or feature.
- Commonly used in text classification tasks, where the presence or absence of words in a document is considered.

**Multinomial Naive Bayes:**

- Suitable for discrete data, often used for document classification tasks.
- Assumes that features represent the frequency of words or other discrete data (integer counts), and it works well when the data can be modeled with a multinomial distribution.

**In summary, the main difference lies in the type of data they handle – Bernoulli Naive Bayes is for binary data, while Multinomial Naive Bayes is for discrete count data.**

#### Q3. How does Bernoulli Naive Bayes handle missing values?

**Bernoulli Naive Bayes typically handles missing values by considering them as a separate category. If a feature is missing for a particular instance, it is treated as a third category distinct from the presence (1) and absence (0) categories. This assumes that the missing values are not missing completely at random and may carry some information.**

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

**Gaussian Naive Bayes is generally suitable for binary and continuous data. For multi-class classification, where there are more than two classes, Gaussian Naive Bayes can still be used, but it requires extending the model to handle multiple classes. Each class would have its own set of mean and variance parameters for the features, and the class with the highest probability would be predicted.**

- In summary, yes, Gaussian Naive Bayes can be adapted for multi-class classification by extending its parameters and calculations for each class.

#### Q5. Assignment:
**Data preparation:**
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

**Implementation:**

- Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

**Results:**
- Report the following performance metrics for each classifier:
- Accuracy
- Precision
- Recall
- F1 score

**Discussion:**
- Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed? 

**Conclusion:**
- Summarise your findings and provide some suggestions for future work.

**Note: This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.**

In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)

# Assume the last column is the target variable (spam or not)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Evaluate performance using cross-validation
def evaluate_classifier(classifier, X, y):
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy').mean()
    precision = cross_val_score(classifier, X, y, cv=10, scoring='precision').mean()
    recall = cross_val_score(classifier, X, y, cv=10, scoring='recall').mean()
    f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1').mean()
    return accuracy, precision, recall, f1

# Get performance metrics for each classifier
accuracy_bernoulli, precision_bernoulli, recall_bernoulli, f1_bernoulli = evaluate_classifier(bernoulli_nb, X, y)
accuracy_multinomial, precision_multinomial, recall_multinomial, f1_multinomial = evaluate_classifier(multinomial_nb, X, y)
accuracy_gaussian, precision_gaussian, recall_gaussian, f1_gaussian = evaluate_classifier(gaussian_nb, X, y)

# Print results
print("Bernoulli Naive Bayes:")
print(f"Accuracy: {accuracy_bernoulli}, Precision: {precision_bernoulli}, Recall: {recall_bernoulli}, F1: {f1_bernoulli}")

print("\nMultinomial Naive Bayes:")
print(f"Accuracy: {accuracy_multinomial}, Precision: {precision_multinomial}, Recall: {recall_multinomial}, F1: {f1_multinomial}")

print("\nGaussian Naive Bayes:")
print(f"Accuracy: {accuracy_gaussian}, Precision: {precision_gaussian}, Recall: {recall_gaussian}, F1: {f1_gaussian}")


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911, Precision: 0.8869617393737383, Recall: 0.8152389047416673, F1: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323, Precision: 0.7393175533565436, Recall: 0.7214983911116508, F1: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915, Precision: 0.7103733928118492, Recall: 0.9569516119239877, F1: 0.8130660909542995
