# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To solve this problem, we can use Bayes' theorem. Let A be the event that an employee is a smoker and B be the event that an employee uses the company's health insurance plan. We want to find the probability of A given B, i.e., P(A|B).

We know that P(B|A) = 0.4, which is the probability of an employee using the health insurance plan given that the employee is a smoker. We also know that P(B) = 0.7, which is the probability of an employee using the health insurance plan. We want to find P(A|B), which is the probability of an employee being a smoker given that the employee uses the health insurance plan.

Using Bayes' theorem, we have:

P(A|B) = P(B|A) * P(A) / P(B)

Substituting the given values, we get:

P(A|B) = 0.4 * P(A) / 0.7

We know that P(A) + P(~A) = 1, where ~A is the complement of A. Since we don't have any information about the probability of an employee not being a smoker, we can assume that P(~A) = 1 - P(A) = 0.5. Therefore, we have:

P(A) = 0.5

Substituting this value, we get:

P(A|B) = 0.4 * 0.5 / 0.7

Simplifying, we get:

P(A|B) = 0.286

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is **0.286**.

# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm that are used for classification tasks. The main difference between the two is the type of data they are designed to handle ¹.

Bernoulli Naive Bayes is used for **discrete data** where the features are only in binary form (i.e., 0 or 1) ¹. It is commonly used in text classification problems where the presence or absence of a word is used as a feature ¹. Bernoulli Naive Bayes explicitly models the presence/absence of a feature ³.

Multinomial Naive Bayes, on the other hand, is used for **discrete data** where the features are counts (i.e., integer values) ¹. It is commonly used in text classification problems where the frequency of a word is used as a feature ¹. Multinomial Naive Bayes cares about counts for multiple features that do occur ³.

In summary, Bernoulli Naive Bayes is used when the features are binary, while Multinomial Naive Bayes is used when the features are counts ¹. 


# Q3. How does Bernoulli Naive Bayes handle missing values?

When constructing probability tables for Bernoulli Naive Bayes, **missing values (NAs) are omitted** ¹. The corresponding predict function excludes all NAs from the calculation of posterior probabilities ¹. When training a naive Bayes classifier, you can choose to either omit records with any missing values or omit only the missing attributes ¹. Another way to deal with missing values is to ignore that category if it is a missing value when calculating probabilities ¹. 

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification ¹. In scikit-learn, the GaussianNB class implements the Gaussian Naive Bayes algorithm for classification tasks ¹. It can be used for both binary and multi-class classification problems ¹. 

In multi-class classification, the algorithm estimates the probability of each class using the Gaussian distribution and then selects the class with the highest probability ². 


# Q5. Assignment:

- Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

- Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

- Results:

Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

- Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

To implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python, you can follow these steps:

1. Download the Spambase dataset from the UCI Machine Learning Repository.
2. Load the dataset into Python using pandas.
3. Split the dataset into training and testing sets.
4. Train the Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the training set.
5. Evaluate the performance of each classifier using 10-fold cross-validation on the testing set.
6. Report the following performance metrics for each classifier: Accuracy, Precision, Recall, and F1 score.

Here are some code snippets to get you started:

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_val_score

# Load the dataset
data = pd.read_csv('spambase.data', header=None)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=42)

# Train the Bernoulli Naive Bayes classifier
bnb = BernoulliNB()
bnb.fit(X_train, y_train)

# Evaluate the performance of the Bernoulli Naive Bayes classifier
bnb_scores = cross_val_score(bnb, X_test, y_test, cv=10)
bnb_accuracy = accuracy_score(y_test, bnb.predict(X_test))
bnb_precision = precision_score(y_test, bnb.predict(X_test))
bnb_recall = recall_score(y_test, bnb.predict(X_test))
bnb_f1_score = f1_score(y_test, bnb.predict(X_test))

# Train the Multinomial Naive Bayes classifier
mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Evaluate the performance of the Multinomial Naive Bayes classifier
mnb_scores = cross_val_score(mnb, X_test, y_test, cv=10)
mnb_accuracy = accuracy_score(y_test, mnb.predict(X_test))
mnb_precision = precision_score(y_test, mnb.predict(X_test))
mnb_recall = recall_score(y_test, mnb.predict(X_test))
mnb_f1_score = f1_score(y_test, mnb.predict(X_test))

# Train the Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Evaluate the performance of the Gaussian Naive Bayes classifier
gnb_scores = cross_val_score(gnb, X_test, y_test, cv=10)
gnb_accuracy = accuracy_score(y_test, gnb.predict(X_test))
gnb_precision = precision_score(y_test, gnb.predict(X_test))
gnb_recall = recall_score(y_test, gnb.predict(X_test))
gnb_f1_score = f1_score(y_test, gnb.predict(X_test))

# Report the performance metrics for each classifier
print('Bernoulli Naive Bayes:')
print('Accuracy:', bnb_accuracy)
print('Precision:', bnb_precision)
print('Recall:', bnb_recall)
print('F1 score:', bnb_f1_score)
print('Cross-validation scores:', bnb_scores)
print()
print('Multinomial Naive Bayes:')
print('Accuracy:', mnb_accuracy)
print('Precision:', mnb_precision)
print('Recall:', mnb_recall)
print('F1 score:', mnb_f1_score)
print('Cross-validation scores:', mnb_scores)
print()
print('Gaussian Naive Bayes:')
print('Accuracy:', gnb_accuracy)
print('Precision:', gnb_precision)
print('Recall:', gnb_recall)
print('F1 score:', gnb_f1_score)
print('Cross-validation scores:', gnb_scores)

Bernoulli Naive Bayes:
Accuracy: 0.8805646036916395
Precision: 0.9069767441860465
Recall: 0.8
F1 score: 0.8501362397820164
Cross-validation scores: [0.88172043 0.84782609 0.81521739 0.91304348 0.90217391 0.88043478
 0.93478261 0.88043478 0.92391304 0.89130435]

Multinomial Naive Bayes:
Accuracy: 0.7861020629750272
Precision: 0.7643835616438356
Recall: 0.7153846153846154
F1 score: 0.7390728476821192
Cross-validation scores: [0.79569892 0.7173913  0.79347826 0.73913043 0.83695652 0.84782609
 0.81521739 0.79347826 0.80434783 0.80434783]

Gaussian Naive Bayes:
Accuracy: 0.8208469055374593
Precision: 0.7192982456140351
Recall: 0.9461538461538461
F1 score: 0.8172757475083057
Cross-validation scores: [0.79569892 0.79347826 0.86956522 0.85869565 0.86956522 0.83695652
 0.85869565 0.83695652 0.84782609 0.80434783]


In general, the performance of each variant of Naive Bayes depends on the nature of the data and the problem at hand. Bernoulli Naive Bayes is used for binary data, Multinomial Naive Bayes is used for discrete data, and Gaussian Naive Bayes is used for continuous data. 

One limitation of Naive Bayes is that it assumes that the features are independent, which may not always be the case in real-world problems.