### [Q1.] A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?
##### [Ans]

Given :
Probabilty of employees use company insurance is :
$$
P(Insurance) = 0.7
$$
Probability of employees use company insurance but smoke is :
$$
P(Smokers|Insurance) = 0.4
$$
So, basically the answer is in the question so the answer is 0.4

###[Q2.] What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?\
#####[Ans]

| Aspect | Bernoulli Naive Bayes | Multinomial Naive Bayes |
| ------ | --------------------- | ----------------------- |
| Data Type |	Works with binary data (0 or 1). | Works with count-based data (integer or frequency). |
| Feature Assumption |Assumes features are either present (1) or absent (0). |	Assumes features represent counts or frequencies. |
| Use Case | Binary classification problems, such as spam detection where features indicate presence/absence of words. | Text classification with word counts, such as topic modeling or sentiment analysis. |

###[Q3.] How does Bernoulli Naive Bayes handle missing values?
#####[Ans]

Bernoulli Naive Bayes does not inherently handle missing values. However, common strategies include:

- Imputation: Replacing missing values with the most common value (mode) or based on domain knowledge.
- Data Preparation: Dropping rows or columns with missing values before applying the classifier.

###[Q4.] Can Gaussian Naive Bayes be used for multi-class classification?
#####[Ans]

Yes, Gaussian Naive Bayes can be used for multi-class classification. Scikit-learn's implementation handles multiple classes using the one-vs-rest strategy, where a separate binary model is trained for each class. The model predicts the class with the highest posterior probability.

###[Q5.] Assignment:
- **Data preparation:**

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.<br>
- **Implementation:**

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.<br>
- **Results:**

Report the following performance metrics for each classifier:<br>
1. Accuracy<br>
2. Precision<br>
3. Recall<br>
4. F1 score<br>
- **Discussion:**

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
- **Conclusion:**

Summarise your findings and provide some suggestions for future work.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

data = pd.read_csv("spambase.data", header=None)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

classifiers = {
    "BernoulliNB" : BernoulliNB(),
    "MultinomialNB" : MultinomialNB(),
    "GaussianNB" : GaussianNB()
}

results = {}
for name, clf in classifiers.items():
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  results[name] = {
      "Accuracy" : accuracy_score(y_test, y_pred),
      "Precision" : precision_score(y_test, y_pred),
      "Recall" : recall_score(y_test, y_pred),
      "F1 Score" : f1_score(y_test, y_pred)
  }

for name, metrics in results.items():
  print(f"\n{name} Performance :")
  for metric, value in metrics.items():
    print(f"{metric} : {value:.4f}")


BernoulliNB Performance :
Accuracy : 0.8791
Precision : 0.8883
Recall : 0.8128
F1 Score : 0.8489

MultinomialNB Performance :
Accuracy : 0.7820
Precision : 0.7624
Recall : 0.6950
F1 Score : 0.7271

GaussianNB Performance :
Accuracy : 0.8248
Precision : 0.7207
Recall : 0.9480
F1 Score : 0.8189
