In [11]:
# Q1 -->
# Given probabilities
P_A = 0.70  # Probability that an employee uses the health insurance plan
P_B_given_A = 0.40  # Probability that an employee is a smoker given that they use the health insurance plan

# Calculate the probability that an employee is a smoker given that they use the health insurance plan
P_B_and_A = P_B_given_A * P_A
P_B_given_A = P_B_and_A / P_A

print(f"The probability that an employee is a smoker given that they use the health insurance plan is: {P_B_given_A:.2f}")


The probability that an employee is a smoker given that they use the health insurance plan is: 0.40


### Q2. Bernoulli Naive Bayes vs. Multinomial Naive Bayes:

- **Bernoulli Naive Bayes:**
  - For binary data (0 or 1).
  - Used in tasks like document classification and spam filtering.
  - Assumes binary features.

- **Multinomial Naive Bayes:**
  - For discrete data, often counts or frequencies.
  - Commonly used in text classification and topic modeling.
  - Assumes integer-valued features.

### Q3. Handling Missing Values in Bernoulli Naive Bayes:

- In Bernoulli Naive Bayes, missing values are ignored during training and prediction.
- The model assumes that missing features don't contribute to the likelihood calculation.

### Q4. Gaussian Naive Bayes for Multi-Class Classification:

- Yes, Gaussian Naive Bayes can handle multi-class classification.
- It assumes continuous features following a Gaussian distribution.
- Each class has its mean and variance parameters.
- During prediction, assigns the instance to the class with the highest probability.

In [12]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
spambase = fetch_ucirepo(id=94) 

# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 


In [13]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
spambase = fetch_ucirepo(id=94) 

# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 


In [None]:
X.dtypes


In [25]:
X.shape,y.shape

((4601, 57), (4601, 1))

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from warnings import filterwarnings
filterwarnings('ignore')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Train classifiers
bernoulli_nb.fit(X_train, y_train)
multinomial_nb.fit(X_train, y_train)
gaussian_nb.fit(X_train, y_train)

# Predictions
y_pred_b = bernoulli_nb.predict(X_test)
y_pred_m = multinomial_nb.predict(X_test)
y_pred_g = gaussian_nb.predict(X_test)

# Evaluate classifiers
accuracy_b = accuracy_score(y_test, y_pred_b)
precision_b = precision_score(y_test, y_pred_b)
recall_b = recall_score(y_test, y_pred_b)
f1_b = f1_score(y_test, y_pred_b)

accuracy_m = accuracy_score(y_test, y_pred_m)
precision_m = precision_score(y_test, y_pred_m)
recall_m = recall_score(y_test, y_pred_m)
f1_m = f1_score(y_test, y_pred_m)

accuracy_g = accuracy_score(y_test, y_pred_g)
precision_g = precision_score(y_test, y_pred_g)
recall_g = recall_score(y_test, y_pred_g)
f1_g = f1_score(y_test, y_pred_g)

# Display results
print("Bernoulli Naive Bayes:")
print(f"Accuracy: {accuracy_b:.4f}")
print(f"Precision: {precision_b:.4f}")
print(f"Recall: {recall_b:.4f}")
print(f"F1 Score: {f1_b:.4f}\n")

print("Multinomial Naive Bayes:")
print(f"Accuracy: {accuracy_m:.4f}")
print(f"Precision: {precision_m:.4f}")
print(f"Recall: {recall_m:.4f}")
print(f"F1 Score: {f1_m:.4f}\n")

print("Gaussian Naive Bayes:")
print(f"Accuracy: {accuracy_g:.4f}")
print(f"Precision: {precision_g:.4f}")
print(f"Recall: {recall_g:.4f}")
print(f"F1 Score: {f1_g:.4f}")


Bernoulli Naive Bayes:
Accuracy: 0.8806
Precision: 0.9070
Recall: 0.8000
F1 Score: 0.8501

Multinomial Naive Bayes:
Accuracy: 0.7861
Precision: 0.7644
Recall: 0.7154
F1 Score: 0.7391

Gaussian Naive Bayes:
Accuracy: 0.8208
Precision: 0.7193
Recall: 0.9462
F1 Score: 0.8173


### Conclusion:

1. **Bernoulli Naive Bayes:**
   - **Accuracy:** 88.06%
   - **Precision:** 90.70%
   - **Recall:** 80.00%
   - **F1 Score:** 85.01%
   - **Observation:** Bernoulli Naive Bayes performs well across all metrics. It has high precision and accuracy, making it effective in correctly identifying spam emails while minimizing false positives.

2. **Multinomial Naive Bayes:**
   - **Accuracy:** 78.61%
   - **Precision:** 76.44%
   - **Recall:** 71.54%
   - **F1 Score:** 73.91%
   - **Observation:** Multinomial Naive Bayes shows decent performance but falls short compared to Bernoulli Naive Bayes. It may not capture binary features as effectively as Bernoulli NB, resulting in slightly lower precision and recall.

3. **Gaussian Naive Bayes:**
   - **Accuracy:** 82.08%
   - **Precision:** 71.93%
   - **Recall:** 94.62%
   - **F1 Score:** 81.73%
   - **Observation:** Gaussian Naive Bayes performs well in recall, indicating its ability to identify a high proportion of spam emails. However, the lower precision suggests a higher rate of false positives, impacting overall accuracy.

### Overall Observations:

- **Bernoulli Naive Bayes** appears to be the most balanced and effective in this context, providing a good trade-off between precision and recall. It is well-suited for binary features, making it suitable for spam classification tasks.

- **Multinomial Naive Bayes** performs reasonably well but lags behind Bernoulli NB in accuracy and precision. It might be more suitable for tasks involving discrete count data.

- **Gaussian Naive Bayes** excels in recall, indicating its strength in capturing spam instances, but it comes at the cost of precision. It may benefit from further tuning or feature engineering.

### Limitations:

- Naive Bayes assumes feature independence, which may not always hold in real-world scenarios.
- The dataset size and characteristics may impact performance; a larger dataset could lead to more robust models.
- The choice of Naive Bayes variant depends on the nature of the features in the dataset.

### Future Work:

- Experiment with feature engineering to enhance the performance of Gaussian Naive Bayes.
- Explore additional hyperparameter tuning to optimize the performance of each classifier.
- Consider using more sophisticated models or ensemble methods to further improve classification accuracy.

In summary, while each Naive Bayes variant has its strengths and weaknesses, Bernoulli Naive Bayes is the most balanced for this spam classification task based on the provided results.