<div class="alert alert-block alert-info" align="center" style="padding: 10px;">    
    <h1><b><u>Naive Bayes-2</u></b></h1>
</div>

**Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?**

To find the probability that an employee is a smoker given that they use the health insurance plan, you can use conditional probability. In this case, you want to calculate:

$$ P(Smoker | Uses Health Insurance Plan) = ? $$

**Using the conditional probability formula:**

$$ P(A|B) = \frac{P(A \text{ and } B)}{P(B)} $$

In this case:

- \( A \) represents being a smoker.
- \( B \) represents using the health insurance plan.

Given:
$$ P(\text{Uses Health Insurance Plan}) = 0.70 $$  (70% of employees use the plan)
$$ P(\text{Smoker | Uses Health Insurance Plan}) = 0.40 $$  (40% of plan users are smokers)

$$ P(Smoker | Uses Health Insurance Plan) = \frac{0.40 \times 0.70}{0.70} = 0.40 $$ 

So, the probability that an employee is a smoker given that they use the *health insurance plan is 40%.*

---

**Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**

**Bernoulli Naive Bayes:**

Designed for binary data, where features are either 0 (absence) or 1 (presence), often used in document classification tasks.

**Multinomial Naive Bayes:**

Suitable for discrete count or frequency data, commonly used in text classification with features like word counts or term frequencies.

---

**Q3. How does Bernoulli Naive Bayes handle missing values?**

Bernoulli Naive Bayes can handle missing values in two ways:

1. It can consider missing values as a separate category alongside 0 and 1, treating them as a third possible value for a feature.

2. Alternatively, you can impute missing values using methods like mean, median, or mode imputation to convert them into 0s or 1s, making the data suitable for Bernoulli Naive Bayes.

---

**Q4. Can Gaussian Naive Bayes be used for multi-class classification?**

Gaussian Naive Bayes can be used for multi-class classification tasks. It is appropriate when features are continuous and follow a Gaussian (normal) distribution. The algorithm estimates Gaussian distribution parameters (mean and variance) for each class's features and uses Bayes' theorem to classify new data into one of the multiple classes.

---

### **Q5. Assignment:** ###

### Data Preparation:

1. **Download the "Spambase Data Set"** from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, and the goal is to predict whether a message is spam or not based on several input features.

### Implementation:

2. **Implement Naive Bayes Classifiers:**
    - Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python.
    - Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset.
    - Use default hyperparameters for each classifier.

### Results:

3. **Report Performance Metrics:**
    - Report the following performance metrics for each classifier:
        - Accuracy
        - Precision
        - Recall
        - F1 score

### Discussion:

4. **Discuss Results:**
    - Discuss the obtained results.
    - Analyze which variant of Naive Bayes performed the best and why.
    - Observe any limitations of Naive Bayes.

### Conclusion:

5. **Conclusion:**
    - Summarize your findings.
    - Provide suggestions for future work.



In [15]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

data = pd.read_csv('spambase_csv.csv')
data.head()


Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_%3B,char_freq_%28,char_freq_%5B,char_freq_%21,char_freq_%24,char_freq_%23,capital_run_length_average,capital_run_length_longest,capital_run_length_total,class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [18]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

data = pd.read_csv('spambase_csv.csv')
data.head()

X = data.drop('class', axis=1)
y = data['class']  

# Implementation of Naive Bayes Classifiers
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

# Evaluate each variant of Naive Bayes classifiers using 10-fold cross-validation

# Bernoulli Naive Bayes:-

bnb = BernoulliNB()
accuracy_bnb = cross_val_score(bnb, X, y, cv=10, scoring='accuracy')
precision_bnb = cross_val_score(bnb, X, y, cv=10, scoring='precision')
recall_bnb = cross_val_score(bnb, X, y, cv=10, scoring='recall')
f1_score_bnb = cross_val_score(bnb, X, y, cv=10, scoring='f1')

# Multinomial Naive Bayes:
mnb = MultinomialNB()
accuracy_mnb = cross_val_score(mnb, X, y, cv=10, scoring='accuracy')
precision_mnb = cross_val_score(mnb, X, y, cv=10, scoring='precision')
recall_mnb = cross_val_score(mnb, X, y, cv=10, scoring='recall')
f1_score_mnb = cross_val_score(mnb, X, y, cv=10, scoring='f1')

# Gaussian Naive Bayes:
gnb = GaussianNB()
accuracy_gnb = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')
precision_gnb = cross_val_score(gnb, X, y, cv=10, scoring='precision')
recall_gnb = cross_val_score(gnb, X, y, cv=10, scoring='recall')
f1_score_gnb = cross_val_score(gnb, X, y, cv=10, scoring='f1')

# Report the performance metrics for each classifier
print("Bernoulli Naive Bayes:")
print("Accuracy:", accuracy_bnb.mean())
print("Precision:", precision_bnb.mean())
print("Recall:", recall_bnb.mean())
print("F1 Score:", f1_score_bnb.mean())

print("\nMultinomial Naive Bayes:")
print("Accuracy:", accuracy_mnb.mean())
print("Precision:", precision_mnb.mean())
print("Recall:", recall_mnb.mean())
print("F1 Score:", f1_score_mnb.mean())

print("\nGaussian Naive Bayes:")
print("Accuracy:", accuracy_gnb.mean())
print("Precision:", precision_gnb.mean())
print("Recall:", recall_gnb.mean())
print("F1 Score:", f1_score_gnb.mean())

Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911
Precision: 0.8869617393737383
Recall: 0.8152389047416673
F1 Score: 0.8481249015095276

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323
Precision: 0.7393175533565436
Recall: 0.7214983911116508
F1 Score: 0.7282909724016348

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915
Precision: 0.7103733928118492
Recall: 0.9569516119239877
F1 Score: 0.8130660909542995


### Discussion of Naive Bayes Performance Metrics 

#### Bernoulli Naive Bayes:

- **Accuracy:** 88.39%
- **Precision:** 88.70%
- **Recall:** 81.52%
- **F1 Score:** 84.81%

#### Multinomial Naive Bayes:

- **Accuracy:** 78.63%
- **Precision:** 73.93%
- **Recall:** 72.15%
- **F1 Score:** 72.83%

#### Gaussian Naive Bayes:

- **Accuracy:** 82.18%
- **Precision:** 71.04%
- **Recall:** 95.70%
- **F1 Score:** 81.31%

### Discussion

Among the three Naive Bayes variants, **Bernoulli Naive Bayes** achieved the highest accuracy, precision, and F1 score, performing well in distinguishing between spam and non-spam emails.

**Multinomial Naive Bayes**, while having a lower accuracy than Bernoulli Naive Bayes, still demonstrated reasonable performance in classifying emails.

**Gaussian Naive Bayes** showed a high recall rate, indicating that it is good at identifying spam emails (fewer false negatives). However, it had a lower precision compared to the other two, leading to a lower F1 score.

# Observations

- Bernoulli Naive Bayes is suitable when the features are binary or binary-like (e.g., presence or absence of words in text classification).

- Multinomial Naive Bayes is typically used for discrete data, often in text classification where features represent word counts.

- Gaussian Naive Bayes assumes that features follow a Gaussian distribution, which may not hold for all types of data.

# Limitations

- Naive Bayes classifiers assume independence between features, which may not always hold in real-world data.

- The choice of Naive Bayes variant should be based on the nature of the data and problem requirements.

- These results represent the performance with default hyperparameters; fine-tuning and feature engineering may further improve performance.

### Conclusion

In this evaluation, **Bernoulli Naive Bayes** performed the best overall in classifying spam and non-spam emails.

**Multinomial Naive Bayes** also showed reasonable performance.

**Gaussian Naive Bayes** had high recall but lower precision, which might not be ideal for all applications.

The choice of the most suitable Naive Bayes variant depends on the specific characteristics of the data and the desired trade-offs between precision, recall, and overall accuracy.
