# Naive Bayes Classification - Theory and Assignment
This notebook covers key questions on Naive Bayes classifiers and an assignment using the Spambase dataset.

## Q1: Probability of a smoker given they use the health insurance plan
We are given:
- Probability of using health insurance, \( P(H) = 0.7 \)
- Probability of being a smoker given health insurance, \( P(S | H) = 0.4 \)
Using Bayes' theorem:
\[
P(S | H) = \frac{P(H | S) P(S)}{P(H)}
\]
Since we are given \( P(S | H) \), the answer is simply:
\[
P(S | H) = 0.4 (or 40%)
\]

## Q2: Difference between Bernoulli Naive Bayes and Multinomial Naive Bayes
- **Bernoulli Naive Bayes** is used for binary feature data (0 or 1). Example: Presence or absence of words in text.
- **Multinomial Naive Bayes** is used for count-based data, like word frequency in text classification.

## Q3: Handling Missing Values in Bernoulli Naive Bayes
Bernoulli Naive Bayes does not handle missing values directly. Some common approaches:
- Impute missing values with 0 (assuming absence of feature).
- Use median/mode of the data to fill missing values.

## Q4: Can Gaussian Naive Bayes be used for multi-class classification?
Yes, Gaussian Naive Bayes can handle multi-class classification by applying Bayes' theorem separately for each class and choosing the class with the highest probability.

## Q5: Assignment - Spam Classification using Naive Bayes
### Steps:
1. Load the Spambase dataset.
2. Preprocess the data.
3. Implement Bernoulli, Multinomial, and Gaussian Naive Bayes classifiers.
4. Perform 10-fold cross-validation.
5. Evaluate and compare performance metrics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# Load Spambase dataset (Download manually from UCI repository)
data = pd.read_csv('spambase.data', header=None)

# Split features and labels
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Standardize data for Gaussian Naive Bayes
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Initialize classifiers
bnb = BernoulliNB()
mnb = MultinomialNB()
gnb = GaussianNB()

# Perform 10-fold cross-validation and compute metrics
classifiers = {'BernoulliNB': bnb, 'MultinomialNB': mnb, 'GaussianNB': gnb}
metrics = {}

for name, clf in classifiers.items():
    if name == 'GaussianNB':
        X_data = X_scaled
    else:
        X_data = X
    
    accuracy = np.mean(cross_val_score(clf, X_data, y, cv=10, scoring='accuracy'))
    precision = np.mean(cross_val_score(clf, X_data, y, cv=10, scoring='precision'))
    recall = np.mean(cross_val_score(clf, X_data, y, cv=10, scoring='recall'))
    f1 = np.mean(cross_val_score(clf, X_data, y, cv=10, scoring='f1'))
    
    metrics[name] = {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1-score': f1}

# Display results
for model, scores in metrics.items():
    print(f"{model} Performance:")
    for metric, value in scores.items():
        print(f"{metric}: {value:.4f}")
    print("\n")

## Discussion and Conclusion
- The classifier that performs best depends on the dataset.
- Multinomial Naive Bayes usually works well with text data (word counts).
- Bernoulli Naive Bayes is better for binary feature data.
- Gaussian Naive Bayes can handle continuous features.
### Limitations of Naive Bayes:
- Assumes feature independence, which is often unrealistic.
- Performs poorly with correlated features.
- Sensitive to data imbalance.

**Future Work:**
- Experiment with feature engineering.
- Try alternative models like Logistic Regression or Random Forest.