# Naive Bayes - Complete Guide

## From Probability Theory to Implementation

Naive Bayes is a probabilistic classifier based on **Bayes' Theorem** with a "naive" assumption of feature independence.

### What You'll Learn
1. Bayes' Theorem foundations
2. Naive independence assumption
3. Gaussian Naive Bayes
4. Multinomial Naive Bayes (text classification)
5. Bernoulli Naive Bayes
6. Implementation from scratch
7. Real-world applications

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris, load_breast_cancer, fetch_20newsgroups
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from scipy.stats import norm

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

## 1. Bayes' Theorem

$$P(C_k|\mathbf{x}) = \frac{P(\mathbf{x}|C_k) \cdot P(C_k)}{P(\mathbf{x})}$$

Where:
- $P(C_k|\mathbf{x})$ = **Posterior**: Probability of class $C_k$ given features $\mathbf{x}$
- $P(\mathbf{x}|C_k)$ = **Likelihood**: Probability of features given class
- $P(C_k)$ = **Prior**: Probability of class $C_k$
- $P(\mathbf{x})$ = **Evidence**: Probability of features (normalization constant)

### Naive Assumption

Features are conditionally independent given the class:

$$P(\mathbf{x}|C_k) = P(x_1, x_2, ..., x_n|C_k) = \prod_{i=1}^{n} P(x_i|C_k)$$

In [None]:
# Visualize Bayes' Theorem concept
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Example: Medical test
# P(Disease) = 0.01
# P(Positive|Disease) = 0.9 (sensitivity)
# P(Positive|No Disease) = 0.1 (false positive rate)

prior_disease = 0.01
prior_no_disease = 0.99
likelihood_pos_given_disease = 0.9
likelihood_pos_given_no_disease = 0.1

# Calculate posterior using Bayes' theorem
evidence = (likelihood_pos_given_disease * prior_disease + 
            likelihood_pos_given_no_disease * prior_no_disease)

posterior_disease = (likelihood_pos_given_disease * prior_disease) / evidence

# Prior vs Posterior
categories = ['Disease', 'No Disease']
priors = [prior_disease, prior_no_disease]
posteriors = [posterior_disease, 1 - posterior_disease]

x = np.arange(len(categories))
width = 0.35

axes[0].bar(x - width/2, priors, width, label='Prior P(C)', color='lightblue')
axes[0].bar(x + width/2, posteriors, width, label='Posterior P(C|Test+)', color='coral')
axes[0].set_ylabel('Probability', fontsize=12)
axes[0].set_title('Bayes Theorem: Updating Beliefs\n(Medical Test Example)', fontsize=14)
axes[0].set_xticks(x)
axes[0].set_xticklabels(categories)
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Flow diagram
axes[1].text(0.5, 0.9, 'Bayes\' Theorem Flow', ha='center', fontsize=16, fontweight='bold')
axes[1].text(0.5, 0.75, 'Prior P(C)', ha='center', fontsize=12, 
            bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))
axes[1].arrow(0.5, 0.7, 0, -0.08, head_width=0.05, head_length=0.03, fc='black')
axes[1].text(0.5, 0.55, 'Likelihood P(X|C)', ha='center', fontsize=12,
            bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))
axes[1].arrow(0.5, 0.5, 0, -0.08, head_width=0.05, head_length=0.03, fc='black')
axes[1].text(0.5, 0.35, 'Evidence P(X)', ha='center', fontsize=12,
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
axes[1].arrow(0.5, 0.3, 0, -0.08, head_width=0.05, head_length=0.03, fc='black')
axes[1].text(0.5, 0.15, 'Posterior P(C|X)', ha='center', fontsize=12,
            bbox=dict(boxstyle='round', facecolor='coral', alpha=0.8))
axes[1].set_xlim(0, 1)
axes[1].set_ylim(0, 1)
axes[1].axis('off')

plt.tight_layout()
plt.show()

print(f"Prior probability of disease: {prior_disease:.1%}")
print(f"Posterior probability given positive test: {posterior_disease:.1%}")

## 2. Gaussian Naive Bayes

Assumes features follow a **Gaussian (normal) distribution**:

$$P(x_i|C_k) = \frac{1}{\sqrt{2\pi\sigma_{k}^2}} \exp\left(-\frac{(x_i - \mu_{k})^2}{2\sigma_{k}^2}\right)$$

Used for continuous features.

In [None]:
# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
target_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Visualize feature distributions by class
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, feature_idx in enumerate(range(4)):
    for class_idx in range(3):
        feature_values = X_train[y_train == class_idx, feature_idx]
        
        # Plot histogram
        axes[idx].hist(feature_values, bins=20, alpha=0.5, label=target_names[class_idx])
        
        # Fit Gaussian and plot
        mu, sigma = feature_values.mean(), feature_values.std()
        x = np.linspace(feature_values.min(), feature_values.max(), 100)
        axes[idx].plot(x, norm.pdf(x, mu, sigma) * len(feature_values) * 
                      (feature_values.max() - feature_values.min()) / 20, linewidth=2)
    
    axes[idx].set_xlabel(feature_names[idx], fontsize=12)
    axes[idx].set_ylabel('Frequency', fontsize=12)
    axes[idx].set_title(f'Distribution: {feature_names[idx]}', fontsize=14)
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Gaussian Naive Bayes Implementation from Scratch

In [None]:
class GaussianNaiveBayesScratch:
    """Gaussian Naive Bayes from scratch"""
    
    def fit(self, X, y):
        """Calculate priors, means, and variances"""
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        
        # Initialize storage
        self.priors = np.zeros(n_classes)
        self.means = np.zeros((n_classes, n_features))
        self.vars = np.zeros((n_classes, n_features))
        
        # Calculate statistics for each class
        for idx, c in enumerate(self.classes):
            X_c = X[y == c]
            self.priors[idx] = X_c.shape[0] / n_samples
            self.means[idx, :] = X_c.mean(axis=0)
            self.vars[idx, :] = X_c.var(axis=0)
        
        return self
    
    def _gaussian_probability(self, x, mean, var):
        """Calculate Gaussian probability density"""
        eps = 1e-9  # Avoid division by zero
        coeff = 1.0 / np.sqrt(2.0 * np.pi * var + eps)
        exponent = np.exp(-(x - mean)**2 / (2 * var + eps))
        return coeff * exponent
    
    def predict(self, X):
        """Predict class labels"""
        predictions = [self._predict_single(x) for x in X]
        return np.array(predictions)
    
    def _predict_single(self, x):
        """Predict single sample"""
        posteriors = []
        
        for idx, c in enumerate(self.classes):
            # Log prior
            prior = np.log(self.priors[idx])
            
            # Log likelihood
            likelihood = np.sum(np.log(
                self._gaussian_probability(x, self.means[idx, :], self.vars[idx, :]) + 1e-9
            ))
            
            # Log posterior
            posterior = prior + likelihood
            posteriors.append(posterior)
        
        return self.classes[np.argmax(posteriors)]

# Train and evaluate
gnb_scratch = GaussianNaiveBayesScratch()
gnb_scratch.fit(X_train, y_train)
y_pred_scratch = gnb_scratch.predict(X_test)

print(f"Accuracy (from scratch): {accuracy_score(y_test, y_pred_scratch):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_scratch, target_names=target_names))

## 4. Scikit-learn Gaussian Naive Bayes

In [None]:
# Using sklearn's GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
y_prob = gnb.predict_proba(X_test)

print(f"Accuracy (sklearn): {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClass Priors: {gnb.class_prior_}")
print(f"\nFeature Means per Class:")
print(pd.DataFrame(gnb.theta_, columns=feature_names, index=target_names))
print(f"\nFeature Variances per Class:")
print(pd.DataFrame(gnb.var_, columns=feature_names, index=target_names))

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=target_names, yticklabels=target_names)
axes[0].set_xlabel('Predicted', fontsize=12)
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_title('Confusion Matrix', fontsize=14)

# Prediction Probabilities
sample_probs = y_prob[:10]
x_pos = np.arange(len(sample_probs))
width = 0.25

for i, class_name in enumerate(target_names):
    axes[1].bar(x_pos + i*width, sample_probs[:, i], width, label=class_name)

axes[1].set_xlabel('Sample', fontsize=12)
axes[1].set_ylabel('Probability', fontsize=12)
axes[1].set_title('Prediction Probabilities (First 10 samples)', fontsize=14)
axes[1].set_xticks(x_pos + width)
axes[1].set_xticklabels(range(10))
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## 5. Multinomial Naive Bayes (Text Classification)

Used for **discrete count data** (e.g., word counts in text):

$$P(x_i|C_k) = \frac{N_{x_i,C_k} + \alpha}{N_{C_k} + \alpha n}$$

Where:
- $N_{x_i,C_k}$ = count of feature $x_i$ in class $C_k$
- $\alpha$ = Laplace smoothing parameter

In [None]:
# Simple text classification example
documents = [
    "Python is great for machine learning",
    "I love Python programming",
    "Machine learning is fascinating",
    "Java is used for enterprise applications",
    "I prefer Java over other languages",
    "Enterprise software often uses Java",
    "Deep learning uses neural networks",
    "Python is perfect for data science"
]

labels = ['Python', 'Python', 'ML', 'Java', 'Java', 'Java', 'ML', 'Python']

# Convert to numerical features
vectorizer = CountVectorizer()
X_text = vectorizer.fit_transform(documents)
feature_names_text = vectorizer.get_feature_names_out()

# Train Multinomial NB
mnb = MultinomialNB(alpha=1.0)  # alpha is Laplace smoothing
mnb.fit(X_text, labels)

# Test
test_docs = [
    "Python is amazing",
    "Java enterprise development",
    "Machine learning with Python"
]

X_test_text = vectorizer.transform(test_docs)
predictions = mnb.predict(X_test_text)
probabilities = mnb.predict_proba(X_test_text)

print("Text Classification Results:\n")
for doc, pred, prob in zip(test_docs, predictions, probabilities):
    print(f"Document: '{doc}'")
    print(f"Prediction: {pred}")
    print(f"Probabilities: {dict(zip(mnb.classes_, prob))}")
    print()

In [None]:
# Visualize word importance per class
log_prob = mnb.feature_log_prob_
classes = mnb.classes_

fig, axes = plt.subplots(1, len(classes), figsize=(16, 5))

for idx, class_name in enumerate(classes):
    # Get top 10 words for this class
    top_indices = np.argsort(log_prob[idx])[-10:]
    top_features = [feature_names_text[i] for i in top_indices]
    top_scores = log_prob[idx][top_indices]
    
    axes[idx].barh(top_features, np.exp(top_scores), color='skyblue')
    axes[idx].set_xlabel('Probability', fontsize=12)
    axes[idx].set_title(f'Top Words for "{class_name}"', fontsize=14)
    axes[idx].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

## 6. Real-World Example: Spam Detection

In [None]:
# Simulate spam/ham emails
spam_emails = [
    "Win money now! Click here for free cash!",
    "Congratulations! You won a lottery!",
    "Free viagra! Buy now!",
    "Make money fast! Limited time offer!",
    "Click here to claim your prize money!",
    "Get rich quick! Amazing opportunity!"
]

ham_emails = [
    "Meeting scheduled for tomorrow at 3pm",
    "Can you send me the project report?",
    "Let's have lunch next week",
    "The presentation looks great",
    "Please review the attached document",
    "Thank you for your help yesterday"
]

all_emails = spam_emails + ham_emails
email_labels = ['spam'] * len(spam_emails) + ['ham'] * len(ham_emails)

# Convert to features using TF-IDF
tfidf = TfidfVectorizer(max_features=50)
X_email = tfidf.fit_transform(all_emails)

# Train-test split
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(
    X_email, email_labels, test_size=0.3, random_state=42
)

# Train Multinomial NB
spam_classifier = MultinomialNB()
spam_classifier.fit(X_train_email, y_train_email)

# Evaluate
y_pred_email = spam_classifier.predict(X_test_email)
print(f"Spam Classifier Accuracy: {accuracy_score(y_test_email, y_pred_email):.4f}")

# Test on new emails
new_emails = [
    "Free money! Win now!",
    "Can we reschedule the meeting?"
]

X_new = tfidf.transform(new_emails)
predictions_new = spam_classifier.predict(X_new)
probs_new = spam_classifier.predict_proba(X_new)

print("\nNew Email Predictions:\n")
for email, pred, prob in zip(new_emails, predictions_new, probs_new):
    print(f"Email: '{email}'")
    print(f"Prediction: {pred}")
    print(f"Confidence: {max(prob):.2%}\n")

## 7. Bernoulli Naive Bayes

Used for **binary/boolean features**:

$$P(x_i|C_k) = P(i|C_k)x_i + (1-P(i|C_k))(1-x_i)$$

Each feature is either present (1) or absent (0).

In [None]:
# Binary features example
from sklearn.preprocessing import Binarizer

# Binarize the email data
binarizer = Binarizer()
X_binary = binarizer.transform(X_email.toarray())

X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X_binary, email_labels, test_size=0.3, random_state=42
)

# Compare Multinomial vs Bernoulli
bnb = BernoulliNB()
bnb.fit(X_train_bin, y_train_bin)

mnb_score = MultinomialNB().fit(X_train_email, y_train_email).score(X_test_email, y_test_email)
bnb_score = bnb.score(X_test_bin, y_test_bin)

print(f"Multinomial NB Accuracy: {mnb_score:.4f}")
print(f"Bernoulli NB Accuracy: {bnb_score:.4f}")

## 8. Comparing All Naive Bayes Variants

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer, y_cancer = cancer.data, cancer.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.3, random_state=42
)

# Train different variants
models = {
    'Gaussian NB': GaussianNB(),
    'Multinomial NB': MultinomialNB(),
    'Bernoulli NB': BernoulliNB()
}

results = []
for name, model in models.items():
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_c, y_train_c, cv=5)
    
    # Train and test
    model.fit(X_train_c, y_train_c)
    test_score = model.score(X_test_c, y_test_c)
    
    results.append({
        'Model': name,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Test Score': test_score
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))
x_pos = np.arange(len(results_df))
ax.bar(x_pos, results_df['Test Score'], yerr=results_df['CV Std'], 
       alpha=0.7, capsize=10, color=['skyblue', 'lightgreen', 'coral'])
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df['Model'])
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Naive Bayes Variants Comparison', fontsize=14)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

## Summary

### Key Takeaways

1. **Bayes' Theorem**: Foundation of probabilistic classification
2. **Naive Assumption**: Features are conditionally independent (rarely true but works well)
3. **Three Variants**:
   - **Gaussian**: Continuous features (normal distribution)
   - **Multinomial**: Discrete counts (text, word frequencies)
   - **Bernoulli**: Binary features (presence/absence)
4. **Laplace Smoothing**: Handles zero probabilities
5. **Fast Training**: Only needs to calculate statistics

### Pros and Cons

**Pros:**
- Fast training and prediction
- Works well with high-dimensional data
- Requires small training dataset
- Naturally handles multi-class problems
- Provides probability estimates
- Not sensitive to irrelevant features

**Cons:**
- Strong independence assumption (rarely true)
- Can be outperformed by more sophisticated models
- Zero-frequency problem (solved with Laplace smoothing)
- Gaussian NB assumes normal distribution

### When to Use Naive Bayes

**Best for:**
- Text classification (spam detection, sentiment analysis)
- Document categorization
- Real-time prediction (fast)
- High-dimensional data
- Baseline model

**Avoid when:**
- Features are highly correlated
- Need highly accurate probability estimates
- Features have complex interactions

### Practice Problems

1. Implement Laplace smoothing in the scratch version
2. Build a sentiment analyzer for movie reviews
3. Compare NB with Logistic Regression on text data
4. Analyze the effect of alpha (smoothing parameter) on performance