# Module 11: Naive Bayes

**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Estimated Time**: 60 minutes  
**Prerequisites**: [Module 04 - Logistic Regression](04_logistic_regression.ipynb), [Module 06 - Model Evaluation](06_model_evaluation_metrics.ipynb)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand Bayes' theorem and its application to classification
2. Explain the "naive" conditional independence assumption
3. Apply Gaussian Naive Bayes to continuous features
4. Use Multinomial Naive Bayes for count/frequency data
5. Apply Bernoulli Naive Bayes to binary features
6. Recognize when Naive Bayes excels (text classification, spam detection)
7. Handle zero probabilities using Laplace smoothing
8. Appreciate the speed and efficiency advantages of Naive Bayes

## 1. Introduction: The Power of Probability

### What is Naive Bayes?

Naive Bayes is a **probability-based classification algorithm** built on Bayes' theorem. Despite its simplicity (and the "naive" assumption), it works surprisingly well in many real-world applications!

### Real-World Example: Email Spam Detection

Imagine you're building a spam filter:
- You receive an email containing the word "FREE"
- Question: Is this email spam?
- Naive Bayes answers: "What's the probability this is spam, given it contains 'FREE'?"

**Key insight**: Instead of learning complex decision boundaries, Naive Bayes calculates probabilities!

### Why "Naive"?

The algorithm assumes that **all features are independent** given the class label. 

Example with spam:
- It assumes the probability of seeing "FREE" is independent of seeing "MONEY"
- In reality, these words often appear together in spam
- **But**: This "naive" assumption simplifies calculations enormously
- **And**: It works well in practice despite being unrealistic!

### Advantages of Naive Bayes

‚úÖ **Extremely fast** training and prediction  
‚úÖ **Works well with high dimensions** (thousands of features)  
‚úÖ **Requires little training data** compared to other algorithms  
‚úÖ **Handles multi-class classification** naturally  
‚úÖ **Probabilistic predictions** (not just class labels)  
‚úÖ **Great for text classification** and categorical data

## 2. Setup and Data Loading

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import roc_curve, auc, roc_auc_score
import warnings

# Configuration
warnings.filterwarnings('ignore')
np.random.seed(42)
%matplotlib inline

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('Set2')

print('‚úì All libraries imported successfully!')
print(f'‚úì Random seed set to 42 for reproducibility')

## 3. Bayes' Theorem: The Foundation

### The Formula

Bayes' theorem tells us how to update our beliefs based on new evidence:

$$P(Class|Features) = \frac{P(Features|Class) \times P(Class)}{P(Features)}$$

**In words:**
- **P(Class|Features)**: Probability of class given features (what we want!)
- **P(Features|Class)**: Probability of seeing these features if it's this class
- **P(Class)**: Prior probability of the class (how common is it?)
- **P(Features)**: Probability of seeing these features (normalization constant)

### Intuitive Example: Medical Diagnosis

**Scenario**: Testing for a rare disease
- Disease affects 1% of population: P(Disease) = 0.01
- Test is 99% accurate: P(Positive|Disease) = 0.99
- Test has 1% false positive rate: P(Positive|Healthy) = 0.01

**Question**: If you test positive, what's the probability you have the disease?

**Naive guess**: 99% (the test accuracy)

**Bayes' theorem answer**: Much lower! Let's calculate...

In [None]:
# Medical diagnosis example with Bayes' theorem
# Given information
P_disease = 0.01          # 1% of people have the disease (prior)
P_healthy = 0.99          # 99% are healthy
P_pos_given_disease = 0.99  # Test correctly identifies disease 99% of the time
P_pos_given_healthy = 0.01  # Test incorrectly says positive 1% of the time

# Calculate P(Positive) - probability of testing positive overall
# This happens if: (you have disease AND test positive) OR (you're healthy AND false positive)
P_positive = (P_pos_given_disease * P_disease) + (P_pos_given_healthy * P_healthy)

# Apply Bayes' theorem: P(Disease|Positive)
P_disease_given_positive = (P_pos_given_disease * P_disease) / P_positive

print("Medical Diagnosis Example:")
print("=" * 50)
print(f"Prior probability of disease: {P_disease:.1%}")
print(f"Test accuracy (sensitivity):  {P_pos_given_disease:.1%}")
print(f"False positive rate:          {P_pos_given_healthy:.1%}")
print("\n" + "=" * 50)
print(f"\nProbability of disease GIVEN positive test: {P_disease_given_positive:.1%}")
print("\n" + "=" * 50)

print("\nüîç Key Insight:")
print("Even with a 99% accurate test, if the disease is rare,")
print("a positive test only means ~50% chance of actually having it!")
print("This is because false positives outnumber true positives.")
print("\nThis is the power of Bayes' theorem - it accounts for base rates!")

## 4. The Naive Assumption Explained

### Feature Independence

For classification with multiple features, we need:

$$P(x_1, x_2, ..., x_n | Class)$$

**Problem**: Computing this joint probability is computationally expensive!

**Naive assumption**: Features are conditionally independent given the class

$$P(x_1, x_2, ..., x_n | Class) = P(x_1|Class) \times P(x_2|Class) \times ... \times P(x_n|Class)$$

**Example**: Email spam with features "FREE" and "MONEY"
- Reality: If "FREE" appears, "MONEY" is more likely (they're correlated)
- Naive Bayes: Assumes they're independent
- P(FREE, MONEY | Spam) = P(FREE | Spam) √ó P(MONEY | Spam)

### Why Does This Work?

Even though the assumption is "naive" (usually violated), Naive Bayes often works because:
1. We only need to rank classes, not get exact probabilities
2. The relative ordering is often correct even if absolute values aren't
3. The bias introduced often helps prevent overfitting!

## 5. Gaussian Naive Bayes: For Continuous Features

**Use when**: Features are continuous and approximately normally distributed

**How it works**:
1. For each feature and each class, calculate mean (Œº) and variance (œÉ¬≤)
2. Assume feature values follow a Gaussian (normal) distribution
3. Calculate P(feature|class) using the Gaussian probability density function

**Best for**: Iris classification, medical measurements, sensor data

In [None]:
# Load Iris dataset for Gaussian Naive Bayes
iris_df = pd.read_csv('data/sample/iris.csv')

print("Iris Dataset for Gaussian Naive Bayes:")
print(f"Shape: {iris_df.shape}")
print(f"\nFeatures: Continuous measurements (perfect for Gaussian NB!)")
print(iris_df.head())
print("\nClass distribution:")
print(iris_df['species'].value_counts())

In [None]:
# Prepare data for Gaussian Naive Bayes
X_iris = iris_df.drop('species', axis=1).values
y_iris = iris_df['species'].values

# Split into train and test
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

print(f"Training set: {X_train_iris.shape}")
print(f"Test set: {X_test_iris.shape}")

In [None]:
# Train Gaussian Naive Bayes
# Note: Unlike KNN, we don't need to scale features for Naive Bayes!
# (But it doesn't hurt either)
gnb = GaussianNB()

# Fit the model - this is extremely fast!
gnb.fit(X_train_iris, y_train_iris)

# Make predictions
y_pred_iris = gnb.predict(X_test_iris)

# Get probability estimates
y_prob_iris = gnb.predict_proba(X_test_iris)

# Evaluate
accuracy_iris = accuracy_score(y_test_iris, y_pred_iris)

print("Gaussian Naive Bayes Results:")
print("=" * 50)
print(f"Accuracy: {accuracy_iris:.3f}")
print("\nClassification Report:")
print(classification_report(y_test_iris, y_pred_iris))

print("\n‚úì Training was instant! Naive Bayes is very fast.")

In [None]:
# Visualize probability predictions
# Show the first 10 test samples with their predicted probabilities
print("Probability Predictions for First 10 Test Samples:")
print("=" * 70)
print(f"{'True Label':<15} {'Predicted':<15} {'Probabilities':<40}")
print("=" * 70)

classes = gnb.classes_
for i in range(10):
    true_label = y_test_iris[i]
    pred_label = y_pred_iris[i]
    probs = y_prob_iris[i]
    
    prob_str = ", ".join([f"{cls}: {p:.2%}" for cls, p in zip(classes, probs)])
    
    match = "‚úì" if true_label == pred_label else "‚úó"
    print(f"{true_label:<15} {pred_label:<15} {prob_str:<40} {match}")

print("\nüí° Insight: Naive Bayes gives probability estimates, not just predictions!")
print("   This allows you to set custom confidence thresholds.")

In [None]:
# Confusion matrix
cm_iris = confusion_matrix(y_test_iris, y_pred_iris)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_iris, annot=True, fmt='d', cmap='Greens',
            xticklabels=classes,
            yticklabels=classes)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix - Gaussian Naive Bayes on Iris', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Excellent performance! Gaussian NB works well with continuous features.")

## 6. Multinomial Naive Bayes: For Count Data

**Use when**: Features represent counts or frequencies

**How it works**:
- Features are counts (non-negative integers)
- Example: Word counts in text documents
- Assumes features follow a multinomial distribution

**Best for**: Text classification, document categorization, word frequency analysis

**Common applications**:
- Spam detection
- Sentiment analysis
- Topic classification
- Language detection

In [None]:
# Simulate text classification data (word counts)
# In practice, you'd use CountVectorizer or TfidfVectorizer on real text

# For demonstration, let's use the synthetic classification dataset
# We'll treat it as "word count" data
text_df = pd.read_csv('data/sample/synthetic_classification.csv')

print("Synthetic Data for Multinomial Naive Bayes:")
print(f"Shape: {text_df.shape}")
print(f"\nFirst few rows (imagine these are word counts):")
print(text_df.head())
print(f"\nTarget distribution:")
print(text_df['target'].value_counts())

In [None]:
# Prepare data for Multinomial Naive Bayes
X_text = text_df.drop('target', axis=1).values
y_text = text_df['target'].values

# Multinomial NB requires non-negative features
# Let's shift features to be non-negative (as if they were counts)
X_text_pos = X_text - X_text.min() + 1

# Split data
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    X_text_pos, y_text, test_size=0.3, random_state=42, stratify=y_text
)

print(f"Training set: {X_train_text.shape}")
print(f"Test set: {X_test_text.shape}")
print(f"\n‚úì Features are now non-negative (required for Multinomial NB)")

In [None]:
# Train Multinomial Naive Bayes
# alpha parameter: Laplace smoothing (more on this later!)
mnb = MultinomialNB(alpha=1.0)  # alpha=1.0 is additive smoothing

# Fit the model
mnb.fit(X_train_text, y_train_text)

# Make predictions
y_pred_text = mnb.predict(X_test_text)

# Evaluate
accuracy_text = accuracy_score(y_test_text, y_pred_text)

print("Multinomial Naive Bayes Results:")
print("=" * 50)
print(f"Accuracy: {accuracy_text:.3f}")
print("\nClassification Report:")
print(classification_report(y_test_text, y_pred_text))

print("\nüí° In practice, Multinomial NB is THE go-to algorithm for text classification!")

## 7. Bernoulli Naive Bayes: For Binary Features

**Use when**: Features are binary (0 or 1, True or False, present or absent)

**How it works**:
- Each feature is either present (1) or absent (0)
- Example: Word appears in document (1) or doesn't (0)
- Assumes features follow a Bernoulli distribution

**Difference from Multinomial**:
- **Multinomial**: "How many times does word X appear?" (counts)
- **Bernoulli**: "Does word X appear at all?" (presence/absence)

**Best for**: Binary feature vectors, short text classification

In [None]:
# Create binary features from breast cancer dataset
bc_df = pd.read_csv('data/sample/breast_cancer.csv')

print("Breast Cancer Dataset for Bernoulli Naive Bayes:")
print(f"Shape: {bc_df.shape}")
print(f"\nFirst few rows:")
print(bc_df.head())
print(f"\nTarget distribution:")
print(bc_df['target'].value_counts())

In [None]:
# Prepare data and binarize features
X_bc = bc_df.drop('target', axis=1).values
y_bc = bc_df['target'].values

# For Bernoulli NB, we need binary features
# Let's binarize: 1 if above median, 0 if below
X_bc_binary = (X_bc > np.median(X_bc, axis=0)).astype(int)

# Split data
X_train_bc, X_test_bc, y_train_bc, y_test_bc = train_test_split(
    X_bc_binary, y_bc, test_size=0.3, random_state=42, stratify=y_bc
)

print(f"Training set: {X_train_bc.shape}")
print(f"Test set: {X_test_bc.shape}")
print(f"\nFeatures binarized: {np.unique(X_train_bc)}")
print("(0 = below median, 1 = above median)")

In [None]:
# Train Bernoulli Naive Bayes
bnb = BernoulliNB(alpha=1.0)

# Fit the model
bnb.fit(X_train_bc, y_train_bc)

# Make predictions
y_pred_bc = bnb.predict(X_test_bc)
y_prob_bc = bnb.predict_proba(X_test_bc)

# Evaluate
accuracy_bc = accuracy_score(y_test_bc, y_pred_bc)

print("Bernoulli Naive Bayes Results:")
print("=" * 50)
print(f"Accuracy: {accuracy_bc:.3f}")
print("\nClassification Report:")
print(classification_report(y_test_bc, y_pred_bc))

print("\nüí° Note: Binarizing continuous features can lose information,")
print("   but Bernoulli NB is great for truly binary features!")

## 8. Comparing the Three Variants

In [None]:
# Compare all three Naive Bayes variants on the same dataset (breast cancer)
# Using original continuous features for fair comparison

X_compare = bc_df.drop('target', axis=1).values
y_compare = bc_df['target'].values

X_train_cmp, X_test_cmp, y_train_cmp, y_test_cmp = train_test_split(
    X_compare, y_compare, test_size=0.3, random_state=42, stratify=y_compare
)

# Prepare different feature versions
X_train_binary = (X_train_cmp > np.median(X_train_cmp, axis=0)).astype(int)
X_test_binary = (X_test_cmp > np.median(X_train_cmp, axis=0)).astype(int)

X_train_pos = X_train_cmp - X_train_cmp.min() + 1
X_test_pos = X_test_cmp - X_train_cmp.min() + 1

# Train all three variants
models = {
    'Gaussian NB': (GaussianNB(), X_train_cmp, X_test_cmp),
    'Multinomial NB': (MultinomialNB(), X_train_pos, X_test_pos),
    'Bernoulli NB': (BernoulliNB(), X_train_binary, X_test_binary)
}

print("Comparing Naive Bayes Variants on Breast Cancer Data:")
print("=" * 60)
print(f"{'Model':<20} {'Accuracy':<15} {'CV Score (mean)':<20}")
print("=" * 60)

results = {}
for name, (model, X_tr, X_te) in models.items():
    # Train and evaluate
    model.fit(X_tr, y_train_cmp)
    accuracy = model.score(X_te, y_test_cmp)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_tr, y_train_cmp, cv=5)
    cv_mean = cv_scores.mean()
    
    results[name] = {'accuracy': accuracy, 'cv_mean': cv_mean}
    
    print(f"{name:<20} {accuracy:<15.3f} {cv_mean:<20.3f}")

print("=" * 60)

print("\nüìä Insights:")
print("- Gaussian NB: Best for continuous features (our case!)")
print("- Multinomial NB: Better for count data (word frequencies)")
print("- Bernoulli NB: Better for binary features (word presence/absence)")
print("\n‚úÖ Always choose the variant that matches your data type!")

## 9. Laplace Smoothing: Handling Zero Probabilities

### The Zero Probability Problem

**Problem**: What if a feature value never appears in training data for a class?
- P(feature|class) = 0
- Since we multiply probabilities: 0 √ó anything = 0
- The entire probability becomes 0!
- Model can't make sensible predictions

**Example**:
- Training spam: Never saw the word "quantum"
- Test email contains "quantum"
- P("quantum" | spam) = 0
- P(spam | email) = 0 (even if all other words suggest spam!)

### Laplace Smoothing Solution

**Add a small constant (Œ±) to all counts**:
- Œ± = 0: No smoothing (can have zero probabilities)
- Œ± = 1: Laplace smoothing (add-one smoothing)
- Œ± < 1: Less smoothing
- Œ± > 1: More smoothing

**Effect**: 
- Prevents zero probabilities
- Gives unseen features a small probability
- Acts as regularization

In [None]:
# Demonstrate the effect of smoothing parameter (alpha)
alpha_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
smoothing_results = []

print("Effect of Smoothing Parameter (alpha) on Multinomial NB:")
print("=" * 50)
print(f"{'Alpha':<10} {'Train Accuracy':<20} {'Test Accuracy':<20}")
print("=" * 50)

for alpha in alpha_values:
    # Train with different alpha values
    mnb_smooth = MultinomialNB(alpha=alpha)
    mnb_smooth.fit(X_train_text, y_train_text)
    
    train_acc = mnb_smooth.score(X_train_text, y_train_text)
    test_acc = mnb_smooth.score(X_test_text, y_test_text)
    
    smoothing_results.append({'alpha': alpha, 'train': train_acc, 'test': test_acc})
    
    print(f"{alpha:<10} {train_acc:<20.3f} {test_acc:<20.3f}")

print("=" * 50)

print("\nüìä Observations:")
print("- Small alpha (< 1): Less smoothing, may overfit")
print("- Alpha = 1: Standard Laplace smoothing (good default)")
print("- Large alpha (> 10): Heavy smoothing, may underfit")
print("\n‚úÖ Alpha is a regularization parameter - tune it with cross-validation!")

In [None]:
# Visualize the effect of alpha
alphas = [r['alpha'] for r in smoothing_results]
train_accs = [r['train'] for r in smoothing_results]
test_accs = [r['test'] for r in smoothing_results]

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, train_accs, 'o-', label='Training Accuracy', linewidth=2, markersize=8)
plt.semilogx(alphas, test_accs, 's-', label='Test Accuracy', linewidth=2, markersize=8)
plt.xlabel('Alpha (Smoothing Parameter)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of Laplace Smoothing on Model Performance', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The gap between train and test narrows with more smoothing (regularization effect).")

## 10. When Naive Bayes Excels

### Perfect Use Cases

**1. Text Classification** ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê
- Spam detection
- Sentiment analysis
- Topic categorization
- Language detection
- Why: High dimensions (thousands of words), independence assumption works reasonably

**2. Real-Time Prediction**
- Need instant predictions
- Training and prediction are extremely fast
- O(n√ód) training, O(d) prediction

**3. Small Training Datasets**
- Works well with limited data
- Less prone to overfitting than complex models
- Good baseline model

**4. High-Dimensional Data**
- Handles thousands of features well
- Doesn't suffer from curse of dimensionality like KNN
- No feature scaling needed

**5. Multi-Class Problems**
- Naturally handles multiple classes
- No need for one-vs-rest strategy

### When to Avoid Naive Bayes

‚ùå **Features are highly correlated**
   - Independence assumption is severely violated
   - Consider logistic regression or decision trees

‚ùå **Need precise probability estimates**
   - Naive Bayes probabilities are often poorly calibrated
   - Rankings are good, absolute values aren't

‚ùå **Complex feature interactions**
   - Can't capture feature combinations
   - Use tree-based methods or neural networks

‚ùå **Numerical data with non-linear patterns**
   - Gaussian assumption may not fit
   - Try SVM or random forests

## Exercises

Now it's your turn to practice! Complete these exercises to reinforce your understanding.

### Exercise 1: Manual Bayes' Theorem Calculation

Given this information about email classification:
- 30% of emails are spam: P(Spam) = 0.3
- The word "FREE" appears in 80% of spam emails: P("FREE"|Spam) = 0.8
- The word "FREE" appears in 10% of legitimate emails: P("FREE"|Legitimate) = 0.1

**Tasks:**
1. Calculate P("FREE") - probability of seeing "FREE" overall
2. Calculate P(Spam|"FREE") - probability email is spam given it contains "FREE"
3. If an email contains "FREE", should you classify it as spam?
4. Verify your answer by coding the calculation below

In [None]:
# Your code here
# Hint: Use Bayes' theorem: P(A|B) = P(B|A) * P(A) / P(B)



### Exercise 2: Comparing Naive Bayes Variants on Wine Dataset

Apply all three Naive Bayes variants to the wine classification dataset.

**Tasks:**
1. Load `wine.csv` dataset
2. Split into train/test (70/30)
3. Train GaussianNB on original features
4. Train MultinomialNB on non-negative features
5. Train BernoulliNB on binarized features
6. Compare accuracies and cross-validation scores
7. Which variant works best for this dataset? Why?

In [None]:
# Your code here
# Hint: Follow the comparison pattern from section 8



### Exercise 3: Hyperparameter Tuning with GridSearchCV

Find the optimal smoothing parameter (alpha) for MultinomialNB on synthetic data.

**Tasks:**
1. Use the synthetic classification dataset
2. Create a parameter grid testing alpha values: [0.001, 0.01, 0.1, 1.0, 10.0]
3. Use GridSearchCV with 5-fold cross-validation
4. Report the best alpha and best score
5. Visualize how performance changes with alpha
6. Does the optimal alpha prevent overfitting?

In [None]:
# Your code here
# Hint: from sklearn.model_selection import GridSearchCV
# param_grid = {'alpha': [...]}



### Exercise 4: Speed Comparison

Compare training and prediction speed of Naive Bayes vs other algorithms.

**Tasks:**
1. Load the breast cancer dataset
2. Time the training of: GaussianNB, Logistic Regression, KNN (K=5), Decision Tree
3. Time the prediction on test set for each model
4. Compare accuracies
5. Create a table showing: Model, Train Time, Predict Time, Accuracy
6. Which model is fastest? Which is most accurate?
7. When would you choose Naive Bayes despite lower accuracy?

In [None]:
# Your code here
# Hint: import time; start = time.time(); ... ; elapsed = time.time() - start



## Summary

### Key Concepts Learned

1. **Bayes' Theorem**
   - Foundation of probabilistic classification
   - P(Class|Features) = P(Features|Class) √ó P(Class) / P(Features)
   - Updates beliefs based on new evidence

2. **The Naive Assumption**
   - Features are conditionally independent given the class
   - Simplifies computation enormously
   - Works surprisingly well despite being "naive"

3. **Three Naive Bayes Variants**
   - **Gaussian NB**: Continuous features (normal distribution)
   - **Multinomial NB**: Count/frequency data (text classification)
   - **Bernoulli NB**: Binary features (presence/absence)

4. **Laplace Smoothing**
   - Prevents zero probability problem
   - Alpha parameter controls smoothing strength
   - Acts as regularization

5. **Advantages**
   - Extremely fast training and prediction
   - Works well with high dimensions
   - Requires little training data
   - Provides probability estimates
   - No feature scaling needed

6. **Best Use Cases**
   - Text classification (spam, sentiment, topics)
   - Real-time prediction systems
   - Baseline model for comparison
   - Multi-class classification

### Best Practices

- **Choose the right variant** for your data type
- **Use alpha=1.0** as starting point (Laplace smoothing)
- **Tune alpha with cross-validation** for optimal performance
- **Consider feature independence** - if violated, try other algorithms
- **Use for baseline** - always try NB as a quick first model
- **Don't trust absolute probabilities** - rankings are reliable, values aren't

### Common Pitfalls to Avoid

- ‚ùå Using wrong variant for data type
- ‚ùå Forgetting about zero probability problem
- ‚ùå Trusting probability estimates for threshold decisions
- ‚ùå Applying to data with strong feature correlations
- ‚ùå Using Multinomial/Bernoulli NB with negative features

### What's Next

In **Module 12: Clustering (K-Means, DBSCAN)**, you'll learn:
- Unsupervised learning for finding patterns
- K-Means algorithm and choosing optimal K
- Density-based clustering with DBSCAN
- Cluster evaluation metrics
- When to use different clustering algorithms

### Additional Resources

**Videos:**
- [StatQuest: Naive Bayes](https://www.youtube.com/watch?v=O2L2Uv9pdDA)
- [Bayes Theorem Explained](https://www.youtube.com/watch?v=HZGCoVF3YvM)

**Documentation:**
- [scikit-learn Naive Bayes Guide](https://scikit-learn.org/stable/modules/naive_bayes.html)
- [GaussianNB API](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

**Articles:**
- [Naive Bayes for Machine Learning](https://machinelearningmastery.com/naive-bayes-for-machine-learning/)
- [Why Naive Bayes Works So Well](https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf)