Data Preparation Guidelines
Text data: Use TF-IDF or CountVectorizer

Categorical data: One-hot encoding or label encoding

Continuous data: Consider binning for MultinomialNB or use GaussianNB

Missing values: Naive Bayes handles missing values naturally in some implementations


Handling Common Issues
Zero Frequency Problem (When a feature value doesn't appear in a class)

Solution: Apply Laplace smoothing (Î± > 0)

Numerical Underflow (Multiplying many small probabilities)

Solution: Use log probabilities

## Strengths & Weaknesses
Advantages
Extremely fast training and prediction

Works well with high-dimensional data (like text)

Requires small amount of training data

Handles both continuous and discrete data

Simple to implement and interpret

Performs well with categorical features

Disadvantages
Strong independence assumption (rarely true in practice)

Zero frequency problem without smoothing

Not ideal for regression tasks (primarily for classification)

Can be outperformed by more complex models with large datasets

Sensitive to irrelevant features (no feature selection built-in)

## Real-World Applications
1. Text Classification
Spam detection (Gmail, Outlook)

Sentiment analysis

News categorization

Language detection

2. Medical Diagnosis
Disease prediction based on symptoms

Risk assessment

3. Recommendation Systems
Product recommendations

Content filtering

4. Fraud Detection
Credit card fraud

Insurance claim fraud

# Naive Bayes

Naive Bayes is a probabilistic classifier based on **Bayes' Theorem** with the "naive" assumption that all features are **conditionally independent** given the class label.



### 1. Bayes' Theorem
This is the mathematical foundation of the algorithm.

$$P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}$$

* **$P(y|X)$ (Posterior):** The probability of class $y$ given features $X$ (what we want to calculate).
* **$P(X|y)$ (Likelihood):** The probability of observing features $X$ given that the class is $y$.
* **$P(y)$ (Prior):** The initial probability of class $y$ (before seeing any data).
* **$P(X)$ (Evidence):** The total probability of observing features $X$ (normalization constant).

### 2. The "Naive" Assumption
The classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This simplifies the computation significantly.

$$P(x_1, x_2, \dots, x_n | y) = P(x_1|y) \cdot P(x_2|y) \cdot \dots \cdot P(x_n|y)$$

> **Note:** This assumption is rarely true in real-world data (e.g., in text, words *are* correlated), but the algorithm still performs surprisingly well.

### 3. Prediction Formula
For classification, we predict the class $\hat{y}$ that has the highest posterior probability:

$$\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i|y)$$

---

## Types of Naive Bayes Classifiers

### 1. Gaussian Naive Bayes
**Use Case:** Continuous features that are assumed to follow a **Normal (Gaussian) Distribution** (e.g., Iris dataset features).

**Probability Density Function:**
$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$

```python
from sklearn.naive_bayes import GaussianNB

# Initialize and train
gnb = GaussianNB()
# gnb.fit(X_train, y_train)

### 2. Multinomial Naive Bayes
**Use Case:** Discrete counts. This is the standard algorithm for **Text Classification** (e.g., word counts, spam detection).



**Likelihood with Smoothing:**
$$P(x_i|y) = \frac{N_{yi} + \alpha}{N_y + \alpha \cdot n}$$

* **$N_{yi}$:** Count of feature $i$ in class $y$.
* **$N_y$:** Total count of all features in class $y$.
* **$\alpha$ (alpha):** Smoothing parameter.
    * If $\alpha=1$, it is **Laplace Smoothing**.
    * This prevents zero probabilities if a word is missing from the training data (zero frequency problem).

```python
from sklearn.naive_bayes import MultinomialNB

# alpha=1.0 is default (Laplace smoothing)
mnb = MultinomialNB(alpha=1.0)
# mnb.fit(X_train_counts, y_train)

### 3. Bernoulli Naive Bayes
**Use Case:** Binary/Boolean features (e.g., Presence vs. Absence of a word, simple True/False features).

**Formula:**
It penalizes the non-occurrence of a feature $i$ that is an indicator for class $y$.

$$P(x_i|y) = P(i|y) \cdot x_i + (1 - P(i|y)) \cdot (1 - x_i)$$

```python
from sklearn.naive_bayes import BernoulliNB

# binarize=0.0 sets the threshold for converting continuous features to binary
bnb = BernoulliNB(binarize=0.0)
# bnb.fit(X_train, y_train)

## Quick Reference Card

### When to Use Which Variant

| Variant | Input Data Type | Example Use Case |
| :--- | :--- | :--- |
| **GaussianNB** | Continuous features (assumed Normal Distribution). | Iris flower dimensions, physical measurements. |
| **MultinomialNB** | Discrete counts. | Word counts in text, ratings (1-5). |
| **BernoulliNB** | Binary features (True/False, 0/1). | Word presence vs. absence, boolean flags. |

### Preprocessing Checklist
* **Handle missing values** (Naive Bayes generally doesn't handle NaNs natively in sklearn).
* **Encode categorical variables.**
* **Scale features** (Critical for **GaussianNB** to ensure bell curves are comparable).
* **Apply text vectorization** (CountVectorizer or TF-IDF) for text data.
* **Balance classes** if needed (though Naive Bayes handles imbalances reasonably well, extreme cases need attention).

### Common Pitfalls to Avoid
* **Forgetting feature scaling** for `GaussianNB`.
* **Not applying smoothing** (alpha) for unseen feature combinations (Zero Frequency Problem).
* **Using raw counts without normalization** for text data (long documents might dominate short ones).
* **Ignoring feature correlations** when they are strong (remember, the "Naive" assumption assumes independence).
* **Assuming probabilities are well-calibrated** (Naive Bayes is known for being a great classifier but a poor probability estimator; the `predict_proba` outputs are often too extreme).