# 📚 Summary: Generative Classifiers (Naive Bayes + Discriminant Analysis)

All these models are *generative*: they model the class-conditional distributions
$P(x \mid y=c)$, then apply Bayes’ rule with priors $P(y=c)$ to predict:

$$
\hat{y} = \arg\max_c \; P(y=c) \, P(x \mid y=c)
$$

---

## 🔸 Naive Bayes family

**Shared idea:** Assume features are conditionally independent given the class.
So joint likelihood factorizes as:

$$
P(x \mid y=c) = \prod_j P(x_j \mid y=c)
$$

---

### 1. GaussianNB
- **Assumption:** Each feature is Gaussian:
  $$
  x_j \mid y=c \sim \mathcal{N}(\mu_{jc}, \sigma_{jc}^2)
  $$
- **Parameters:** mean $\mu_{jc}$ and variance $\sigma_{jc}^2$ for each feature $j$ in class $c$.
- **Best for:** continuous, real-valued features (e.g. sensor data, normally distributed features).
- **Limitation:** ignores correlations between features (covariance assumed diagonal).

---

### 2. MultinomialNB
- **Assumption:** Features are counts/frequencies (e.g. word counts).
- **Likelihood:**
  $$
  P(x \mid y=c) \propto \prod_j \theta_{jc}^{\,x_j}
  $$
- **Parameter estimation:**
  $$
  \theta_{jc} = \frac{N_{jc} + \alpha}{\sum_k (N_{kc} + \alpha)}
  $$
  where $N_{jc}$ is the total count of feature $j$ in class $c$, and $\alpha$ is smoothing.
- **Best for:** text classification (bag-of-words, TF-IDF).
- **Limitation:** not suitable for continuous values; treats fractional counts oddly.

---

### 3. BernoulliNB
- **Assumption:** Features are binary (present/absent).
- **Likelihood:**
  $$
  P(x \mid y=c) = \prod_j \theta_{jc}^{\,x_j} (1-\theta_{jc})^{1-x_j}
  $$
- **Best for:** binary indicators (e.g. "does this email contain word X?").
- **Limitation:** discards frequency information.

---

### 4. CategoricalNB
- **Assumption:** Features are categorical, each taking values from a finite set.
- **Likelihood:**
  $$
  P(x \mid y=c) = \prod_j P(x_j = v \mid y=c)
  $$
- **Parameter estimation:**
  $$
  \theta_{j,v,c} = \frac{N_{jvc} + \alpha}{N_{jc} + \alpha K_j}
  $$
  where $N_{jvc}$ = number of samples in class $c$ with feature $j=v$,
  $N_{jc}$ = total samples in class $c$, $K_j$ = number of categories for feature $j$.
- **Best for:** categorical/tabular data (e.g. color ∈ {red, green, blue}).
- ⚠️ If used on raw continuous floats, each unique value is treated as its own category → memorization.

---

## 🔸 Discriminant Analysis family

**Shared idea:** Assume class-conditional distributions are *multivariate Gaussian*:

$$
P(x \mid y=c) = \mathcal{N}(x; \mu_c, \Sigma_c)
$$

Differences come from assumptions about the covariance $\Sigma_c$.

---

### 5. Linear Discriminant Analysis (LDA)
- **Assumption:** Each class is Gaussian with the **same covariance** $\Sigma$:
  $$
  P(x \mid y=c) \sim \mathcal{N}(\mu_c, \Sigma)
  $$
- **Discriminant function:**
  $$
  \delta_c(x) = x^\top \Sigma^{-1}\mu_c \;-\; \tfrac{1}{2}\mu_c^\top \Sigma^{-1}\mu_c \;+\; \log P(y=c)
  $$
  Prediction = class with largest $\delta_c(x)$.
  Boundaries are **linear** hyperplanes.

- **Fisher’s dimension reduction view:**
  Define *within-class scatter*:
  $$
  S_W = \sum_{c=1}^K \sum_{i: y_i=c} (x_i - \mu_c)(x_i - \mu_c)^\top
  $$
  Define *between-class scatter*:
  $$
  S_B = \sum_{c=1}^K n_c (\mu_c - \mu)(\mu_c - \mu)^\top
  $$
  Optimize:
  $$
  J(w) = \frac{w^\top S_B w}{w^\top S_W w}
  $$
  This leads to the generalized eigenproblem:
  $$
  S_B w = \lambda S_W w
  $$
  Keep the top $K-1$ eigenvectors (since $\mathrm{rank}(S_B) \le K-1$).
  Project data to at most $K-1$ dimensions where classes are maximally separated.

---

### 6. Quadratic Discriminant Analysis (QDA)
- **Assumption:** Each class is Gaussian with its **own covariance** $\Sigma_c$:
  $$
  P(x \mid y=c) \sim \mathcal{N}(\mu_c, \Sigma_c)
  $$
- **Discriminant function:**
  $$
  \delta_c(x) = -\tfrac{1}{2}\log|\Sigma_c| - \tfrac{1}{2}(x-\mu_c)^\top \Sigma_c^{-1}(x-\mu_c) + \log P(y=c)
  $$
  Boundaries are **quadratic curves/surfaces**.
- **More flexible** than LDA, but needs more data to estimate each $\Sigma_c$ reliably.

---

## ✅ Big Picture

- **Naive Bayes:** assumes conditional independence.
  - GaussianNB → continuous, independent features.
  - MultinomialNB → count/frequency features.
  - BernoulliNB → binary features.
  - CategoricalNB → categorical features.

- **Discriminant Analysis:** assumes multivariate Gaussian.
  - LDA → shared covariance → linear boundaries, dimension reduction.
  - QDA → separate covariance → quadratic boundaries, more flexible.

All follow the same recipe:
1. Estimate $P(x \mid y)$ under some assumption.
2. Multiply by prior $P(y)$.
3. Pick the class with max posterior.
