# generation vs. discrimination


Generative models and discriminative models are two different types of models in machine learning

|                    | Generative Models       | Discriminative Models          |
|--------------------|-------------------------|--------------------------------|
| modelling goal               | joint probability $p(x, y)$           | posterior probability $p(y\|x)$          |
| Approach           | Bayes' Rule to calculate $p(y|x)$        | Minimize classification errors |
| Examples           | Naïve Bayes, Hidden Markov Models (HMM), Probabilistic Context-Free Grammar (PCFG)  | Logistic Regression, SVM, Decision Tree   |
| Pros               | deal with unlabeled data  | Optimize classification accuracy |
| Cons               | More challenging task   | Sensitive to feature selection, Overfitting when sparse data |


# Bayes rule

posterior probability of $y$ given $\mathbf{x}$ is the joint probability of $\mathbf{x}$  and $y$ divided by marginal probability of $\mathbf{x}$ 


\begin{align}
&\mathbb{P}(y|\mathbf{x})=\frac{\mathbb{P}(\mathbf{x}|y)\mathbb{P}(y)}{\mathbb{P}(\mathbf{x})}
\\[1em]
\end{align}

where $\mathbb{P}(y)=\left\{\begin{matrix}
\pi_0 & y=0\\ \pi_1 & y=1\end{matrix}\right.\ $,  $\mathbb{P}(\mathbf{x})=\sum_{j}\mathbb{P}(\mathbf{x}|y=j)\mathbb{P}(y=j)$

# modeling choice

- Naive Bayes:

    conditional probability of $\mathbf{x}\in \mathbb{R}^d$ given $y \in \mathbb{R}$ is the cumulative product of conditional probability of $d$ independent coordinate $\mathbf{x}_j$ given $y$

$$
\mathbb{P}(\mathbf{x}|y)=\prod_{j=1}^{d}\mathbb{P}(\mathbf{x}_j|y)
$$


- we can pick any univariate distribution for $\mathbf{x}_j|y$

- Gaussian:  explore in HW3 Q2

    distribution of $\mathbf{x}\in \mathbb{R}^d$ conditioned on $y\in \mathbb{R}$ is normal with mean $\mathbf{\mu}_y \in \mathbb{R}^d$ and covariance $\Sigma_y \in \mathbb{R}^{d \times d}$

$$
\mathbf{x}|y \sim N(\mathbf{\mu}_y,\Sigma_y)
$$

# decision rule of Naive Bayes classifier

decision rule of binary classification is:

$$
\mathbb{1}[\mathbb{P}(y=1|\mathbf{x})>\mathbb{P}(y=0|\mathbf{x})]
$$

use Bayes rule to simplify

$$
\begin{align}
&\frac{\mathbb{P}(\mathbf{x}|y=1)\mathbb{P}(y=1)}{\mathbb{P}(\mathbf{x})} > \frac{\mathbb{P}(\mathbf{x}|y=0)\mathbb{P}(y=0)}{\mathbb{P}(\mathbf{x})}\\[1em]

& \mathbb{P}(\mathbf{x}|y=1)\mathbb{P}(y=1)>\mathbb{P}(\mathbf{x}|y=0)\mathbb{P}(y=0)\\[1em]

& \ln \left[\mathbb{P}(\mathbf{x}|y=1)\mathbb{P}(y=1)\right] > \ln \left[\mathbb{P}(\mathbf{x}|y=0)\mathbb{P}(y=0) \right] \\[1em]

&\ln \left[\mathbb{P}(\mathbf{x}|y=1) \right] + \ln \left[ \mathbb{P}(y=1)\right] > \ln \left[\mathbb{P}(\mathbf{x}|y=0) \right] + \ln \left[ \mathbb{P}(y=0)\right]\\[1em]

&\ln(\mathbb{P}(\mathbf{x}|y=1))-\ln(\mathbb{P}(\mathbf{x}|y=0))>\ln(\mathbb{P}(y=0))-\ln(\mathbb{P}(y=1))\\[1em]

&\ln\left[\frac{\mathbb{P}(\mathbf{x}|y=1)}{\mathbb{P}(\mathbf{x}|y=0)}\right]>\ln\left[\frac{\mathbb{P}(y=0)}{\mathbb{P}(y=1)}\right]\\[1em]

&\ln \left [ \frac{\prod_{j=1}^{d}\mathbb{P}(\mathbf{x}_j|y=1)}{\prod_{j=1}^{d}\mathbb{P}(\mathbf{x}_j|y=0)} \right ] > \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]

&\sum_{j=1}^{d}\ln\left[\frac{\mathbb{P}(\mathbf{x}_j|y=1)}{\mathbb{P}(\mathbf{x}_j|y=0)} \right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]
\end{align}
$$

## Spam detection

label $y\in \left \{0,1 \right \}$, a letter is a Spam or NOT a Spam

features $\mathbf{x}\in \mathbb{R}^d$, each coordinate is independent and conditioned on $y$

each feature $\mathbf{x}_j\sim Bernoulli(p_j)$ is occurrence of a word $j$ in an email, independent with each other

$$
\mathbf{x}_j=\left\{\begin{matrix}
1 & \text{word j appears} \\ 
0 &\ \text{word j not appear}
\end{matrix}\right.
$$

Plug in

$$
\mathbb{P}(\mathbf{x}_j|y)=(p_{j,y})^{\mathbf{x}_j}\ (1-p_{j,y})^{1-\mathbf{x}_j}
$$

to the previous decision rule

$$
\sum_{j=1}^{d}\ln\left[\frac{\mathbb{P}(\mathbf{x}_j|y=1)}{\mathbb{P}(\mathbf{x}_j|y=0)} \right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )
$$

we have

$$
\sum_{j=1}^{d}\ln\left[\frac{(p_{j,1})^{\mathbf{x}_j}\ (1-p_{j,1})^{1-\mathbf{x}_j}}{(p_{j,0})^{\mathbf{x}_j}\ (1-p_{j,0})^{1-\mathbf{x}_j}} \right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )
$$

simplify this,

$$
\begin{align}

&\sum_{j=1}^{d}\ln\left[ \left ( \frac{p_{j,1}}{p_{j,0}}\right )^{\mathbf{x}_j}\left ( \frac{1-p_{j,1}}{1-p_{j,0}}\right )^{1-\mathbf{x}_j}\right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]

&\sum_{j=1}^{d}\ln\left[ \left ( \frac{p_{j,1}}{p_{j,0}}\right )^{\mathbf{x}_j}\left ( \frac{1-p_{j,0}}{1-p_{j,1}}\right )^{\mathbf{x}_j}\left ( \frac{1-p_{j,1}}{1-p_{j,0}} \right )\right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]

&\sum_{j=1}^{d}\ln\left[ \left ( \frac{p_{j,1}}{p_{j,0}}\frac{1-p_{j,0}}{1-p_{j,1}}\right )^{\mathbf{x}_j}\left ( \frac{1-p_{j,1}}{1-p_{j,0}} \right )\right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]

&\sum_{j=1}^{d}\ln\left[ \left ( \frac{p_{j,1}}{1-p_{j,1}}\frac{1-p_{j,0}}{p_{j,0}}\right )^{\mathbf{x}_j}\left ( \frac{1-p_{j,1}}{1-p_{j,0}} \right )\right]> \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]

&\sum_{j=1}^{d}\ln\left ( \frac{p_{j,1}}{1-p_{j,1}}\frac{1-p_{j,0}}{p_{j,0}}\right )^{\mathbf{x}_j}+\sum_{j=1}^{d}\ln\left( \frac{1-p_{j,1}}{1-p_{j,0}} \right) > \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]


&\sum_{j=1}^{d}\mathbf{x}_j\ln\left ( \frac{p_{j,1}}{1-p_{j,1}}\frac{1-p_{j,0}}{p_{j,0}}\right )+\sum_{j=1}^{d}\ln\left( \frac{1-p_{j,1}}{1-p_{j,0}} \right) > \ln \left ( \frac {\pi_0}{\pi_1}\right )\\[1em]


&\sum_{j=1}^{d}\mathbf{x}_j\ln\left ( \frac{p_{j,1}}{1-p_{j,1}}\frac{1-p_{j,0}}{p_{j,0}}\right )> \ln \left ( \frac {\pi_0}{\pi_1}\right )-\sum_{j=1}^{d}\ln\left( \frac{1-p_{j,1}}{1-p_{j,0}} \right) \\[1em]
\end{align}
$$

## decision rule is a **linear classifier**

set $\mathbf{\beta} \in \mathbb{R}^d$, $\mathbf{\beta}_j=\ln\left ( \frac{p_{j,1}}{1-p_{j,1}}\frac{1-p_{j,0}}{p_{j,0}}\right )\in \mathbb{R}$, $C=\ln \left ( \frac {\pi_0}{\pi_1}\right )-\sum_{j=1}^{d}\ln\left( \frac{1-p_{j,1}}{1-p_{j,0}} \right)\in \mathbb{R}$

then the decision rule can be written as

$$
\sum_{j=1}^{d}\mathbf{x}_j \mathbf{\beta}_j > C
$$

a word appears in Spam or not depends on sign of $\mathbf{\beta}_j$

odds ratio for a word appear in a Spam is $\frac{p_{j,1}}{1-p_{j,1}}$

odds ratio for a word appear in a NOT-Spam email is $\frac{p_{j,0}}{1-p_{j,0}}$

### in practice

- in practice, can't directly use decision rule bc don't know distribution of $x_j$, i.e., $p_{j,y}=\mathbb{P}(\mathbf{x}_j|y)$

- but can can estimate $p_{j,y}$ from training data, then use **plug-in** $\ln\left( \frac{1-p_{j,1}}{1-p_{j,0}} \right)$ back to linear classifier

- words may dependent on each other, so $\mathbf{x_j}$ is not independent

    e.g., in the Email of Disney, word "Micky" and word "Mouse" may always appear at the same time