# Chapter 1: Generative Models

## Generative Versus Discriminative Modeling

- A _discriminative model_ estimates $p(y|\mathbf{x})$, i.e. the probability of a label $y$ given an observation $\mathbf{x}$.

- A _generative model_ estimates $p(\mathbf{x})$, i.e. the probability of observing observation $\mathbf{x}$. If the dataset is labeled, then a generative model estimates the distribution $p(y|\mathbf{x})$.

## The Generative Modeling Framework

- We have a dataset of observations $\mathbf{X}$.

- We assume the observations have been generated according to some unknown distribution, $p_\text{data}$.

- A generative model $p_\text{model}$ tries to mimic $p_\text{data}$ so that we can generate new observations.

We are generally impressed with $p_\text{model}$ if:

1. It can generate observations that appear to have been drawn from $p_\text{data}$.

2. It can generate examples suitably different from the observations it has seen, i.e. it does not just copy previously seen observations.

## Probabilistic Generative Models

### Sample Space

A _sample space_ is the complete set of all values that $\mathbf{x}$ can take.

### Probability Density Function

A _probability density function_ (or _density function_), $p(\mathbf{x})$, is a function from the sample space, $S$, to the interval [0, 1] such that

$$ \int_{\mathbf{x}\,\in\,S} p(\mathbf{x})\,d\mathbf{x} = 1 $$

if $S$ is continuous or

$$ \sum\limits_{\mathbf{x}\,\in\,S} p(\mathbf{x}) = 1 $$

if $S$ is discrete.

### Parametric Modeling

A _parametric model_, $p_\theta(\mathbf{x})$, is a family of density functions that can be described using a finite number of parameters represented by the vector, $\theta$.

### Likelihood

The _likelihood_, $\mathcal{L} (\theta | \mathbf{x})$, of a parameter set $\theta$ is a function that gives you the likelihood of $\theta$ given some observed point $\mathbf{x}$ is defined to be the value of a density function parameterized by theta. It is defined as

$$ \mathcal{L}(\theta|\mathbf{x}) = p_\theta(\mathbf{x}). $$

Given a dataset, $\mathbf{X}$, the likelihood of the distribution parameterized by $\theta$ is given by

$$ \mathcal{L}(\theta|\mathbf{X}) = \prod\limits_{\mathbf{x}\,\in\,\mathbf{X}} p_\theta (\mathbf{x}). $$

It is more common to use the _log-likelihood_ function instead, which is less computationally intense:

$$ \ell(\theta|\mathbf{X}) = \prod\limits_{\mathbf{x}\,\in\,\mathbf{X}} \log p_\theta(\mathbf{x}). $$

### Maximum Likelihood Estimation

The _maximum likelihood estimation_ (MLE) is defined by the set of parameters, $\hat{\theta}$, which maximize the likelihood function, i.e.

$$ \hat{\theta} = \underset{\theta}{\text{argmax}} \, \mathcal{L}(\theta|\mathbf{X}). $$

## Probabalistic Generative Models

### Multinomial Distribution

The _multinomial distribution_ assigns each combination of features a probablity given by the number of times that combination appears in the original dataset. The MLE, $\hat{\theta}_j$ is given by

$$ \hat{\theta}_j = \frac{n_j}{N}. $$

The one clear downside of this distribution is that it cannot be used to generate any new samples. This can be remedied is to add a _pseudocount_ of 1 to each possible combination. This would make the MLE

$$ \hat{\theta}_j = \frac{n_j + 1}{N + d} $$

where $d$ is the number of total possible combinations of features. This technique is known as _additive smoothing_. This technique ensures that all combinations have some probability of being observed, but it also means that any unobserved combination has an equal probability of being drawn.

### Naive Bayes

The _Naive Bayes_ assumption assumes that for all features of the dataset, $x_j$ and $x_k$, we have that

$$ p\left(x_j | x_k\right) = p\left(x_j\right) $$

i.e. that any two features' values are independent of each other. Using this assumption and the chain rule of probability, we can compute the probability of an observation, $\mathbf{x}$

$$ \begin{align} p(\mathbf{x}) & = p\left(x_1,...,x_K\right) \\ & = p\left(x_2,...,x_K|x_1\right)p\left(x_1\right) \\ & = p\left(x_3,...,x_K|x_1,x_2\right) p\left(x_2|x_1\right) p\left(x_1\right) \\ & = \prod\limits_{k=1}^K p\left(x_k | x_1,...,x_{k-1}\right) \end {align} $$

where $K$ is the number of features. Using the Naive Bayes' assumption, we can simplify this expression to

$$ p(\mathbf{x}) = \prod\limits_{k=1}^K p\left(x_k\right) $$

This is the Naive Bayes' model. The problem of finding the best model is to finding the best parameters

$$ \theta_{kl} = p\left(x_k = l\right). $$

The MLE $\hat{\theta}_{kl}$ is given by

$$ \hat{\theta}_{kl} = \frac{n_{kl}}{N} $$

where $n_{kl}$ is the number of times $x_k$ has the value $l$ in the dataset of observations.