# Deep Generative Model
This is a study note for Stanford CS236 Deep Generative Model.
Additional Resources:
[Course Github Notes](https://deepgenerativemodels.github.io/notes/)

## Module 1: Introduction to Generative Models
- Suggest Readings:
    - [Deep Generative Models](https://ermongroup.github.io/generative-models/)
    - [Generative Modeling by Estimating Gradients of the Data Distribution](https://yang-song.net/blog/2021/score/)
    - [Tutorial on Deep Generative Models](https://www.youtube.com/watch?v=JrO5fSskISY)
    - [Learning Deep Generative Models](https://www.cs.cmu.edu/~rsalakhu/papers/annrev.pdf)

### What is Generative Modeling?

**Generative Models** contains two parts:
- **Generation (graphics)**: From high level description to raw sensory outputs 
- **Inference (vision as inverse graphics)**: From Raw sensory outputs to high level descriptions.

**Statistical** Generative Models are **learned from data**.
This course depends less on prior data but computer graphics deeps on more.

#### Statistical Generative Models


A **statistical generative** model is a probability distribution $p(x)$:
- **Data**: Samples
- **Prior Knowledge**: parametric form, loss function, optimization algorithm

Image $x$ $\rightarrow$ A probability distribution $p(x)$ $\rightarrow$ scalar probability $p(x)$.

It is generative because sampling from $p(x)$ generates new images.

It can be used to build a simulator for the data-generating process.

Control Signals/Potential datapoints $\rightarrow$ Data Simulator = Statistical Model = Generative Model $\rightarrow$ New datapoints/Probability values


### Audio and Image Applications of Generative Models

- Data Generation in the real world
    - Text to Image: [Language-guided artwork creation](https://chainbreakers.kath.io/)
    - Draw Image to Realistic Images[Meng, He, Song et al ICLR 2022](https://arxiv.org/abs/2108.01073)
- Solving inverse problems with generative models
    - Medical image reconstruction [Song et al ICLR 2022](https://arxiv.org/abs/2111.08005)
- Outlier Detection with genertive models
    - Outlier Detection [Song et al ICLR 2018](https://arxiv.org/abs/1710.10766)
- Progress in Generative Models of Images
    - GANs [Ian Goodfellow 2019](https://arxiv.org/abs/1406.2661)
    - Diffusion Models [Song et al 2021](https://arxiv.org/abs/2101.09258)
        - Text2Image Diffusion Models
- Progress in Inverse Problems
    -  Low Resolution $\rightarrow$ High resolution [Menon et al, 2020](https://arxiv.org/abs/2003.03808)
    -  Mask $\rightarrow$ Full Image [Liu et al 2018](https://arxiv.org/abs/1804.07723)
    -  Greyscale Images $\rightarrow$  Color Image
    -  Scatch $\rightarrow$ Fine Image
    -  Origin Images $\rightarrow$ Edited Images
-  Audio
    - WaveNet[van den Oord et al 2016c](https://arxiv.org/abs/1609.03499)
    - Diffusion Text2Speech [Betker, Better Speech Synthesis through scaling 2023](https://arxiv.org/abs/2305.07243)
    - Conditional Generative Model: Low-Resolution Audio Signal $\rightarrow$ High-Resolution Audio Signal

### Language, Video, and Robotic Applications of Generative Models

- Language Generation [Radford et al 2019](https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf)
    - Conditional Generative Model *P(next word | previous word)*
    - ChatGPT
- Machine Translation
    - Conditional Generative Model *P(English Text | Chinese Text)*
- Code Generation
- Video Generation
- Imitation Learning
    - Conditional Generative Model *P(actions | past observations)* [Li et al 2017](https://arxiv.org/abs/1701.01036) | [Janner et al 2022](https://arxiv.org/abs/2205.09991)
- Molecule Generation
- DeepFake

### Roadmap and Challenges in Generative Modeling

**Representation**: how do we model the joint distribution of many random variables?

**Learning** What is the right way to compare probability distributions?

**Inference** How do we invert the generation process

### Generative Model Curse of Dimensionality and Bayesian Networks

#### Overview

- What is a generative model
- Representing probability distributions
    - Curse of dimensionality
    - Crash course on graphical models (Bayesian networks)
    - Generative vs discriminative models
    - Neural models

#### Learning  a generative model

We want to learn a probability distribution $p(x)$ over images x such that

**Generation** If we sample $x_{new} \sim p(x)$,$x_{new}$ should look like a dog (sampling)

**Density Estimation** p(x) should be high if x looks like a dog and low otherwise (anomly detection)

**Unsupervised Representation Learning**: We should be able to learn what these images have in common, e.g. ears, tail, etc(features)

#### How to Represent $p(x)$                   

**Basic Discrete Distribution**
- Bernoulli Distribution: (biased) coin flip
    - $D = {Head,Tails}$
    - $P(X=Heads) = p, P(X=Tails) = 1-p$
    - $X \sim Ber(p)$
- Categorical Distribution: (biased) m-sided dice
    - $D={1,...,m}$
    - $P(Y=i) = p, \sum p_i=1$
    - $Y \sim Cat(p_1,...,p_m)$

Example of joint distribution: 
- Modeling pixels - Red, Blue, Green: $Val(R) = Val(B) = Val(G) = {0,...,255} = 256 * 256 * 256 - 1$ Number of parameters.
- Modeling grey image numbers: Bernoulli $Val(X_i)={0,1} = 2^n -1 $ Number of parameters.

$x-1$ number of parameters because the sum of $x$ parameters needs to sum up to 1; therefore, the last one is determined based on the previous $x-1$ parameters.



**Assumption**: Independent
$$p(x_1,...,x_n) = p(x_1)p(x_2)...p(x_n) $$

- $2^n$ possible states
- $p(x_1,...,x_n)$ we need only 1 parameters to specify marginal distribution $p(x_1)$
- $2^n$ entries can be described by just n numbers (if $|Val(X_i)| = 2$).

Problems: Too strong. Model may not be useful.

**Two Important Rules**
- Chain Rule
    - $P(S_1 \cap S_2 \cap \cdots \cap S_n) = p(S_1)p(S_2 | S_1) \cdots p(S_n | S_1 \cap \cdots \cap S_{n-1})$
- Bayes' Rule
    - $p(S_1 | S_2) = \frac{p(S_1 \cap S_2)}{p(S_2)} = \frac{p(S_2 | S_1)p(S_1)}{p(S_2)}$ 

**Structure Through Conditional Independence**
$$p(x_1,...,x_n) = p(x_1)p(x_2 |x_1)p(x_3 | x_1,x_2) \cdots p(x_n | x_1,...,x_{n-1})$$

- $p(x_1)$ requires 1 parameters.
- $p(x_2 | x_1 = 0)$ requires 1 parameter, $p(x_2 | x_1 = 1)$ requires 1 parameter
- In total, we need $1+2+ ... +2^{n-1}= 2^n-1$ parameters
- It is still exponential.

  

Now let's try, such as predicting next word.
$$\begin{align}
p(x_1,...,x_n) & = p(x_1)p(x_2 |x_1)p(x_3 | x_1,x_2) \cdots p(x_n | x_1,\cdots,x_{n-1}) \\
& = p(x_1) p(x_2 |x_1)p(x_3 | x_2)\cdots p(x_n | x_{n-1})\\
\end{align}$$

It requires $2n-1$ parameters

**Bayes Network** General Idea

- Use conditional parameterization (instead of joint parameterization)
- For each random variable $X_i$ specify $p(x_i |  \mathbf{x_{A_i}})$ for set $\mathbf{X_{A_i}}$ of random variables.

$$p(x_1,\cdots,x_n) = \prod_i p(x_i |\mathbf{x_{A_i}})$$

- We need to guarantee it is a legal probability distribution.
  

**Bayesian Network** Formal

A **Bayesian Network** is specified by a *directed* **acyclic** graph (DAG), $G=(V,E)$, with
- One node $i \in V$ for each random variable $X_i$
- One conditional probability distribution (CPD) per node, $p(x_i |\mathbf{x_{Pa_i}})$, specifying the variable's probability conditioned on its parents' value.

Graph $G=(V,E)$ is called the structure of the Bayesian Network

Defines a joint distribution:
$$p(x_1, \cdots, x_n) = \prod_i p(x_i |\mathbf{x_{Pa_i}})$$

**Claim**: $p(x_1,...,x_n)$ is valid probability distribution because of ordering implied by DAG.

**Economical Representation**: Exponential in $|Pa(i)|$, not |V|.
$$p(x_1,\cdots,x_n) = \prod_i p(x_i |\mathbf{x_{A_i}})$$

### Generative v.s. Discriminative Models

**Naive Bayes for Single Label Prediction**

- Words are conditionally independent given Y
    - Let $1:n$ index the words in our vocabulary
    - $X_i = 1$ if word $i$ appears in an email, and 0 otherwise
    - E-mails are drawn according to some distribution $p(Y,X_1,\cdots,x_n)$

Then,
$$p(y,x_1,\cdots,x_n) = p(y) \prod_{i=1}^n p(x_i |y)$$

**Estimate** parameters from training data. **Predict** with Bayes rule:
$$p(Y=1 | x_1,\cdots,x_n) = \frac{p(Y=1)\prod_{i=1}^n p(x_i | Y = 1)}{\sum_{y=\{0,1\}} p(Y=y) \prod_{i=1}^n p(x_i | Y=y)}$$

Chain Rule $p(Y,\mathbf{X}) = p(\mathbf{X} | Y) p(Y) = p(Y|\mathbf{X}) p(\mathbf{X})$

Corresponding Bayesian Networks:<br>
*Generative* $Y \rightarrow X$ <br>
*Discriminative* $X \rightarrow Y$ <br>

Suppose all we need for prediction is $p(Y|\mathbf{X})$

In the left model, we need to specify both $p(Y)$ and $p(\mathbf{X}|Y)$, then compute $p(Y | \mathbf{X})$ via the Bayes rule.

In the right model, it suffices to estimate just the **conditional distribution** $p(Y|\mathbf{X})$ 
- We never need to model/learn/use $p(\mathbf{X})$!
- Called a **discriminative** model because it is only useful for discriminating Y's label when given $\mathbf{X}$.

$$p(Y,\mathbf{X}) = p(Y)p(X_1| Y )p(X_2 | Y,X_1) \cdots p(X_n | Y,X_1,\cdots,X_{n-1})$$
$$p(Y,\mathbf{X}) = p(X_1)p(X_2| X_1 )p(X_3 | X_1,X_2) \cdots p(Y | X_1,\cdots,X_{n-1},X_n)$$
- In the generative model, $p(Y)$ is simple, but how do we parameterize $p(X_i | \mathbf{X}_{pa(i)}, Y)$
- In the discriminative model, how do we parameterize $p(Y | \mathbf{X})$? Here we assume we don't care about modeling $p(\mathbf{X})$ because  $\mathbf{X}$ is always given to us in a classification problem

**Naive Bayes**
- For the generative model, assume that $X_i \perp X_{-i} | Y$ **Naive Bayes**

<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M1_1_6.png" />
</div>

**Logistic Regression**
Discriminative Model: $$p(Y=1|\mathbf{x;\alpha}) = f(\mathbf{x},a)$$

It is a parameterized function of x (regression). It has to be between 0 and 1.

Linear Dependence: 

let $z(\alpha;x)= \alpha_0 + \sum_{i=1}^n \alpha_i x_i$. 

Then, $p(Y=1 | x;\alpha) = \sigma (z (\alpha, x))$ where $\sigma(z) = \frac{1}{1+e^{-z}}$ is called **logistic function**.


- Decision Boundary $p(Y=1 | x;\alpha) > 0.5$ is linear in x
- Equal Probability contours are straight lines.

Logistic model does not assume $X_i \perp X_{-i} | Y$. For example, in spam classification. Let $X_1$ = "bank" in email and $X_2$ = "account" in email. Assume that regardless of whether spam, these always appear together, i.e. $X_1 = X_2$.

Learning in naive Bates results in $p(X_1 | Y) = p(X_2 | Y)$, Thus naive Bayes double counts the evidence.

Using a conditional model is only possible when $X$ is observed. When some $X_i$ variables are unobserved, the generative model allow us to compute $p(Y | X_{evidence})$ by marginalizing over the unseen variables.

**Logistic Regression** is stronger than **Naive Bayes** in practice because it make weaker assumptions. Therefore, if you have less data try Naive Bayes because it makes stronger assumptions, so there is no need with many data to figure out the relationship.

### Neural Models

**None-linear dependence**: let $h(A,b,x) = f(Ax+b)$ be a non-linear transformation of the inputs (features).
$$p_{Neural} (Y=1 | x;\alpha,A,b) = \sigma(\alpha_0 + \sum_{i=1}^h \alpha_ih_i)$$
- More Flexible
- More Parameters: $A, b, \alpha$
- Can repeat multiple times to get a neural network

<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M1_1_7.png" />
</div>

**Continuous Variables**
If X is a continuous random variable, we can use *probability density function* $p_X: \mathbb{R} \rightarrow \mathbb{R}^+$. Typically, consider parameterized densities:
- Gaussian: $X \sim N(\mu,\sigma)$ if $p_X(x) = \frac{1}{\sigma \sqrt{2[\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$
- Uniform: $X \sim \mu(a,b)$ if $p_X(x)=\frac{1}{b-a} 1[a\leq x\leq b]$

If X is a continuous random vector, we can usually represent it using its **joint probability density function**.
- Gaussian: if $p_X(x) = \frac{1}{\sqrt{(2\pi})^n |\sum|} exp(\frac{-\frac{1}{2}(x-\mu)^T \sum^{-1}(x-\mu)})$

Chain Rule, Bayes Rule still apply.
$$p_{X,Y,Z}(x,y,z) = p_X(x)p_{Y|X}(y|x) p_{Z|\{X,Y\}}(z|x,y)$$

We can still use Bayesian Networks with continuous (and discrete) variables.

For example: **Mixture of 2 Gaussians**: Bayes net $Z \rightarrow X$ with factorization $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z \sim Bernoulli(p)$
- $X|(Z=0) \sim N(\mu_0,\sigma_0)$, $X|(Z=1)\sim N(\mu_0,\sigma_0)$
- The parameters are $p,\mu_0,\sigma_0,\mu_1,\sigma_1$
  
Bayes Net $Z \rightarrow X$ with factorization  $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z \sim \mu(a,b)$
- $X|(Z=z) \sim N(z,\sigma)$
- The parameters are $a,b,\sigma$

Variational Autoencoder: Bayes net $Z \rightarrow X$ with factorization $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z\sim N(0,1)$
- $X|(Z=z) \sim N(\mu_{\theta}(z),e^{\sigma_{\theta}})$,where $\mu_{\theta}: \mathbb{R} \rightarrow \mathbb{R}$ and $\sigma_{\phi}$ are neural networks with parameters (weights) $\theta$,$\phi$ respectively.

## Module 2 Autoregressive Models

### FVSBN

#### Neural Models for Classification

**Setting**: 

Binary classification of $Y \in \{0,1\}$ given input features $X \in \{0,1\}^n$

For classification, we care about $P(Y|x)$, and assume that $P(Y=1 | x;\alpha) = f(x,\alpha)$

Motivating Example: MNIST

**Given**: a dataset $D$ of handwritten digits, each image has n=28*28 = 784 pixels of black (0) and white (1)
**Goal**: Learn a probability distribution $p(x) = p(x_1,\cdots,x_784)$ over $x \in \{0,1\}^{784}$ such that when $x \sim p(x)$, x looks like a digit.

Process:
1. Parameterize a model family $\{p_{\theta}(x),\theta \in \Theta\}$
2. Search for model parameters $\theta$ based on training data $D$

#### Autoregressive Models

We can pick an ordering of all the random variables, i.e. raster scan ordering of pixels from top-left (X_1) to bottom-right ($X_{n=784}$)

$$p(x_1,\cdots,x_{784}) = p(x_1)p(x_2| x_1) p(x_3 | x_1,x_2)\cdots p(x_n|x_1,\cdots,x_{n-1})$$

Some conditional are too complex to be stored in tabular form. Instead, we assume 
$$p(x_1,\cdots,x_{784}) = p_{CPT}(x_1;\alpha^1) p_{logit}(x_2 | x_1;\alpha^2) p_{logit}(x_3 | x_1,x_2;\alpha^3)\cdots p_{logit}(x_n | x_1,\cdots,x_{n-1};\alpha^n)$$


More explicitly:

$p_{CPT}(X_1=1;\alpha^1) = \alpha^1, p(X_1 = 0) = 1-\alpha^1$

$p_{logit}(X_2=1 | x_1;\alpha^2)=\sigma(\alpha_0^2 + \alpha_1^2x_1)$

$p_{logit}(X_3=1 | x_1,x_2;\alpha^3)=\sigma(\alpha_0^3 + \alpha_1^3x_1,\alpha_2^3x_2)$

This is **modeling assumption**. Given all the previous ones, we are using parameterized functions to predict the next pixel. It might work well; it might not work well. It depends on how easy the relationship between two pixels is.

#### **Fully Visible Sigmoid Belief Network (FVSBN)**

The conditional variables $X_i | X_1,\cdots,X_{i-1}$ are Bernoulli with parameters.
$$\hat{x}_i = p(X_i = 1 | x_1,\cdots,x_{i-1};\alpha^i) = p(X_i = 1 | x_{<i}; \alpha^i) = \sigma(\alpha_0^i + \sum_{j=1}^{i-1} \alpha_j^ix_j)$$



How to evaluate $p(x_1,\cdots,x_{784})$?
$$\begin{align*}
p(X_1 = 0,X_2 =1,X_3=1,X_4 = 1) & = (1-\hat{x}_1) \times \hat{x}_2 \times \hat{x}_3 \times (1-\hat{x}_4)\\
& = (1-\hat{x}_1) \times \hat{x}_2(X_1 = 0) \times \hat{x}_3(X_1 = 0,X_2 = 1) \times (1-\hat{x}_4(X_1 = 0,X_2 = 1,X_3 = 1))
\end{align*}$$

How to sample from $p(x_1,\cdots,x_{784})$?

- Sample $\bar{x}_1 \sim p(x_1)$ (np.random.choise([1,0]),p(\hat{x}_1,1-\hat{x}_1))
- Sample $\bar{x}_2 \sim p(x_2 |x_1 = \bar{x}_1 )$
- Sample $\bar{x}_3 \sim p(x_3 |x_1 = \bar{x}_1 ,x_2 = \bar{x}_2)$

The performance of this model is bad because logistic regression is not able to capture the relative relationship between pixels.

### NADE: Neural Autogressive Density Estimation

To improve FBSBN model, use one layer neural network instead of logistic regression.

$$h_i = \sigma(A_i x_{<i}+c_i)$$
$$\hat{x}_i = p(x_i | x_1,\cdots,x_{i-1};A_i,c_i,\alpha_i,b_i) = \sigma(\alpha_ih_i+b_i)$$
where $A_i,c_i,\alpha_i,b_i$ are parameters

For example $h_2 = \sigma(A_2x_1+c_2), h_3 = \sigma(A_3x_{1,2} + c_3)$ 

Tie weights to reduce the number of parameters and speed up computation.
$$h_i = \sigma(W_{\cdot}  {<i}x_{<i}+c)$$
$$\hat{x}_i = p(x_i | x_1,\cdots,x_{i-1}) = \sigma(\alpha_ih_i+b_i)$$
where $A_i,c_i,\alpha_i,b_i$ are parameters

For examples:
<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M2_1_2.png" />
</div>

If $h_i \in \mathbb{R}^d$ How many parameters?
- Linear in n: Weights $W \in \mathbb{R}^{d \times n}$, biases $c \in \mathbb{R}^d$, n logistic regression coefficient vectors $\alpha_i, b_i \in \mathbb{R}^{d+1}$
- Probability is evaluated in O(nd)

#### General Discrete Distributin

How to model non-binary discrete random variables $X_i \in \{1,\cdots, K\}$
- One Solution: Let $\hat{x}_i$ parameterize a categorical distribution

$$h_i = \sigma(W_{\cdot}  {<i}x_{<i}+c)$$
$$p(x_i | x_1,\cdots,x_{i-1}) = Category(p_i^1,\cdots,p_i^K)$$
$$\hat{x}_i = (p_i^1,\cdots p_i^K) = softmax(A_ih_i+b_i)$$

Softmax generalizes the sigmoid/logistic function $\sigma(\cdot)$ and transforms a vector of K numbers into a vector of K probabilities (non-negative, sum to 1).

$$softmax(a) = softmax(\alpha^1,\cdots,a^K) = (\frac{exp(a^1)}{\sum_i exp(a^i)},\cdots,\frac{exp(a^K)}{\sum_i exp(a^i)})$$

Python: 
> np.exp(a)/np.sum(np.exp(a))

### RNADE

How to model continuous random variables $X_i \in \mathbb{R}$? E.g. speech signals.

Solution: Let $\hat{x}_i$ parameterize a continuous distribution.

$$p(x_i | x_1,\cdots,x_{i-1}) = \sum_{j=1}^K \frac{1}{K} N(x_i;\mu^j_i,\sigma^i_i)$$

$$h_i = \sigma(W_{\cdot}  {<i}x_{<i}+c)$$

$$\hat{x}_i = (\mu_i^1,\cdots,\mu_i^K,\sigma_i^1,\cdots,\sigma_i^K) = f(h_i)$$


$$\hat{x}_i$$ defines the mean and standard deviation of each of the K Gaussians $$(\mu_i^j, \sigma_i^j)$$

Can use exponential $exp(\cdot)$ to ensure non-negativity

E.g. uniform mixture of K Gaussians.


### Autoregressive Models vs Autoencoder

FVSBN and NADE look similar to **Autoencoder**
- an encoer $e(\cdot)$, E.g. $e(x) = \sigma(W^2 (W^1x +b^1)+b^2)$
- an decoder such that $d(e(x)) \approx x.$ E.g. $d(h) = \sigma(Vh+c)$
- Loss function for dataset D
    - Binary: $\min_{W^1,W^2,b^1,b^2,VC} \sum_{x \in D} \sum_{i} -x_i log \hat{x}_i -(1-x_i) log(1-\hat{x}_i)$
    - Continuous: $\min_{W^1,W^2,b^1,b^2,V} \sum_{x \in D} \sum_{i} (x_i - \hat{x}_i)^2$
- e and d are constrained so that we don't learn identity mappings. Hope that $e(x)$ is a meaningful, compressed representation of x (feature learning)
- A vanilla autoencoder is *not* a generative model: it does not define a distribution over x we can sample from to generate new data points.


We need to ensure it corresponds to a valid Bayesian Network (DAG structure), i.e. we need an ordering for chain rule. If ordering is 1,2,3, then:
- $\hat{x}_1$ cannot depend on any input $x=(x_1,x_2,x_3)$; then, at generation time, we don't need any input to get started.
- $\hat{x}_2$ can only depend on $x_1$

Bonus: we can use a single neural network (with n inputs and outputs to produce all parameters $\hat{x}$ in a single pass). In contrast, NADE requires n passes. Much more efficient on modern one.

#### MADE: Masked Autoencoder for Distribution Estimation

**Challenge**: An autoencoder that is autoregressive (DAG structure)

**Solution**: use masks to disallow certain path (Germain et al 2015). Suppose ordering is $x_2,x_3,x_1$ so $p(x_1,x_2,x_3) = p(x_2)p(x_3|x_2)p(x_1 |x_2,x_3)$
- The unit producing the paramters for $\hat{x}_2 = p(x_2)$ is not allowed to depend on any input.
- For each unit in a hidden layer, pick a random integer $i$ in $[1,n-1]$. That unit is allowed to depend only on the first $i$ inputs (according to the chosen ordering)
- Add a mask to preserve this invariant: connect to all units in the previous layer with a smaller or equal assigned number.

#### RNN: Recurrent Neural Nets

**Challenge**: model p(x_t | x_{1:t-1};\alpha^t). "History" x_{1:t-1} keeps getting longer.

**Idea**: keep a summary and recursively update it
<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M2_1_3.png" />
</div>

- Summary update rule $h_{t+1}$ = tanh(W_{hh}h_t + W_{xh}x_{t+1})
- Prediction $o_{t+1} = W_{hy} h_{t+1}$
- Summary initalization: $h_0 = b_0$

Hidden Layer $h_t$ is a summary of the inputs seen till time t

Output layer $o_{t-1}$ specifies parameters for conditional $p(x_t | x_{1:t-1})$

Parameterized by $b_0$ (initialization), and matrices $W_{hh},W_{xh}, W_{hy}$. Constant number of parameters with regard to n!

**Example: Character RNN (from Andrej Karpathy)**

- Use one-hot encoding for $x_i \in \{h,e,l,o\}$
- **Autoregressive**: $p(x = hello) = p(x_1 = h)p(x_2 = e|x_1 = h)p(x_3 = l | x_1 = h,x_2 = 3), \cdots p(x_5 = o | x_1 = h,x_2 = e,x_3 = l,x_4 = l)$

$$p(x_2 = e | x_1 = h) = softmax(o1) = \frac{exp(2.2)}{exp(1.0) + \cdots + exp(4.1)}$$
$$o_1 = W_{hy}h_1$$
$$h_1 = tanh(W_{hh}h_0 + W_{xh}x_1)$$

**Pros**
- Can be applied to sequences of arbitrary length
- Very general: For every computable function, there exists a finite RNN that can compute it

**Cons**
- Still requires an ordering
- Sequential likelihood evaluation (very slow for training)
- Sequential generation (unavoidable in an autoregressive model)

**Issue with RNN models**
- A single hidden vector needs to summarize all the (growing) history. For example $h^4$ needs to be summarized the meaning of "My friend opened the"
- Sequential Evaluation, cannot be parallelized
- Exploding/Vanishing gradients when accessing information from many steps back.

### Attention-based Models vs RNNs

#### Attention based models

<div style="text-align:center;">
  <img height="50%" width="50%" src="sources/M2_1_4.png" />
</div>

Attention mechanism to compare a *query* vector to a set of *key* vectors.
- Compare current hidden state (*query*) to all past hidden states (*keys*)
- Construct attention distribution to figure out what parts of the history are relevant, e.g. via a softmax
- Construct a summary of the history, e.g. by weighted sum
- Use summary and current hidden state to predict the next token/word

#### Generative Transformers

Current state of the art (GPTs): replace RNN with Transformer
- Attention mechanisms to adaptively focus only on relevant context
- Avoid recursive computation. Use only self-attention to enable parallelization
- Needs **masked** self-attention to preserve autoregressive structure


#### Pixel RNN (Oord et al 2016)

Model images pixel by pixel using raster scan order.

Each pixel conditional $p(x_t | x_{1:t-1})$ needs to specify 3 color.
$$p(x_t | x_{1:t-1}) = p(x_t^{red} | x_{1:t-1} )p(x_t^{green} | x_{1:t-1};x_t^{red} )p(x_t^{blue} | x_{1:t-1} ;x_t^{red};x_t^{green})$$

and each conditional is a categorical random variable with 256 possible.

Conditionals modeled using RNN variants. LSTMs + masking (like MADE)



#### Convolutional Architectures - Pixel CNN (Oord et al 2016)

**Idea**: Use convolutional architecture to predict next pixel given context (a neighborhood of pixels).

**Challenge**: Has to be autoregressive. Masked convolutions preserve raster scan order. Additional masking for colors order.

#### Application in Adversarial Attacks and Anomaly detection

Machine Learning methods are vulnerable to adversarial examples.

dog + noise = ostrich

**PixelDefend (Song et al 2018)**
- Train a generative model $p(x)$ on clean inputs (PixelCNN)
- Given a new input $\bar{x}$, evaluate $p(\bar{x})$
- Adversarial examples are significantly less likely under $p(x)$

#### Summary of Autoregressive Models

Easy to sample from:
1. Sample $\bar{x}_0 \sim p(x_0)$
2. Sample   $\bar{x}_1 \sim p(x_1 | x_0 =\bar{x}_0)$

Easy to compute probability $p(x=\bar{x})$
1. Compute $p(x_0 = \bar{x}_0)$
2. Compute $p(x_0 = \bar{x}_1 | x_0 = \bar{x}_0)$
3. Multiply together (sum their logarithms)
4. ...
5. Ideally, can compute all these terms in parallel for fast training

Easy to extend to continuous variable. For example, can choose Gaussian conditionals $p(x_t | x_{<t}) = N(\mu_{\theta}(x_{<t}),\sum_{\theta}(x_{<t}))$ or mixture of logistics

No natural way to get features, cluster points, do unsupervised learning.

### KL Divergence - Learning

**Goal of Learning**:  return a model $P_{\theta}$ that precisely captures the distribution $P_{data}$ from which our data was sampled.

What is **"best"**?
- Density Estimation: we are interested in the full distribution (so later we can compute whatever conditional probabilities we want)
- Specific prediction tasks: we are using the distribution to make a prediction
    - **Structured prediction**: Predict next frame in a video, or caption given an image
-  Structure or Knowledge Discovery: we are interested in the model itself
    -   How do some genes interact with each other?
    -   What causes cancer?

#### Learning as **Density Estimation**

We want to construct $P_{\theta}$ as "close" as possible to $P_{data}$ (recall we assume we are given a dataset $D$ of samples from $P_{data}$)

How do we evaluate "closeness"?
- [KL-Divergence](#KL-Divergence)
$$D(P_{data} || P_{\theta}) = E_{x\sim p}[log \frac{P_{data}(x)}{P_{\theta}(x)}]$$

$D(P_{data} || P_{\theta}) = 0$ iff the two distributions are the same.

#### KL-Divergence

We use KL-Divergence to measure distance between distribution.

**Kullback-Leibler Divergence** (KL-divergence) between two distribution $p$ and $q$ is defined as
$$D(p ||q) = \sum_x p(x)log \frac{p(x)}{q(x)}$$

$$D(p || q ) \geq 0$$ for all p,q, with equality if and only if $p=q$. It is non-negative.

Proof:
$$E_{x\sim p}[-log \frac{q(x)}{p(x)}] \geq -log(E_{x\sim p}[\frac{q(x)}{p(x)})]) = - log (\sum_x p(x)\frac{q(x)}{p(x)}) = 0$$

Note that KL-divergence is **asymmetric**, i.e. $D(p || q ) \neq D(q||p)$

Measures the expected number of extra bits required to describe samples from $p(x)$ using a compression code based on $q$ instead of $p$
- If your data comes from $p$, but you use a scheme optimized for $q$, the divergence $D_{KL}(p||q)$ is the number of extra bits you'll need on average.

#### Expected Log-likelihood

$$\begin{align*}
D(P_{data} || P_{\theta}) & = E_{x\sim p}[log \frac{P_{data}(x)}{P_{\theta}(x)}]\\
& = E_{x\sim p}[log P_{data}(x)]-E_{x\sim p}[log P_{\theta}(x)]
\end{align*}$$

The first term does not depend on $P_{\theta}$

Then, *minimizing* KL divergence is equivalent to *maximizing* the **expected log-likelihood**
$$argmin_{P_{\theta}}D(P_{data} || P_{\theta}) = argmin_{P_{\theta}} -E_{x\sim p}[log P_{\theta}(x)] = argmax_{P_{\theta}} E_{x\sim p}[log P_{\theta}(x)]$$
- $P_{\theta}$ assign high probability to instances sampled from $P_{data}$ so as to reflect the true distribution
- Because of log, samples x where $P_{\theta}(x) \approx 0$ weigh heavily in objective
- **Problem**: Although we can not compare models since we are ignoring $H(P_{data}) = - E_{x\sim p}[log P_{data}(x)]$ , we don't know how close we are to the optimum



#### Maximum Likelihood

Approximate the expected log-likelihood $$E_{x\sim p}[log P_{data}(x)]$$
with the empirical log-likelihood:
$$E_{D}[log P_{\theta}(x)] = \frac{1}{D}\sum_{x\in D}log P_{\theta}(x)$$

Maximum likelihood learning is then:
$$max_{P_{\theta}} \frac{1}{|D|}\sum_{x \in D} log P_{\theta}(x)$$

Equivalently, maximize likelihood of the data
$$P_{\theta}(x^1,\cdots,x^m) = \prod_{x \in D} P_{\theta}(x)$$

### Monte Carlo Estimation

#### Main idea in Monte Carlo Estimation

Express the quantity of interest as the expected value of a random variable
$$E_{x\sim P}[g(x)] = \sum_x g(x)P(x)$$

Alternatively, Generate T samples $x^1,\cdots x^T$ from the distribution P with respect to which the expectation was taken.

Estimate the expected value from the samples using:
$$\hat{g}(x^1,\cdots x^T) = \frac{1}{T}\sum_{t=1}^T g(x^t)$$
where $x^1,\cdots x^T$ are independent samples from P.

#### Properties of Monte Carlo Estimate

- **Unbiased**
  $$E_{P}[\hat{g}] = E_P[g(x)]$$
- **Convergence**: By law of large numbers
  $$\hat{g} =  \frac{1}{T}\sum_{t=1}^T g(x^t) \rightarrow E_P[g(x)] \text{ for } T \rightarrow \infty$$ 
- **Variance**
  $$V_P[\hat{g} ] = V_P [\frac{1}{T}\sum_{t=1}^T g(x^t)] = \frac{V_P [g(x)]}{T}$$

Thus, the variance of the estimator can be reduced by increasing the number of samples.

#### Expanding the MLE principle to autoregressive models

Given an autoregressive model with $n$ variables and factorization
$$P_{\theta}(x) = \prod_{i=1}^n p_{neural}(x_i | x_{<i}; \theta_i)$$

$\theta = (\theta_1,\cdots,\theta_n)$ are parameters of all the conditionals.

Training data $D={x^1,\cdots,x^m}$. Maximum likelihood estimate of the parameters $\theta$
- Decomposition of likelihood function
$$L(\theta,D) = \prod_{j=1}^m P_{\theta}(x^j) = \prod_{j=1}^m \prod_{i=1}^n p_{neural} (x_i^j | x^i_{<i};\theta_i)$$

- Goal: maximize $argmax_{\theta} L(\theta,D) = argmax_{\theta} log L(\theta,D)$
- We no longer have a closed form solution

#### MLE Learning: Gradient Descent

$$L(\theta,D) = \prod_{j=1}^m P_{\theta}(x^j) = \prod_{j=1}^m \prod_{i=1}^n p_{neural} (x_i^j | x^i_{<i};\theta_i)$$

Goal: maximize $argmax_{\theta} L(\theta,D) = argmax_{\theta} log L(\theta,D)$

$l(\theta) = log L(\theta,D) = \sum_{j=1}^m \sum_{i=1}^n log p_{neural} (x_i^j | x^i_{<i};\theta_i)$
- Initialize $\theta^0 = (\theta_1,\cdots,\theta_n)$ at random
- Compute $\nabla_\theta l(\theta)$ (by back propagation)
- $\theta^{t+1} = \theta^t + \alpha_t \nabla_\theta l(\theta)$

Non-convex optimization problem, but often works well in practice.


What is the gradient with respect to $\theta_i$?

$\nabla_{\theta_i} l(\theta) =\sum_{j=1}^m \nabla_{\theta_i} \sum_{i=1}^n log p_{neural} (x_i^j | x^i_{<i};\theta_i) =\sum_{j=1}^m \nabla_{\theta_i} log p_{neural} (x_i^j | x^i_{<i};\theta_i)$

Each conditional $p_{neural}(x_i | x_{<i};\theta_i)$ can be optimized separately if there is no parameter sharing.

$\nabla_{\theta} l(\theta)= \sum_{j=1}^m \sum_{i=1}^n \nabla_{\theta}  log p_{neural} (x_i^j | x^i_{<i};\theta_i) $

What if $m=|D|$ is huge?

$\begin{align}
\nabla_{\theta} l(\theta) &= m \sum_{j=1}^m \frac{1}{m} \sum_{i=1}^n \nabla_{\theta}  log p_{neural} (x_i^j | x^i_{<i};\theta_i) \\
& = m E_{x^j \sim D} [\sum_{i=1}^n \nabla_{\theta}  log p_{neural} (x_i^j | x^i_{<i};\theta_i)]
\end{align}$

A uniform distribution over dataset

Monte Carlo: Sample $x^j \sim D$; $\nabla_{\theta} l(\theta) =m [\sum_{i=1}^n \nabla_{\theta}  log p_{neural} (x_i^j | x^i_{<i};\theta_i)]$

#### Empirical Risk and Overfitting

Empirical risk minimization can easily **overfit** the data

**Generalization**: the data is a sample, usually there is vast amount of samples that you have never seen. Your model should generalize well to these "never-seen" sample.

Thus we typically restrict the **hypothesis space** of distributions that we search over.

**Bias-Variance Trade Off**
- If the hypothesis space is very limited, it might not be able to represent $P_{data}$ even with unlimited data
    - This type of limitation is called **bias**, as the learning is limited on how close it can approximate the target distribution 
- If we select a highly expressive hypothesis class, we might represent better the data
    - When we have small amount of data, multiple models can fit well. or even better than the true model.Moreover, small perturbations on $D$ will result in very different estimates. 

#### How to avoid overfitting?

- Hard constraints:
    - smaller neural networks with less parameters
    - weight sharing 
- Soft preference for "simpler"models
- Augment the objective functions with regularizations
  $$objective(x,M) = loss(x,M)+R(M)$$