# Deep Generative Model
This is a study note for Stanford CS236 Deep Generative Model.
Additional Resources:
[Course Github Notes](https://deepgenerativemodels.github.io/notes/)

## Module 1: Introduction to Generative Models
- Suggest Readings:
    - [Deep Generative Models](https://ermongroup.github.io/generative-models/)
    - [Generative Modeling by Estimating Gradients of the Data Distribution](https://yang-song.net/blog/2021/score/)
    - [Tutorial on Deep Generative Models](https://www.youtube.com/watch?v=JrO5fSskISY)
    - [Learning Deep Generative Models](https://www.cs.cmu.edu/~rsalakhu/papers/annrev.pdf)

### What is Generative Modeling?

**Generative Models** contains two parts:
- **Generation (graphics)**: From high level description to raw sensory outputs 
- **Inference (vision as inverse graphics)**: From Raw sensory outputs to high level descriptions.

**Statistical** Generative Models are **learned from data**.
This course depends less on prior data but computer graphics deeps on more.

#### Statistical Generative Models


A **statistical generative** model is a probability distribution $p(x)$:
- **Data**: Samples
- **Prior Knowledge**: parametric form, loss function, optimization algorithm

Image $x$ $\rightarrow$ A probability distribution $p(x)$ $\rightarrow$ scalar probability $p(x)$.

It is generative because sampling from $p(x)$ generates new images.

It can be used to build a simulator for the data-generating process.

Control Signals/Potential datapoints $\rightarrow$ Data Simulator = Statistical Model = Generative Model $\rightarrow$ New datapoints/Probability values


### Audio and Image Applications of Generative Models

- Data Generation in the real world
    - Text to Image: [Language-guided artwork creation](https://chainbreakers.kath.io/)
    - Draw Image to Realistic Images[Meng, He, Song et al ICLR 2022](https://arxiv.org/abs/2108.01073)
- Solving inverse problems with generative models
    - Medical image reconstruction [Song et al ICLR 2022](https://arxiv.org/abs/2111.08005)
- Outlier Detection with genertive models
    - Outlier Detection [Song et al ICLR 2018](https://arxiv.org/abs/1710.10766)
- Progress in Generative Models of Images
    - GANs [Ian Goodfellow 2019](https://arxiv.org/abs/1406.2661)
    - Diffusion Models [Song et al 2021](https://arxiv.org/abs/2101.09258)
        - Text2Image Diffusion Models
- Progress in Inverse Problems
    -  Low Resolution $\rightarrow$ High resolution [Menon et al, 2020](https://arxiv.org/abs/2003.03808)
    -  Mask $\rightarrow$ Full Image [Liu et al 2018](https://arxiv.org/abs/1804.07723)
    -  Greyscale Images $\rightarrow$  Color Image
    -  Scatch $\rightarrow$ Fine Image
    -  Origin Images $\rightarrow$ Edited Images
-  Audio
    - WaveNet[van den Oord et al 2016c](https://arxiv.org/abs/1609.03499)
    - Diffusion Text2Speech [Betker, Better Speech Synthesis through scaling 2023](https://arxiv.org/abs/2305.07243)
    - Conditional Generative Model: Low-Resolution Audio Signal $\rightarrow$ High-Resolution Audio Signal

### Language, Video, and Robotic Applications of Generative Models

- Language Generation [Radford et al 2019](https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf)
    - Conditional Generative Model *P(next word | previous word)*
    - ChatGPT
- Machine Translation
    - Conditional Generative Model *P(English Text | Chinese Text)*
- Code Generation
- Video Generation
- Imitation Learning
    - Conditional Generative Model *P(actions | past observations)* [Li et al 2017](https://arxiv.org/abs/1701.01036) | [Janner et al 2022](https://arxiv.org/abs/2205.09991)
- Molecule Generation
- DeepFake

### Roadmap and Challenges in Generative Modeling

**Representation**: how do we model the joint distribution of many random variables?

**Learning** What is the right way to compare probability distributions?

**Inference** How do we invert the generation process

### Generative Model Curse of Dimensionality and Bayesian Networks

#### Overview

- What is a generative model
- Representing probability distributions
    - Curse of dimensionality
    - Crash course on graphical models (Bayesian networks)
    - Generative vs discriminative models
    - Neural models

#### Learning  a generative model

We want to learn a probability distribution $p(x)$ over images x such that

**Generation** If we sample $x_{new} \sim p(x)$,$x_{new}$ should look like a dog (sampling)

**Density Estimation** p(x) should be high if x looks like a dog and low otherwise (anomly detection)

**Unsupervised Representation Learning**: We should be able to learn what these images have in common, e.g. ears, tail, etc(features)

#### How to Represent $p(x)$                   

**Basic Discrete Distribution**
- Bernoulli Distribution: (biased) coin flip
    - $D = {Head,Tails}$
    - $P(X=Heads) = p, P(X=Tails) = 1-p$
    - $X \sim Ber(p)$
- Categorical Distribution: (biased) m-sided dice
    - $D={1,...,m}$
    - $P(Y=i) = p, \sum p_i=1$
    - $Y \sim Cat(p_1,...,p_m)$

Example of joint distribution: 
- Modeling pixels - Red, Blue, Green: $Val(R) = Val(B) = Val(G) = {0,...,255} = 256 * 256 * 256 - 1$ Number of parameters.
- Modeling grey image numbers: Bernoulli $Val(X_i)={0,1} = 2^n -1 $ Number of parameters.

$x-1$ number of parameters because the sum of $x$ parameters needs to sum up to 1; therefore, the last one is determined based on the previous $x-1$ parameters.



**Assumption**: Independent
$$p(x_1,...,x_n) = p(x_1)p(x_2)...p(x_n) $$

- $2^n$ possible states
- $p(x_1,...,x_n)$ we need only 1 parameters to specify marginal distribution $p(x_1)$
- $2^n$ entries can be described by just n numbers (if $|Val(X_i)| = 2$).

Problems: Too strong. Model may not be useful.

**Two Important Rules**
- Chain Rule
    - $P(S_1 \cap S_2 \cap \cdots \cap S_n) = p(S_1)p(S_2 | S_1) \cdots p(S_n | S_1 \cap \cdots \cap S_{n-1})$
- Bayes' Rule
    - $p(S_1 | S_2) = \frac{p(S_1 \cap S_2)}{p(S_2)} = \frac{p(S_2 | S_1)p(S_1)}{p(S_2)}$ 

**Structure Through Conditional Independence**
$$p(x_1,...,x_n) = p(x_1)p(x_2 |x_1)p(x_3 | x_1,x_2) \cdots p(x_n | x_1,...,x_{n-1})$$

- $p(x_1)$ requires 1 parameters.
- $p(x_2 | x_1 = 0)$ requires 1 parameter, $p(x_2 | x_1 = 1)$ requires 1 parameter
- In total, we need $1+2+ ... +2^{n-1}= 2^n-1$ parameters
- It is still exponential.

  

Now let's try, such as predicting next word.
$$\begin{align}
p(x_1,...,x_n) & = p(x_1)p(x_2 |x_1)p(x_3 | x_1,x_2) \cdots p(x_n | x_1,\cdots,x_{n-1}) \\
& = p(x_1) p(x_2 |x_1)p(x_3 | x_2)\cdots p(x_n | x_{n-1})\\
\end{align}$$

It requires $2n-1$ parameters

**Bayes Network** General Idea

- Use conditional parameterization (instead of joint parameterization)
- For each random variable $X_i$ specify $p(x_i |  \mathbf{x_{A_i}})$ for set $\mathbf{X_{A_i}}$ of random variables.

$$p(x_1,\cdots,x_n) = \prod_i p(x_i |\mathbf{x_{A_i}})$$

- We need to guarantee it is a legal probability distribution.
  

**Bayesian Network** Formal

A **Bayesian Network** is specified by a *directed* **acyclic** graph (DAG), $G=(V,E)$, with
- One node $i \in V$ for each random variable $X_i$
- One conditional probability distribution (CPD) per node, $p(x_i |\mathbf{x_{Pa_i}})$, specifying the variable's probability conditioned on its parents' value.

Graph $G=(V,E)$ is called the structure of the Bayesian Network

Defines a joint distribution:
$$p(x_1, \cdots, x_n) = \prod_i p(x_i |\mathbf{x_{Pa_i}})$$

**Claim**: $p(x_1,...,x_n)$ is valid probability distribution because of ordering implied by DAG.

**Economical Representation**: Exponential in $|Pa(i)|$, not |V|.
$$p(x_1,\cdots,x_n) = \prod_i p(x_i |\mathbf{x_{A_i}})$$

### Generative v.s. Discriminative Models

**Naive Bayes for Single Label Prediction**

- Words are conditionally independent given Y
    - Let $1:n$ index the words in our vocabulary
    - $X_i = 1$ if word $i$ appears in an email, and 0 otherwise
    - E-mails are drawn according to some distribution $p(Y,X_1,\cdots,x_n)$

Then,
$$p(y,x_1,\cdots,x_n) = p(y) \prod_{i=1}^n p(x_i |y)$$

**Estimate** parameters from training data. **Predict** with Bayes rule:
$$p(Y=1 | x_1,\cdots,x_n) = \frac{p(Y=1)\prod_{i=1}^n p(x_i | Y = 1)}{\sum_{y=\{0,1\}} p(Y=y) \prod_{i=1}^n p(x_i | Y=y)}$$

Chain Rule $p(Y,\mathbf{X}) = p(\mathbf{X} | Y) p(Y) = p(Y|\mathbf{X}) p(\mathbf{X})$

Corresponding Bayesian Networks:<br>
*Generative* $Y \rightarrow X$ <br>
*Discriminative* $X \rightarrow Y$ <br>

Suppose all we need for prediction is $p(Y|\mathbf{X})$

In the left model, we need to specify both $p(Y)$ and $p(\mathbf{X}|Y)$, then compute $p(Y | \mathbf{X})$ via the Bayes rule.

In the right model, it suffices to estimate just the **conditional distribution** $p(Y|\mathbf{X})$ 
- We never need to model/learn/use $p(\mathbf{X})$!
- Called a **discriminative** model because it is only useful for discriminating Y's label when given $\mathbf{X}$.

$$p(Y,\mathbf{X}) = p(Y)p(X_1| Y )p(X_2 | Y,X_1) \cdots p(X_n | Y,X_1,\cdots,X_{n-1})$$
$$p(Y,\mathbf{X}) = p(X_1)p(X_2| X_1 )p(X_3 | X_1,X_2) \cdots p(Y | X_1,\cdots,X_{n-1},X_n)$$
- In the generative model, $p(Y)$ is simple, but how do we parameterize $p(X_i | \mathbf{X}_{pa(i)}, Y)$
- In the discriminative model, how do we parameterize $p(Y | \mathbf{X})$? Here we assume we don't care about modeling $p(\mathbf{X})$ because  $\mathbf{X}$ is always given to us in a classification problem

**Naive Bayes**
- For the generative model, assume that $X_i \perp X_{-i} | Y$ **Naive Bayes**

<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M1_1_6.png" />
</div>

**Logistic Regression**
Discriminative Model: $$p(Y=1|\mathbf{x;\alpha}) = f(\mathbf{x},a)$$

It is a parameterized function of x (regression). It has to be between 0 and 1.

Linear Dependence: 

let $z(\alpha;x)= \alpha_0 + \sum_{i=1}^n \alpha_i x_i$. 

Then, $p(Y=1 | x;\alpha) = \sigma (z (\alpha, x))$ where $\sigma(z) = \frac{1}{1+e^{-z}}$ is called **logistic function**.


- Decision Boundary $p(Y=1 | x;\alpha) > 0.5$ is linear in x
- Equal Probability contours are straight lines.

Logistic model does not assume $X_i \perp X_{-i} | Y$. For example, in spam classification. Let $X_1$ = "bank" in email and $X_2$ = "account" in email. Assume that regardless of whether spam, these always appear together, i.e. $X_1 = X_2$.

Learning in naive Bates results in $p(X_1 | Y) = p(X_2 | Y)$, Thus naive Bayes double counts the evidence.

Using a conditional model is only possible when $X$ is observed. When some $X_i$ variables are unobserved, the generative model allow us to compute $p(Y | X_{evidence})$ by marginalizing over the unseen variables.

**Logistic Regression** is stronger than **Naive Bayes** in practice because it make weaker assumptions. Therefore, if you have less data try Naive Bayes because it makes stronger assumptions, so there is no need with many data to figure out the relationship.

### Neural Models

**None-linear dependence**: let $h(A,b,x) = f(Ax+b)$ be a non-linear transformation of the inputs (features).
$$p_{Neural} (Y=1 | x;\alpha,A,b) = \sigma(\alpha_0 + \sum_{i=1}^h \alpha_ih_i)$$
- More Flexible
- More Parameters: $A, b, \alpha$
- Can repeat multiple times to get a neural network

<div style="text-align:center;">
  <img height="100%" width="50%" src="sources/M1_1_7.png" />
</div>

**Continuous Variables**
If X is a continuous random variable, we can use *probability density function* $p_X: \mathbb{R} \rightarrow \mathbb{R}^+$. Typically, consider parameterized densities:
- Gaussian: $X \sim N(\mu,\sigma)$ if $p_X(x) = \frac{1}{\sigma \sqrt{2[\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$
- Uniform: $X \sim \mu(a,b)$ if $p_X(x)=\frac{1}{b-a} 1[a\leq x\leq b]$

If X is a continuous random vector, we can usually represent it using its **joint probability density function**.
- Gaussian: if $p_X(x) = \frac{1}{\sqrt{(2\pi})^n |\sum|} exp(\frac{-\frac{1}{2}(x-\mu)^T \sum^{-1}(x-\mu)})$

Chain Rule, Bayes Rule still apply.
$$p_{X,Y,Z}(x,y,z) = p_X(x)p_{Y|X}(y|x) p_{Z|\{X,Y\}}(z|x,y)$$

We can still use Bayesian Networks with continuous (and discrete) variables.

For example: **Mixture of 2 Gaussians**: Bayes net $Z \rightarrow X$ with factorization $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z \sim Bernoulli(p)$
- $X|(Z=0) \sim N(\mu_0,\sigma_0)$, $X|(Z=1)\sim N(\mu_0,\sigma_0)$
- The parameters are $p,\mu_0,\sigma_0,\mu_1,\sigma_1$
  
Bayes Net $Z \rightarrow X$ with factorization  $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z \sim \mu(a,b)$
- $X|(Z=z) \sim N(z,\sigma)$
- The parameters are $a,b,\sigma$

Variational Autoencoder: Bayes net $Z \rightarrow X$ with factorization $p_{Z,X}(z,x) = p_Z(z)p_{X|Z}(x|z)$ and
- $Z\sim N(0,1)$
- $X|(Z=z) \sim N(\mu_{\theta}(z),e^{\sigma_{\theta}})$,where $\mu_{\theta}: \mathbb{R} \rightarrow \mathbb{R}$ and $\sigma_{\phi}$ are neural networks with parameters (weights) $\theta$,$\phi$ respectively.

## Module 2 Autoregressive Models

### FVSBN