# Generative models

Given $\mathcal{D}=\{X_i\}_{i\in [n]}\sim p_{\text{data}}(x)$, we want to learn the underlying data distribution $p_{\text{data}}(x)$ using the training samples $X_1, X_2,..., X_n$. One way of doing this is to consider a family of parameterized distribution $p_{\theta}(x)$ and then find the right parameter $\theta$ such that 

$$p_{\theta}(x)\approx p_{\text{data}}(x)$$

One way of framing this mathematically is to find $\theta$ that minimizes the KL-divergence between these two distributions

$$\text{KL}(p_{\text{data}}||p_{\theta}) = \mathbb{E}_{p_{\text{data}}}[-\log p_{\theta}(X)] - H(p_{\text{data}})$$

Note that we can drop the entropy term. Using sample average approximation, we have

$$\theta^* = \underset{\theta}{\text{argmin}} \frac{1}{n}\sum_{i=1}^n -\log p_{\theta}(X_i)\tag{1}$$

Another interpretation of the optimization problem is given by the MLE estimation. We want to maximize the log-probability of likelihood

$$\theta^* = \underset{\theta}{\text{argmax}} p_{\theta}(X_1, X_2,..., X_n) = \underset{\theta}{\text{argmax}} \frac{1}{n}\sum_{i=1}^n \log p_{\theta}(X_i)\tag{2}$$

Both formulations give us equivalent optimization problem. Now the question remains how to find a suitable class of parameterization $p_{\theta}(x)$. 

## Autoregressive models

Assuming that the data $X$ is $K$ dimensional, say $X=[x_1, x_2,..., x_K]$, then the parameterized distribution can be written as $p_{\theta}(x_1, x_2,..., x_K)$. Modeling this distribution directly is hard, especially because of low coverage of data points in high dimensional (curse of dimensionality). We need to somehow decompose the distribution. We can use the idea of Bayes net to decompose the distribution

$$p_{\theta}(x_1, x_2,..., x_K) = \prod_i p(x_i|\text{Parent}(x_i))$$

However, it's not obvious how to choose the parent nodes. Therefore, a safe option is to consider the fully expressive Bayes net structure

$$p_{\theta}(x_1, x_2,..., x_K) = \prod_i p(x_i|x_{<i})$$

Where $x_{<i}$ denote the nodes $x_1, x_2,..., x_{i-1}$. Modelling $p_{\theta}$ under such form is called an autoregressive model, since the value $x_i$ depends on $x_1, x_2,..., x_{i-1}$. 