# Notes

### Problem Scenario

| symbol | description |
| ----: | ----: |
| $\mathbf{x}$ | input variable |
| $\mathbf{z}$ | latent variable |
| $p_{\boldsymbol{\theta}}(\mathbf{x})$ | 用于近似$\mathbf{x}$分布的函数 |
| $p^*(\mathbf{x})$ | $\mathbf{x}$的实际分布 |
| $q_{\boldsymbol{\phi}}(\mathbf{z} |\mathbf{x})$ | inference model, encoder, recognition model |
| $p_{\boldsymbol{\theta}}(\mathbf{z}) $ | priori |
| $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$ | posterior |
| $p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})$ | generative model $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})=p_{\boldsymbol{\theta}}(\mathbf{x}\vert\mathbf{z})p_{\boldsymbol{\theta}}(\mathbf{z})$ |
| $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ | decoder |


|全称|缩写|
|---|----|
| Maximum Likelihood | ML |
| Maximum A Posteriori | MAP |
| Variational Bayesian | VB |
| Stochastic Gradient VB | SGVB |
| Auto-Encoding VB | AEVB |
| Deep Latent Variable Model $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})=p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})p_{\boldsymbol{\theta}}(\mathbf{z})$ | DLVM |

Let us consider some dataset $\mathbf{X}=\{\mathbf{x}^{(i)}\}_{i=1}^N$ consisting of $N$ i.i.d. samples of some continuous or discrete variable $\mathbf{x}$. We assume that the data are generated by some random process, involving an unobserved continuous random variable $\mathbf{z}$. The process consists of two steps: (1) a value $\mathbf{z}^{(i)}$ is generated from some prior distribution $p_{\boldsymbol{\theta}}^*(\mathbf{z})$; (2) a value $\mathbf{x}^{(i)}$ is generated from some conditional distribution $p_{\boldsymbol{\theta}}^*(\mathbf{x}|\mathbf{z})$. We assume that the prior $p_{\boldsymbol{\theta}}^*(\mathbf{z})$ and likelihood $p_{\boldsymbol{\theta}}^*(\mathbf{x}|\mathbf{z})$ come from parametric families of distributions $p_{\boldsymbol{\theta}}(\mathbf{z})$ and $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$, and their PDFs are differentiable almost everywhere w.r.t. both $\boldsymbol{\theta}$ and $\mathbf{x}$.

We are interested in three related problems:
1. fficient approximate ML or MAP estimation for the parameters $\boldsymbol{\theta}$.

2. Efficient approximate posterior inference of the latent variable $\mathbf{z}$ given an observed value $\mathbf{x}$ for a choice of parameters $\boldsymbol{\theta}$.

3. Efficient approximate marginal inference of the variable $\mathbf{x}$.

For the purpose of solving the above problems, let us introduce a recognition model $q_{\phi}(\mathbf{z}|\mathbf{x})$: an approximation to the intractable true posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$.

<mark>From a coding theory perspective, the unobserved variables $\mathbf{z}$ have an interpretation as a latent representation or *code*. In this paper, we will therefore also refer to the recognition model $q_{\phi}(\mathbf{z}|\mathbf{x})$ as a probabilistic *encoder*, since given a datapoint $\mathbf{x}$ it produces a distribution over the possible values of the code $\mathbf{z}$ from which the datapoint $\mathbf{x}$ could have been generated. In a similar vein we will refer to $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$ as a probabilistic *decoder*, since given a code $\mathbf{z}$ it produces a distribution over the possible corresponding values of $\mathbf{x}$.</mark>

### The Variational Bound

The marginal likelihood is composed of a sum over the marginal likelihoods of individual datapoints $\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(1)},\cdots,\mathbf{x}^{(N)})=\sum\limits_{i=1}^N\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})$ <font color="red">(因为样本是i.i.d.)</font>, which can be rewritten as:
$$
\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})=D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}^{(i)}))+\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)}).\ \ \ \ \ \ \ \ \ \ \ \ (1)
$$

Since the KL divergence is non-negative, the second Right Hand Side (RHS) term $\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)})$ is called the **(variational) lower bound** on the marginal likelihood of datapoint $i$, and can be written as
$$
\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})\geq \mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)})=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})],\ \ \ \ \ \ \ \ \ \ (2)
$$
which can also be written as
$$
\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)})=-D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})\|p_{\boldsymbol{\theta}}(\mathbf{z}))+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(i)})}\left[\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}|\mathbf{z})\right].\ \ \ \ \ \ \ \ \ (3)
$$

We want to differentiate and optimize the lower bound $\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)})$ w.r.t. both the variational parameters $\boldsymbol{\phi}$ and generative parameters $\boldsymbol{\theta}$.

<font color="red">
式(2)的证明如下所示:
因为$\mathcal{L}(\boldsymbol{\theta}, \boldsymbol{\phi};\mathbf{x}^{(i)})=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})]$代入式(1)可得
$$
\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})=D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x}^{(i)})\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}^{(i)}))+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})]
$$
由KL散度的定义可将上式变换为
$$
\begin{split}
\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(i)})&=\int\limits_{\mathbf{z}}q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\log\frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})}+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})]\\
&=\int\limits_{\mathbf{z}}q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})[\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) -\log p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}) -\log p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})+\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})-\log p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[\log\frac{p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})}\right]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{x})]\\
&=\log p_{\boldsymbol{\theta}}(\mathbf{x})
\end{split}
$$
与式(1)相等,至此证明完式(2).
    
式(3)证明与式(2)证明过程类似,具体可查看*https://zhuanlan.zhihu.com/p/161277762*.
</font>

### The SGVB Estimator and AEVB Algorithm

<img src="figs/vae_flow.png" width="800px"/>

# Introduction

One major division in machine learning is generative versus discriminative modeling. While in discriminative modeling one aims to learn a predictor given the observations, in generative modeling one aims to solve the more general problem of learning a joint distribution over all the variables. Generative modeling can be useful more generally. One can think of it as an auxiliary task. 

The VAE can be viewed as two coupled, but independently parameterized models: the encoder or recognition model, and the decoder or generative model. These two models support each other. 

The generative model is a Bayesian network of the form $p(\mathbf{x}|\mathbf{z})p(\mathbf{z})$, or, if there are multiple stochastic latent layers, a hierarchy such as $p(\mathbf{x}|\mathbf{z}_L)p(\mathbf{z}_L|\mathbf{z}_{L-1})\cdots p(\mathbf{z}_1|\mathbf{z}_0)$. Similarly, the recognition model is also a conditional Bayesian network of the form $q(\mathbf{z}|\mathbf{x})$ or as a hierarchy, such as $q(\mathbf{z}_0|\mathbf{z}_1)\cdots q(\mathbf{z}_L|\mathbf{x})$.

Let's use $\mathbf{x}$ as the vector representing the set of all observed variables whose joint distribution we would like to model. We assume the observed variable $\mathbf{x}$ is a random sample from an *unknown underlying process*, whose true distribution $p^*(\mathbf{x})$ is unknown. We attempt to approximate this underlying process with a chosen model $p_{\boldsymbol{\theta}}(\mathbf{x})$, with parameters $\boldsymbol{\theta}$:

$$
\mathbf{x}\sim p_{\boldsymbol{\theta}}(\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ \ (1.1)
$$

*Learning* is, most commonly, the process of searching for a value of the parameters $\boldsymbol{\theta}$ such that the probability distribution function given by the model, $p_{\boldsymbol{\theta}}(\mathbf{x})$, approximates the true distribution of the data, denoted by $p^*(\mathbf{x})$, such that for any observed $\mathbf{x}$:
$$
p_{\boldsymbol{\theta}}(\mathbf{x})\approx p^*(\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ (1.2)
$$

Natrually, we wish $p_{\boldsymbol{\theta}}(\mathbf{x})$ to be sufficiently *flexible* to be able to adapt to the data, such that we have a chance of obtaining a sufficiently accurate model. At the same time, we wish to be able to incorporate knowledge about the distribution of data into the model that is known a priori.

We often collect a dataset $\mathcal{D}$ consisting of $N\geq 1$ datapoints:
$$
\mathcal{D}=\{\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \dots, \mathbf{x}^{(N)}\}\equiv \{\mathbf{x}^{(i)}\}_{i=1}^N\equiv \mathbf{x}^{(1:N)}\ \ \ \ \ \ \ \ \ \ \ \ \ (1.9)
$$

The log-probability assigned to the data by the model is therefore given by:
$$
\log p_{\boldsymbol{\theta}}(\mathcal{D})=\sum\limits_{\mathbf{x}\in\mathcal{D}}\log p_{\boldsymbol{\theta}}(\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ \ (1.10)
$$

If we compute gradients using all datapoints, $\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(\mathcal{D})$, then this is nkown as *batch* gradient descent. Computation of this derivative is, however, an expensive operation for large dataset size $N_{\mathcal{D}}$, since it scales linearly with $N_{\mathcal{D}}$. 

The *stochastic gradient descent (SGD)* randomly draws minibatches of data $\mathcal{M}\subset \mathcal{D}$ of size $N_{\mathcal{M}}$. With such minibatches we can form an unbiased estimator of the ML criterion:
$$
\frac{1}{N_{\mathcal{D}}}\log p_{\boldsymbol{\theta}}(\mathcal{D})\simeq \frac{1}{N_{\mathcal{M}}}\log p_{\boldsymbol{\theta}}(\mathcal{M})=\frac{1}{N_{\mathcal{M}}}\sum\limits_{\mathbf{x}\in\mathcal{M}}\log p_{\boldsymbol{\theta}}(\mathcal{x})\ \ \ \ \ \ \ \ \ \ \ \ \ (1.11)
$$

The unbiased estimator $\log p_{\boldsymbol{\theta}}(\mathcal{M})$ is differentiable, yielding the unbiased stochastic gradients:
$$
\frac{1}{N_{\mathcal{D}}}\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(\mathcal{D})\simeq \frac{1}{N_{\mathcal{M}}}\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(\mathcal{M})=\frac{1}{N_{\mathcal{M}}}\sum\limits_{\mathbf{x}\in\mathcal{M}}\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(\mathcal{x})\ \ \ \ \ \ \ \ \ \ \ \ \ (1.11)
$$

We typically use $\mathbf{z}$ to denote latent variable. The marginal distribution over the observed variables $p_{\boldsymbol{\theta}}(\mathbf{x})$, is given by 
$$
p_{\boldsymbol{\theta}}(\mathbf{x})=\int p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})d\mathbf{z}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1.13)
$$
This is also called the *marginal likelihood* or the *model evidence*, when taken as a function of $\boldsymbol{\theta}$.

# Variational Autoencoders

We introduce a parametric *inference model* $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$. This model is also called the *encoder* or *recognition model*. With $\boldsymbol{\phi}$ we indicate the parameters of this inference model, also called the *variational parameters*. We optimize the variational parameters $\boldsymbol{\phi}$ such that: 
$$
q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\approx p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ (2.1)
$$

The generative model learns a joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})$ that is often factorized as $p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})=p_{\boldsymbol{\theta}}(\mathbf{z})p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$, with a prior distribution over latent space $p_{\boldsymbol{\theta}}(\mathbf{z})$, and a stochastic decoder $p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})$. The stochstic encoder $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$, also called *inference model*, approximates the true but intractable posterior $p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})$ of the generative model.

Evidence Lower Bound (ELBO).

For any choice of inference model $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$, including the choice of variational parameters $\boldsymbol{\theta}$, we have:
$$
\begin{split}
\log p_{\boldsymbol{\theta}}(\mathbf{x})&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{x})]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[\log\left(\frac{p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})}\right)\right]\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[\log\left(\frac{p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})}\right)\right]\\
&=\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[\log\left(\frac{p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\right)\right]}_{=\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})}+\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[\log\left(\frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x})}\right)\right]}_{=D_{KL}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\|p_{\boldsymbol{\theta}}(\mathbf{z}|\mathbf{x}))}\ \ \ \ \ \ \ \ \ \ \ \ \ (2.8)
\end{split}
$$

The second term in eq. (2.8) is the Kullback-Leibler (KL) divergence.

The first term is ELBO:
$$
\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]\ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.10)
$$

Due to the non-negativity of the KL divergence, the ELBO is a lower bound on the log-likelihood of the data:
$$
\begin{split}
\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})&=\log p_{\boldsymbol{\theta}}(\mathbf{x})-D_{KL}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\|p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x}))\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.11)\\
&\leq \log p_{\boldsymbol{\theta}}(\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.12)
\end{split}
$$

So, interestingly, the KL divergence determines two distances:
1. By definition, the KL divergence of the approximate posterior from the true posterior;
2. The gap between the ELBO $\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ and the marginal likelihood $\log p_{\boldsymbol{\theta}}(\mathbf{x})$; this is also called the *tightness* of the bound. The better $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ approximates the true (posterior) distribution $p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})$, in terms of the KL divergence, the smaller the gap.

By looking at eq. 2.11, it can be understood that maximization of the ELBO $\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ w.r.t. the parameters $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$, will concurrently optimize the two things we care about:
1. It will approximately maximize the marginal likelihood $p_{\boldsymbol{\theta}}(\mathbf{x})$. This means that out generative model will become better.
2. It will minimize the KL divergence of the approximation $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ from the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})$, so $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ becomes better.

### Stochastic Gradient-based Optimization of the ELBO

An important property of the ELBO, is that it allows *joint* optimization w.r.t. all parameters ($\boldsymbol{\theta}$ and $\boldsymbol{\phi}$) useing SGD. 

Given a dataset with i.i.d. data, the ELBO objective is the sum (or average) of individual-datapoint ELBO's:
$$
\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathcal{D})=\sum\limits_{\mathbf{x}\in\mathcal{D}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ (2.13)
$$

The individual-datapoint ELBO, and its gradient $\nabla_{\boldsymbol{\theta},\boldsymbol{\phi}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ is, in general, intractable. However, good unbiased estimators $\tilde{\nabla}_{\boldsymbol{\theta},\boldsymbol{\phi}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ exist, such that we can still perform minibatch SGD.

Unbiased gradients of the ELBO w.r.t. the generative model parameters $\boldsymbol{\theta}$ are simple to obtain:
$$
\begin{split}
\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})&=\nabla_{\boldsymbol{\theta}}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]\ \ \ \ \ \ \ \ \ \ \ (2.14)\\
&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\nabla_{\boldsymbol{\theta}}(\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}))]\ \ \ \ \ \ \ \ \ \ \ \ (2.15)\\
&\simeq \nabla_{\boldsymbol{\theta}}(\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}))\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.16)\\
&=\nabla_{\boldsymbol{\theta}}(\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x}))\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.17)
\end{split}
$$

The last line (eq. (2.17)) is a simple Monte Carlo estimator of the second line (eq. (2.15)), where $\mathbf{z}$ in the last two lines (eq. (2.16) and eq. (2.17)) is a random sample from $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$.<font color="red">这个没看懂!</font>

Unbiased gradient w.r.t. the variational parameters $\boldsymbol{\phi}$ are more difficult to obtain, since the ELBO's expectation is taken w.r.t. the distribution $q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$, which is a function of $\boldsymbol{\phi}$. i.e., in general:
$$
\begin{split}
\nabla_{\boldsymbol{\phi}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})&=\nabla_{\boldsymbol{\theta}}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]\\
&\neq \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\nabla_{\boldsymbol{\phi}}(\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}))]
\end{split}
$$

In the case of continuous latent variables, we can use a reparameterization trick for computing unbiased estimates of $\nabla_{\boldsymbol{\theta},\boldsymbol{\phi}}\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$.

### Reparameterization Trick

For continuous latent variables and a differentiable encoder and generative model, the ELBO  can be straightforwardly differentiated w.r.t. both $\boldsymbol{\theta}$ and $boldsymbol{\phi}$ through a change of variables, also called the *reparameterization trick*.

First, we express the random variable $\mathbf{z}\sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})$ as some differentiable (and invertible) transformation of another random variable $\boldsymbol{\epsilon}$, given $\mathbf{z}$ and $\boldsymbol{\phi}$:
$$
\mathbf{z}=\mathbf{g}(\boldsymbol{\epsilon},\boldsymbol{\phi},\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ (2.20)
$$
where the distribution of $\boldsymbol{\epsilon}$ is independent of $\boldsymbol{\phi}$ and $\mathbf{x}$.

Given such a change of variable, expectations can be rewritten in terms of $\boldsymbol{\epsilon}$:
$$
\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]=\mathbb{E}_{p(\boldsymbol{\epsilon})}[f(\mathbf{z})]\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.21)
$$
where $\mathbf{z}=\mathbf{g}(\boldsymbol{\epsilon},\boldsymbol{\phi},\mathbf{x})$. and the expectation and gradient operators become commutative, and we can form a simple Monte Carlo estimator:
$$
\begin{split}
\nabla_{\boldsymbol{\phi}}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]&=\nabla_{\boldsymbol{\phi}}\mathbb{E}_{p(\boldsymbol{\epsilon})}[f(\mathbf{z})]\ \ \ \ \ \ \ \ \ \ \ (2.22)\\
&=\mathbb{E}_{p(\boldsymbol{\epsilon})}[\nabla_{\boldsymbol{\phi}}f(\mathbf{z})]\ \ \ \ \ \ \ \ \ \ \ (2.23)\\
&\simeq\nabla_{\boldsymbol{\phi}}f(\mathbf{z})\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.24)
\end{split}
$$
where in the last line, $\mathbf{z}=\mathbf{g}(\boldsymbol{\epsilon},\boldsymbol{\phi},\mathbf{x})$ with random noise sample $\boldsymbol{\epsilon}\sim p(\boldsymbol{\epsilon})$.

The ELBO can be rewritten as
$$
\begin{split}
\mathcal{L}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})&=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]\ \ \ \ \ \ \ \ \ \ \ \ \ (2.25)\\
&=\mathbb{E}_{p_{\boldsymbol{\epsilon}}}[\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})]\ \ \ \ \ \ \ \ \ \ \ \ \ (2.26)
\end{split}
$$
where $\mathbf{z}=\mathbf{g}(\boldsymbol{\epsilon},\boldsymbol{\phi},\mathbf{x})$.

As a result, we can form a simple Monte Carlo estimator $\tilde{\mathcal{L}}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ of the individual-datapoint ELBO where we use a single noise sample $\boldsymbol{\epsilon}$ from $p(\boldsymbol{\epsilon})$:
$$
\begin{split}
\boldsymbol{\epsilon}&\sim p(\boldsymbol{\epsilon})\ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.27)\\
\mathbf{z}&=\mathbf{g}(\boldsymbol{\epsilon},\boldsymbol{\phi},\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.28)\\
\tilde{\mathcal{L}}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})&=\log p_{\boldsymbol{\theta}}(\mathbf{z},\mathbf{x})-\log q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (2.29)
\end{split}
$$

The resulting gradient $\nabla_{\boldsymbol{\phi}}\tilde{\mathcal{L}}_{\boldsymbol{\theta},\boldsymbol{\phi}}(\mathbf{x})$ is used to optimize the ELBO using minibatch SGD.

<img src="./figs/vae_alg1.png" width="500px"/>
<img src="./figs/vae_alg2.png" width="500px"/>

# Appendixes

# Marginal Likelihood

In many applications, the goal is to make inference about a variable of interest, $\mathbf{x}$ given a set of observed measurements $\mathbf{y}$. In the Bayesian framework, one complete model $\mathbf{M}$ is formed by a likelihood function $l(\mathbf{y}|\mathbf{x},\mathbf{M})$ and a prior probability density function $g(\mathbf{x}|\mathbf{M})$. All the statistical information is summarized by the posterior pdf, i.e.,
$$
\bar{\pi}(\mathbf{x}|\mathbf{M})=p(\mathbf{x}|\mathbf{y},\mathbf{M})=\frac{l(\mathbf{y}|\mathbf{x},\mathbf{M})g(\mathbf{x}|\mathbf{M})}{p(\mathbf{y}|\mathbf{M})},\ \ \ \ \ \ \ \ \ \ (1)
$$
where 
$$
Z=p(\mathbf{y}|\mathbf{M})=\int_{\mathbf{X}}l(\mathbf{y}|\mathbf{x},\mathbf{M})g(\mathbf{x}|\mathbf{M})d\mathbf{x},\ \ \ \ \ \ \ \ \ \ (2)
$$
is the so-called marginal likelihood, a.k.a., Bayesian evidence. However, usually $p(\mathbf{y}|\mathbf{M})$ is unknown and difficult to approximate, so that in many cases we are only able to evaluate the unnormalized target function,
$$
\pi(\mathbf{x}|\mathbf{M})=l(\mathbf{y}|\mathbf{x},\mathbf{M})g(\mathbf{x}|\mathbf{M}).\ \ \ \ \ \ \ \ \ (3)
$$
Note that $\bar{\pi}(\mathbf{x}|\mathbf{M})\varpropto \pi(\mathbf{x}|\mathbf{M})$.

# KL Divergence

To measure the difference between two probability distributions over the same variable $x$, a measure called the **Kullbck-Leibler divergence**, or simply, the KL divergence, have been popularly used in the data mining literature. The KL divergence of pdf $q(x)$ from pdf $p(x)$, denoted by $D_{KL}(p(x), q(x))$, is a measure of the information lost when $q(x)$ is used to approximate $p(x)$.

If $p(x)$ and $q(x)$ are pdfs of discrete random variable $x$. $D_{KL}(p(x), q(x))$ is defined as
$$
D_{KL}(p(x)\|q(x))=\sum\limits_{x\in X}p(x)\ln\frac{p(x)}{q(x)}.
$$

The KL divergence measures the expected number of extra bits required to code samples from $p(x)$ when using a code based on $q(x)$, rather than using a code based on $p(x)$. Typically $p(x)$ represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution. The measure $q(x)$ typically represents a theory, model, description, or approximation of $p(x)$.

The continuous version of the KL divergence is 
$$
D_{KL}(p(x)\|q(x))=\int\limits_{-\infty}^{+\infty}p(x)\ln\frac{p(x)}{q(x)}dx.
$$

# References

[1] D. P. Kingma, and M. Welling, "Auto-Encoding Variational Bayes," *in proc. ICLR'14*, 2014.

[2] F. Llorente, L. Martino, D. Delgado, and J. Lopez-Santiago, "Marginal likelihood computation for model selection and hypothesis testing: an extensive review," *axiv*, 2020.

[3] J. Han, "Kullback-Leibler Divergence," *http://hanj.cs.illinois.edu/cs412/bk3/KL-divergence.pdf*.

[4] D. P. Kingma, and M. Welling, "An Introduction to Variational Autoencoders," *Foundations and Trends in Machine Learning*, 2019.