# Variational auto-encoder



Tutorial on Variational Autoencoders https://arxiv.org/abs/1606.05908


# Definition and notation


$X$: datapoints in possibly high-dimensional space $\mathcal{X}$

$P(X)$: a generative model of distribution defined over datapoints X

$z$: latent variable in high dimensional space $\mathcal{Z}$

$f(z, \theta)$: deterministic function parameterized by $\theta$ where $f: \mathcal{Z} \times \mathcal{\Theta} \to \mathcal{X}$

$\mathcal{D}[P(x)\|Q(x)] = -\int P(x)\ln \frac{Q(x)}{P(x)}dx = -\mathbb{E}_{x \sim P}[\log \frac{Q(x)}{P(x)}]$: KL divergence

# Goal

Maximize the following equation by optimizting $\theta$.

$P(X) = \int P(X|z;\theta)P(z)dz$

Normally,

$P(X|z;\theta) = \mathcal{N}(X|f(z, \theta), \sigma^2 * I)$

# Relation between $P(X)$ and $\mathbb{E}_{z \sim Q}[\log P(X|z)]$

$\log P(X) - \mathcal{D}[Q(z)\|P(z|X)] = \mathbb{E}_{z \sim Q}[\log P(X|z)] - \mathcal{D}[Q(z)\|P(z)]$

If we change $Q(z)$ to $Q(z|X)$,

$\log P(X) - \mathcal{D}[Q(z|X)\|P(z|X)] = \mathbb{E}_{z \sim Q}[\log P(X|z)] - \mathcal{D}[Q(z|X)\|P(z)]$ ... (5)

We want to maximize $\log P(X)$ while simultaneously minimizing $\mathcal{D}[Q(z|X)\|P(z|X)]$.

Instead we can optimize the right hand side via stochastic gradient descent given the right choice of Q.

---

$\mathcal{D}[Q(z)\|P(z|X)] = \mathbb{E}_{z \sim Q}[\log Q(z) - \log P(z|X)]$

$ = \mathbb{E}_{z \sim Q}[\log Q(z) - \log P(X|z) - \log P(z)] + \log P(X)$  (by applying Bayes rule.)

$\log P(X) - \mathcal{D}[Q(z)\|P(z|X)] = -\mathbb{E}_{z \sim Q}[\log Q(z) - \log P(X|z) - \log P(z)]$

$ = \mathbb{E}_{z \sim Q}[\log P(X|z)] + \mathbb{E}_{z \sim Q}[\log P(z) - \log Q(z)]$

$ = \mathbb{E}_{z \sim Q}[\log P(X|z)] - \mathcal{D}[Q(z)\|P(z)]$

# Optimizing the objective (1): KL divergence of two Gaussians


We optimize the right hand side of (5). 

$\mathbb{E}_{z \sim Q}[\log P(X|z)] - \mathcal{D}[Q(z|X)\|P(z)]$

First we compute the second term $\mathcal{D}[Q(z|X)\|P(z)]$.

Let

$P(z) = \mathcal{N}(z|0, I)$

$Q(z|X) = \mathcal{N}(z|\mu(X;\theta), \Sigma(X;\theta))$

Then 

$\mathcal{D}[Q(z|X)\|P(z)] = \mathcal{D}[\mathcal{N}\left(z|\mu(X), \Sigma(X)\right) \| \mathcal{N}(0, I)]$

$ = \frac{1}{2}\left\{ \mathrm{tr}\left(\Sigma(X)\right) + \left(- \mu(X)\right)^\top(- \mu(X)) - k + \log\left(\frac{1}{\det\Sigma(X)}\right) \right\}$

$ = \frac{1}{2}\left\{ \mathrm{tr}\left(\Sigma(X)\right) + \mu(X)^\top\mu(X) - k - \log\det\left(\Sigma(X)\right) \right\}$

Since 

$\mathcal{D}[\mathcal{N}(\mu_1, \Sigma_1)\|\mathcal{N}(\mu_2, \Sigma_2)] =
\frac{1}{2}\left\{ \mathrm{tr}\left(\Sigma_2^{-1}\Sigma_1\right) + \left(\mu_2 - \mu_1\right)^\top \Sigma_2^{-1}(\mu_2 - \mu_1) - k + \log\left(\frac{\det \Sigma_2}{\det\Sigma_1}\right) \right\}$



# Optimizing the objective (2): Reparametrization trick

We want to compute the first term by sampling.

$\mathbb{E}_{z \sim Q}[\log P(X|z)]$

In this equation, parameters from Q such as $\mu(X)$ and $\Sigma(X)$ dissapears. This prvents backpropagation.

Let

$\epsilon \sim \mathcal{N}(0, I)$

Then

$z = \mu(X) + \Sigma^{1/2}(X) * \epsilon$

$\mathbb{E}_{z \sim Q}[\log P(X|z)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}[\log P(X|z=\mu(X) + \Sigma^{1/2}(X) * \epsilon)]$