# Variational Autoencoders
### Faustine Li and Luyang Wang 

--------------------

### Abstract

Variational autoencoders (VAE) are a popular unsupervised learning method. Built on top of neural networks, powerful function approximators, VAEs can do efficient posterior inference on complicated distributions. Because the data is encoded in a lower dimensional latent variable space, dimensionality reduction / compression is a common application. They are often trained with images, where they encode the several hundreds or thousands of pixels into a single digit number of latent variable parameters. VAEs can also be thought of as a generative probability model; realistic new images can be produced from a sample from the latent variable space. In this project, we build a variational autoencoder from sratch in Python and train it on MNIST handwritten digit images.

### 1. Introduction

####  1.1 Background

Variational autoencoders are primarily a form of a generative model; they produce data that is similar to, but not exactly like the data they are trained on. The data is often very high dimensional such as images, text, or speech. Trying to design a model from stratch that can produce, for example, realistic images of cats would be increadibly daunting. Not only is there considerable variation among different breeds of cats, but they are notorious deformable. VAEs have become a popular research topic because they can reduce the dimensionality of complex data into a latent variable representation. Generating new data points can be easily thought of as taking a sample from the latent probability distribution.

As a way of example, consider a data point $x \in X$. This could be a set of pixel intensities in an image of a handwritten digit. We assume that there is an unknown distribution with parameters $\theta$ that produced the image. We also assume that there is a latent variable $z$ from some distribution with parameters $\phi$ that is unobserved, but influences that data generation step. Therefore $x$ comes from the conditional distribution $p(x \mid z, \ \theta)$. Below is a figure that illustrates the probability model as a diagram. Each data point $x_i \in \{ x_1, \cdots, x_N \}$ is influenced by a random variable $z_i$ and unknown parameters $\theta$.

![](resources/figure1.png)


#### 1.2 Variational Inference

If we could estimate the parameters of $p(x; \theta)$, the marginal likelihood, then we could sample a new data point from it. We have found parameters for the generative model. From the rules of probability, we can integrate out the latent variable.

$$p(x; \theta) = \int p(z) \ p(x \mid z) \ dz$$

The distribution $p(z)$ is assumed to be from some parametric model - in the case of VAEs often just $N(0, I)$. This is also called the prior in Bayesian inference. The term $p(x \mid z)$ is the conditional distribution. However, for very complex distributions the integral quickly becomes intractable. Other methods of inference such as MCMC are computationally expensive and have problems scaling with very large datasets. A method called variational inference seeks to very efficiently estimate parameters of even very complicated data.

Variational inference works by finding the posterior $p(z \mid X)$. We do this by finding an approximate function $q(z \mid \phi)$, where $\phi$ are variational parameters that need to be estimated. The function $q$ does not need to be in the same family as $p$. We just tune $\phi$ so that $q$ is close to $p$. The closeness is measured using Kullback–Leibler divergence.

$$KL(q \ || \ p) = E[log( \frac{q(z)}{p(z \mid x)})]$$

Because of the assumption that $z \mid x$ comes from a normal distribution $N(\mu, \sigma^2 I)$, the KL term becomes:

$$KL(N(\mu, \sigma^2) \ || \  N(0, 1)) = \frac{-1}{2} \ \sum (1 + \sigma^2 - \mu^2 - exp(\sigma^2))$$

Now all we need some functional approximation technique to minimize the KLD. 

#### 1.3 Variational Autoencoders in Practice

In practice, we use neural networks for functional approximation. That means that variational autoencoders can be built using the large body of neural network packages and resources. In addition, they can borrow some of the techniques of efficient training including optimzation techniques such as stochastic gradient descent and mini-batching.

The archtechture is similar to an autoencoder. The encoder part of the network encodes high dimensional data into a lower dimensional representation. The decoder turns that lower dimensional input into a reconstruction of the input data. Below is a schematic of the neural network architecture. 

<img src="resources/figure2.png" width="40%">

The loss function is a sum of the reconstruction loss (cross-entropy or square error) and the KL divergence loss. We can take the gradient of the loss with respect to the weights and train using backpropagation. The only snag is that we can't backpropigate through a stochastic node. We get around this by using a reparameterization trick. We change $z \sim N(\mu, sigma^2)$ to $z = \mu + \sigma^2 * \epsilon, \epsilon \sim N(0, 1)$. Then the gradient is determanistic and we can train the network in the normal way. 

### 2. Implementation

### 4. Optimization

### 5. Results

### 6. Comparison

### 7. Conclusion

### References 

1. [Tutorial on Variational Autoencoders]()
2. [Autoencoding Variational Bayes]()
3. [The Truth about Cats and Dogs](https://www.robots.ox.ac.uk/~vgg/publications/2011/Parkhi11/parkhi11.pdf)
4. [Variational Inference - Blei](https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf)