<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Intro_to_VAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Variational Autoencoders Explained

[tutorial link](http://kvfrans.com/variational-autoencoders-explained/)

There were a couple of downsides to using a plain GAN

First, the images are generated off some arbitrary noise. If you wanted to generate a picture with specific features, there's no way of determining which initial noise values would produce that picture, other than searching over the entire distribution.

Second, a generative adversarial model only discriminates between "real" and "fake" images. There's no constraints that an image of a cat has to look like a cat. This leads to results where there's no actual object in a generated image, but the style just looks like picture.

In this post, I'll go over the variational autoencoder, a type of network that solves these two problems.

### What is a variational autoencoder?

To get an understanding of a VAE, we'll first start from a simple network and add parts step by step.

An common way of describing a neural network is an approximation of some function we wish to model. However, they can also be thought of as a data structure that holds information.

Let's say we had a network comprised of a few deconvolution layers. We set the input to always be a vector of ones. Then, we can train the network to reduce the mean squared error between itself and one target image. The "data" for that image is now contained within the network's parameters.

![](http://kvfrans.com/content/images/2016/08/dat.jpg)

Now, let's try it on multiple images. Instead of a vector of ones, we'll use a one-hot vector for the input. [1, 0, 0, 0] could mean a cat image, while [0, 1, 0, 0] could mean a dog. This works, but we can only store up to 4 images. Using a longer vector means adding in more and more parameters so the network can memorize the different images.

To fix this, we use a vector of real numbers instead of a one-hot vector. We can think of this as a code for an image, which is where the terms encode/decode come from. For example, [3.3, 4.5, 2.1, 9.8] could represent the cat image, while [3.4, 2.1, 6.7, 4.2] could represent the dog. This initial vector is known as our latent variables.

Choosing the latent variables randomly, like I did above, is obviously a bad idea. In an autoencoder, we add in another component that takes in the original images and encodes them into vectors for us. The deconvolutional layers then "decode" the vectors back to the original images.

![](http://kvfrans.com/content/images/2016/08/autoenc.jpg)

We've finally reached a stage where our model has some hint of a practical use. We can train our network on as many images as we want. If we save the encoded vector of an image, we can reconstruct it later by passing it into the decoder portion. What we have is the standard autoencoder.

However, we're trying to build a generative model here, not just a fuzzy data structure that can "memorize" images. We can't generate anything yet, since we don't know how to create latent vectors other than encoding them from images.



There's a simple solution here. We add a constraint on the encoding network, that forces it to generate latent vectors that roughly follow a unit gaussian distribution. It is this constraint that separates a variational autoencoder from a standard one.

Generating new images is now easy: all we need to do is sample a latent vector from the unit gaussian and pass it into the decoder.

In practice, there's a tradeoff between how accurate our network can be and how close its latent variables can match the unit gaussian distribution.

We let the network decide this itself. For our loss term, we sum up two separate losses: the generative loss, which is a mean squared error that measures how accurately the network reconstructed the images, and a latent loss, which is the KL divergence that measures how closely the latent variables match a unit gaussian.

```
generation_loss = mean(square(generated_image - real_image))  
latent_loss = KL-Divergence(latent_variable, unit_gaussian)  
loss = generation_loss + latent_loss 
```

In order to optimize the KL divergence, we need to apply a simple reparameterization trick: instead of the encoder generating a vector of real values, it will generate a vector of means and a vector of standard deviations.

![](http://kvfrans.com/content/images/2016/08/vae.jpg)

This lets us calculate KL divergence as follows:

```
# z_mean and z_stddev are two vectors generated by encoder network
latent_loss = 0.5 * tf.reduce_sum(tf.square(z_mean) + tf.square(z_stddev) - tf.log(tf.square(z_stddev)) - 1,1)  
```

When we're calculating loss for the decoder network, we can just sample from the standard deviations and add the mean, and use that as our latent vector:

```
samples = tf.random_normal([batchsize,n_z],0,1,dtype=tf.float32)  
sampled_z = z_mean + (z_stddev * samples)  
```

In addition to allowing us to generate random latent variables, this constraint also improves the generalization of our network.

To visualize this, we can think of the latent variable as a transfer of data.

Let's say you were given a bunch of pairs of real numbers between [0, 10], along with a name. For example, 5.43 means apple, and 5.44 means banana. When someone gives you the number 5.43, you know for sure they are talking about an apple. We can essentially encode infinite information this way, since there's no limit on how many different real numbers we can have between [0, 10].

However, what if there was a gaussian noise of one added every time someone tried to tell you a number? Now when you receive the number 5.43, the original number could have been anywhere around [4.4 ~ 6.4], so the other person could just as well have meant banana (5.44).

The greater standard deviation on the noise added, the less information we can pass using that one variable.

Now we can apply this same logic to the latent variable passed between the encoder and decoder. The more efficiently we can encode the original image, the higher we can raise the standard deviation on our gaussian until it reaches one.

This constraint forces the encoder to be very efficient, creating information-rich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding non-training images, will produce a nicer result when decoded.

## Variational Autoencoder in PyTorch

[tutorial link](https://vxlabs.com/2017/12/08/variational-autoencoder-in-pytorch-commented-and-annotated/)

The general idea of the autoencoder (AE) is to squeeze information through a narrow bottleneck between the mirrored encoder (input) and decoder (output) parts of a neural network. (see the diagram below)

Because the network achitecture and loss function are setup so that the output tries to emulate the input, the network has to learn how to encode input data on the very limited space represented by the bottleneck.

Variational Autoencoders, or VAEs, are an extension of AEs that additionally force the network to ensure that samples are normally distributed over the space represented by the bottleneck.

They do this by having the encoder output two n-dimensional (where n is the number of dimensions in the latent space) vectors representing the mean and the standard devation. These Gaussians are sampled, and the samples are sent through the decoder. This is the reparameterization step, also see my comments in the reparameterize() function.

The loss function has a term for input-output similarity, and, importantly, it has a second term that uses the Kullback–Leibler divergence to test how close the learned Gaussians are to unit Gaussians.

The loss function has a term for input-output similarity, and, importantly, it has a second term that uses the Kullback–Leibler divergence to test how close the learned Gaussians are to unit Gaussians.

In other words, this extension to AEs enables us to derive Gaussian distributed latent spaces from arbitrary data. Given for example a large set of shapes, the latest space would be a high-dimensional space where each shape is represented by a single point, and the points would be normally distributed over all dimensions. With this one can represent existing shapes, but one can also synthesise completely new and plausible shapes by sampling points in latent space.

### Results using MNIST


Below you see 64 random samples of a two-dimensional latent space of MNIST digits that I made with the example below, with ZDIMS=2.

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-sample-z2-epoch10.png?w=660&ssl=1)

Next is the reconstruction of 8 random unseen test digits via a more reasonable 20-dimensional latent space. Keep in mind that the VAE has learned a 20-dimensional normal distribution for any input digit, from which samples are drawn that reconstruct via the decoder to output that appear similar to the input.

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-reconstruction-z10-epoch10.png?w=660&ssl=1)

### A diagram of a simple VAE

An example VAE, incidentally also the one implemented in the PyTorch code below, looks like this:

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-arch-2.png?resize=660%2C317&ssl=1)

In [0]:
import os
import torch
import torch.utils.data
from torch import nn, optim
from torch.autograd import Variable
from torch.nn import functional as F
from torchvision import datasets, transforms
from torchvision.utils import save_image