<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Intro_to_VAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Variational Autoencoders Explained

[tutorial link](http://kvfrans.com/variational-autoencoders-explained/)

There were a couple of downsides to using a plain GAN

First, the images are generated off some arbitrary noise. If you wanted to generate a picture with specific features, there's no way of determining which initial noise values would produce that picture, other than searching over the entire distribution.

Second, a generative adversarial model only discriminates between "real" and "fake" images. There's no constraints that an image of a cat has to look like a cat. This leads to results where there's no actual object in a generated image, but the style just looks like picture.

In this post, I'll go over the variational autoencoder, a type of network that solves these two problems.

### What is a variational autoencoder?

To get an understanding of a VAE, we'll first start from a simple network and add parts step by step.

An common way of describing a neural network is an approximation of some function we wish to model. However, they can also be thought of as a data structure that holds information.

Let's say we had a network comprised of a few deconvolution layers. We set the input to always be a vector of ones. Then, we can train the network to reduce the mean squared error between itself and one target image. The "data" for that image is now contained within the network's parameters.

![](http://kvfrans.com/content/images/2016/08/dat.jpg)

Now, let's try it on multiple images. Instead of a vector of ones, we'll use a one-hot vector for the input. [1, 0, 0, 0] could mean a cat image, while [0, 1, 0, 0] could mean a dog. This works, but we can only store up to 4 images. Using a longer vector means adding in more and more parameters so the network can memorize the different images.

To fix this, we use a vector of real numbers instead of a one-hot vector. We can think of this as a code for an image, which is where the terms encode/decode come from. For example, [3.3, 4.5, 2.1, 9.8] could represent the cat image, while [3.4, 2.1, 6.7, 4.2] could represent the dog. This initial vector is known as our latent variables.

Choosing the latent variables randomly, like I did above, is obviously a bad idea. In an autoencoder, we add in another component that takes in the original images and encodes them into vectors for us. The deconvolutional layers then "decode" the vectors back to the original images.

![](http://kvfrans.com/content/images/2016/08/autoenc.jpg)

We've finally reached a stage where our model has some hint of a practical use. We can train our network on as many images as we want. If we save the encoded vector of an image, we can reconstruct it later by passing it into the decoder portion. What we have is the standard autoencoder.

However, we're trying to build a generative model here, not just a fuzzy data structure that can "memorize" images. We can't generate anything yet, since we don't know how to create latent vectors other than encoding them from images.



There's a simple solution here. We add a constraint on the encoding network, that forces it to generate latent vectors that roughly follow a unit gaussian distribution. It is this constraint that separates a variational autoencoder from a standard one.

Generating new images is now easy: all we need to do is sample a latent vector from the unit gaussian and pass it into the decoder.

In practice, there's a tradeoff between how accurate our network can be and how close its latent variables can match the unit gaussian distribution.

We let the network decide this itself. For our loss term, we sum up two separate losses: the generative loss, which is a mean squared error that measures how accurately the network reconstructed the images, and a latent loss, which is the KL divergence that measures how closely the latent variables match a unit gaussian.

```
generation_loss = mean(square(generated_image - real_image))  
latent_loss = KL-Divergence(latent_variable, unit_gaussian)  
loss = generation_loss + latent_loss 
```

In order to optimize the KL divergence, we need to apply a simple reparameterization trick: instead of the encoder generating a vector of real values, it will generate a vector of means and a vector of standard deviations.

![](http://kvfrans.com/content/images/2016/08/vae.jpg)

This lets us calculate KL divergence as follows:

```
# z_mean and z_stddev are two vectors generated by encoder network
latent_loss = 0.5 * tf.reduce_sum(tf.square(z_mean) + tf.square(z_stddev) - tf.log(tf.square(z_stddev)) - 1,1)  
```

When we're calculating loss for the decoder network, we can just sample from the standard deviations and add the mean, and use that as our latent vector:

```
samples = tf.random_normal([batchsize,n_z],0,1,dtype=tf.float32)  
sampled_z = z_mean + (z_stddev * samples)  
```

In addition to allowing us to generate random latent variables, this constraint also improves the generalization of our network.

To visualize this, we can think of the latent variable as a transfer of data.

Let's say you were given a bunch of pairs of real numbers between [0, 10], along with a name. For example, 5.43 means apple, and 5.44 means banana. When someone gives you the number 5.43, you know for sure they are talking about an apple. We can essentially encode infinite information this way, since there's no limit on how many different real numbers we can have between [0, 10].

However, what if there was a gaussian noise of one added every time someone tried to tell you a number? Now when you receive the number 5.43, the original number could have been anywhere around [4.4 ~ 6.4], so the other person could just as well have meant banana (5.44).

The greater standard deviation on the noise added, the less information we can pass using that one variable.

Now we can apply this same logic to the latent variable passed between the encoder and decoder. The more efficiently we can encode the original image, the higher we can raise the standard deviation on our gaussian until it reaches one.

This constraint forces the encoder to be very efficient, creating information-rich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding non-training images, will produce a nicer result when decoded.

## Variational Autoencoder in PyTorch

[tutorial link](https://vxlabs.com/2017/12/08/variational-autoencoder-in-pytorch-commented-and-annotated/)

The general idea of the autoencoder (AE) is to squeeze information through a narrow bottleneck between the mirrored encoder (input) and decoder (output) parts of a neural network. (see the diagram below)

Because the network achitecture and loss function are setup so that the output tries to emulate the input, the network has to learn how to encode input data on the very limited space represented by the bottleneck.

Variational Autoencoders, or VAEs, are an extension of AEs that additionally force the network to ensure that samples are normally distributed over the space represented by the bottleneck.

They do this by having the encoder output two n-dimensional (where n is the number of dimensions in the latent space) vectors representing the mean and the standard devation. These Gaussians are sampled, and the samples are sent through the decoder. This is the reparameterization step, also see my comments in the reparameterize() function.

The loss function has a term for input-output similarity, and, importantly, it has a second term that uses the Kullback–Leibler divergence to test how close the learned Gaussians are to unit Gaussians.

The loss function has a term for input-output similarity, and, importantly, it has a second term that uses the Kullback–Leibler divergence to test how close the learned Gaussians are to unit Gaussians.

In other words, this extension to AEs enables us to derive Gaussian distributed latent spaces from arbitrary data. Given for example a large set of shapes, the latest space would be a high-dimensional space where each shape is represented by a single point, and the points would be normally distributed over all dimensions. With this one can represent existing shapes, but one can also synthesise completely new and plausible shapes by sampling points in latent space.

### Results using MNIST


Below you see 64 random samples of a two-dimensional latent space of MNIST digits that I made with the example below, with ZDIMS=2.

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-sample-z2-epoch10.png?w=660&ssl=1)

Next is the reconstruction of 8 random unseen test digits via a more reasonable 20-dimensional latent space. Keep in mind that the VAE has learned a 20-dimensional normal distribution for any input digit, from which samples are drawn that reconstruct via the decoder to output that appear similar to the input.

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-reconstruction-z10-epoch10.png?w=660&ssl=1)

### A diagram of a simple VAE

An example VAE, incidentally also the one implemented in the PyTorch code below, looks like this:

![](https://vxlabs.com/wp-content/uploads/2017/12/pytorch-vae-arch-2.png?resize=660%2C317&ssl=1)

In [0]:
import os
import torch
import torch.utils.data
from torch import nn, optim
from torch.autograd import Variable
from torch.nn import functional as F
from torchvision import datasets, transforms
from torchvision.utils import save_image

In [0]:
# changed configuration to this instead of argparse for easier interaction
CUDA = True
SEED = 1
BATCH_SIZE = 128
LOG_INTERVAL = 10
EPOCHS = 10
# connections through the autoencoder bottleneck
# in the pytorch VAE example, this is 20
ZDIMS = 20

In [21]:
torch.manual_seed(SEED)
if CUDA:
    print ("yes") 
    torch.cuda.manual_seed(SEED)

# DataLoader instances will load tensors directly into GPU memory
kwargs = {'num_workers': 1, 'pin_memory': True} if CUDA else {}

yes


In [0]:
# Download or load downloaded MNIST dataset
# shuffle data at every epoch

train_loader = torch.utils.data.DataLoader(dataset=datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
                                            batch_size=BATCH_SIZE, shuffle=True, **kwargs)


# each training example comes in batch size of 128, has 1 ip color channel and is 28x28. So [128,1,28,28]

# Same for test data
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=False, transform=transforms.ToTensor()),
    batch_size=BATCH_SIZE, shuffle=True, **kwargs)

In [23]:
train_loader.dataset.train_data[0].shape



torch.Size([28, 28])

In [0]:
class VAE(nn.Module):
  def __init__(self):
    super(VAE, self).__init__()

    #-----------ENCODER-----------------

    # 28 x 28 pixels = 784 input pixels, 400 outputs

    self.fc1 = nn.Linear(784, 400)
    # rectified linear unit layer from 400 to 400
    # max(0, x)
    self.relu = nn.ReLU()

    self.fc21 = nn.Linear(400, ZDIMS) # mu layer
    self.fc22 = nn.Linear(400, ZDIMS) # logvariance layer
    # this last layer bottlenecks through ZDIMS connections

    #-----------DECODER-----------------


    # from bottleneck to hidden 400

    self.fc3 = nn.Linear(ZDIMS, 400)

    # from hidden 400 to 784 outputs
    self.fc4 = nn.Linear(400, 784)

    self.sigmoid = nn.Sigmoid()
  
  

In [0]:
def encode(self, x: Variable) -> (Variable, Variable):
  """
  Input vector x -> fully connected 1 -> ReLU -> (fully connected
  21, fully connected 22)

  Parameters
  ----------
  x : [128, 784] matrix; 128 digits of 28x28 pixels each

  Returns
  -------

  (mu, logvar) : ZDIMS (here 20) mean units one for each latent dimension, ZDIMS
      variance units one for each latent dimension

  mu : [128, ZDIMS] mean matrix
  logvar : [128, ZDIMS] variance matrix

  """

  # h1 is [128, 400]
  h1 = self.relu(self.fc1(x))  # type: Variable

  return self.fc21(h1), self.fc22(h1)

In [0]:
def reparameterize(self, mu: Variable, logvar: Variable) -> Variable:
  
  """
  THE REPARAMETERIZATION IDEA:
  
  For each training sample (we get 128 images batched at a time)
  
  - take the current learned mu, stddev for each of the ZDIMS
    dimensions and draw a random sample from that distribution
  
  - the whole network is trained so that these randomly drawn
    samples decode to output that looks like the input
    
  - which will mean that the std, mu will be learned
    *distributions* that correctly encode the inputs
    
  - due to the additional KLD term (see loss_function() below)
    the distribution will tend to unit Gaussians
  
  Parameters
  ----------
  mu : [128, ZDIMS] mean matrix
  logvar : [128, ZDIMS] variance matrix

  Returns
  -------

  During training random sample from the learned ZDIMS-dimensional
  normal distribution; during inference its mean.

  """
  
  if self.training:
    # convert log variance to exp
    # multiply log variance with 0.5, then in-place exponent
    # yielding the standard deviation
    
    std = logvar.mul(0.5).exp_() # type: Variable
    # - std.data is the [128,ZDIMS] tensor that is wrapped by std
    
    # - so eps is [128,ZDIMS] with all elements drawn from a mean 0
    #   and stddev 1 normal distribution that is 128 samples
    #   of random ZDIMS-float vectors
    
    eps = Variable(std.data.new(std.size()).normal_())
    
    # - sample from a normal distribution with standard
    #   deviation = std and mean = mu by multiplying mean 0
    #   stddev 1 sample with desired std and mu, see
    #   https://stats.stackexchange.com/a/16338
    # - so we have 128 sets (the batch) of random ZDIMS-float
    #   vectors sampled from normal distribution with learned
    #   std and mu for the current input
    
    return eps.mul(std).add_(mu)
  
  else:
    # During inference, we simply spit out the mean of the
    # learned distribution for the current input.  We could
    # use a random sample from the distribution, but mu of
    # course has the highest probability.
    return mu

In [0]:
def decode(self, z:Variable) -> Variable:
  #[128, 20] -> [128, 400] -> [128, 784]
  
  h3 = self.relu(self.fc3(z))
  return self.sigmoid(self.fc4(h3))

In [0]:
def forward(self, x:Variable) -> (Variable, Variable, Variable):
  mu, logvar = self.encode(x.view(-1, 784))
  z = self.reparameterize(mu, logvar)
  return self.decode(z), mu, logvar

In [0]:
# init the model

model = VAE()

if CUDA:
  model.cuda()
  
# define ADAM as optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-3)
  
def loss_function(recon_x, x, mu, logvar) -> Variable:
  # how well do input x and output recon_x agree?
  
  BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784))
  
  # KLD is Kullback–Leibler divergence -- how much does one learned
  # distribution deviate from another, in this specific case the
  # learned distribution from the unit Gaussian

  # see Appendix B from VAE paper:
  # Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
  # https://arxiv.org/abs/1312.6114
  # - D_{KL} = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
  # note the negative D_{KL} in appendix B of the paper
  KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
  # Normalise by same number of elements as in reconstruction
  KLD /= BATCH_SIZE * 784

  # BCE tries to make our reconstruction as accurate as possible
  # KLD tries to push the distributions as close as possible to unit Gaussian
  return BCE + KLD

In [67]:
len(train_loader.dataset)

60000

In [0]:
def train(epoch):
  # toggle model to train mode
  
  model.train()
  
  train_loss = 0
  
  # in the case of MNIST, len(train_loader.dataset) is 60000
  # each `data` is of BATCH_SIZE samples and has shape [128, 1, 28, 28]
  
  for batch_idx, (data, _) in enumerate(train_loader):
    data = Variable(data)
    if CUDA:
      data = data.cuda()
    optimizer.zero_grad()
    
    # push whole batch of data through VAE.forward() to get recon_loss
    
    recon_batch, mu, logvar = model(data) # equivalent to calling forward(data)
    
    # calculate scalar loss
    
    loss = loss_function(recon_batch, data, mu, logvar)
    
    # calculate the gradient of the loss w.r.t. the graph leaves
    # i.e. input variables -- by the power of pytorch!
    
    loss.backward()
    
    train_loss += loss.data[0]
    
    # perform the optimization step
    optimizer.step()
    
    if batch_idx % LOG_INTERVAL == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
          epoch, batch_idx * len(data), len(train_loader.dataset),
          100. * batch_idx / len(train_loader),
          loss.data[0] / len(data)))
      
  print("======> Epoch: {} Average loss: {:.4f}".format(epoch, train_loss/len(train_loader.dataset)))
      