<a name='0'></a>

# Variational Autoencoders(VAE)

Variational autoencoders are a class of deep generative networks that are based on autencoders and likelihood estimation. Variational autoencoders were introduced in 2014 in paper [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114).

***Outline:***

* [1. Introduction to Normal Autoencoders](#1)
* [2. Introduction to Variational Autoencoders](#2)
* [3. Variational Autoencoder Architectures](#3)
  * [3.1 VAE1](#2-1)
  * [3.2 VAE2](#2-2)
  * [3.3 VAE3](#2-3)
* [4. Advantages and Disadvantages of VAE](#4)
* [4. Final notes](#3)
* [5. Further Learning](#4)

<a name='1'></a>

## 1. Introduction to Normal Autoencoders

Normal autencoders are neural network architectures that are used for learning features representations from unlabelled datasets. Autoencoders are unsupervised learning networks that learn the latent representation of the images. The latent features learned by autoencoders can be used for downstream recognition tasks such as image classification.

Essentially, autoencoders compress and deconstruct the images. They are made of encoder and decoder. Encoder transforms the input unlabelled images into latent vector(or latent feature representation) and decoder deconstructs the orginal image from the latent vector. Both encoder and decoder are any type of neural network architectures that can compress and decompress data such as fully-connected networks and convolutional neural networks(CNNs). Modern autoencoders are based on CNNs. To get the latent representation of the images with CNNs(maxpooling could also work but not recommended), you can use strided convolution and to reconstruct the input image from the latent representation, you can use transposed convolution.

The architecture of autoencoder is shown below:

![image](https://drive.google.com/uc?export=view&id=1AdqlF48zeG90GZRzFPyVuh6wVlN3_wTS)

The overall goal of autoencoders is to learn the latent feature representation of the data and to reconstruct the input image from latent representation. Thus, autoencoders are trained to minimize the L2 distance between input data and reconstructed data or reconstruction loss. If the input training images are represented as $x$, and the reconstructed images are represented as $\hat{x}$, the reconstruction loss is simply the squarred difference(pixel-wise) between $x$ and $\hat{x}$:

$$ reconstruction \space loss = || x - \hat{x}||^2 $$

After training the autoencoder, the decoder is typically removed and the remaining network with latent feature representation can be used for downstream tasks such as image classification on small dataset in form of transfer learning.


Autoencoders are beautiful idea in feature representation learning since they can allow us to learn from massive amount of unlabelled data. They can also be used in data compression(or dimensionality reduction), [denoising images](https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf), and anomaly detection. However, autoencoders are not useful idea in image generation since they are not probabilistic at all. We can not sample new images from the latent representation vector! Autoencoders are a great motivation of variation autoencoder, a probabilistic variant of autoencoders that can actually generate new images! Let's look at variational autoencoders(VAE) in the next section!



## 2. Introduction to Variational Autoencoders(VAE)

Variation autoencoders(VAE) are probabilistic version of standard autoencoders. The idea of VAE is to learn the latent variables(or parameters of probability distribution) from the input image and then sample from that distribution to generate new images at test time. VAEs has shown that they can generate photorealistic images and they are currently used in state-of-the-arts image generation systems.

Normal autoencoders merely transform image into latent feature representation. Variational autoencoder takes image and transform it into low-dimensional latent space parameterized by probability distribution. The specific kind of probability distribution learnt by VAE encoder is [gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution)(characterized by mean and standard deviation).


As we previously said, the fundamental building of VAEs is autoencoders. They also have encoder and decoder with learnable probability distribution between them. Encoder computes the probability distribution of latent feature conditioned on input image and decoder computes the probability distribution of output reconstructed image conditioned on latent feature.


![image](https://drive.google.com/uc?export=view&id=1ZbN6YcE70tnNj1aTo-FNaH-2JGRGZsH5)

Variational autoencoders are trained to maximize the likelihood of the input data or [variational lower bound](https://en.wikipedia.org/wiki/Evidence_lower_bound). The training loss factors in both reconstruction loss and regularization loss(defined by the parameters of latent feature z distribution). There are lots of mathematics behind the loss function of VAE but that's not in the scope of this notebook. If interested in diving deep into VAE, check further resources at the end of this notebook.

Let's see some famous image generative architectures based on VAE.



## 3. Variation AutoEncoder Architectures

Variational autencoders alone are not enough to create photo-realistic images. They have often been combined with other generative networks that we saw previously such as autoregressive networks such as PixelCNN and other generative approaches we haven't seen yet such as generative adversarial networks(GANs). Below we will look at three popular VAE based generative networks.


### 3.1 Neural Discrete Representation Learning - (VQ-VAE)

[VQ-VAE](https://arxiv.org/abs/1711.00937) which stands for Vector Quantized - Variational Autoencoder is one of the earliest generative models that combined variational autoencoders with discrete latent representations and autoregressive models. Different to standard VAEs whose encoder learns the continous distribution of latent feature given the input data, VQ-VAE learns the discrete latent representation. In order to learn the discrete latent representation, the authors used [vector quantization](https://en.wikipedia.org/wiki/Vector_quantization) technique, an idea of converting a continous variable into discrete variable. VQ-VAE can thus be seen as a new and powerful way of training variational autoencoders and upon its introduction, it achieved excellent generation capabilities in various modalities(images, video, speech).

Images and speech(VQ-VAE was also tested for speech generation) are discrete in nature. Take an example: images is a finite sequence of pixels. Thus, the idea of using vector quantization to get discrete latent representation make sense.

TO BE CONTINUED!