# Variational Autoencoders

In technical terms, here is how a variational autoencoder works. First, an encoder module turns the input samples input_img into two parameters in a latent space of representations, which we will note z_mean and z_log_variance. Then, we randomly sample a point z from the latent normal distribution that is assumed to generate the input image, via z = z_mean + exp(z_log_variance) * epsilon, where epsilon is a random tensor of small values. Finally, a decoder module will map this point in the latent space back to the original input image. Because epsilon is random, the process ensures that every point that is close to the latent location where we encoded input_img (z-mean) can be decoded to something similar to input_img, thus forcing the latent space to be continuously meaningful. Any two close points in the latent space will decode to highly similar images. Continuity, combined with the low dimensionality of the latent space, forces every direction in the latent space to encode a meaningful axis of variation of the data, making the latent space very structured and thus highly suitable to manipulation via concept vectors.

The parameters of a VAE are trained via two loss functions: first, a reconstruction loss that forces the decoded samples to match the initial inputs, and a regularization loss, which helps in learning well-formed latent spaces and reducing overfitting to the training data.

Let's go over a Keras implementation of a VAE. Schematically, it looks like this:

In [None]:
# Encode the input into a mean and variance parameter
z_mean, z_log_variance = encoder(input_img)

# Draw a latent point using a small random epsilon
z = z_mean + exp(z_log_variance) * epsilon

# Then decode z back to an image
reconstructed_img = decoder(z)

# Instantiate a model
model = Model(input_img, reconstructed_img)

# Then train the model using 2 losses:
# a reconstruction loss and a regularization loss

** Build the encoder network in this way (it is a simple convnet which maps the input image x to two vectors, z_mean and z_log_variance). **

- Convolution layer of size 32, with 3x3 kernel and ReLU activation function;
- Convolution layer of size 64, with 3x3 kernel, stride 2x2 and ReLU activation function;
- Convolution layer of size 64, with 3x3 kernel, and ReLU activation function;
- Convolution layer of size 64, with 3x3 kernel, and ReLU activation function;
- Flatten layer;
- Fully Connected layer of size 32 with ReLU activation function;
- Fully Connected layer of size `latent_dim` (assign the result of this to a variable called `z_mean`)
- Fully Connected layer of size `latent_dim` (assign the result of this to a variable called `z_log_var`)

In [None]:
import keras
from keras import layers
from keras import backend as K
from keras.models import Model
import numpy as np

img_shape = (28, 28, 1)
batch_size = 16
latent_dim = 2  # Dimensionality of the latent space: a plane

input_img = keras.Input(shape=img_shape)

### TO DO: build the encoder here. ###

z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)

Here is the code for using z_mean and z_log_var, the parameters of the statistical distribution assumed to have produced input_img, to generate a latent space point z.

In [None]:
def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
                              mean=0., stddev=1.)
    return z_mean + K.exp(z_log_var) * epsilon

z = layers.Lambda(sampling)([z_mean, z_log_var])

** Build the decoder network in this way: **
- Upsample to the correct number of units
- Reshape into an image of the same shape as before our last `Flatten` layer
- Apply the reverse operation to the initial stack of convolution layers: a 'Conv2DTranspose' layer with corresponding parameters
- Convolutional layer, 1 filter of dimension3x3 with `sigmoid` activation function

We end up with a feature map of the same size as of the original input.

This is the decoder implementation: we reshape the vector z to the dimensions of an image, then we use a few convolution layers to obtain a final image output that has the same dimensions as the original input_img.

In [None]:
# This is the input where we will feed `z`.
decoder_input = layers.Input(K.int_shape(z)[1:])


### TO DO: build the decoder here ####


# This is our decoder model.
decoder = Model(decoder_input, x)

# We then apply it to `z` to recover the decoded `z`.
z_decoded = decoder(z)

The dual loss of a VAE doesn't fit the traditional expectation of a sample-wise function of the form loss(input, target). Thus, we set up the loss by writing a custom layer with internally leverages the built-in add_loss layer method to create an arbitrary loss.

In [None]:
class CustomVariationalLayer(keras.layers.Layer):

    def vae_loss(self, x, z_decoded):
        x = K.flatten(x)
        z_decoded = K.flatten(z_decoded)
        xent_loss = keras.metrics.binary_crossentropy(x, z_decoded)
        kl_loss = -5e-4 * K.mean(
            1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
        return K.mean(xent_loss + kl_loss)

    def call(self, inputs):
        x = inputs[0]
        z_decoded = inputs[1]
        loss = self.vae_loss(x, z_decoded)
        self.add_loss(loss, inputs=inputs)
        # We don't use this output.
        return x

# We call our custom layer on the input and the decoded output,
# to obtain the final model output.
y = CustomVariationalLayer()([input_img, z_decoded])

Finally, we instantiate and train the model. Since the loss has been taken care of in our custom layer, we don't specify an external loss at compile time (loss=None), which in turns means that we won't pass target data during training (as you can see we only pass x_train to the model in fit).

** Create a VAE model and pass `input_img` and `y` (the output) to it.**

** Import the MNIST dataset from Keras. Remember that we do not use the training labels! Perform some data pre-processing: use `astype('float32')` and then `reshape`. **

** Fit the VAE model with 10 epochs and test data as validation data.**

Once such a model is trained -- e.g. on MNIST, in our case -- we can use the decoder network to turn arbitrary latent space vectors into images:

In [None]:
import matplotlib.pyplot as plt
from scipy.stats import norm

# Display a 2D manifold of the digits
n = 15  # figure with 15x15 digits
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))
# Linearly spaced coordinates on the unit square were transformed
# through the inverse CDF (ppf) of the Gaussian
# to produce values of the latent variables z,
# since the prior of the latent space is Gaussian
grid_x = norm.ppf(np.linspace(0.05, 0.95, n))
grid_y = norm.ppf(np.linspace(0.05, 0.95, n))

for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_sample = np.array([[xi, yi]])
        z_sample = np.tile(z_sample, batch_size).reshape(batch_size, 2)
        x_decoded = decoder.predict(z_sample, batch_size=batch_size)
        digit = x_decoded[0].reshape(digit_size, digit_size)
        figure[i * digit_size: (i + 1) * digit_size,
               j * digit_size: (j + 1) * digit_size] = digit

plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='Greys_r')
plt.show()

The grid of sampled digits shows a completely continuous distribution of the different digit classes, with one digit morphing into another as you follow a path through latent space. Specific directions in this space have a meaning, e.g. there is a direction for "four-ness", "one-ness", etc.