<a href="https://colab.research.google.com/github/khipu-ai/practicals-2019/blob/master/3b_generative_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical 3b: Deep Generative Models

© Deep Learning Indaba. Apache License 2.0.


## Introduction

In this practical, we will investigate two kinds of deep generative models, namely Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). We will first train a GAN to generate images of clothing, and then apply VAEs to the same problem.

## Learning Objectives

* Get a feel for what a generative model is.
* Train a GAN, and a VAE for generating images of clothing.
* Understand the difficulties in training GANs.
* Investigate at least one improvement to the training of GANs.
* Understand the differences between autoencoders and VAEs.
* Be able to explain the differences between GANs and VAEs.

**IMPORTANT: Please fill out the exit ticket form before you leave the practical: https://forms.gle/ZLiuTPct4q3BzKtY8**

## Running on GPU

For this practical, you will need to use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box. This is all you need to do, Colab and TensorFlow will take care of the rest!

In [None]:
#@title Imports (RUN ME!) { display-mode: "form" }
!pip install tensorflow-gpu==2.0.0-beta0 > /dev/null 2>&1

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import time

print("TensorFlow executing eagerly: {}".format(tf.executing_eagerly()))

## The data



For this practical, we will use the Fashion MNIST dataset consisting of 70,000 greyscale images of clothing items and their labels. The dataset is divided into 60,000 training images and 10,000 test images. However, unlike in the *supervised learning* setting that we have been exposed to in the previous practicals, because we are doing *unsupervised learning* in this practical, we will not make use of the labels. In this practical, our goal is not to learn labels for images, but rather to learn to generate (hopefully) new and unique images.

Regarding our task of generating images, you may be wondering why we care about doing this! One simple reason we're spending time on this is that it is a fun application of generative models, which gives us an excellent visual way of exploring how different models work. However, generating images itself has many interesting applications ranging from creating new fonts to making people smile in pictures, and from assisting with interactive design to creating photo-realistic images from drawn outlines (see [this cool article](https://distill.pub/2017/aia/) for more information on these applications). It should also be noted that many of the techniques presented in this practical apply to generative modelling in general and not just to generating images. Generative modelling has many interesting applications ranging from anomaly detection to missing data imputation, as well as, in leveraging abundant unsupervised data for representation learning (e.g. [BigBiGAN](https://www.google.com/url?q=https://arxiv.org/abs/1907.02544) and [BERT](https://arxiv.org/abs/1810.04805)).

In [None]:
fashion_mnist = tf.keras.datasets.fashion_mnist
# We are getting the labels only for the pruposes of exploring the data,
# we won't use them to train our models (or will we?).
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

text_labels = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

#### Visualize the data



Each image in the dataset consists of a 28 x 28 matrix of greyscale pixels. The values are between 0 and 255 where 0 represents black, 255 represents white, and there are many shades of grey in-between. Each image is assigned a corresponding numerical label, between 0 and 9, so the image in `train_images[i]` has its corresponding label stored in `train_labels[i]`. We also have a lookup array called `text_labels` to associate a text description with each of the numerical labels. For example, the label 1 is associated with the text description "Trouser".

In [None]:
# Show 25 randomly selected images at a time
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)

    img_index = np.random.randint(0, 50000)
    plt.imshow(train_images[img_index], cmap="gray_r")
    plt.xlabel(text_labels[train_labels[img_index]])

## Generative Adversarial Networks



[Generative Adversarial Networks](https://arxiv.org/abs/1406.2661) (GANs), are a very popular class of deep generative models which have shown some incredible results in generating images. For example, take a look at the progress that has been made in generating pictures of people's faces, taken from Ian Goodfellow's [twitter](https://www.google.com/url?q=https://twitter.com/goodfellow_ian/status/1084973596236144640&sa=D&ust=1565613603406000&usg=AFQjCNEtcXzfgDqPL0vPBkDJhl-x2Iup_Q_):

![faces](https://pbs.twimg.com/media/Dw6ZIOlX4AMKL9J.jpg)

The idea behind GANs is to train two networks simultaneously. Firstly, a **generator network**, which we can train to take a random input (often called $z$) and produce a sample from some distribution, in this case, the distribution of 28 x 28 grayscale clothing images. Secondly, we have a **discriminator network**, which will learn to tell apart the real (training) data and the fake (generated) data. The generator is trained to trick the discriminator and the discriminator is trained to avoid being tricked. This process is described in the following diagram:

![GAN](https://i.imgur.com/OUPd4Av.png)

As you can see in the above diagram, the discriminator's loss is calculated from using the labels (real or fake) for the generated and training images. We can use the standard cross-entropy loss function for a binary classifier. However, more interestingly, the generator's loss comes from performing back-propagation through the discriminator. In a sense, **the discriminator is the loss function for the generator**. While it may seem strange to train a neural network using another neural network as the loss function, rather than some pre-defined function, it turns out that training this way approximately minimizes something called the Jensen-Shannon divergence (JSD), which is a well motivated choice of loss function for generative models that we would be unable to calculate analytically.


### Optional extra reading: The Jensen-Shannon Divergence

As mentioned above, the discriminator is used as a loss function to train the generator, and it approximates something called the Jensen-Shannon divergence (JSD). Let's take a closer look at what the JSD is and how it is approximated.

We can, and will in just a moment, prove that using an optimal discriminator (one that makes the fewest mistakes possible) to train the generator is equivalent to minimizing the JS-divergence between the distributions of the generated and real images. Of course, in practice, we do not necessarily have an optimal discriminator, and so we end up with some approximation of the JSD. 

Okay, but what exactly is the Jensen-Shannon Divergence? In short, **the JSD is simply a method for measuring the similarity between two probability distributions**. It is defined as:

\begin{equation}
 D_{JS}(P||Q) = \frac{1}{2}D_{KL}(P,M) + \frac{1}{2}D_{KL}(Q,M)
\end{equation}

where $P$ and $Q$ are the distributions we want to measure the similarity between, $M = \frac{P + Q}{2}$ is the average of $P$ and $Q$, and $D_{KL}$ is the Kulback-Liebler Divergence (KLD). Don't worry too much about the funny $||$ notation, you can think of $D(P||Q)$ as being the same as $D(P,Q)$, just a function that takes two distributions as inputs and returns a scalar similarity between them.

Okay, but what exactly is the Kulback-Liebler Divergence? As you might have guessed from the word "divergence" in both the KLD and JSD, the KLD is also a measure of similarity between probability distributions. It is defined as:

\begin{equation}
D_{KL}(P||Q) = \sum_x P(x)\log\frac{P(x)}{Q(x)}
\end{equation}

in the case where $P$ and $Q$ are discrete distributions (the sum is replaced by an integral for continuous distributions). From the formula above we can see that the KLD will be 0 if and only if $P = Q$ and will be greater than 0 otherwise. We can also see that $D_{KL}(P,Q) \neq D_{KL}(Q,P)$. 

**Exercise:** explain to the person next to you why $D_{KL}(P,Q) = 0$ iff $P = Q$.

**Exercise:** explain to the person next to you why $D_{KL}(P,Q) \neq D_{KL}(Q,P)$.

**Advanced Excercise:** why can we not calculate the JSD in practice? If we could we wouldn't need to do GAN training. *Hint* think about what quantities we need to calculate the JSD.

#### Proof that using the optimal discriminator as the loss function for the generator is the same as minimizing the JSD

The loss we are minimizing is

$$\mathbb{E}_{x \sim p_d(x)}[\log D^*(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D^*(G(z)))]$$ 

where $p_d(x)$ is the training data distribution, $p_z(z)$ is the random noise distribution from which we draw samples to pass through our generator, $D$ and $G$ are the discriminator and generator, and $D^*$ is the optimal discriminator which has the form:

$$ D^*(x) = \frac{p_d(x)}{p_d(x) + p_g(x)}.$$

Here $p_g(x)$ is the distribution of the data sampled from the generator. The form for this optimal discriminator comes from Bayes' rule.

**Exercise:** derive the form of the optimal discriminator given above. *Hint:* The discriminator has to output the probability that an image ($x$) is real, $p_d(x)$ is the probability of the image under the real/training data distribution,  $p_g(x)$ is the probability of the image under the fake/generated data distribution, and we have no prior knowledge about whether a given image is real or fake. 

Substituting $D^*(x)$ and $p_g(x)$ into our loss, we can rewrite it as

$$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_d(x) + p_g(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_d(x) + p_g(x)}]. $$

Now we can multiply the values inside the logs by $1 = \frac{0.5}{0.5}$ to get

 $$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{0.5 p_d(x)}{0.5(p_d(x) + p_g(x))}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{0.5 p_g(x)}{0.5(p_d(x) + p_g(x))}]. $$

Recall that $\log(ab) = \log(a) + \log(b)$ (log multiplication law) and define $p_m(x) = \frac{p_d(x) + p_g(x)}{2}$, we now get

 $$ \mathbb{E}_{x \sim p_d(x)}[\log \frac{p_d(x)}{p_m(x)}] + \mathbb{E}_{x \sim p_g(x)}[\log \frac{p_g(x)}{p_m(x)}] - 2\log2. $$

Using the definition of the KL-Divergence, this simplifies to

$$ D_{KL}(p_d||p_m) + D_{KL}(p_g||p_m) - 2\log2. $$

Now, noting that for the purposes of optimization the $2\log2$ term can be ignored and that similarly we can multiply the above by a constant factor of 0.5, we get

$$ \frac{1}{2} D_{KL}(p_d||p_m) + \frac{1}{2} D_{KL}(p_g||p_m) $$

Finally, using the definition of the JS-Divergence, we get

$$ D_{JS}(p_d||p_g).$$

## Training a GAN



Now that we know what a GAN is let's try training one to generate images of clothing! Before we train our model, we first need to process our training data, and we'll need to decide on our model architecture. Remember that a GAN consists of two networks, so we'll have to define both of them.

#### Some data pre-processing




We need to do a little data pre-processing before training our GAN:

1. We add an extra dimension because many of the layers we will use expect a channel dimension. 
2. We normalize the images to be in the range [-1, 1] because we are using the DCGAN guidelines, described below.
3. We randomize and batch the data.

In [None]:
gan_train_images = np.expand_dims(train_images, axis=3) # add a channel dim
gan_train_images = gan_train_images.astype('float32') # convert to float32
gan_train_images = (gan_train_images - 127.5) / 127.5 # normalize the images to [-1, 1]
batch_size = 256
gan_train_dataset = tf.data.Dataset.from_tensor_slices(gan_train_images).shuffle(batch_size*10).batch(batch_size)

### The model

Let's think about how our two networks must be structured. 

The *discriminator* should take a 28x28x1 array of numbers in the range [-1, 1] as its input and return a scalar between 0 and 1. This scalar will be interpreted as the probability the image came from the training dataset rather than being generated by the generator, i.e. the probability that the image is real rather than fake.

The *generator* should take a vector and return a 28x28x1 array of numbers in the range [-1, 1]. The size of the vector is the *latent dimension* and is a hyperparameter that we can tune. 

One of the most challenging parts of deep learning is choosing the architecture for our models. This issue is highlighted for GAN training, which is notoriously tricky. A GAN might work correctly with one architecture and fail to train at all with a slightly different architecture.  Furthermore, instabilities and pathologies are very common in GAN training, for example, the generator and discriminator might end up chasing each other in circles rather than converging. The good news is that GANs are an active area of research and several methods exist for making GANs train more reliably. One such method that works very well for images is the DCGAN architecture by _Radford et al._ which we will draw inspiration from to design our networks. Here are some tips from the [git repo](https://github.com/Newmu/dcgan_code) for DCGAN:

* Replace any pooling layers with strided convolutions.
* Use batchnorm in both the generator and the discriminator
* Remove fully connected hidden layers for deeper architectures. Just use average pooling at the end. 
* Use ReLU activations in the generator for all layers except for the output, which uses Tanh. *(This means they are generating pixel values between -1 and 1.)*
* Use LeakyReLU activation in the discriminator for all layers.

We will use these guidelines to help us define our discriminator and generator, but you should experiment and see what happens if you make other choices.

**Exercise:** why do you think we use LeakyReLU and avoid the use of pooling layers? *Hint:* we want to make training as easy as possible.

In [None]:
# some useful constants and hyper-parameters
img_shape = (28, 28, 1)
latent_dim = 100

#### The generator

In [None]:
def make_generator(latent_dim):
  generator = tf.keras.Sequential(name="generator")
  
  # we start with 7 * 7 * XXX so that we can apply two blocks that include
  # UpSampling layers, which will eventually give us a 28 * 28 image, since
  # 7 * 2 * 2 = 28
  generator.add(tf.keras.layers.Dense(7 * 7 * 128,
                                activation="relu",
                                input_dim=latent_dim))

  assert generator.input_shape == (None, latent_dim), "input is wrong shape"
  
  generator.add(tf.keras.layers.Reshape([7, 7, 128]))
  
  generator.add(tf.keras.layers.UpSampling2D())
  generator.add(tf.keras.layers.Conv2D(128,
                                 kernel_size=3,
                                 padding="same"))
  generator.add(tf.keras.layers.BatchNormalization(momentum=0.8))
  generator.add(tf.keras.layers.Activation("relu"))
  
  generator.add(tf.keras.layers.UpSampling2D())
  generator.add(tf.keras.layers.Conv2D(64,
                                 kernel_size=3,
                                 padding="same"))
  generator.add(tf.keras.layers.BatchNormalization(momentum=0.8))
  generator.add(tf.keras.layers.Activation("relu"))
  
  generator.add(tf.keras.layers.Conv2D(1,
                                 kernel_size=3,
                                 padding="same"))
  generator.add(tf.keras.layers.Activation("tanh"))
  
  assert generator.output_shape == (None, 28, 28, 1), "output is wrong shape"
  
  return generator
  
generator = make_generator(latent_dim)    

generator.summary()

We can (try to) generate an image with our un-trained generator.

**Exercise:** What do you think will happen when you run the code block below?

In [None]:
noise = tf.random.normal([1, latent_dim])
generated_image = generator(noise, training=False)

plt.imshow(generated_image[0, :, :, 0], cmap='gray_r')
plt.axis('off')
plt.show()

#### The discriminator

In [None]:
def make_discriminator(img_shape):
  discriminator = tf.keras.Sequential(name="discriminator")
  
  discriminator.add(tf.keras.layers.Conv2D(32,
                                  kernel_size=3,
                                  strides=2,
                                  padding="same",
                                  input_shape=img_shape))

  assert discriminator.input_shape == (None, 28, 28, 1), "input is wrong shape"
  
  discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.2))
  discriminator.add(tf.keras.layers.Dropout(0.25))
  discriminator.add(tf.keras.layers.Conv2D(64,
                                  kernel_size=3,
                                  strides=2,
                                  padding="same"))
  discriminator.add(tf.keras.layers.BatchNormalization(momentum=0.8))
  
  discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.2))
  discriminator.add(tf.keras.layers.Dropout(0.25))
  discriminator.add(tf.keras.layers.Conv2D(128,
                                  kernel_size=3,
                                  strides=2,
                                  padding="same"))
  discriminator.add(tf.keras.layers.BatchNormalization(momentum=0.8))
  
  discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.2))
  discriminator.add(tf.keras.layers.Dropout(0.25))
  discriminator.add(tf.keras.layers.Conv2D(256,
                                  kernel_size=3,
                                  strides=1,
                                  padding="same"))
  discriminator.add(tf.keras.layers.BatchNormalization(momentum=0.8))
  
  discriminator.add(tf.keras.layers.LeakyReLU(alpha=0.2))
  discriminator.add(tf.keras.layers.Dropout(0.25))
  discriminator.add(tf.keras.layers.Flatten())
  discriminator.add(tf.keras.layers.Dense(1, activation='sigmoid'))
  
  assert discriminator.output_shape == (None, 1), "output is wrong shape"
    
  return discriminator
  
discriminator = make_discriminator(img_shape)    

discriminator.summary()

We can also (try to) make predictions about whether an image is real or fake with our un-trained discriminator.

**Exercise:** What do you think will happen when you run the code block below? What does the prediction mean?


In [None]:
decision = discriminator(generated_image)
print (decision)

### Optional extra reading: generative vs discriminative models



You should already be starting to notice some of the differences between generative and discriminative models. For example, we've said that we are now doing unsupervised learning and that we do not need labels for our data. So you might now be wondering if this is always the case and whether there are other differences between the two kinds of models. In general, *generative models are not always unsupervised*; they can also be supervised or even semi-supervised. In these cases, labels are often used. Similarly, unsupervised learning is not always generative, for example, k-means clustering. The main difference between generative and discriminative modelling is the goal. The purpose of a discriminative model is to predict an output $y$ given an input $x$, as acurately as possible. On the other hand, the purpose of a generative model is to describe the process in which data is generated. This lets us create new examples of the data, often written $\hat x$, but it can also help us to understand the underlying nature of the data. 

We can summarise the above by saying discriminative models learn $p(y | x)$, whilst generative models learn *something* about $p(x)$. But what do we mean by 'something'? Let's get an idea by taking a look at what a few different kinds of models learn:

* GANs learn to sample from $p(x)$.
* VAEs learn to approximate $p(x)$.
* Conditional VAEs learn to approximate $p(x|y)$.
* Normalizing flows learn $p(x)$ exactly.
* Auto-regressive models learn $p(x_i|x_{i-1},x_{i-2},...,x_1)$.

**Partner exercise:** considering the following list of applications for generative modelling, which of the models above are applicable, and which aren't: anomaly detection, filling in missing data, generating images of numeric digits, generating images of specific digits, unsupervised representation learning, generating time-series data.



### Training the model

#### Defining the loss functions and optimizers

Lets first define the discriminator loss. Recall that the discriminator is trained to determine whether images are real or fake. In other words, the discriminator is a binary classifier that predicts either "real" or "fake" for a given image. In this case, we will train the discriminator to predict `1` for a real image and `0` for a fake image so we can use a standard loss for a binary classifier: cross entropy. The only thing we need to take care of is combining the losses for a batch of generated images and a batch of real images. 

In [None]:
# define the loss function for the discriminator
cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=False)

def discriminator_loss(real_output, fake_output):
    # real_output is the prediction of the discriminator for a batch of real images
    # fake_output is the same but for images from the generator
    real_loss = cross_entropy(tf.ones_like(real_output), real_output)
    fake_loss = cross_entropy(tf.zeros_like(fake_output), fake_output)
    total_loss = real_loss + fake_loss
    return total_loss

Now we can define the generator loss. Recall that the generator is trained to trick the discriminator. As a result, our loss function must make use of the discriminator's predictions for the fake/generated images. However, in this case, we are training the generator such that the discriminator predicts `1` for a fake image (which is the opposite of what we did for the discriminator above).

In [None]:
def generator_loss(fake_output):
    return cross_entropy(tf.ones_like(fake_output), fake_output)

Finally, because we are training two networks for two different tasks, we need two different optimizers. Here the optimizers are identical, but the choice of the optimizer, learning rate, and other parameters are all things we can tweak to try to improve training.

In [None]:
generator_optimizer = tf.keras.optimizers.Adam(1e-4)
discriminator_optimizer = tf.keras.optimizers.Adam(1e-4)

####The training loop

Because GAN training requires that we simultaneously train a generator network and a discriminator network, using the Keras `model.fit()` function has issues - for more complex training setups like this, it will be easier to define the training loop using `tf.GradientTape()`.

In [None]:
# if you change the definitions for the generator and/or discriminator,
# or if you want to restart training...
# RE-RUN THIS CELL!

gen_loss_mean = tf.keras.metrics.Mean(name='gen_loss_mean')
disc_loss_mean = tf.keras.metrics.Mean(name='disc_loss_mean')

@tf.function
def train_step(images):                                       
    noise = tf.random.normal([batch_size, latent_dim])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
      generated_images = generator(noise, training=True)

      real_output = discriminator(images, training=True)
      fake_output = discriminator(generated_images, training=True)

      gen_loss = generator_loss(fake_output)
      disc_loss = discriminator_loss(real_output, fake_output)

    gen_loss_mean(gen_loss)
    disc_loss_mean(disc_loss)
      
    gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
    gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)

    generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
    discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))
    
# Exercise: why do we define this as a constant?
noise = np.random.normal(0, 1, (8, latent_dim))

def generate_and_display_images(generator):
    predictions = generator(noise, training=False)
      
    fig = plt.figure(figsize=(16,2))

    for i in range(predictions.shape[0]):
      plt.subplot(1, 8, i+1)
      plt.imshow(predictions[i, :, :, 0], cmap='gray_r')
      plt.axis('off')

    plt.show()
    
def train_GAN(epochs):
  generate_and_display_images(generator)
  
  for epoch in range(epochs):
    start = time.time()

    for image_batch in gan_train_dataset:
      train_step(image_batch)
      
    template = 'Epoch {:03d}: time {:.3f} sec, gen loss {:.3f}, disc loss {:.3f}'
    print(template.format(epoch + 1, time.time()-start, gen_loss_mean.result(), disc_loss_mean.result()))
    
    generate_and_display_images(generator)
    
generator = make_generator(latent_dim)
discriminator = make_discriminator(img_shape)

Let's train our GAN! 20 epochs are around when we start to see reasonable results, which will take about 4-5 mins to train using a GPU in Colab. If you have time, try training for 50 epochs and see how much of a difference there is (or isn't). 

In [None]:
num_epochs = 20
train_GAN(num_epochs)

There you have it; we've successfully trained a GAN to generate images of clothing. There is certainly room for improvement, but comparing the results after our last epoch to the "image" generated by the untrained generator shows that our GAN training worked.

**Exercise:** What relationships, if any, can you see between the generator loss, discriminator loss, and the generated image quality?

### Optional extra reading: GAN generalization

You may have noticed that we didn't make use of the FashionMNIST test set to check if our GAN is generalizing well. The images that we generated above could be imperfect copies of images from the training set. 

It would be very difficult to tell that this is the case because there are 60000 images in the test set and we can't manually check them all against each image that the generator produces. In general, it can be quite tricky to evaluate whether or not a GAN is generalizing. 

One method, which you can explore in a task below, is to interpolate between two random codes and observe the outputs of the generator as you do so. If the images produced by the generator interpolate naturally, then that would suggest that the GAN is generalizing. 

You can read more about issues with evaluating generative models (not just GANs)  [here](https://arxiv.org/pdf/1511.01844.pdf).

## GAN Tasks




1.   **[ALL]** experiment with different architectures for the generator and discriminator. Try breaking the DCGAN guidelines. How successful were your other architectures? (Hint: don't remove the `assert` statements from the `make_generator` and `make_discriminator` functions).
2. **[ALL]** experiment with interpolating between two random codes $z_1$ and $z_2$and examine what happens to the output of the generator as you do so.
3.   **[INTERMEDIATE]** Modify the GAN implementation (the discriminator, generator and training loop) to make use of the data labels to improve training. More specifically, implement an [Auxiliary Classifier GAN](https://arxiv.org/abs/1610.09585). If you get stuck, you can look at [this Keras implementation](https://github.com/eriklindernoren/Keras-GAN/blob/master/acgan/acgan.py) of AC-GAN.
4. **[ADVANCED]** Modify the loss function of the discriminator to be relativistic. More specifically, implement the [Relativistic Discriminator](https://arxiv.org/abs/1807.00734). How does this change the training and final results of the GAN?
5. **[ADVANCED]** Convert the GAN to a [Wasserstein GAN](https://arxiv.org/abs/1701.07875). How does this change the training and final results of the GAN?



## Extra reading on GANs



* [GAN Lab](https://poloclub.github.io/ganlab/) (an interactive GAN visualization, **highly recommended**)
* [Generative Adversarial Nets](https://arxiv.org/pdf/1406.2661.pdf) (Goodfellow's original paper) 
* [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434) (the DCGAN architecture by _Radford et al._)
* [GAN Hacks](https://github.com/soumith/ganhacks) (more tips and tricks for training GANs from various authors of influencial GAN papers)
* [NIPS 2016 Tutorial: Generative Adversarial Networks](https://arxiv.org/abs/1701.00160) (a tutorial by Goodfellow)
* [Depth First Learning: Wasserstein GAN](http://www.depthfirstlearning.com/2019/WassersteinGAN) (a self-study guide to WGAN with plenty of excellent resources for understanding generative models and GANs)
* [TF 2.0 GAN Example](https://www.tensorflow.org/beta/tutorials/generative/dcgan)

## Variational Autoencoders

VAEs, also known as the [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114) method, are extensions of a kind of neural network architecture called an autoencoder (AE). We'll start by building and training an autoencoder. After we've done that, we'll introduce a few changes that make an AE into a VAE. 

Autoencoders are networks trained unsupervised on input data (often images) that attempt to find a simple *code* to represent the input data. The main idea is to jointly train two sub-networks to cooperate: the *encoder* network, which takes an image and turns it into a code, and the *decoder* network, which takes a code and turns it back into a reconstructed image. (For non-image data, the basic idea is the same). The code is often called the *latent code* or *latent variable*, which lives in a *latent space*. The latent code uses the symbol $\mathbf z$, whereas data typically uses the symbol $\mathbf x$, and the reconstructed data uses the symbol $\mathbf{\hat{x}}$. In the case of MNIST, the input data is made of $28\times28$ grayscale images. A typical latent code might be 100 dimensional, but the exact size is a hyperparameter that can be tuned &#8212; the only thing that matters is that it is smaller than the input dimension.

![autoencoder](https://i.imgur.com/yj8f5jS.jpg)

Why do we want to learn a latent code for the data? There are a number of applications. An obvious one is that it allows us to *compress* our data, in a process called *dimensionality reduction*. By making the dimensionality of our latent code smaller than that of the data, we force the model to compress the important pieces of information into the code and discard the unimportant information. Another application is *representation learning*, which we mentioned briefly in the introduction. The idea behind representation learning is that we want different parts of our code to represent different aspects of our data. For example, one dimension of the code might represent whether or not a person is smiling, while another might represent a person's hair colour. Learning good representations of our data not only allows us to have fine control over the images we generate but also lets us build powerful image classifiers.





## Training an Autoencoder































Training an autoencoder is extremely simple: we take an input example $\mathbf x$, apply the encoder $e(\mathbf{x})$ to get a code $\mathbf z$, apply the decoder $d(\mathbf{z})$ to get a reconstructed input example $\mathbf{\hat x}$. The perfect autoencoder will yield $\mathbf{\hat x} = \mathbf{x}$, or equivalently $d(e(\mathbf{x})) = \mathbf{x}$, meaning that the input can be reconstructed perfectly from the code. Therefore we can train an autoencoder to minimize a reconstruction loss which is the difference between $\mathbf{x}$ and $\mathbf{\hat{x}}$. A typical choice for reconstruction loss $\mathcal{L_R}$ would be mean squared error (MSE): 

$$\mathcal{L_R}=\frac{1}{N}\sum_{n=1}^N(x_n-\hat{x}_n)^2$$

where $N$ is the number of dimensions in $\mathbf x$ and $x_n$ is the $n$th component of $\mathbf x$.

Let's train an autoencoder to reconstruct the same images of clothing from FashionMNIST. We'll start by defining our encoder and decoder networks.

In [None]:
# some useful constants and hyper-parameters
img_shape = (28, 28, 1)
latent_dim = 100

In [None]:
def make_encoder(latent_dim, img_shape):
  encoder = tf.keras.Sequential(name="encoder")
  
  encoder.add(tf.keras.layers.Conv2D(32,
                                     kernel_size=3,
                                     strides=(2, 2),
                                     activation='relu',
                                     input_shape=img_shape))
  
  assert encoder.input_shape == (None, 28, 28, 1), "input is wrong shape"
  
  encoder.add(tf.keras.layers.Conv2D(64,
                                     kernel_size=3,
                                     strides=(2, 2),
                                     activation='relu'))
  
  encoder.add(tf.keras.layers.Flatten())
  encoder.add(tf.keras.layers.Dense(latent_dim))
  
  assert encoder.output_shape == (None, latent_dim), "output is wrong shape"
  
  return encoder

encoder = make_encoder(latent_dim, img_shape)

encoder.summary()

In [None]:
def make_decoder(latent_dim):
  decoder = tf.keras.Sequential(name="decoder")
  
  decoder.add(tf.keras.layers.Dense(7*7*32,
                                    activation='relu',
                                    input_dim=latent_dim))
  
  assert decoder.input_shape == (None, latent_dim), "input is wrong shape"

  
  decoder.add(tf.keras.layers.Reshape(target_shape=(7, 7, 32)))
  decoder.add(tf.keras.layers.Conv2DTranspose(64,
                                              kernel_size=3,
                                              strides=(2, 2),
                                              padding="SAME",
                                              activation='relu'))
  
  decoder.add(tf.keras.layers.Conv2DTranspose(32,
                                              kernel_size=3,
                                              strides=(2, 2),
                                              padding="SAME",
                                              activation='relu'))
  
  decoder.add(tf.keras.layers.Conv2DTranspose(1,
                                             kernel_size=3,
                                             strides=(1, 1),
                                             padding="SAME"))
  
  assert decoder.output_shape == (None, 28, 28, 1), "output is wrong shape"
  
  return decoder

decoder = make_decoder(latent_dim)

decoder.summary()

Let's do some pre-processing of our data. We can measure the reconstruction error on the test set to see if our autoencoder is generalizing, so we'll pre-process the test set too.

In [None]:
ae_train_images = np.expand_dims(train_images, axis=3) # add a channel dim
ae_train_images = ae_train_images.astype('float32') # convert to float32
ae_train_images = ae_train_images / 255 # normalize the images to [0, 1]


ae_test_images = np.expand_dims(test_images, axis=3) 
ae_test_images = ae_test_images.astype('float32') 
ae_test_images = ae_test_images / 255.

And now we can train our autoencoder:

In [None]:
# remember to re-run this cell if you want to restart training
encoder = make_encoder(latent_dim, img_shape)
decoder = make_decoder(latent_dim)
  
autoencoder = tf.keras.Sequential([
    encoder,
    decoder
], name="autoencoder")

In [None]:
optimizer = tf.keras.optimizers.Adam(1e-4)

autoencoder.compile(optimizer=optimizer, loss='mse')

autoencoder.fit(ae_train_images, ae_train_images,
                epochs=20,
                batch_size=256,
                shuffle=True,
                validation_data=(ae_test_images, ae_test_images))

**Exercise:** why does `model.fit()` function get the training dataset twice? 

Let's compare some of the input images (top)  with their reconstructions (bottom):


In [None]:
#@title Comparison (RUN ME!) {display-mode: "form"}
decoded_images = np.array(autoencoder(ae_test_images, training=False))

n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
    # original
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(ae_test_images[i].reshape(28, 28), cmap="gray_r")
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # reconstruction
    ax = plt.subplot(2, n, i + n + 1)
    plt.imshow(decoded_images[i].reshape(28, 28), cmap="gray_r")
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

As you can see, the autoencoder does a reasonable job of reconstructing images from our 100-dimensional latent code. However, there are a few things to note here:



1.   Some of the high-level features have been lost because autoencoders perform **lossy compression**. We could probably mitigate this problem by training for longer with a more complex model, and possibly with a larger latent code, but there will always be some degree of loss if the latent code is smaller than the input dimension. However, this isn't always a bad thing. In some cases, when we have noisy images, we might want to throw away the noisy parts of the inputs. In this light, *autoencoders can be seen as a non-linear form of PCA*. In fact, in the case of single-layer encoders and decoders with tied weights, we  are doing ordinary PCA.
2.   We haven't generated any new images because **an autoencoder is not a generative model**. It is simply a model which has been trained compress its inputs into codes and then reconstruct them as accurately as possible. As a result, if we run our decoder on a randomly sampled code, we should not expect to get an output that looks like clothing:


In [None]:
noise = tf.random.normal([1, latent_dim])
decoded_image = decoder(noise, training=False)

plt.imshow(decoded_image[0, :, :, 0], cmap='gray_r')
plt.axis('off')
plt.show()

If we want to be able to generate *new* images of clothing, we'll need to use a VAE!

## From AE to VAE

A VAE is a more probabilistic take on a standard autoencoder (AE), and the two approches share many similarities. The main difference is that rather than the encoder producing a code for a given input and the decoder using that code to reconstruct that input, **the encoder of a VAE produces the parameters of a probability distribution that explains the data**. More specifically, it produces the distribution $p(z|x)$. Then, given samples from this distribution, **the decoder generates new input data samples**. In other words, we have learnt a latent variable model for the data. Typically we treat each dimension of our latent space as being modelled with a Gaussian distribution with some mean $\mu_z$ and variance $\sigma_z^2$. 

As a result of this small modelling difference, the training of a VAE differs somewhat to the training of an AE. The best way to show these differences is to jump right into it, so let's try to train a VAE on FashionMNIST. Since the VAE is a little more complicated to train than an AE we'll define the training loop using  `tf.GradientTape()` rather than using `model.fit`. As you'll see, other than having to define the training loop and a slightly modified encoder architecture, a VAE is almost the same as an AE. Let's start by defining the encoder and decoders for the VAE:

In [None]:
# some useful constants and hyper-parameters
img_shape = (28, 28, 1)
latent_dim = 100

Let's first define our encoder, which in the context of VAEs is often called the **inference network**:

In [None]:
def make_inference_net(latent_dim, img_shape):
  inference_net = tf.keras.Sequential(name="inference_net")
  
  inference_net.add(tf.keras.layers.Conv2D(32,
                                     kernel_size=3,
                                     strides=(2, 2),
                                     activation='relu',
                                     input_shape=img_shape))
  
  assert inference_net.input_shape == (None, 28, 28, 1), "input is wrong shape"
  
  inference_net.add(tf.keras.layers.Conv2D(64,
                                     kernel_size=3,
                                     strides=(2, 2),
                                     activation='relu'))
  
  inference_net.add(tf.keras.layers.Flatten())
  inference_net.add(tf.keras.layers.Dense(latent_dim*2))
  
  assert inference_net.output_shape == (None, latent_dim*2), "output is wrong shape"
  
  return inference_net

inference_net = make_inference_net(latent_dim, img_shape)

inference_net.summary()

And now the decoder, which is also called the **generative network** in the context of VAEs.

In [None]:
def make_generative_net(latent_dim):
  generative_net = tf.keras.Sequential(name="generative_net")
  
  generative_net.add(tf.keras.layers.Dense(7*7*32,
                                    activation='relu',
                                    input_dim=latent_dim))
  
  assert generative_net.input_shape == (None, latent_dim), "input is wrong shape"
  
  generative_net.add(tf.keras.layers.Reshape(target_shape=(7, 7, 32)))
  generative_net.add(tf.keras.layers.Conv2DTranspose(64,
                                              kernel_size=3,
                                              strides=(2, 2),
                                              padding="SAME",
                                              activation='relu'))
  
  generative_net.add(tf.keras.layers.Conv2DTranspose(32,
                                              kernel_size=3,
                                              strides=(2, 2),
                                              padding="SAME",
                                              activation='relu'))
  
  generative_net.add(tf.keras.layers.Conv2DTranspose(1,
                                             kernel_size=3,
                                             strides=(1, 1),
                                             padding="SAME"
                    ))
  
  assert generative_net.output_shape == (None, 28, 28, 1), "output is wrong shape"
  
  return generative_net

generative_net = make_generative_net(latent_dim)
generative_net.summary()


**Exercise:** What differences can you see between our definitions for the encoder and decoder above and the definitions for the standard autoencoder?


Because we are defining the training loop, we'll use `tf.data` to batch and shuffle the data.

In [None]:
batch_size = 256
vae_train_dataset = tf.data.Dataset.from_tensor_slices(ae_train_images).shuffle(batch_size*10).batch(batch_size)
vae_test_dataset = tf.data.Dataset.from_tensor_slices(ae_test_images).batch(batch_size)


Now let's define the loss function and the model optimizer. We are no longer using mean squared error for the loss because rather than trying to reconstruct the input image, we are trying to generate new images. Our new loss, called the ELBO, is a little more complicated than MSE and is described in more detail below, for those of you who are interested, but we'll quickly go over some of the important details now. The ELBO contains two terms:

1. a reconstruction term, which does the same job as MSE in a regular autoencoder, and 
2. a Kulback Liebler Divergence term, which helps our VAE to learn a distribution for the data.

There is one slight complication, which is that the ELBO contains an expected value, which means that we can only calculate an approximation of the ELBO by taking samples and calculating the average. We can solve this problem using a little trick (called the **reparameterization trick**), so that we can use TensorFlow to optimize it, but it isn't a problem otherwise.

In [None]:
optimizer = tf.keras.optimizers.Adam(1e-4)

# helper function
def log_normal_pdf(sample, mean, logvar, raxis=1):
  log2pi = tf.math.log(2. * np.pi)
  return tf.reduce_sum(
      -.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
      axis=raxis)

cross_entropy = tf.keras.losses.BinaryCrossentropy(from_logits=False)

# loss function
def compute_ELBO(inference_net, generative_net, x):
  mean, logvar = tf.split(inference_net(x), num_or_size_splits=2, axis=1)
  
  # reparameterization trick
  eps = tf.random.normal(shape=mean.shape)
  z = eps * tf.exp(logvar * .5) + mean
  
  x_logit = generative_net(z)

  # reconstruction term
  cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
  logpx_z = -tf.reduce_sum(cross_ent, axis=[1, 2, 3])
  
  # KLD term
  logpz = log_normal_pdf(z, 0., 0.)
  logqz_x = log_normal_pdf(z, mean, logvar)
  
  return -tf.reduce_mean(logpx_z + logpz - logqz_x)


### Optional extra reading: the ELBO

Let's take a closer look at what the ELBO is and how it is derived. 

ELBO stands for **E**vidence **L**ower **BO**und. That is a pretty loaded term so it might not be immediately apparent what that means. Let's unpack it by first looking at the term *evidence* and then at *lower bound*. 

Evidence, also known as marginal likelihood, comes from Bayes' rule:

\begin{align}
p_{\theta}(z|x) & = \frac{p_{\theta}(x|z) \, p(z)}{p_{\theta}(x)} \\
 & = \frac{p_{\theta}(x|z) \, p(z)}{\sum_{z\in\mathcal{Z}}p_{\theta}(x|z) \, p(z)} \\
\textrm{posterior} & = \frac{\textrm{likelihood} \times \textrm{prior}}{\textrm{evidence (or marginal likelihood)}}
\end{align}

where $\theta$ are the parameters of some distribution $p$, and $\mathcal{Z}$ is the set of all possible latent codes ($z$). The marginal likelihood tells us how well the data supports our model. In other words, it tells us how good our model is based on the evidence provided by the data. **The higher the evidence, the better our model is**. 

We want to optimize the parameters of our model to maximize the ELBO. Unfortunately, there is a problem with doing that. As shown above, the evidence $p_\theta(x)$ is calculated by marginalizing (summing or integrating) out $z$ from the term $p_{\theta}(x|z) \, p(z)$. This calculation is almost always computationally infeasible, which brings us to the *lower bound*.

The lower bound refers to a lower bound of the evidence. In optimization, when we are unable to calculate some quantity, let's call it $A$, which we  wish to maximize, we can use a trick to perform the optimization indirectly. If we have some other quantity, let's call it $B$, which by definition, is always lower than or equal to $A$ **and we can calculate it**, then all we need to do is maximize $B$, and we will also be maximizing $A$! ($B$ is called a **lower bound** of $A$.)

Okay, so we want to maximize the evidence but can't because we are unable to calculate it. So all we need to do is find a lower bound which we can calculate. Let's do this by taking a closer look at the marginal likelihood. More specifically, we will take a look at the log-marginal likelihood. We can do this because the $\log$ function is monotonically increasing, which means that the maxima/minima of a function $f(x)$ and $\log f(x)$ are identical. 

\begin{align}
\log p_{\theta}(x) & = \log \sum_{z\in\mathcal{Z}} p_{\theta}(x, z) \, \mathrm{d} z \\
& = \log \sum_{z\in\mathcal{Z}} p_{\theta}(x | z) \, p(z) \, \mathrm{d} z \\
& = \log \sum_{z\in\mathcal{Z}} q_{\phi}(z | x) \left[ \frac{ p_{\theta}(x | z) \, p(z) }{q_{\phi}(z | x)} \right] \mathrm{d} z && \triangleright \textrm{Multiply by } 1 = \frac{q_{\phi}(z|x)}{q_{\phi}(z|x)}.\\
& \ge \sum_{z\in\mathcal{Z}} q_{\phi}(z | x) \log \left[ \frac{ p_{\theta}(x | z) \, p(z) }{q_{\phi}(z | x)} \right] \mathrm{d} z && \triangleright \textrm{Jensen's Inequality.}\\
& = \mathbb{E}_{z \sim q_{\phi}(z|x)}\Big[ \log p_{\theta}(x | z) + \log p(z) - \log q_{\phi}(z | x)\Big]  && \triangleright \textrm{Apply log laws.}\\
& \equiv \mathcal{L}(\theta, \phi)
\end{align}

Okay, so what have we accomplished here? Well, we have a derivation for some quantity $\mathcal{L}(\theta, \phi)$ which is less than or equal to the marginal likelihood. Clearly, $\mathcal{L}(\theta, \phi)$ is a lower bound for the evidence. Which means that if we can maximize $\mathcal{L}(\theta, \phi)$ then we also maximize the evidence. Hopefully, it is clear that $\mathcal{L}(\theta, \phi)$ is the ELBO. However, it is not clear that we can calculate $\mathcal{L}(\theta, \phi)$. So let's look at each term in $\mathcal{L}(\theta, \phi)$, so that we can figure out what they are and if we can calculate them:

* $\log p_{\theta}(x | z) $ is the **likelihood** of the data ($x$) given some code ($z$) and is the output of our decoder, also called the *generator network* in the VAE literature, so we can easily calculate it.
* $\log p(z)$ is the **prior** for the codes and is something that we can choose, which means we can easily choose it to be calculable. 
* But what about $\log q_{\phi}(z | x)$? We seemingly added this at random in our derivation of the ELBO. What is it? Can we calculate it? $\log q_{\phi}(z | x)$ is an **approximation for the true posterior** $\log p_{\theta}(z | x)$. The reason we need an approximation is that without knowing $p(x)$ we cannot calculate $\log p_{\theta}(z | x)$ (see Bayes' law above). Since we are making an approximation, we can choose $\log q_{\phi}(z | x)$ to ensure that we can calculate it.

**Exercise:** Can you think of a class of extremely flexible functions which we can both train and calculate and which we can use for our approximation $\log q_{\phi}(z | x)$?

You guessed it; we are going to use another neural network for $\log q_{\phi}(z | x)$. This network is our encoder, also called the *inference network* in the VAE literature.

Fantastic, this means that we can indeed maximize the ELBO and therefore maximize the evidence $p_\theta(x)$. The beautiful thing is that we can use TensorFlow to optimize $\mathcal{L}(\theta, \phi)$, by tuning the parameters of our encoder and decoder, $\phi$ and $\theta$ respectively, using good old gradient descent!

**Exercise:** Can you think of any issues with maximizing the ELBO rather than directly maximizing the evidence? (*Hint*: think about the fact that it is a *lower bound*.)



#### Optional extra reading: the reparameterization trick

There is one more detail about the ELBO which you might have noticed, and that we need to take care of before we can implement a VAE, which is the expectation in our definition:

\begin{equation}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{z \sim q_{\phi}(z|x)}\Big[ \log p_{\theta}(x | z) + \log p(z) - \log q_{\phi}(z | x)\Big] 
\end{equation}

The expectation of the $\log p(z) - \log q_{\phi}(z | x)$  term is just the KL-divergence which can be calculated analytically, which means we are left with

\begin{equation}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{z \sim q_{\phi}(z|x)}\Big[ \log p_{\theta}(x | z)\Big] - D_{KL}\Big(\log q_{\phi}(z | x), \log p(z)\Big)
\end{equation}

which still has an expectation in it.

Remember that $\mathcal{L}(\theta, \phi)$ is our optimization objective. In other words, the loss, when training a VAE. Unless you're familiar with reinforcement learning or variational inference, it might seem a bit weird that we are optimizing an expectation. You might be asking yourself questions like "Is sampling a differentiable operation?" or "Can we take the gradients?", and these are good questions to ask. Their answers are "Not directly" and "Yes, but with a trick".

It turns out that this is a common issue in stochastic optimization, and luckily for us, there is a simple solution. The solution is called the **reparameterization trick**. Here is how it works. Rather than sampling $z$ directly from $q_{\phi}(z|x)$, we can sample $\epsilon$ from a Gaussian distribution with zero mean and unit variance and then transform $\epsilon$ to get $z$ as follows:

\begin{equation}
z = \mu_\phi(x) + \epsilon \times \sigma_\phi(x)  
\end{equation}

where $\mu_\phi(x)$ and $\sigma_\phi(x)$ are the outputs of our encoder for an input $x$.

Now we can rewrite the ELBO as

\begin{equation}
\mathcal{L}(\theta, \phi) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1)}\Big[ \log p_{\theta}(x | z = \mu_\phi(x) + \epsilon \times \sigma_\phi(x) )\Big]  - D_{KL}\big(\log q_{\phi}(z | x), \log p(z)\big)
\end{equation}

which we can approximate using a Monte-Carlo estimate of the expectation

\begin{equation}
\mathcal{L}(\theta, \phi) \approx \sum_{n = 1}^{N}\Big[ \log p_{\theta}(x | z = \mu_\phi(x) + \epsilon_n \times \sigma_\phi(x) ) \Big]  - D_{KL}\big(\log q_{\phi}(z | x), \log p(z)\big)
\end{equation}

and this is finally something that TensorFlow can work with.

Note that although we have kept the KLD term separate from our Monte-Carlo estimate since it can be calculated analytically, we can also estimate it using Monte-Carlo, which is what we will be doing in this practical for simplicity.

### Training the VAE

In [None]:
g# if you change the definitions for the inference_net and/or generative_net,
# or if you want to restart training...
# RE-RUN THIS CELL!

train_loss = tf.keras.metrics.Mean(name="train_loss")
test_loss = tf.keras.metrics.Mean(name="test_loss")

@tf.function
def train_step(inference_net, generative_net, images):
  with tf.GradientTape() as tape:
    loss = compute_ELBO(inference_net, generative_net, images)

  trainable_variables = inference_net.trainable_variables + generative_net.trainable_variables
  gradients = tape.gradient(loss, trainable_variables)

  optimizer.apply_gradients(zip(gradients, trainable_variables))
  
  train_loss(loss)
  
noise = np.random.normal(0, 1, (8, latent_dim))

def generate_and_display_images(generative_net):
    predictions = generative_net(noise, training=False)
    # normalize to be in the range [0, 1]
    predictions = tf.sigmoid(predictions)

      
    fig = plt.figure(figsize=(16,2))

    for i in range(predictions.shape[0]):
      plt.subplot(1, 8, i+1)
      plt.imshow(predictions[i, :, :, 0], cmap='gray_r')
      plt.axis('off')

    plt.show()
    
def train_VAE(inference_net, generative_net, epochs):
  generate_and_display_images(generative_net)

  for epoch in range(epochs):
    start = time.time()

    for image_batch in vae_train_dataset:
      train_step(inference_net, generative_net,image_batch)
            
    end = time.time()
    
    for image_batch in vae_test_dataset:
      test_loss(compute_ELBO(inference_net, generative_net, image_batch))
      
    template = 'Epoch {:03d}: time {:.3f} sec, train ELBO {:.3f}, test ELBO {:.3f}'
    print(template.format(epoch + 1, end-start, -train_loss.result(), -test_loss.result()))
    
    generate_and_display_images(generative_net)

inference_net = make_inference_net(latent_dim, img_shape)
generative_net = make_generative_net(latent_dim)

Let's train our VAE! 20 epochs are around when we start to see reasonable results, which will take about 2-3 mins to train using a GPU in Colab. If you have time, try training for 50 epochs and see how much of a difference there is.

In [None]:
num_epochs = 20
train_VAE(inference_net, generative_net, num_epochs)

Now, let's compare the samples from the posterior distribution $p(z|x$) (bottom) with corresponding data $x$ from the test dataset (middle) and the samples from the prior distribution $p(z)$ (top).

**Exercise:** Were we displaying samples from the prior or the posterior during the training of our VAE above?

In [None]:
#@title Comparison (RUN ME!)
fig = plt.figure(figsize=(16,6))

for i in range(8):
  prior_sample = noise[i:i+1, :]
  prior_image = generative_net(prior_sample, training=False)
  prior_image = tf.sigmoid(prior_image)

  plt.subplot(3, 8, i+1)
  plt.imshow(prior_image[0, :, :, 0], cmap='gray_r')
  plt.axis('off')

  data = np.reshape(ae_test_images[np.random.randint(10000)], (1,28,28,1))
  plt.subplot(3, 8, i+9)
  plt.imshow(data[0, :, :, 0], cmap='gray_r')
  plt.axis('off')

  mean, logvar = tf.split(inference_net(data), num_or_size_splits=2, axis=1)
  posterior_sample = prior_sample * tf.exp(logvar * .5) + mean
  posterior_image = generative_net(posterior_sample, training=False)
  posterior_image = tf.sigmoid(posterior_image)

  plt.subplot(3, 8, i+17)
  plt.imshow(posterior_image[0, :, :, 0], cmap='gray_r')
  plt.axis('off')

plt.show()



Notice that the prior samples do not necessarily look like any particular item of clothing, which means that there is still plenty of room for improvement in our model. Similarly, while the posterior samples do look something like the corresponding test dataset images, there is plenty of room for improvement. That said, it should be clear that we have successfully trained a VAE to model images of clothing if we compare the prior samples here to the random samples we tried to take from the standard autoencoder.

## VAE Tasks

1. **[All]** Try and improve the performance of our VAE. (*Hint*: try making the inference and generative networks more complex and train them for longer. Don't remove the assert statements from the `make_inference_net` and `make_generative_net` functions.) 
2. **[ALL]** experiment with interpolating between two random codes $z_1$ and $z_2$and examine what happens to the output of the generative network as you do so.
3. **[Advanced]** Modify the VAE code above to implement a [Beta-VAE](https://openreview.net/forum?id=Sy2fzU9gl). You can read more about Beta-VAE [here](https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html#beta-vae).

## VAE Extra Reading



* [Keras blog](https://blog.keras.io/building-autoencoders-in-keras.html) on various kinds of autoencoders.
* [2017's VAE Prac](https://github.com/deep-learning-indaba/practicals2017/blob/master/practical5.ipynb).
* [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) (the VAE paper).
* [TF 2.0 VAE Example](https://www.tensorflow.org/beta/tutorials/generative/cvae)

**IMPORTANT: Please fill out the exit ticket form before you leave the practical: https://forms.gle/ZLiuTPct4q3BzKtY8**

## What we didn't talk about


This practical has given an overview of two kinds of deep generative models, namely GANs and VAEs. However, you shouldn't think that there aren't any other methods out there. Other important kinds of generative models include:

*  [Normalizing flows](https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html) (e.g. [RealNVP](https://arxiv.org/abs/1605.08803_), [Glow](https://arxiv.org/abs/1807.03039))
*  Auto-regressive models (e.g. [WaveNet](https://arxiv.org/pdf/1609.03499.pdf), [PixelRNN](https://arxiv.org/pdf/1601.06759.pdf), [PixelCNN](https://arxiv.org/pdf/1606.05328.pdf))
*  and a whole range of non-deep learning related models such as [Latent Dirichlet Allocation](http://ai.stanford.edu/~ang/papers/jair03-lda.pdf).


