# Machine Vision - Assignment 5: Generating Images with Variational Autoencoders

---

Prof. Dr. Markus Enzweiler, Esslingen University of Applied Sciences

markus.enzweiler@hs-esslingen.de

---

This is the fifth assignment for the "Machine Vision" lecture. 
It covers:
* training variational autoencoders and sampling them to generate new images
* starting with existing TensorFlow / Keras code and adapting it to new problems
* datasets used are [MNIST](http://yann.lecun.com/exdb/mnist/) and [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/)

**Make sure that "GPU" is selected in Runtime -> Change runtime type**

To successfully complete this assignment, it is assumed that you already have some experience in Python and numpy. You can either use [Google Colab](https://colab.research.google.com/) for free with a private (dedicated) Google account (recommended) or a local Jupyter installation.

---


## Exercise 1 - Using Variational Autoencoders (VAE) on MNIST (10 points)

This exercise involves working with existing TensorFlow / Keras code and experimenting with the code. 



1.   Work through the [TensorFlow Convolutional Variational Autoencoder Tutorial](https://www.tensorflow.org/tutorials/generative/cvae) that trains a VAE on MNIST. It involves encoding 28x28 pixel MNIST images into **2(!) latent dimensions** and reconstructing the images from those 2 dimensions. Copy the tutorial to your Colab (or local) workspace and run the Jupyter notebook. Try to understand the main concepts in this tutorial, such as data, encoder/decoder CNN architecture, latent feature space, as we have seen in the lecture. **Do not get lost in details, such as the exact loss formulation and exact sampling procedure from the model (although it might be hard for TIB / SWB students ...).** 
2.   Make the following adaptations to the code:

*   Visualize the original input test samples next to their reconstructed versions in the function ```generate_and_save_images()```
*   Create a function to sample random samples from the VAE model. Sample 100 images and display them. *Hint: Have a look at ```model.sample()```*
* Experiment with different values for the dimensions of the latent feature space. Besides 2 latent dimensions, also try 8, 32, and 128. How does the quality of the reconstructions and random samples change with the dimensionality of the latent feature space? *Hint: You need to disable ```plot_latent_images()``` for latent feature spaces with more than 2 dimensions.*









## Exercise 2 - Generating virtual faces with Variational Autoencoders (10 points) 

---

Adapt your code from exercise 1 to use the [Labeled Faces in the Wild](http://vis-www.cs.umass.edu/lfw/) dataset instead of MNIST. The goal is to train a VAE model that allows to reconstruct existing and sample novel face images. 

1. Integrate the code provided below to download, extract and preprocess the LFW dataset into your existing VAE MNIST code. We will be using images of size 64 x 64 pixels (instead of 28 x 28 pixels in the case of MNIST). Additionally, LFW involves three-channel RGB color images. We need to account for both differences in the encoder / decoder CNN architecture. Also, when displaying images, make sure to display all channels, e.g. ```plt.imshow(lfwDataset[i, :, :, :])```. 

2. Split the LFW dataset (13233 images) into 11233 training images and 2000 test images. *Hint: You can use [sklearn.model_selection.train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for that. Your output should then correspond to ```train_images``` and ```test_images``` in the VAE MNIST tutorial (exercise 1).*

3. Adapt the encoder / decoder CNN architecture from the MNIST VAE as follows:

 **Encoder (64x64x3 images -> 2*32 latent dimensions (mean and variance per latent dimension))**
 *   [InputLayer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/InputLayer) with ```input_shape=(64,64,3)```(64x64 images with three channels, RGB)
 *   [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) with 32 filters, 3x3 kernels, strides of 2x2, and ReLU activation (tensor dimensions: 64x64x3 -> 31x31x32) 
 *   [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) with 32 filters, 3x3 kernels, strides of 2x2, and ReLU activation (tensor dimensions: 31x31x32 -> 15x15x32) 
 *   [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) with 64 filters, 3x3 kernels, strides of 2x2, and ReLU activation (tensor dimensions: 15x15x32 -> 7x7x64) 
 *   [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) with 128 filters, 3x3 kernels, strides of 2x2, and ReLU activation (tensor dimensions: 7x7x64 -> 3x3x128) 
 * [Flatten](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten)  (tensor dimensions: 3x3x128 -> 1152)
 * [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) with ```latent_dim + latent_dim``` dimensions (predict mean and variance per latent dimension) (tensor dimensions: 1152 -> 64)

 The encoder should look as follows (output of ```self.encoder.summary()```):

 ```
    Model: "sequential_6"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    conv2d_15 (Conv2D)           (None, 31, 31, 32)        896       
    _________________________________________________________________
    conv2d_16 (Conv2D)           (None, 15, 15, 32)        9248      
    _________________________________________________________________
    conv2d_17 (Conv2D)           (None, 7, 7, 64)          18496     
    _________________________________________________________________
    conv2d_18 (Conv2D)           (None, 3, 3, 128)         73856     
    _________________________________________________________________
    flatten_3 (Flatten)          (None, 1152)              0         
    _________________________________________________________________
    dense_6 (Dense)              (None, 64)                73792     
    =================================================================
    Total params: 176,288
    Trainable params: 176,288
    Non-trainable params: 0
 ```

 **Decoder (32 latent dimensions -> 64x64x3 images)**

 The overall decoder architecture uses the following principle: Fully connect 32 input neurons (number of latent feature dimensions) to a grid of 16x16x32 neurons. With transposed convolutions (with the *correct* strides, kernel sizes and padding) we can subsequently increase tensor dimensions from 16x16x32 up to 64x64x128. For the last layer, a convolutional layer with three different filter kernels of size 1x1x128 (!) are used to shrink the 64x64x128 tensor to 64x64x3, our output image size. 1x1 convolutions might not seem intutive, but they can be used to shrink down feature maps in the ```depth``` dimension, without affecting the spatial x,y dimensions of the output, e.g. see [this explanation](https://machinelearningmastery.com/introduction-to-1x1-convolutions-to-reduce-the-complexity-of-convolutional-neural-networks/) for more info on that (if you want ....). Remember that in CNNs all convolution filters always extend to the full depth of the input tensor volume, as we have seen in the lecture. 

 *   [InputLayer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/InputLayer) with ```input_shape=(32)```(32 latent dimensions)
 * [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) with ```units=16*16*32, activation=tf.nn.relu``` (tensor dimensions: 32 -> 8192)
 * [Reshape](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Reshape) with ```target_shape=(16, 16, 32)``` (tensor dimensions: 8192 -> 16x16x32)
 *  [Conv2DTranspose](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2DTranspose) with 64 filters, 3x3 kernels, strides of 2x2, "same" padding, and ReLU activation (tensor dimensions: 16x16x32 -> 32x32x64) 
 *  [Conv2DTranspose](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2DTranspose) with 128 filters, 3x3 kernels, strides of 2x2, "same" padding, and ReLU activation (tensor dimensions: 32x32x64 -> 64x64x128) 
 * [Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) with 3 filters, 1x1 kernels, strides of 1x1, "same" padding and **no activation function** (tensor dimensions: 64x64x128 -> 64x64x3)

  The decoder should look as follows (output of ```self.decoder.summary()```):

    ```
    Model: "sequential_11"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense_11 (Dense)             (None, 8192)              270336    
    _________________________________________________________________
    reshape_5 (Reshape)          (None, 16, 16, 32)        0         
    _________________________________________________________________
    conv2d_transpose_20 (Conv2DT (None, 32, 32, 64)        18496     
    _________________________________________________________________
    conv2d_transpose_21 (Conv2DT (None, 64, 64, 128)       73856     
    _________________________________________________________________
    conv2d_29 (Conv2D)           (None, 64, 64, 3)         387       
    =================================================================
    Total params: 363,075
    Trainable params: 363,075
    Non-trainable params: 0
    ```



4. Train your LFW VAE using 32 latent feature dimensions for 50 epochs on your training set. Similar to exercise 1, visualize the original input test samples next to their reconstructed versions in the function ```generate_and_save_images()``` and generate 100 virtual random face samples from your VAE model. Which features of persons (e.g. hair, head pose, skin color, clothing, ...) have been learned by the model and which features are not present?

### Code for LFW dataset handling

In [None]:
# Installs & Imports

!pip install progress
!pip install scikit-learn

# Python stuff
import os
import glob
import tarfile
import urllib.request
import time
from IPython import display

# OpenCV and other image handling
import cv2   
import imageio
import PIL

# NumPy                    
import numpy as np   

# TensorFlow
import tensorflow as tf
import tensorflow_probability as tfp

# Matplotlib    
import matplotlib.pyplot as plt
import matplotlib.patches as patches
# make sure we show all plots directly below each cell
%matplotlib inline 

# scikit-learn
from sklearn.model_selection import train_test_split

In [None]:
# Code for loading, extracting and pre-processing the LFW dataset. 
# Returns lfwDataset[i,:,:,:] with i being the image index.  


def loadLFWDataset(imgWidth, imgHeight):
  '''
  Load, extract and pre-process the LFW dataset. Images will be converted to 
  RGB and re-scaled to 0-1 (float). 

  Args: 
    imgWidth: target image width
    imgHeight: target image height

  Returns:
    lfwDataset[i,:,:,:] with i being the image index, followed by height, width, channels
  '''

  # load and untar the LFW dataset into the runtime
  url = "http://vis-www.cs.umass.edu/lfw/lfw.tgz"
  pathToTar = "./lfw.tgz"
  print("Downloading LFW data from {} to {}".format(url, pathToTar))
  urllib.request.urlretrieve(url, pathToTar)

  print("Extracting {}".format(pathToTar))
  with tarfile.open(pathToTar) as tar_ref:
      tar_ref.extractall("")

  # get a list of paths to the individual images
  images = glob.glob('lfw/*/' + "*.jpg")
  print("Number of extracted images : {}".format(len(images)))

  # preprocess the images
  lfwDataset = np.zeros([len(images), imgHeight, imgWidth, 3]).astype('float32')

  i = 0
  for imagePath in images:
    # read image
    image = cv2.imread(imagePath)  
    # convert to RGB
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    # resize and rescale to 0-1 (float)
    resized = cv2.resize(image, (imgHeight, imgWidth), interpolation = cv2.INTER_AREA)
    lfwDataset[i,:,:,:] = (resized / 255.0).astype('float32')
    i = i + 1
    if not (i % 2500):
      print("Preprocessing images: {} / {}".format(i, len(images)))

  print("Done \nDataset shape is {}".format(lfwDataset.shape))
  return lfwDataset

In [None]:
# load LFW dataset
imgWidth  = 64
imgHeight = 64
lfwDataset = loadLFWDataset(imgWidth, imgHeight);
numLfwImages = lfwDataset.shape[0]

In [None]:
# visualize some random samples from the dataset
plt.figure(figsize=(10,10))
indices = np.arange(numLfwImages)
np.random.shuffle(indices)
count=0
for i in indices[0:25]:
    plt.subplot(5,5,count+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(lfwDataset[i,:,:,:])
    count = count+1
plt.show()