# Generating Images of Faces with Generative Adversarial Networks
Report by Alexander Lercher, 01560095

## 1. Project Overview
Our goal was the generation of images of humans with artificial neural networks (ANNs). We discovered that images of human bodies are extremely hard to generate as many research papers currently only focus on faces and training data for whole body postures is rare and oftentimes not aligned.
We already identified this problem in the proposal and defined the generation of images of faces as fallback plan.

For our project we used a Generative Adversarial Network (GAN) to create images of faces. Here, two ANNs are connected in a way that they try to outsmart each other during training, hence the term adversary. The _generator_ generates new images of faces while the _discriminator_ tries to distinguish them from real ones [1].

Both networks and the training process were implemented from scratch in Python by using the machine learning framework TensorFlow [2]. <br />
The tasks for the project were split as follows:

| Task                       | Team member       | Description                     |
|:---------------------------|:------------------|:--------------------------------|
| Training Data Collection   | Manuel; Alexander | Collection of raw images        |
| Training Data Alignment    | Manuel            | Preparation of images for learning      |
| GAN Training Pipeline      | Alexander         | Training process implementation |
| GAN Reference Architecture | Alexander         | Image generation from [3, 4]       |
| GAN VAE Noise Input        | Manuel            | Image generation from [5]       |

The remainder of this report contains my contributions including their theory. If I describe Manuel's contributions they will be explicitly marked as his.

## 2. Training Data
We needed real images of faces for the training process so the GAN can learn how newly generated images should look like. 

First, we implemented a web crawling script to download images from public websites featuring models. Unfortunately, this approach had the problem that most images were discarded during Manuel's alignment process as no distinct face was found in the picture.

Next, we downloaded multiple professional datasets from research papers and Kaggle competitions. In the end, the training process was done with the CelebA dataset [6], which contains 200,000 images of celebrity faces.

### Data Preparation
The dataset was prepared in a way that 
- images are converted to greyscale (1 channel),
- image dimensions match our desired size (256x256), 
- and faces are in the middle of the cropped image. 

Methods for face detection and cropping were implemented by Manuel.
After these steps, the images were combined to batches of size 64 and stored on disk as large numpy arrays with resulting shapes (64, 256, 256, 1).

### Reasons for Additional Preparation
If we could not detect a face, the image was discarded from the training set. For all other images, the face was moved to the center.

This aids the training process as the discriminator must learn how a valid face looks like. As all faces are in the center of the image the discriminator successfully learns to focus on that part. Simultaneously, this helps the generator to improve the generation of facial features without focusing on the background.

## 3. Generative Adversarial Networks
As already explained, a GAN consists of two ANNs.
The _generator_ generates new images of faces while the _discriminator_ tries to distinguish them from real ones [1].
This basic architecture is visualized in Figure 1.

![Basic Architecture](https://developers.google.com/machine-learning/gan/images/gan_diagram.svg)
_Figure 1: Basic GAN Architecture (Source: [1])_

### Discriminator Goal
The discriminator's goal is to correctly classify input images from two classes: fake and real. Real images are taken from the CelebA dataset and fake images are generated by the generator.
The loss function for the discriminator is the following [4, 7]:

\begin{equation*}
L(D) = -\frac{1}{2} \mathbb{E}_x [ log (D(x)) ] -\frac{1}{2} \mathbb{E}_z [ log (1 - D(G(z))) ]
\end{equation*}

Intuitively, the first part optimizes predictions for real data $x$ which should be classified as real, with $D(x)=1$. The second part represents predictions for fake data from the generator which should be classified as fake, with $D(G(z))=0$.

### Generator Goal
The generator's goal is the generation of realistic images based on the real ones. Therefore, its loss is opposite from the discriminator's [7]:

\begin{equation*}
L(G) = - \mathbb{E}_z [ log (D(G(z))) ]
\end{equation*}

Here, the generator's output should be classified as real, with $D(G(z))=1$, where $z$ is random input to the generator.

## 4. Chosen GAN Architectures
We chose two GAN architectures for implementation, namely *Deep Convolutional GAN* (DCGAN) [3] and *High-Resolution Deep Convolutional GAN* (HDCGAN or HR-DCGAN) [4].

### Use of Convolutions
Both networks employ two-dimensional convolutions for generator and discriminator. 
The discriminator takes the image as input, e. g. with dimensions (256, 256, 1), and applies convolutions with stride 2 to create a classification result with one single output, i. e. real or not real. 
The generator takes a noise vector, e. g. of size 100, and uses transposed convolutions with stride 2 for upsampling to the desired resolution, e. g. (256, 256, 1). 

Figure 2 shows a convolution with a kernel size of 3 and stride of 2. The kernel size means that every new value is calculated from 3x3 pixel values. The stride of 2 defines that the kernel is moved 2 pixels for each new calculation. Therefore, not every possible 3x3 pixel combination is looked at, but only every second and the output matrix is half the original size [8].

<img alt="2D Convolution Stride 2"
     src="https://miro.medium.com/max/294/1*BMngs93_rm2_BpJFH2mS0Q.gif" 
     width="200" />
_Figure 2: 2D convolution with kernel size 3 and stride 2 (Source: [8])_

Figure 3 shows a transposed convolution with a kernel size of 3 and stride of 2. Here, the original matrix is padded to increase its size. Executing the convolution will result in a output matrix twice the original size [8]. 

<img alt="2D Transposed Convolution Stride 2"
     src="https://miro.medium.com/max/395/1*Lpn4nag_KRMfGkx1k6bV-g.gif" 
     width="200" />
_Figure 3: 2D transposed convolution with kernel size 3 and stride 2 (Source [8])_

In a GAN, the discriminator and generator have to learn the weigths for convolution and transposed convolution, respectively.


### Deep Convolutional GAN
The DCGAN's architecture chains multiple convolutions with stride 2 to reach the desired output shape (Figure 4). Each convolution doubles or halves its image size while halving or doubling the filter size for generator and discriminator, respectively. After each convolution a batch normalization layer is added to normalize the values and help with gradient calculation. The activation functions are ReLU and LeakyReLU for generator and discriminator, respectively. The authors used the Adam optimizer with a learning rate of 0.0002 and momentum of 0.5 [3].

![DCGAN Generator Architecture](dcgan_generator_architecture.png)
_Figure 4: DCGAN generator architecture with a resulting image shape of (64, 64, 3) (Source: [3])_

### High-Resolution Deep Convolutional GAN
The HR-DCGAN is a improved version of the DCGAN for high-resolution images. It uses the same convolution chain but replaces all activation functions with a Scaled Exponential Linear Unit (SELU). Furthermore, the authors introduce additional convolution layers with stride of 1 as the first layer of the generator and last layer of the discriminator [4]. 
Figure 5 shows the generator and discriminator architectures from the paper.

<img alt="HR-DCGAN Architecture"
     src="hrdcgan_architecture.png" 
     width="350" />
_Figure 5: HR-DCGAN architecture for generator and discriminator (Source: [4])_

## 5. Implementation
The implementation was done in Python 3 with  TensorFlow [2].
As the project got rather large most of the code is object-oriented for better maintenance. The training process was executed with the Jupyter Notebook ```lercher.ipynb```.

### Training Data Preparation
My part of the data preparation was mostly done in ```training_images/training_data_provider.py```. The following method loads all image paths from disk, preprocesses individual images as explained in section 2, and combines them to batches. Once a batch is complete, it is yielded to the caller directly.

```python
def get_all_training_images_in_batches(self) -> Iterable[np.ndarray]:
    '''
    Preprocesses all training images and returns them in multiple batches, where one batch is one numpy array.

    :returns: arrays with shape (self.batch_size, self.image_width, self.image_height, 1)
    '''
    current_batch = [] # the current batch to fill

    for image_path in self._load_all_images_from_disk():
        image: Image = Image.open(image_path)
        try:
            image = self._preprocess_image(image)
        except LookupError:
            continue # no face was found in image

        current_batch.append(np.array(image))

        if len(current_batch) == self.batch_size:
            yield self._convert_image_batch_to_training_array(current_batch)
            current_batch = []

    if len(current_batch) > 0:
        yield self._convert_image_batch_to_training_array(current_batch)
```

This implementation avoids loading all images into memory. Unfortunately, the preprocessing takes a long time and must be done for each epoch. Therefore, I decided to preprocess the images only once and store complete batches on disk as numpy arrays, which is the data type used during training.

The following method loads the prepared image batches one by one and was used during final training.

```python
def get_all_training_images_in_batches_from_disk(self) -> Iterable[np.ndarray]:
    '''
    Loads preprocessed image arrays from files. 
    Memory load is not that high, as only individual batches are read in.
    The batch size is fixed to 64.

    :returns: same as self.get_all_training_images_in_batches() but faster.
    '''
    if not os.path.exists(self.npy_data_path):
        raise IOError(f"The training arrays folder {self.npy_data_path} does not exist.")

    for _, _, files in os.walk(self.npy_data_path):
        random.shuffle(files)
        for file_ in files:
            yield np.load(os.path.join(self.npy_data_path, file_), allow_pickle=True)
```

### GAN Training Process
The training process is implemented once in ```gan/gan.py``` and subsequently used for all GAN architectures. We applied the strategy pattern to add new architectures more easily. As seen in Figure 6, a new network only needs to provide the new architecture, noise dimension, and optimizer hyperparameters.

<img alt="HR-DCGAN Architecture"
     src="gan_strategy_pattern.png" 
     width="350" />

_Figure 6: Implementation of the strategy pattern for GAN architectures_

#### Implementation of Loss Functions
As explained in section 3 there are two losses, one for generator and one for discriminator. The following two functions use binary cross-entropy to calculate the losses, where the discriminator should detect fake images as fake and the generator should have them detected as real.

```python
@staticmethod
def discriminator_loss(real_output, generated_output):
    '''The discriminator loss function, where real output should be classified as 1 and generated as 0.'''
    return GAN.bce(tf.ones_like(real_output), real_output) + GAN.bce(tf.zeros_like(generated_output), generated_output)

@staticmethod
def generator_loss(generated_output):
    '''The generator loss function, where generated output should be classified as 0.'''
    return GAN.bce(tf.ones_like(generated_output), generated_output)
```

#### Implementation of Training Steps 
For actual training we implemented the loss calculation, gradient calculation and weight updates manually as build-in methods did not work properly. 

In the beginning, we trained the discriminator and generator with build-in methods taking turns for each epoch. For instance, we trained both for 20 iterations while the adversary was stable, i. e. not learning. This was done by Manuel but unfortunately yielded inferior results.

The solution was to train both ANNs truly parallel by first generating both network's results, calculating both losses, calculating both gradients, and updating both weights simultaneously [9]. This is done by the following TensorFlow function.

```python
@tf.function
def train_step(self, real_data_batch: np.ndarray) -> '(disc_loss, gen_loss)':
    # prepare real data and noise input
    batch_size = real_data_batch.shape[0]
    noise = tf.random.normal([batch_size, self.get_noise_dim()])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
        # Predict images with G
        gen_data = self.generator(noise, training=True)

        # Predict classes with D
        d_real_predicted_labels = self.discriminator(real_data_batch, training=True)
        d_fake_predicted_labels = self.discriminator(gen_data, training=True)

        # Compute losses
        d_loss_value = GAN.discriminator_loss(real_output=d_real_predicted_labels, generated_output=d_fake_predicted_labels)
        g_loss_value = GAN.generator_loss(generated_output=d_fake_predicted_labels)

    # Now that we have computed the losses, we can compute the gradients (using the tapes)
    gradients_of_discriminator = disc_tape.gradient(d_loss_value, self.discriminator.trainable_variables)
    gradients_of_generator = gen_tape.gradient(g_loss_value, self.generator.trainable_variables)

    # Apply gradients to variables
    self.d_optimizer.apply_gradients(zip(gradients_of_discriminator, self.discriminator.trainable_variables))
    self.g_optimizer.apply_gradients(zip(gradients_of_generator, self.generator.trainable_variables))

    return d_loss_value, g_loss_value
```

### GAN Architectures
The architectures for the individual ANNs were implemented as Keras Sequential Models [10]. As seen in Figure 6, we implemented two GANs per paper, a small version for the generation of MNIST handwritten digits [11] with image shape (28, 28, 1) and a larger version for the generation of facial images with shape (256, 256, 1).

The architecture for images of faces was changed during implementation as training for the proposed architectures took unreasonably long.
We reduced the number of convolution layers (including their batch-norm and activation layers) by half and instead chose a stride of 4. Interestingly, this also improved training results during the first 5 epochs.

The following two snippets taken from ```gan/dcgan_mnist.py``` show the smaller architecture for the DCGAN working with the MNIST dataset.

```python
def build_generator(self):
    noise_shape = (self.get_noise_dim(),)

    model = Sequential([

        # project and reshape
        layers.Dense(7*7*128, use_bias=False, input_shape=noise_shape),
        layers.BatchNormalization(),
        layers.ReLU(),
        layers.Reshape((7, 7, 128)),
        # shape (7, 7, 128)

        # stride 2 -> larger image
        # thiccness 64 -> channels
        layers.Conv2DTranspose(64, (5,5), strides=(2,2), padding='same', use_bias=False),
        layers.BatchNormalization(),
        layers.ReLU(),
        # shape (14, 14, 64)

        layers.Conv2DTranspose(1, (5,5), strides=(2,2), padding='same', use_bias=False, activation='tanh')
        # shape (28, 28, 1)

    ])

    return model
```

```python
def build_discriminator(self):
    img_shape = (28, 28, 1)

    model = Sequential([

        layers.Conv2D(64, (5,5), strides=(2,2), padding='same', input_shape=img_shape),
        layers.LeakyReLU(alpha=.2),
        # shape (14, 14, 64)

        layers.Conv2D(128, (5,5), strides=(2,2), padding='same'),
        layers.BatchNormalization(),
        layers.LeakyReLU(alpha=.2),
        # shape (7, 7, 128)

        layers.Flatten(),
        layers.Dense(1)
    ])

    return model
```

## 6. Results
This section shows some results generated by our GANs during 50 epochs with 60,000 MNIST images or 50,000 faces from CelebA.

Overall, our results after just 50 epochs look promising especially for the DCGAN. The center of generated images already contains the outline of a round face including eyes, nose and mouth. Even hair is already generated in various hairstyles. 

### DCGAN
<div>
    <img alt="DCGAN MNIST"
         src="../../gan/models/mnist/dcgan/progress.gif"
         style="width: 30%; float: left;"/>
    <img alt="DCGAN Faces"
         src="../../gan/models/faces/dcgan/progress.gif"
         style="width: 30%; margin-left:20px"/>
<div/>

### HR-DCGAN
<div>
    <img alt="HR-DCGAN MNIST"
         src="../../gan/models/mnist/hr_dcgan/progress.gif"
         style="width: 30%; float: left;"/>
    <img alt="HR-DCGAN Faces"
         src="../../gan/models/faces/hr_dcgan/progress.gif"
         style="width: 30%; margin-left:20px"/>
<div/>

### Comparison with Existing Networks
Our generator has inferior performance compared to existing approaches. Figure 7 shows results from the authors of HR-DCGAN after 150 epochs with image size 512x512x3 on a different dataset. 
    
![HR-DCGAN 150 Epochs](hrdcgan_150_epochs.png)
_Figure 7: Results from the HR-DCGAN paper after 150 epochs (Source: [4])_

### Calculation of Fréchet Inception Distance
In addition to qualitative results we calculated the Fréchet Inception Distance (FID), which measures the distance of feature distributions between real and generated images (lower is better) [4, 12].

The FID is defined as

\begin{equation*}
FID((m,C),(m_w,C_w)) = \left \| m-m_w \right \|^2_2 + Tr(C + C_w - 2(CC_w)^\frac{1}{2})
\end{equation*}

where $(m,C)$ and $(m_w,C_w)$ are mean and covariance of generated and real-world image features respectively and $Tr(.)$ sums up the diagonal elements of the matrix [12]. The formula is implemented in Python as function ```calculate_fid(.)```. 

```python
def calculate_fid(act1, act2):
    # calculate mean and covariance statistics
    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)
    
    # calculate sum squared difference between means
    ssdiff = sum((mu1 - mu2)**2.0)
    
    # calculate sqrt of product between cov
    covmean = sqrtm(sigma1.dot(sigma2))
    if iscomplexobj(covmean):
        covmean = covmean.real
        
    # calculate score
    fid = ssdiff + trace(sigma1 + sigma2 - 2.0 * covmean)
    
    return fid
```

To extract features from real and generated images we use Keras' pretrained InceptionV3 model. For this, we must resize the images to (299, 299, 3) [13]. 
The actual script to load and prepare the images and to calculate the FID was excluded for brevity.

In [1]:
import fid

# sets the sample size to represent the distribution 
# (tradeoff: representative vs mem/cpu intensive)
fid.NR_IMAGES = 2000

for dataset in ['mnist', 'faces']:
    for arc in ['dcgan', 'hr_dcgan']:
        print(f"FID for {dataset} with {arc}: {fid.calculate_fid_for_dataset(architecture=arc, dataset=dataset)}")

Using TensorFlow backend.


Loading models from file
FID for mnist with dcgan: 36.839734822259544
Loading models from file
FID for mnist with hr_dcgan: 22.92706040474824
Loading models from file
FID for faces with dcgan: 500.22448981244565
Loading models from file
FID for faces with hr_dcgan: 564.4752399097662


The FID results for a sample size of 2,000 images are approximately the following:

|          | MNIST | Faces |
|:---------|:------|:------|
| DCGAN    | 36.8  | 500.2 |
| HR-DCGAN | 22.9  | 564.5 |

The HR-DCGAN authors stated a Fréchet Inception Distance of 8.44 for the CelebA dataset [4]. This means, we still have to improve our generator.

### Image Generation with our Models
Finally, we provide a short script to create images with our trained generators.

In [2]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import tensorflow as tf

# Execute this to avoid internal tf error
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

In [3]:
from ipywidgets import interact_manual

# add the project's base dir to interpreter paths
import sys
sys.path.insert(1, '../../')
from gan import GAN

# use this hack as Jupyter Notebook 
# cannot create new reference in interact_manual
gan = {'gan': None} 

print("Select Generator:")
@interact_manual
def test(architecture=['dcgan', 'hr_dcgan'], dataset=['mnist', 'faces']):
    gan['gan'] = GAN.import_gan(path=f"../../gan/models/{dataset}/{architecture}{'_reduced_architecture' if dataset=='faces' else ''}/")

Select Generator:


interactive(children=(Dropdown(description='architecture', options=('dcgan', 'hr_dcgan'), value='dcgan'), Drop…

In [4]:
import matplotlib.pyplot as plt

@interact_manual
def generate():
    img = gan['gan'].generate()
    plt.imshow((img[0, :,:,0]), cmap='gray')
    plt.axis('off')
    plt.show()

interactive(children=(Button(description='Run Interact', style=ButtonStyle()), Output()), _dom_classes=('widge…

## 7. Discussion
The results from our normal DCGAN look better than from our HR-DCGAN. 
I assume this has to do with the size of training images, as 256x256x1 is not quite high-resolution. In their paper, the authors used training images of size 512x512x3 [4] which I regarded as similar size, as this would only add a single convolution layer multiplying or dividing our current resolution by 2x2 again.

In future work, we would like to train the GAN on a server with the original network architectures and the whole range of 200,000 celebrity faces. This should give more variety and allows the discriminator to learn when a face is _consistent_ because current faces contain a mixture of real faces, e. g. different hairstyles or two different eyes.
Then, we could also add color channels which increases the size per image by an additional factor of 3.


## 8. Sources

[1] https://developers.google.com/machine-learning/gan

[2] https://www.tensorflow.org/

[3] https://arxiv.org/pdf/1511.06434.pdf

[4] https://arxiv.org/pdf/1711.06491.pdf

[5] https://arxiv.org/pdf/1807.03923.pdf

[6] http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

[7] https://arxiv.org/pdf/1406.2661.pdf

[8]: https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

[9] https://www.tensorflow.org/tutorials/generative/dcgan

[10] https://keras.io/guides/sequential_model/

[11] http://yann.lecun.com/exdb/mnist/

[12] https://arxiv.org/pdf/1706.08500.pdf

[13] https://keras.io/api/applications/inceptionv3/