# Generating Images of Faces with Generative Adversarial Networks
by Manuel Herold and Alexander Lercher

## Project Overview
Our goal was the generation of images of humans with artificial neural networks (ANNs). We discovered that images of human bodies are extremely hard to generate as many research papers currently only focus on faces and training data for whole body postures is rare and oftentimes not aligned.
We already identified this problem in the proposal and chose images of faces as fallback plan.

For our project we used a Generative Adversarial Network (GAN) to create images of faces. Here, two ANNs are connected in a way that they try to outsmart each other during training, hence the term adversary. The _generator_ generates new images of faces while the _discriminator_ tries to distinguish them from real ones [1].

Both networks and the training process were implemented from scratch in Python by using TensorFlow's machine learning framework [2].

The tasks for the project were split as follows:

| Task                       | Team member       | Description                     |
|----------------------------|-------------------|---------------------------------|
| Training Data Collection   | Manuel; Alexander | Collection of raw images        |
| Training Data Alignment    | Manuel            | Training data preparation       |
| GAN Training Pipeline      | Alexander         | Training process implementation |
| GAN Reference Architecture | Alexander         | Image generation with [3, 4]       |
| GAN VAE Noise Input        | Manuel            | Image generation with [5]       |

The remainder of this report contains my contributions including their theory. If Manuel's contributions are needed to understand mine they will be explicitly marked as his.

[1] https://developers.google.com/machine-learning/gan



[3] https://arxiv.org/pdf/1511.06434.pdf 

[4] https://arxiv.org/pdf/1711.06491.pdf

[5] @misc{zhong2018generative,
    title={Generative Adversarial Networks with Decoder-Encoder Output Noise},
    author={Guoqiang Zhong and Wei Gao and Yongbin Liu and Youzhao Yang},
    year={2018},
    eprint={1807.03923},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

## Training Data
We needed real images of faces for the training process so the GAN can learn how newly generated images should look like. 

First, we implemented a web crawling script to download images from public websites featuring models. Unfortunately, this approach had the problem that most images were discarded during alignment.

Next, we downloaded multiple professional datasets from research papers or Kaggle competitions. In the end, the training process was done with the [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html) dataset, which contains 200,000 images of celebrity faces.

### Preparation
The dataset was prepared in a way that 
- images are converted to greyscale (1 channel),
- the image dimensions match our desired size (256x256), 
- and faces are in the middle of the cropped image. 

Methods for face detection and cropping were implemented by Manuel.
Furthermore, the images were combined to batches of size 64 and stored on disk as large numpy arrays with resulting shapes (64, 256, 256, 1).

### Considerations
If we could not detect a face, the image was discarded from the training set. For all other images the face was moved to the center.

This aids the training process where the discriminator must learn how a valid face looks like. As all faces are in the center the discriminator successfully learns to focus on this part.

## GAN Basics
As already explained, the GAN consists of two ANNs.
The _generator_ generates new images of faces while the _discriminator_ tries to distinguish them from real ones [1].
The basic architecture is visualized in figure 1.

![Basic Architecture](https://developers.google.com/machine-learning/gan/images/gan_diagram.svg)
_Figure 1: Basic GAN Architecture_

### Discriminator Goal
The discriminator's goal is to correctly classify input images from two classes: fake and real. Real images are taken from the CelebA[x] dataset. Fake images are generated by the Generator.
The loss function for the discriminator is the following [6]:

\begin{equation*}
L(D) = -\frac{1}{2} log (D(x)) -\frac{1}{2} log (1 - D(G(z)))
\end{equation*}

Intuitively, the first part represents real data x which should be classified as real with D(x)=1. The second part represents fake data from the generator which should be classified as fake with D(G(z))=0.

### Generator Goal
The generator's goal is the generation of realistic images based on the real images. Therefore its goal is opposite from the discriminator's:

\begin{equation*}
L(G) = - log (D(G(z)))
\end{equation*}

Here, the generator's output should be classified as real with D(G(z))=1, where _z_ is random input for the generator.

[6] https://arxiv.org/pdf/1711.06491.pdf

## Chosen GAN Architectures
We chose two architectures for generator and discriminator, namely DCGAN [7] and HDCGAN [8] (also called HR-DCGAN). The abbreviations stand for _Deep Convolutional GAN_ and _High-Resolution Deep Convolutional GAN_ respectively.

### Convolutions
Both networks employ two-dimensional convolutions for generator and discriminator. The discriminator takes the image as input, e. g. with dimensions (256, 256, 1), and applies convolutions with stride 2 to create a classification result with one single output, i. e. real or not real.

The generator takes a noise vector of size 100 and uses transposed convolutions with stride 2 for upsampling to the desired resolution, e. g. (256, 256, 1). 

Figure 2 shows a convolution with a kernel size of 3 and stride of 2. The kernel size means that a single value is calculated for each 3x3 pixel values. The stride of 2 defines that the kernel is moved 2 pixels for each new calculation. Therefore, not every 3x3 pixel matrix is looked at but only every second and the resulting matrix is half the original size [9].

<img alt="2D Convolution Stride 2"
     src="https://miro.medium.com/max/294/1*BMngs93_rm2_BpJFH2mS0Q.gif" 
     width="200" />
_Figure 2: 2D convolution with kernel size 3 and stride 2 (Source: [9])_

Figure 3 shows a transposed convolution with a kernel size of 3 and stride of 2. Here, the original matrix is padded to increase its size. Executing the convolution will result in a matrix twice the original size [9]. 

<img alt="2D Transposed Convolution Stride 2"
     src="https://miro.medium.com/max/395/1*Lpn4nag_KRMfGkx1k6bV-g.gif" 
     width="200" />
_Figure 3: 2D transposed convolution with kernel size 3 and stride 2 (Source [9])_

In a GAN, the discriminator and generator have to learn the weigths for convolution and transposed convolution respectively.




[7]: https://arxiv.org/pdf/1511.06434.pdf

[8]: https://arxiv.org/pdf/1711.06491.pdf

[9]: https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d


### Deep Convolutional GAN
The DCGAN's architecture chains multiple convolutions with stride 2 to reach the desired output shape. Each convolution doubles or halves its image size while halving or doubling the filter size for generator and discriminator respectively. After each convolution a batch normalization layer is added to normalize the values and help with gradient calculation. The activation functions are ReLU and LeakyReLU for generator and discriminator respectively [7]. 
Figure 4 visualizes the convolutions for the generator.

![DCGAN Generator Architecture](dcgan_generator_architecture.png)
_Figure 4: DCGAN generator architecture with a resulting image shape of (64, 64, 3) (Source [7])_

They used the Adam optimizer with a learning rate of 0.0002 and momentum of 0.5 [7].

### High-Resolution Deep Convolutional GAN
The HR-DCGAN is a improved version of the DCGAN for high-resolution images. They build on the same convolution chain but replace all activation functions with a Scaled Exponential Linear Unit (SELU). Furthermore, they add some additional convolution layers with stride of 1 as the first layer of the generator and last layer of the discriminator [8]. 
Figure 5 shows the generator and discriminator architectures from the paper.

<img alt="HR-DCGAN Architecture"
     src="hrdcgan_architecture.png" 
     width="350" />
_Figure 4: HR-DCGAN architecture for generator and discriminator (Source [8])_

## Implementation
The implementation was done in Python 3 with  TensorFlow [2].
As the project got rather large most of the code is object-oriented for better maintenance. The training process was executed with the Jupyter Notebook ```lercher.ipynb```.

[2] : https://www.tensorflow.org/

### Training Data Preparation
My part of the data preparation was mostly done in ```training_images/training_data_provider.py```. The following method loads all image paths from disk, preprocesses individual images as explained in the theory, and combines them to batches. Once a batch is complete, it is yielded to the caller directly.

In [None]:
def get_all_training_images_in_batches(self) -> Iterable[np.ndarray]:
    '''
    Preprocesses all training images and returns them in multiple batches, where one batch is one numpy array.

    :returns: arrays with shape (self.batch_size, self.image_width, self.image_height, 1)
    '''
    current_batch = [] # the current batch to fill

    for image_path in self._load_all_images_from_disk():
        image: Image = Image.open(image_path)
        try:
            image = self._preprocess_image(image)
        except LookupError:
            continue # no face was found in image

        current_batch.append(np.array(image))

        if len(current_batch) == self.batch_size:
            yield self._convert_image_batch_to_training_array(current_batch)
            current_batch = []

    if len(current_batch) > 0:
        yield self._convert_image_batch_to_training_array(current_batch)


This implementation was chosen to avoid loading all images in memory. Unfortunately, the preprocessing takes a lot of time and has to be done for each epoch. I decided to preprocess the images only once and store complete batches on disk as numpy arrays, which is the data type used during training.

The following method loads the prepared image batches one by one and was  used during final training.

In [None]:
def get_all_training_images_in_batches_from_disk(self) -> Iterable[np.ndarray]:
    '''
    Loads preprocessed image arrays from files. Memory load is not that high, as only individual batches are read in.
    The batch size is fixed to 64.

    :returns: same as self.get_all_training_images_in_batches() but slightly faster.
    '''
    if not os.path.exists(self.npy_data_path):
        raise IOError(f"The training arrays folder {self.npy_data_path} does not exist.")

    for _, _, files in os.walk(self.npy_data_path):
        random.shuffle(files)
        for file_ in files:
            yield np.load(os.path.join(self.npy_data_path, file_), allow_pickle=True)


### GAN Training Process
The training process is implemented once in ```gan/gan.py``` and subsequently used for all GAN architectures. We applied the strategy pattern to add new networks more easily. As seen in figure 5, a new network only needs to provide the new architecture, noise dimension, and optimizer hyperparameters.

<img alt="HR-DCGAN Architecture"
     src="gan_strategy_pattern.png" 
     width="350" />

_Figure 5: Implementation of the strategy pattern for GAN architectures_

#### Implementation of Loss Functions
As explained in the theory there are two losses, one for generator and one for discriminator. The following two functions use binary cross-entropy to calculate the losses, where the discriminator should detect fake images as fake and the generator should have fakes detected as real.

In [None]:
@staticmethod
def discriminator_loss(real_output, generated_output):
    '''The discriminator loss function, where real output should be classified as 1 and generated as 0.'''
    return GAN.bce(tf.ones_like(real_output), real_output) + GAN.bce(tf.zeros_like(generated_output), generated_output)

@staticmethod
def generator_loss(generated_output):
    '''The generator loss function, where generated output should be classified as 0.'''
    return GAN.bce(tf.ones_like(generated_output), generated_output)

#### Implementation of Training Steps 
For the actual training we implemented the loss calculation, gradient calculation and weight updates manually. 

In the beginning, we trained the discriminator and generator taking turns for each epoch. For instance, we trained both for 20 iterations while the adversary was stable, i. e. not learning. This was done by Manuel but unfortunately yielded inferior results.

The solution was to train both ANNs truly parallel by first generating the network's results, calculating the losses, calculating the gradients and updating the weights side-by-side [10]. This is done by the following TensorFlow function.

[10] https://www.tensorflow.org/tutorials/generative/dcgan

In [None]:
@tf.function
def train_step(self, real_data_batch: np.ndarray) -> '(disc_loss, gen_loss)':
    # prepare real data and noise input
    batch_size = real_data_batch.shape[0]
    noise = tf.random.normal([batch_size, self.get_noise_dim()])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
        # Predict images with G
        gen_data = self.generator(noise, training=True)

        # Predict classes with D
        d_real_predicted_labels = self.discriminator(real_data_batch, training=True)
        d_fake_predicted_labels = self.discriminator(gen_data, training=True)

        # Compute losses
        d_loss_value = GAN.discriminator_loss(real_output=d_real_predicted_labels, generated_output=d_fake_predicted_labels)
        g_loss_value = GAN.generator_loss(generated_output=d_fake_predicted_labels)

    # Now that we have computed the losses, we can compute the gradients (using the tapes)
    gradients_of_discriminator = disc_tape.gradient(d_loss_value, self.discriminator.trainable_variables)
    gradients_of_generator = gen_tape.gradient(g_loss_value, self.generator.trainable_variables)

    # Apply gradients to variables
    self.d_optimizer.apply_gradients(zip(gradients_of_discriminator, self.discriminator.trainable_variables))
    self.g_optimizer.apply_gradients(zip(gradients_of_generator, self.generator.trainable_variables))

    return d_loss_value, g_loss_value

### Implementation of GAN Architectures
The architectures for the ANNs were implemented as Keras Sequential Models [11]. As seen in figure 5, we implemented two architectures per paper, a small version for the generation of MNIST handwritten digits with image shape (28, 28, 1) and a larger version for the generation of facial images with shape (256, 256, 1).

The following snippet taken from ```gan/dcgan_mnist.py``` shows the architecture for a DCGAN working with the MNIST dataset.

[11] https://keras.io/guides/sequential_model/

In [None]:
def build_generator(self):
    noise_shape = (self.get_noise_dim(),)

    model = Sequential([

        # project and reshape
        layers.Dense(7*7*128, use_bias=False, input_shape=noise_shape),
        layers.BatchNormalization(),
        layers.ReLU(),
        layers.Reshape((7, 7, 128)),
        # shape (7, 7, 128)

        # stride 2 -> larger image
        # thiccness 64 -> channels
        layers.Conv2DTranspose(64, (5,5), strides=(2,2), padding='same', use_bias=False),
        layers.BatchNormalization(),
        layers.ReLU(),
        # shape (14, 14, 64)

        layers.Conv2DTranspose(1, (5,5), strides=(2,2), padding='same', use_bias=False, activation='tanh')
        # shape (28, 28, 1)

    ])

    return model

def build_discriminator(self):
    img_shape = (28, 28, 1)

    model = Sequential([

        layers.Conv2D(64, (5,5), strides=(2,2), padding='same', input_shape=img_shape),
        layers.LeakyReLU(alpha=.2),
        # shape (14, 14, 64)

        layers.Conv2D(128, (5,5), strides=(2,2), padding='same'),
        layers.BatchNormalization(),
        layers.LeakyReLU(alpha=.2),
        # shape (7, 7, 128)

        layers.Flatten(),
        layers.Dense(1)
    ])

    return model


## Additional Sources for Report


https://datagrid.co.jp/en/all/release/386/