<img align="right" style="max-width: 200px; height: auto" src="https://github.com/GitiHubi/courseML/blob/master/lab_06/hsg_logo.png?raw=1">

### Lab 07 - "Deep Learning - Generative Adversarial Networks"

Machine Learning (BBWL), University of St. Gallen, Spring Term 2021

In the last labs you learned about using **supervised** deep learning techniques that aim at classifying data based on learned patterns.

In this lab, we will learn how deep neural networks can work together to generate fake yet realistic data. To do this, we use a modern architecture called **Generative Adversarial Networks**, or simply '**GAN**'s. The intuition behind **GAN** is that by training two neural networks against each other, we enhance data generation with an artificial feedback loop. Real data is shown to a **discriminative model**. This model is trained to simply distinguish whether a sample is real or fake (binary classifier). A **generative model** submits fake samples to the discriminative model, and uses the latter's output as feedback with which it can improve its generated samples. This process will, of course, be further explained during the lab. **GANs** have been recognized as a powerful tool to generate very realistic data. Its applications are diverse: data augmentation, text-to-image translation, artificial music, photos to emojis and many others.

We will again use the functionality of the `PyTorch` library to implement and train an **GAN**. The networks will be trained on a dataset you should now be familiar with, namely the **FashionMNIST** dataset. Upon training, we will examine what kind of images we have been able to generate artificially.

The figure below illustrates a high-level view on the machine learning process we aim to establish in this lab.

<img align="center" style="max-width: 500px; height: 300px" src="gan_pipeline.png">

As always, pls. don't hesitate to ask all your questions either during the lab, post them in our CANVAS (StudyNet) forum (https://learning.unisg.ch), or send us an email (using the course email).

## 1. Lab Objectives

After today's lab, you should be able to:

> 1. Understand the basic concepts, intuitions and major building blocks of **Generative Adversarial Networks (GANs)**.
> 2. Know how to **implement** and to **train a GAN** to generate new images of fashion articles.
> 3. Understand the concepts and interactions between a **Generative Network** and a **Discriminative Network** in the context of a **GAN**.
> 4. Know how to **interpret and visualize** the model's outputs and results.

Before we start let's watch a motivational video:

In [None]:
from IPython.display import YouTubeVideo
# Aiva: "I am AI - AI Composed Music by AIVA"
YouTubeVideo('Emidxpkyk6o', width=800, height=600)

## 2. Setup of the Jupyter Notebook Environment

Similarly to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. We will mostly use `PyTorch`, `Numpy`, `Matplotlib` and a few utility libraries throughout this lab.

We start by importing `numpy` and utility libraries. Here, we also import the `pickle` module to save and reuse some Python objects. `pickle` "serializes" an object before writing it to a file. An object can be converted to a character stream and then reconstructed either later on in the script, or in another script.

In [None]:
# import standard python libraries
import os
import datetime as time
from datetime import datetime
import numpy as np
import pickle as pkl

Importing PyTorch data download and transform libraries:

In [None]:
# import PyTorch datasets and transforms
import torchvision.datasets as datasets
from torchvision import transforms

Importing Python ML/DL libraries:

In [None]:
# import the PyTorch deep learning library
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Importing `Matplotlib` plotting library and enabling notebook inline plotting:

In [None]:
# import matplotlib and enabling notebook inline plotting:
import matplotlib.pyplot as plt
%matplotlib inline

Import Google's GDrive connector and mount your GDrive directories:

In [None]:
# import the Google Colab GDrive connector
from google.colab import drive

# mount GDrive inside the Colab notebook
drive.mount('/content/drive')

Create a structure of Colab Notebook sub-directories inside of GDrive to store (1) the data as well as (2) the trained neural network models:

In [None]:
# create Colab Notebooks directory
notebook_directory = '/content/drive/MyDrive/Colab Notebooks'
if not os.path.exists(notebook_directory): os.makedirs(notebook_directory)

 # create data sub-directory inside the Colab Notebooks directory
data_directory = '/content/drive/MyDrive/Colab Notebooks/data'
if not os.path.exists(data_directory): os.makedirs(data_directory)

 # create models sub-directory inside the Colab Notebooks directory
models_directory = '/content/drive/MyDrive/Colab Notebooks/models'
if not os.path.exists(models_directory): os.makedirs(models_directory)

Set a random seed value to obtain reproducible results:

In [None]:
# init deterministic seed
seed_value = 123
np.random.seed(seed_value) # set numpy seed
torch.manual_seed(seed_value); # set pytorch seed CPU

Google Colab provides the use of free GPUs for running notebooks. However, if you just execute this notebook as is, it will use your device's CPU. To run the lab on a GPU, got to `Runtime` > `Change runtime type` and set the Runtime type to `GPU` in the drop-down. Running this lab on a CPU is fine, but you will find that GPU computing is faster. *CUDA* indicates that the lab is being run on GPU.

Enable GPU computing by setting the device flag and init a CUDA seed:

In [None]:
# set cpu or gpu enabled device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu').type

# init deterministic GPU seed
torch.cuda.manual_seed(seed_value)

# log type of device enabled
print('[LOG] notebook with {} computation enabled'.format(str(device)))

Let's determine if we have access to a GPU provided by e.g. Google's COLab environment:

In [None]:
!nvidia-smi

## 3. Dataset Download and Data Assessment

Download based on courseML lab 4

In this lab, we will use the popular **FashionMNIST** dataset, which you have already seen in lab 04 "Artificial Neural Network Classification". Back then, we used the dataset to train a simple neural network to classify the fashion articles. In this lab, we are going to train a model - consisting of 2 networks - to create its own images, based on the **FashionMNIST** items.

The **Fashion-MNIST database** is a large database of Zalando articles that is commonly used for training various image processing systems. The database is widely used for training and testing in the field of machine learning. Let's have a brief look into a couple of sample images contained in the dataset:

<img align="center" style="max-width: 500px; height: 300px" src="FashionMNIST.png">

Source: https://www.kaggle.com/c/insar-fashion-mnist-challenge

Further details on the dataset can be obtained via Zalando research's [github page](https://github.com/zalandoresearch/fashion-mnist).

The **Fashion-MNIST database** is a dataset of Zalando's article images, consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando created this dataset with the intention of providing a replacement for the popular **MNIST** handwritten digits dataset. It is a useful addition as it is a bit more complex, but still very easy to use. It shares the same image size and train/test split structure as MNIST, and can therefore be used as a drop-in replacement. It requires minimal efforts on preprocessing and formatting the distinct images.

Let's download, transform and inspect the training images of the dataset. Therefore, let's first define the directory in which we aim to store the data:

In [None]:
data_path = '/train_fashion_mnist'

Now, let's download the data accordingly:

In [None]:
# define pytorch transformation into tensor format
transf = transforms.Compose([transforms.ToTensor()])

# download and transform images
fashion_mnist_data = datasets.FashionMNIST(root=data_path, train=True, transform=transf, download=True)

Verify the number of images downloaded:

In [None]:
# determine the number of training data images
len(fashion_mnist_data)

Furthermore, let's inspect a couple of the downloaded images:

In [None]:
# select and set a (random) image id
image_id = 7779

# retrieve image exhibiting the image id
fashion_mnist_data[image_id]

Ok, that doesn't seem right :). Let's now seperate the image from its label information:

In [None]:
fashion_mnist_image, fashion_mnist_label = fashion_mnist_data[image_id]

We can verify the label that our selected image has:

In [None]:
fashion_mnist_label

Ok, we know that the numerical label is 1. Each image is associated with a label from 0 to 9, and this number represents one of the fashion items. So what does 1 mean? Is 1 a bag? A pullover? The order of the classes can be found on Zalando research's [github page](https://github.com/zalandoresearch/fashion-mnist). We need to map each numerical label to its fashion item, which will be useful throughout the lab:

In [None]:
fashion_classes = {0: 'T-shirt/top',
                    1: 'Trouser',
                    2: 'Pullover',
                    3: 'Dress',
                    4: 'Coat',
                    5: 'Sandal',
                    6: 'Shirt',
                    7: 'Sneaker',
                    8: 'Bag',
                    9: 'Ankle boot'}

So, we can determine the fashion item that the label represents:

In [None]:
# define tensor to image transformation
trans = transforms.ToPILImage()

# set image plot title 
plt.title('Example: {}, Label: {}'.format(str(image_id), fashion_classes[fashion_mnist_label]))

# plot mnist handwritten digit sample
plt.imshow(trans(fashion_mnist_image), cmap='gray')

That's it! In this this lab, we will not use any test dataset. Unlike our previous labs where we built classifiers we had to test on some evaluation data, this lab has no use for test data. We will train a model to generate new images, which, by definition, cannot be compared and validated against a test set. To train this lab's model, we only use `FashionMNIST`'s training set, as there are a lot of images.

## 4. Neural Network Implementation

In this section, we present a theoretical bakground on **GAN**s, and we implement the architectures of the 2 **neural networks** that will constitute our **GAN** model. We aim to train our **GAN** to generate new images - artifical images based on the **FashionMNIST** dataset that no human has ever drawn, created or dreamt of. Before we start with the theory, let's briefly revisit the process to be established. The following cartoon provides a birds-eye view:

<img align="center" style="max-width: 500px; height: 300px" src="gan_pipeline.png">

### 4.1 GAN Theoretical Background

*“Generative Adversarial Networks is the most interesting idea in the last 10 years in Machine Learning.”*

 — Yann LeCun, Chief AI scientist at Facebook

Up until now, we have focused on using deep neural networks to understand objects by learning a mapping from some data (e.g. an image) to its label. We then used this learning to assign a label to new objects, i.e. to classify. This type of learning is referred to as *discriminative* learning - we discriminate between different classes. Machine Learning however offers more than disciminative applications.

A big question in the recent history of Machine Learning has been: is it possible to use learned patterns to **generate artifial objects?** We know, and we have seen throughout the labs, that we can learn a model that captures the characteristics of some data (e.g. we can create a model to distinguish cats from dogs). Given such a model, could we sample synthetic data examples that resemble the distribution of the training data? For instance, given a large corpus of images of clothes, we might want to generate a new photorealistic image that looks like it might plausibly have come from the same dataset. This kind of learning is called *generative modeling*.

**Ian Goodfellow** et al. published a [paper](https://arxiv.org/pdf/1406.2661.pdf) in 2014 that introduced Generative Adversarial Networks (**GANs**) to do exactly that. The model uses discriminative techniques (often a multilayer perceptron - the type of artificial neural network we have already seen) to build a generative model. The central idea behind GANs is simple and elegant: a data generator is good if we cannot distinguish its generated data from real data. If we cannot say whether a sample $x$ belongs to a real dataset $X = [x_{1}, .., x_{n}]$ or a generated, 'fake' distribution $X' = [x'_{1}, .., x'_{n}]$, then the generator is doing a great job.

How do we put this idea into practice? Goodfellow and his colleagues provide the answer: we train two models against each other (hence 'adversarial'). A generative model $G$ is built to generate a data distribution from some noise $z$ ('noise' just means random values). This generated distribution will be an attempt (very bad at first, because random) at generating the desired data - an image in our case. A discriminative model $D$, that has some knowledge of what the real data looks like, is simultaneously used to estimate the probability that a sample comes from the fake data generated by $G$ rather than the real data. As such, this is an unsupervised learning setup, where our generator learns unknown patterns.

The goal for $G$ is thus to fool $D$ by making it unsure about the origin of a sample. In other words, it is to maximize the probability of $D$ making a mistake. The solution of this minimax two-player game is reached when $D$ assigns a $50$% probability that a given sample comes from $G$ or from the real dataset.

In this lab, we cover an example where a **GAN** is built to generate artificial images of fashion articles. You could however apply this concept to the generation of audio sequences for speech, sequences of characters for text, or for further use cases such as video, financial data and many more. **GAN**s can also be used for data augmentation, modification of natural images or the likes. They are also one of the main tools to create deep fakes. We will not cover that in this lab though :-)

<img align="center" style="max-width: 500px; height: 300px" src="gan_process.png">

Source: https://d2l.ai/chapter_generative-adversarial-networks/gan.html

As suggested by the figure, a **GAN** architecture is composed of the two elements discussed above - the generative model $G$, and the discriminative model $D$. In this lab, we will call them `Generator` and `Discriminator`. The `Generator` attempts to generate some data, and the `Discriminator` attempts to distinguish fake and real data from each other. Let's dive deeper into the two pieces of a **GAN** architecture.

#### 4.1.1 The Discriminator

The `Discriminator` $D$ is a binary classifier that aims at detecting whether an input sample $X$ is real (from the actual dataset, **FashionMNIST** in our case), or fake (created artificially by the `Generator`). $D$ outputs a scalar prediction $sp \in \mathbb{R}$, which is then fed to a *sigmoid* function to get a probability. This result is the probability that the input belongs to the real data rather than the fake data.

The *sigmoid* function is used for **binary** classification problems - which is what we need here, as the `Discriminator` attempts to decide between the labels:
- $1$ - the sample belongs to the real data, and
- $0$ - the sample is fake and has been generated by $G$

This *sigmoid* function is defined as follows:

$S(x) = {\frac {1}{1 + e^{-x}}}$

Consequently, the probability computed by the `Discriminator` is:

$$D(x) = {\frac {1}{1 + e^{-sp}}}$$

To optimize this neural network, $D$ is subject to a loss function. In this lab we use the **'Binary Cross Entropy (BCE) with Logits'** loss.
>- Unlike the *Negative Log-Likelihood (NLL) Loss*, which doesn’t punish based on prediction confidence, *Cross-Entropy* punishes incorrect but confident predictions, as well as correct but less confident predictions. 
>- Binary Cross Entropy is used for binary classification problems, which consist of only 2 classes. Here the two classes are 0 (an image that does not correspond to the original data) and 1 (an image that corresponds to the original data).
>- By using **BCEWithLogitsLoss** instead of **BCELoss**, we remove the need to define a sigmoid activation in the network. The activation is already included and applied in the loss function.
>- The function is defined as $L = -\frac {1}{N}\sum_{i=1}^\infty y_{i}*log(ŷ_{i}) + (1-y_{i})*log(1-ŷ_{i})$,
where $y$ is the true label of the sample (0 or 1), and $ŷ$ is the prediction.

That means that for a given sample ($N = 1$), the optimization process is defined as:

$$min_{D}[-y*log(D(x)) - (1-y)*log(1-D(x))]$$

#### 4.1.2 The Generator

The `Generator` $G$, unlike the Deep Learning models we have studied so far, is not a classifier. Its aim is to generate data. To do so, it draws parameters $z \in \mathbb{R}^d$ from a random source. This source stems from a distribution (e.g. Gaussian or uniform) $z$, which is also called the *latent variable*. In this lab, we will draw our samples from a **uniform** distribution **$z ∼ U[0, 1]$** (for a discussion about Gaussian vs. uniform in the context of GANs, check [here](https://stats.stackexchange.com/questions/295880/importance-of-choice-of-latent-distribution-in-gans)).

The `Generator`, which ultimately applies a function (although long and complex), generates $X' = G(z)$: artificial data that aims at mimicking the real data $X$. $G$'s goal is to fool $D$ into classifying $X' = G(z)$ as true data, i.e., we want $D(G(z)) ≈ 1$. In other words, for a given discriminator $D$, we update the parameters of the generator $G$ to maximize the cross-entropy loss when $y=0$, i.e., when the data is fake (as it has been created by $G$). Since $G$ wants to maximize $D$'s cross-entropy loss when $y = 0$, we have:

$$max_{G}[-y*log(D(G(z))) - (1-y)*log(1-D(G(z)))]$$

which simplifies to 

$$max_{G}[-log(1-D(G(z)))]$$ as of course, $y = 0$ for fake images.

What we end up with is a 'minimax' game with the comprehensive value (objective) function $V$:



$$min_{D}max_{G}[- \mathbb{E}_{x∼Data} logD(x) - \mathbb{E}_{z∼Noise} log(1-D(G(z)))]$$

Which translates to, as found in the original paper, the following elegant equation:

$$min_{G}max_{D} V(D, G) = \mathbb{E}_{x∼Data} [logD(x)] + \mathbb{E}_{z∼Noise} [log(1-D(G(z)))]$$

$\mathbb{E}$ stands for *expectation*, or *average value*. Notice that we switched the minimization/maximization objetives and switched the signs within the function accordingly. This function can be interpreted as follows: 

On the one hand, we try to maximize the log probability that the `Discriminator` makes a correct prediction on real data $x$. On the other, the `Generator` cannot directly affect $log(D(x))$, so minimizing its loss is equivalent to minimizing $log(1 - D(G(z)))$.

To update the `Generator`, we use the same loss as for the `Discriminator`: **BCEWithLogitsLoss**. We update $G$ based on the output from $D$ when it has been fed fake data $G(z)$. If $D$ does a good job classifying data generated by $G$ as fake, $G$'s loss will be significant. If $D$ is fooled and classifies $G(z)$ as real data, $G$'s loss will be small.

When a real image $x$ is fed to the `Discriminator`, only it is updated. But when a fake sample $G(z)$ is fed to $D$, both models' parameters are updated. More on this in the next subchapter.

<img align="center" style="max-width: 500px; height: 300px" src="model_update.png">

####4.1.3 Two scenarios

Let's denote $D$'s loss as $L_{D}$ and $G$'s loss as $L_{G}$. Both players have loss functions that are defined in terms of both players’ parameters $θ$, i.e $L_{D}(θ_{D}, θ_{G})$ and $L_{G}(θ_{D}, θ_{G})$. $D$ wishes to minimize $L_{D}(θ_{D}, θ_{G})$ and must do so while controlling only $θ_{D}$, and the same holds for $G$ and its parameters. The solution to this game is a (local) minimum, a point in the parameter space where all neighboring points have greater or equal cost. More specifically, this point is a Nash equilibrium with a local minimum of $L_{D}$ with respect to $θ_{D}$ and a local minimum of $L_{G}$ with respect to $θ_{G}$. It should also be noted that this game is a *zero-sum game*, as any advantage gained by one of the models is lost by the other.



To make things more tangible, let's take a look at two concrete examples. In the first example, $D$ is fed real data $X$ from the training set. In the second, we give it fake data $X'$, which has been generated by $G$.

*Scenario 1*: training samples are taken from the real dataset and used as input for the `Discriminator`. $D$'s goal us to output the probability that its input is real rather than fake. Since the input comes from the real dataset, $y = 1$. $D$'s goal, therefore, is to successfully classify it as such: $ŷ = D(x) ≈ 1$. When $y = 1$, only the `Discriminator` is tested and updated. The `Generator` does not come into play yet.

<img align="center" style="max-width: 500px; height: 300px" src="real_scenario.png">

*Scenario 2*: inputs $z$ to the `Generator` are randomly sampled over the latent variables - which are created with our uniform distribution. $G$ outputs $G(z)$, which is is fake data $X'$, an attempt at imitating real data $X$. The `Discriminator` then receives $G(z)$ as input. Here, both players participate. $D$ strives to make $D(G(z))$ approach $0$, while $G$ tries to make the same quantity approach $1$. If both models have sufficient
capacity, then the Nash equilibrium of this game corresponds to the $G(z)$ being drawn from the same distribution as the training data, and $D(x) = 1/2$ for all $x$.

<img align="center" style="max-width: 500px; height: 300px" src="fake_scenario.png">

How do we specify this when writing the training loop? When we iterate over our mini-batches, we first train the `Discriminator`. $D$'s training is separated into two distinct parts: first, we feed it only real data. For each of the images in the batch, we give the label $1$ (real data). From this, we get $D$'s error on real data. We then train $D$ on fake data. To do so, we generate a batch of latent vectors (latent variables), feed them to $G$ which computes $G(z)$, associate each of these outputs with label $0$ (fake data), and feed them $D$. We get $D$'s error on fake data. $D$'s total loss is the sum of its error on real and fake data.

Second, we train the `Generator`. We take the previously generated $G(z)$, and again feed it to $D$. But, very importantly, we give each element in the fake data the label $1$ here. This is of crucial importance. It is also what allows us to minimize the loss according to $G$, instead of maximizing it (see sub-chapter above). By associating the label $1$ to the fake data for $G$, we enable the computation of $G$'s loss. That means that when feeding $G(z)$ to the `Discriminator`, its label is $0$ when training $D$ (as explained above) and is $1$ when training $G$. The label is changed to $1$ when training $G$ because in $G$'s eyes, $G(z)$ is real and is the target. If $D$ outputs a probability of $0.8$ that a fake sample $G(z)$ belongs to the real data, we want to punish that severly, and do so thanks to the loss function which sees that $0.8$ is far from the sample's label $0$. However, for the same output from $D$ on the same sample, we want to reward $G$ for fooling $D$. We successfully do so thanks to the loss function that results in a small loss since the output $0.8$ is close to the sample's label $1$ in this case. This is how we get $G$'s loss, which then enables us to update the model's parameters.

If this is still confusing, it will hopefully become clearer when going through the actual training loop in the code down below.

### 4.2 GAN Architecture Implementation

In this subchapter, we will build the two models that together constitute the **GAN** architecture. We will start with the construction of the `Discriminator`, and then go on to the `Generator`. After this, we will also instantiate both models and the other required components.

You will see that in both models we use `LeakyReLU` rather than `ReLU` as our activation layer (non-linearity), as this avoids [*sparce gradients*](https://www.quora.com/What-does-it-mean-in-deep-learning-and-optimization-problem-that-the-gradient-is-sparse). We also use `dropout`, which will assign a deactivation probability of 30% to each neuron to prevent overfitting.

#### 4.2.1 Discriminator

Our discriminative network $D$, which we name `Discriminator` consists of four **fully-connected layers**. This type of layer aims at learning **non-linear feature combinations** that allow the detection of patterns. In this type of layer, all inputs are connected to all activation units of the next layer.

Let's implement the `Discriminator`. This is a binary classifier described above. The input size to the first layer is 28x28 = 784, as our **FashionMNIST** images are 28x28 pixels. The output size of the last layer is 1, which corresponds to the model's classification of the input.

In [None]:
# implement the Discriminator network architecture
class Discriminator(nn.Module):

    # define the class constructor
    def __init__(self):

        # call super class constructor
        super(Discriminator, self).__init__()
        
        # specify fc layer 1: in 28*28, out 128
        self.fc1 = nn.Linear(28*28, 128) # the linearity W*x+b
        self.activation1 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity
        
        # specify fc layer 2: in 128, out 64
        self.fc2 = nn.Linear(128, 64) # the linearity W*x+b
        self.activation2 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity

        # specify fc layer 3: in 64, out 32
        self.fc3 = nn.Linear(64, 32) # the linearity W*x+b
        self.activation3 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity
        
        # specify fc layer 4: in 32, out 1
        self.fc4 = nn.Linear(32, 1) # the linearity W*x+b

        # dropout layer
        self.dropout = nn.Dropout(0.3)
        
    # define network forward pass
    def forward(self, x):

        # flatten image
        x = x.view(-1, 28*28)

        # define fc layer 1 forward pass and add dropout
        x = self.activation1(self.fc1(x))
        x = self.dropout(x)

        # define fc layer 2 forward pass and add dropout
        x = self.activation2(self.fc2(x))
        x = self.dropout(x)

        # define fc layer 3 forward pass and add dropout
        x = self.activation3(self.fc3(x))
        x = self.dropout(x)
        
        # define fc layer 4 forward pass
        out = self.fc4(x)

        # return forward pass result
        return out

#### 4.2.2 Generator

Our generative network $G$, which we name `Generator` consists of four **fully-connected layers**. This type of layer aims at learning **non-linear feature combinations** that allow the detection and, later, generation of patterns. In this type of layer, all inputs are connected to all activation units of the next layer.

The input size to the first layer is 2100, as the size of our latent vector (latent variable, uniformly distributed) $z$ is 100. The output size of the last layer is 784, which corresponds to the size of a 28x28 **FashionMNIST** image.

You might notice that we use the `tanh` function as the last layer of our `Generator`. This is done because the `Discriminator` expects a normalized input.

In [None]:
# implement the Generator network architecture
class Generator(nn.Module):

    # define the class constructor
    def __init__(self):

        # call super class constructor
        super(Generator, self).__init__()
        
        # specify fc layer 1: in 100, out 32
        self.fc1 = nn.Linear(100, 32) # the linearity W*x+b
        self.activation1 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity

        # specify fc layer 2: in 32, out 64
        self.fc2 = nn.Linear(32, 64) # the linearity W*x+b
        self.activation2 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity

        # specify fc layer 3: in 64, out 128
        self.fc3 = nn.Linear(64, 128) # the linearity W*x+b
        self.activation3 = nn.LeakyReLU(0.2, inplace=True) # the non-linearity
        
        # specify fc layer 4: in 128, out 28*28
        self.fc4 = nn.Linear(128, 28*28) # the linearity W*x+b
       
        # dropout layer 
        self.dropout = nn.Dropout(0.3)

    # define network forward pass
    def forward(self, x):

        # define fc layer 1 forward pass and add dropout
        x = self.activation1(self.fc1(x))
        x = self.dropout(x)

        # define fc layer 2 forward pass and add dropout
        x = self.activation2(self.fc2(x))
        x = self.dropout(x)

        # define fc layer 3 forward pass and add dropout
        x = self.activation3(self.fc3(x))
        x = self.dropout(x)

        # define fc layer 4 with tanh applied
        out = self.fc4(x).tanh()

        # return forward pass result
        return out

#### 4.2.3 Model Instantiation

Now that we have implemented our first **GAN** and its two models, we are ready to instantiate both models to be trained:

In [None]:
D = Discriminator()
G = Generator()

Let's push the initialized `Discriminator` and `Generator` models to the computing `device` that is enabled:

In [None]:
D = D.to(device)
G = G.to(device)

Let's double check if our model was deployed to the GPU if available:

In [None]:
!nvidia-smi

Once the models are initialized, we can visualize the model structures and review the implemented network architectures by execution of the following cells. We start with the `Discriminator`:

In [None]:
print('[LOG] Discriminator architecture:\n\n{}\n'.format(D))

And now the `Generator`:

In [None]:
print('[LOG] Generator architecture:\n\n{}\n'.format(G))

Looks like intended? Brilliant! Finally, let's have a look into the number of model parameters that we aim to train in the next steps of the notebook. Again, we start with the `Discriminator`. The number of parameters, if everything was defined successfully, should be: 
$$(784+1)*128 + (128+1)*64 + (64+1)*32 + (32+1)*1 = 110'849$$
Don't hesitate re-visit our **CNN** lab if you are unsure as to how to count the number of parameters. Let's verify:

In [None]:
# init the number of model parameters
num_params_d = 0

# iterate over the distinct parameters
for param in D.parameters():

    # collect number of parameters
    num_params_d += param.numel()
    
# print the number of model paramters
print('[LOG] Number of Discriminator model parameters to be trained: {}.'.format(num_params_d))

Now, the `Generator`. The number of parameters, if everything was defined successfully, should be: 
$$(100+1)*32 + (32+1)*64 + (64+1)*128 + (128+1)*784 = 114'800$$
Let's verify:



In [None]:
# init the number of model parameters
num_params_g = 0

# iterate over the distinct parameters
for param in G.parameters():

    # collect number of parameters
    num_params_g += param.numel()

# print the number of model paramters
print('[LOG] Number of Generator model parameters to be trained: {}.'.format(num_params_g))

Ok, our 'simple' **GAN** model already encompasses an impressive number: 110'849 + 114'800 = **225'649** model parameters to be trained.

Now that we have implemented the GANs, we are ready to train the network. However, before starting the training, we need to define an appropriate loss function. Remember, we discussed in the theory part above (see 4.1.1) that we want to use **Binary Cross-Entropy (BCE)** loss, with logits so that we do not have to manually define a sigmoid function in the network

Let's instantiate the **BCEWithLogitsLoss** via the execution of the following PyTorch command:

In [None]:
# define the optimization criterion / loss function
criterion = nn.BCEWithLogitsLoss()

Let's also push the initialized `criterion` computation to the computing `device` that is enabled:

In [None]:
criterion = criterion.to(device)

Based on the loss magnitude of a certain mini-batch PyTorch automatically computes the gradients. But even better, based on the gradient, the library also helps us in the optimization and update of the network parameters $\theta$.

Following the advice of [Soumith Chintala](https://github.com/soumith/ganhacks), we use the **Stochastic Gradient Descent** (`SGD`) optimization optimizer for our **Discriminator**, and the `Adam` optimizer for our **Generator**. We also set the learning rate to 0.002.

In [None]:
# set learning rate
lr = 0.002

# create optimizers for the discriminator and generator
d_optimizer = optim.SGD(D.parameters(), 0.02) 
g_optimizer = optim.Adam(G.parameters(), 0.002) 

That's it! We are finally done with the implementation, now let's get down to training!

## 5. Training the Neural Networks

In this section, we will train our neural network model (as implemented in the section above) using the **FashionMNIST** images. More specifically, we will have a detailed look into the distinct training steps as well as how to monitor the training progress.

### 5.1 Preparing the Network Training

So far, we have pre-processed the dataset, implemented the GANs and defined the loss function. Let's now start to train a corresponding model for **20 epochs** and a **mini-batch size of 64** Fashion images images per batch. This implies that the whole dataset will be fed to the ResNet times in chunks of 128 images yielding to **938 mini-batches** (60.000 images / 64 images per mini-batch) per epoch. After the processing of each mini-batch, the parameters of the network will be updated.

In [None]:
# specify the training parameters
num_epochs = 20 # number of training epochs
mini_batch_size=64 # size of the mini-batches

Furthermore, let's specifiy and instantiate a corresponding PyTorch data loader that feeds the image tensors to our neural network:

In [None]:
train_loader = torch.utils.data.DataLoader(fashion_mnist_data, 
                                           batch_size=mini_batch_size,
                                           shuffle=True
                                           )

We can verify the length of the training `DataLoader`, which should correspond tp **938 mini-batches**:

In [None]:
len(train_loader)

Remember that our `Discriminator` will attempt to classify samples between *real* and *fake*. We therefore have to define what these lables will be.

As this is a binary classification task, we define:
>- $1$ as the label for real images: $y = 1$, and
>- $0$ as the label for fake images: $y = 0$

In [None]:
# establish convention for real and fake labels during training
real_label = 1
fake_label = 0

Lastly, we a create a batch of latent vectors of size 100 that we will use later on to visualize the progress of the `Generator`. We will call it `fixed_noise`, as it will remain fix. It will allow us to take 4 images (we define a sample size of 4) and to see how the results evolve in the evaluation section.

In [None]:
# define size of latent vector
z_size = 100

# define sample size
sample_size = 4

# uniformly distribute data of size z_size over an interval of -1; 1
fixed_noise = np.random.uniform(-1, 1, size=(sample_size, z_size))

# create numpy array into tensor, and convert data to float
fixed_noise = torch.from_numpy(fixed_noise).float()

# push the fixed vector to the device that's enabled
fixed_noise = fixed_noise.to(device)

### 5.2 Running the Network Training

Finally, we start training the model. The training procedure for each mini-batch is pexplained in detail in section 4.1.3, but here are the main steps:

>1. Train the `Discriminator` on the real images,
>2. Generate fake images with the `Generator` and train the `Discriminator` on them,
>3. Do a backward pass through the `Discriminator` and update its parameters $θ_{D}$,
>4. Train the `Generator` based on the `Discriminator`'s output on fake data,
>5. Do a backward pass through the `Generator` and update its parameters $θ_{G}$.

To ensure learning while training our **GAN** model, we will monitor whether the loss decreases with progressing training. Therefore, we obtain and evaluate the performance of the entire training dataset after each iteration. Based on this evaluation, we can conclude on the training progress and whether the loss is converging (indicating that the model might not improve any further).

The following elements of the network training code below should be given particular attention:
 
>- `loss.backward()` computes the gradients based on the magnitude of the reconstruction loss,
>- `optimizer.step()` updates the network parameters based on the gradient.

In [None]:
# initialize list of the generated (fake) images
fake_images = []

# initialize collection of batch losses
D_batch_losses = []
G_batch_losses = []

# initialize collection of epoch losses
D_epoch_losses = []
G_epoch_losses = []

# set networks to training mode
D.train()
G.train()

# define time right before training
start = time.datetime.now()

# train the GANs
for epoch in range(num_epochs):

    # iterate over mini batches
    for i, data in enumerate(train_loader, 0):

        # define real images and push to computation device
        real_images = data[0].to(device)

        # define batch size as size of the images to make sure the loader is emptied completely
        batch_size = real_images.size(0)

        # --------------------------------------------------------------------------
        # (1) Update Discriminator network

        #### train with real images

        # create tensor of same size as mini-batch and filled with 1's (real_label)
        label = torch.full((batch_size,), real_label, dtype=torch.float, device=device)

        # rescaling input images from [0,1) to [-1, 1), which is needed for network
        real_images = real_images*2 - 1

        # run forward pass through Discriminator
        output = D(real_images).view(-1)

        # reset graph gradients
        D.zero_grad()

        # determine loss on Discriminator
        errD_real = criterion(output, label)

        # run backward pass
        errD_real.backward()
    
        #### train with fake images

        # generate batch of latent vectors
        z = np.random.uniform(-1, 1, size=(batch_size, z_size))
        z = torch.from_numpy(z).float()
        z = z.to(device)

        # generate fake image batch with Generator
        fake = G(z)

        # fills label tensor with 0's (fake_label)
        label.fill_(fake_label)

        # classify all fake batch with Discriminator
        output = D(fake.detach()).view(-1)

        # get discriminator loss on the fake batch
        errD_fake = criterion(output, label)

        # run backward pass
        errD_fake.backward()

        # compute error of Discriminator as sum of loss over the fake and the real batches
        errD = errD_fake + errD_real

        # update Discriminator parameters
        d_optimizer.step()


        # --------------------------------------------------------------------------
        # (2) Update Generator network

        # reset graph gradients
        G.zero_grad()

        # fake labels are real for generator
        label.fill_(real_label)

        # since we just updated D, perform another forward pass of fake batch through the Discriminator
        output = D(fake).view(-1)

        # get Generator loss based on this output
        errG = criterion(output, label)

        # run backward pass
        errG.backward()

        # update Generator paramaters
        g_optimizer.step()

        # --------------------------------------------------------------------------

        # each 250 iterations (4x per epoch), print losses
        if i % 500 == 0:
          now = datetime.utcnow().strftime("%H:%M:%S")
          print('[LOG {}] Epoch [{}/{}] \t[{}/{}] \t d_loss: {} \t g_loss: {}'.format(
              now, epoch+1, num_epochs,i, len(train_loader), errD.item(), errG.item()))
          
        # save losses for plotting later
        D_batch_losses.append(errD.item())
        G_batch_losses.append(errG.item())

        # set Generator to eval mode for generating samples (equivalent to 'testing' the model)
        G.eval() 

        # make Generator generate samples from the fixed noise ditribution
        samples = G(fixed_noise.float())

        # if you are using a GPU, copy tensor to host memory (cpu) - needed for later operations
        if device == 'cuda':
          samples = samples.cpu()

        # append generated fixed samples to the fake_images list
        fake_images.append(samples)

        # set Generator back to train mode
        G.train()

    # determine mean min-batch loss of epoch
    D_epoch_loss = np.mean(D_batch_losses)

    D_epoch_losses.append(D_epoch_loss)

    # determine mean min-batch loss of epoch
    G_epoch_loss = np.mean(G_batch_losses)

    G_epoch_losses.append(G_epoch_loss)

    # set filename of actual model
    d_model_name = 'gan_d_model_epoch_{}.pth'.format(str(epoch+1))

    # set filename of actual model
    g_model_name = 'gan_g_model_epoch_{}.pth'.format(str(epoch+1))

    # save current model to GDrive models directory
    torch.save(D.state_dict(), os.path.join(models_directory, d_model_name))

    # save current model to GDrive models directory
    torch.save(G.state_dict(), os.path.join(models_directory, g_model_name))

# save generated samples with pickle
with open('fake_images.pkl', 'wb') as f:
  pkl.dump(fake_images, f)

# print total training time
print('\nTotal training time:', time.datetime.now() - start)

## 6. Evaluation

As we do not have a test set, the evaluation of a **GAN** does not resemble that of a 'normal' classifier. First, we base our evaluation on the progression of the losses of our two adversarial models. Then, and more interestingly, we look at the images that were generated by the `Generator` from the fixed noise we created. In this regard, you could say that the training and the testing of the network happens simultaneously. There exist some more complex methods to evaluate **GAN**s, but we will stick to those mentionned here in our lab.

### 6.1 Model Loss

Let's visualize and inspect the loss per training iteration (mini-batch). We start with the `Discriminator`'s loss:

In [None]:
# prepare plot
fig, ax = plt.subplots(figsize=(20,8))

# convert losses to numpy arrays
D_batch_losses = np.array(D_batch_losses)

# add grid
ax.grid(linestyle='dotted')

# plot losses of Discriminator and Generator
plt.plot(D_batch_losses, label='Discriminator', c = 'tab:green')

# add axis legends
ax.set_xlabel("[Training mini-batch $mb_i$]", fontsize=10)
ax.set_ylabel("[Classification Error of Discriminator $D$, $L^{BCE}$]", fontsize=10)

# add plot legends
plt.legend()

# add plot title
plt.title('Training Iterations $mb_i$ vs. Classification Error of Discriminator $D$, $L^{BCE}$', fontsize=10);

Now the `Generator`'s loss:

In [None]:
# prepare plot
fig, ax = plt.subplots(figsize=(20,8))

# convert losses to numpy arrays
G_batch_losses = np.array(G_batch_losses)

# add grid
ax.grid(linestyle='dotted')

# plot losses of Discriminator and Generator
plt.plot(G_batch_losses, label='Generator', c = 'tab:orange')

# add axis legends
ax.set_xlabel("[Training mini-batch $mb_i$]", fontsize=10)
ax.set_ylabel("[Classification Error of Generator $G$, $L^{BCE}$]", fontsize=10)

# add plot legends
plt.legend()

# add plot title
plt.title('Training Iterations $mb_i$ vs. Classification Error of Generator $G$, $L^{BCE}$', fontsize=10);

What our batch losses seem to indicate is that although very fluctuating, the `Discriminator` starts off with a low loss which progressively increases, and the `Generator`'s loss decreases along training. Let's plot the mean epoch losses of both models to get a clearer overview:

In [None]:
# prepare plot
fig, ax = plt.subplots(figsize=(20,8))

# convert losses to numpy arrays
D_epoch_losses = np.array(D_epoch_losses)
G_epoch_losses = np.array(G_epoch_losses)

# add grid
ax.grid(linestyle='dotted')

# plot losses of Discriminator and Generator
plt.plot(D_epoch_losses, label='Discriminator', c = 'tab:green')
plt.plot(G_epoch_losses, label='Generator', c = 'tab:orange')

# add axis legends
ax.set_xlabel("[Training mini-batch $mb_i$]", fontsize=10)
ax.set_ylabel("[Classification Error $L^{BCE}$]", fontsize=10)

# add plot legends
plt.legend()

# add plot title
plt.title('Training Iterations $mb_i$ vs. Classification Error $L^{BCE}$', fontsize=10);

Ok, fantastic. The training error converges nicely for both networks. The `Discriminator` starts off strong during the first few epochs with a very low loss, while the `Generator` has a very high loss. It is very apparent that the Generator has no idea what to do at that point. Then, the trends change as the Generator gets better at knowing what fools the Discriminator - i.e. gets better at faking images. 

We see that the loss of the Generator is consistently a bit lower than that of the Discriminator after a few epochs. You could then hypothesize that the Generator is often able to fool the Discriminator.

### 6.2 Images generated on fixed noise

Let's now inspect the images that were generated during the training of our **GAN**. To do so, we start be defining a function that we will use to display the generated samples. These were the samples that were created on fixed noise - which helps us see the evolution of the `Generator`'s progress on a fixed distribution.

In [None]:
# create function to view the image samples
def view_samples(epoch, samples):

    # initialize plot
    fig, axes = plt.subplots(figsize=(10,7), nrows=1, ncols=4, sharey=True, sharex=True)
    
    # adjust padding between subplots
    fig.tight_layout(pad=5.0)

    # iterate over fake images at each epoch (we change epochs each 938 mini-batch)
    # remember, we save 4 images together at each iteration
    for i, (ax, img) in enumerate(zip(axes.flatten(), fake_images[epoch*938])):
        
        # create title for each subplot
        ax.set_title(f'Epoch {epoch+1}, Sample {i+1}')

        # detach image
        img = img.detach()

        # disable axes
        ax.xaxis.set_visible(False)
        ax.yaxis.set_visible(False)

        # show 28 by 28 grayscale image
        im = ax.imshow(img.reshape((28,28)), cmap='Greys_r')

We now 'unpickle' the samples we had saved with the `pickle` library:

In [None]:
# load the previously pickled fixed-noised samples
with open('fake_images.pkl', 'rb') as f:
    samples = pkl.load(f)

And call our function at each epoch, to inspect the progress:

In [None]:
# iterate over epochs
for i in range(num_epochs):
  
    # call function to view the 4 samples
    view_samples(i, samples)

Cool, right? The samples generated by the `Generator` start by being very bad. They are totally random in the first epoch, and quite bad in the second - although we witness clear progress. They then progressively improve, to the point where we can clearly see them representing clothes similar to those in the **FashionMNIST** dataset. The quality stabilizes. This rapid improvement and stabilization perfectly corresponds to the progression of the `Generator`'s loss. 

Interestingly, we sometimes see the same sample switching fashion class between epochs. As you can see, the model - the parameters of which are updated at each iteration - does not care whether it outputs a shoe or a shirt, it simply aims at minimizing its loss.

## 7. Exercises

We recommend you try the following exercises as part of the lab:

**1. Train the network a couple more epochs and evaluate its performance.**

> Increase the number of training epochs up to 40 or 50 epochs and re-run the network training. Load and evaluate the model exhibiting the lowest training loss. What do you notice? Is the quality of the generated images improving? What about the losses? How could we improve the networks?

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

**2. Optimizers and learning rate.**

> Try changing the **optimizers** (currently *SGD* and *Adam*) and the **learning rate** (currently 0.002) used in the network training. How does this affect the training and the results?

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

**3. Find out about the Sigmoid function.**
> We applied a *Sigmoid* function to the output of the `Discriminator` to get a prediction probability. Why did we use this function and not a *Softmax* function, as we did in previous labs?

In [None]:
# ***************************************************
# INSERT YOUR ANSWER HERE
# ***************************************************

## 8. Lab Summary:

In this lab, a step by step introduction into the **design, implementation, training and evaluation** of Generative Adversarial Networks (**GANs**) to generate tiny images of fashion objects is presented. The code and exercises presented in this lab may serves as a starting point for developing more complex, deeper and more tailored GANs.

You may want to execute the content of your lab outside of the Jupyter notebook environment, e.g. on a compute node or a server. The cell below converts the lab notebook into a standalone and executable python script. Pls. note that to convert the notebook, you need to install Python's **nbconvert** library and its extensions:

In [None]:
# installing the nbconvert library
!pip install nbconvert
!pip install jupyter_contrib_nbextensions

Let's now convert the Jupyter notebook into a plain Python script:

In [None]:
!jupyter nbconvert --to script ml_lab_07.ipynb

## 9. References

>- Theory: Goodfellow et al. (2014). 'Generative Adversarial Nets'. Available at https://arxiv.org/pdf/1406.2661.pdf
>- Theory: Goodfellow, I. (2017). 'NIPS 2016 Tutorial: Generative Adversarial Networks'. Available at https://arxiv.org/pdf/1701.00160.pdf
>- Theory/mathematical notation: Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2020). Dive into Deep Learning. Available at https://d2l.ai/index.html
>- Code snippets: PyTorch DCGAN tutorial, available at https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
>- Pickled images: Agrawal, T. (2019). 'Train your first GAN model from scatch using PyTorch' https://blog.usejournal.com/train-your-first-gan-model-from-scratch-using-pytorch-9b72987fd2c0 