- Introducing generative models for synthesizing new data;
- Autoencoders, variational autoencoders, and their relationship to GANs;
- Understanding he building blocks of GANs;
- Implementing a simple GAN model to generate handwritten digits;
- Understanding transposed convolution and batch normalization;
- Improving GANs: deep convolutional GANs and GANs using the Wasserstein distance.

# Introducing generative adversarial networks

The overall objective of a GAN is to synthesize new data that has the same distribution as its training dataset. They 
are considered to be in the unsupervised learning category of machine learning tasks, since no labeled data is required.

While the original GAN architecture proposed in this paper was based on fully connected layers, similar to multilayer
perceptron architectures, and trained to generate low-resolution MNIST-like handwritten digits, it served more as a
proof of concept to demonstrate the feasibility of this new approach.

# Starting with autoencoders

While standard autoencoders cannot generate new data, understanding their function will help you better understand GANs.

Autoencoders are composed of two networks concatenated together: an encoder network and a decoder network. The encoder
network receives a $d$ dimensional input feature vector associated with example $x$ and encodes it into a $p$
dimensional vector $z$. In other words, the role of the encoder is to learn how to model the function $z = f(x)$.

The encoded vector, $z$, is also called the latent vector, or the latent feature representation. Typically, the
dimensionality of the latent vector is less than that of the input examples ($p<d$). Hence, we can say that the encoder
acts as a data compression function.

Then, the encoder decompresses $\hat{x}$ from the lower-dimensional latent vector, $z$, where we can thing of the
encoder as a function $\hat{x}=g(z)$.

> Notice that there are variants of autoencoders that use a latent space of bigger dimensionality compared to the
> dimensionality of the input. This is especially usefull in the context of de-noising.

# Generative model for synthesizing new data
Autoencoders are deterministic models, so they are just able to reconstruct an image from its latent feature
representation. They are not able to generate data beyound reconstructing its input through the transformation of the
compressed representation.

A generative model, on the other hand, can generate a new example, $\~x$, from a random vector, $z$ (corresponding to
the latent representation).

We can notice some similarities between the decoder of the autoencoder and a generative model. However a major
difference between the two is that we do not know the distribution of $z$ in the autoencoder, while in the generative
model, the distribution of $z$ is fully characterizable. One approach to generalize an autoencoder int oa generative
model is the variational autoencoder (VAE).

In a VAE receiving an input example, $x$, the encoder network is modified in such a way that it computes two moments of
the distribution of the latent vector: the mean, $\mu$ and the variance $\sigma^2$. During the training of a VAE, the
network is forced to match these moments with those of a standard normal distribution (zero mean and unit variance).
Then, after the VAE model is trained, the encoder is discarded and we can use the decoder network to generate new
examples, $\~x$, by feeding random $z$ vectors from the "learned" gaussian distribution.

# Generating new samples with GANs

Let's assume we have a network generator $G$ so that $\~x = G(z)$, where $z$ is a random vector, sampled from a known
ditribution. As always, we will initialize this network with random weights. Therefore, the first output images, before
the weights are adjusted, will look like white noise.

Now, imagine there is a function that can assess the quality of images (assessor function). We can use the feedback from
that function to tell our generator network how to adjust its weights to improve the quality of the generated images.

While an assessor function, as described in the previous paragraph, would make the image generation task very easy, the
question is whether such a universal function to assess the quality of images exists and how it is defined.

GAN model consists of an additional NNcalled discriminator (D), which is a classifier that learns to detect a
synthesized image, $\~x$, from a real image $x$.

In a GAN model, the two networks, generator and discriminator, are trained together. Over time, both networks become
better as they interact with each other. In fact, the two networks play an adversarial game, where the generator learns
to improve its output to be able to fool the discriminator. At the same time, the discriminator becomes better at
detecting the synthesized images.

## Understanding the loss function of the generator and discriminator networks in a GAN model
$$
V(\theta^{(D)},\theta^{(G)}) = E_{x~p_{data}(x)}[\log{D(x)}] + E_{z~p_z(z)}[\log{(1-D(G(z)))}]
$$

Here, $V(\theta^{(D)},\theta^{(G)})$ is called the value function, which can be interpreted as a payoff: we want to
maximize its value with respect to the discriminator, while minimizing its value with respect to the generator.

$D(x)$ is the probability that indicates whether the input example, $x$ (that is generated), is real or fake.

The expression $E_{x~p_{data}(x)}[\log{D(x)}]$ refers to the expected value (averaging though all the samples) of the
quantity in brackets with respect to the examples from the data distribution (distribution of the real examples).
$E_{z~p_z(z)}[\log{(1-D(G(z)))}]$ refers to the expected value of the quantity with respect to the distribution of the
input, $z$, vectors.

A practical way of training GANs is to alternate between these two optimization steps:
1. Freeze the parameters of one network and optimie the weights of the other one;
2. Freeze the second network and optimize the first one;
3. Repeat at each training iteration.

Let's assume that the generator network is fixed, and we want to optimize the discriminator. Both terms in the value
function contribute to optimizing the discriminator, where the first term corresponds to the loss associated with the
real examples, and the second term is the loss for the fake examples.

Therefore, when $G$ is fixed, our objective is to maximize the value function, which means making the discriminator
better at distinguishing between real and generated images.

After optimizing the discriminator using the loss terms for real and fake samples, we then fix the discriminator and
optimize the generator. In this case, only the second term of the value function contributes to the gradients of the
generator. As a result, when $D$ is fixed, our objective is to minimize the value function.

However, $\log{(1-D(G(z)))}$ suffers from vanishing gradients in the early training stages (early in the learning
process generated outputs look nothing like real examples, therefore, $D(G(z))$ will be close to zero). This phenomenon
is called saturation. To resolve this issue, we can reformulate the maximization objective to the minimization of
$E_{z~p_z(z)}[\log{(D(G(z)))}]$. If we are using this technique we have to swap the labels of real and fake examples (1
will be assigned to fake images, 0 otherwise).

## Implementing a GAN from scratch

In [59]:
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

def make_generator_network(input_size = 20, num_hidden_layers = 1, num_hidden_units = 100,  num_output_units = 784):
    model = nn.Sequential()
    for i in range(num_hidden_layers):
        model.add_module(f"hidden{i}",nn.Sequential(
            nn.Linear(input_size,num_hidden_units),
            nn.LeakyReLU(),
        ))
        input_size = num_hidden_units
    model.add_module("prediction_head",nn.Sequential(
        nn.Linear(input_size,num_output_units),
        nn.Tanh()
    ))
    return model

def make_discriminator_network(input_size, num_hidden_layers = 1, num_hidden_units = 100,  num_output_units = 1):
    model = nn.Sequential()
    for i in range(num_hidden_layers):
        model.add_module(f"hidden{i}",nn.Sequential(
            nn.Linear(input_size,num_hidden_units),
            nn.LeakyReLU(),
            nn.Dropout()
        ))
        input_size = num_hidden_units
    model.add_module("prediction_head",nn.Sequential(
        nn.Linear(input_size,num_output_units),
        nn.Sigmoid()
    ))
    return model

In [60]:
image_size = (28,28)
z_size = 20
gen_hidden_layers = 1
gen_hidden_size = 100
disc_hidden_layers = 1
disc_hidden_size = 100

import torch
torch.manual_seed(1)
gen_model = make_generator_network(z_size,gen_hidden_layers,gen_hidden_size,np.prod(image_size))
disc_model = make_discriminator_network(np.prod(image_size),disc_hidden_layers,disc_hidden_size)

In [61]:
import torchvision
import torchvision.transforms.v2 as t
image_path = '../NNs with PyTorch/'
transform = t.Compose([
    t.ToImage(),
    t.ToDtype(torch.float32,scale=True),
    t.Normalize(mean=[0.5],std=[0.5])
])
mnist_dataset = torchvision.datasets.MNIST(root=image_path,train=True,transform=transform,download=False)

In [62]:
def create_noise(batch_size, z_size, mode_z):
    if mode_z == 'uniform':
        input_z = torch.rand(batch_size,z_size)*2 - 1
    elif mode_z == "normal":
        input_z = torch.randn(batch_size,z_size)
    return input_z

In [63]:
from torch.utils.data import DataLoader
batch_size = 64
torch.manual_seed(1)
np.random.seed(1)
device = "mps"
mnist_dl = DataLoader(
    mnist_dataset, batch_size=batch_size, shuffle=True, drop_last=True
)
gen_model = make_generator_network(
    input_size=z_size,
    num_hidden_layers=gen_hidden_layers,
    num_hidden_units=gen_hidden_size,
    num_output_units=np.prod(image_size),
).to(device)
disc_model = make_discriminator_network(
    input_size=np.prod(image_size),
    num_hidden_layers=disc_hidden_layers,
    num_hidden_units=disc_hidden_size,
).to(device)
loss_fn = nn.BCELoss()
g_optimizer = torch.optim.Adam(gen_model.parameters())
d_optimizer = torch.optim.Adam(disc_model.parameters())
mode_z = "uniform"

In [64]:
def d_train(x):
    disc_model.zero_grad()

    # Train discriminator with a real batch
    batch_size = x.size(0)
    x = x.view(batch_size, -1).to(device)
    d_labels_real = torch.ones(batch_size, 1, device=device)

    d_proba_real = disc_model(x)
    d_loss_real = loss_fn(d_proba_real, d_labels_real)

    # Train discriminator on a fake batch
    input_z = create_noise(batch_size, z_size, mode_z).to(device)
    g_output = gen_model(input_z)

    d_proba_fake = disc_model(g_output)
    d_labels_fake = torch.zeros(batch_size, 1, device=device)
    d_loss_fake = loss_fn(d_proba_fake, d_labels_fake)

    # gradient backprop & optimize ONLY D's parameters
    d_loss = d_loss_real + d_loss_fake
    d_loss.backward()
    d_optimizer.step()

    return d_loss.data.item(), d_proba_real.detach(), d_proba_fake.detach()


def g_train(x):
    gen_model.zero_grad()

    batch_size = x.size(0)
    input_z = create_noise(batch_size, z_size, mode_z).to(device)
    g_labels_real = torch.ones(batch_size, 1, device=device)

    g_output = gen_model(input_z)
    d_proba_fake = disc_model(g_output)
    g_loss = loss_fn(d_proba_fake, g_labels_real)

    # gradient backprop & optimize ONLY G's parameters
    g_loss.backward()
    g_optimizer.step()

    return g_loss.data.item()

In [65]:
fixed_z = create_noise(batch_size,z_size,mode_z).to(device)
def create_samples(g_model, input_z):
    g_output = g_model(input_z)
    images = torch.reshape(g_output,(batch_size, *image_size))
    return (images+1)/2

epoch_samples = []

all_d_losses = []
all_g_losses = []

all_d_real = []
all_d_fake = []

num_epochs = 100
torch.manual_seed(1)
for epoch in range(1, num_epochs + 1):
    d_losses, g_losses = [], []
    d_vals_real, d_vals_fake = [], []
    for i, (x, _) in enumerate(mnist_dl):
        d_loss, d_proba_real, d_proba_fake = d_train(x)
        d_losses.append(d_loss)
        g_losses.append(g_train(x))

        d_vals_real.append(d_proba_real.mean().cpu())
        d_vals_fake.append(d_proba_fake.mean().cpu())

    all_d_losses.append(torch.tensor(d_losses).mean())
    all_g_losses.append(torch.tensor(g_losses).mean())
    all_d_real.append(torch.tensor(d_vals_real).mean())
    all_d_fake.append(torch.tensor(d_vals_fake).mean())
    print(
        f"Epoch {epoch:03d} | Avg Losses >>"
        f" G/D {all_g_losses[-1]:.4f}/{all_d_losses[-1]:.4f}"
        f" [D-Real: {all_d_real[-1]:.4f} D-Fake: {all_d_fake[-1]:.4f}]"
    )
    epoch_samples.append(create_samples(gen_model, fixed_z).detach().cpu().numpy())

Epoch 001 | Avg Losses >> G/D 1.4675/0.5866 [D-Real: 0.8773 D-Fake: 0.3288]
Epoch 002 | Avg Losses >> G/D 0.8591/1.1609 [D-Real: 0.6106 D-Fake: 0.4532]
Epoch 003 | Avg Losses >> G/D 1.1555/1.0438 [D-Real: 0.6454 D-Fake: 0.3814]
Epoch 004 | Avg Losses >> G/D 0.9994/1.1473 [D-Real: 0.5993 D-Fake: 0.4003]
Epoch 005 | Avg Losses >> G/D 1.0814/1.1487 [D-Real: 0.6015 D-Fake: 0.3959]
Epoch 006 | Avg Losses >> G/D 1.0299/1.1625 [D-Real: 0.5965 D-Fake: 0.4001]
Epoch 007 | Avg Losses >> G/D 1.0570/1.1688 [D-Real: 0.5965 D-Fake: 0.3988]
Epoch 008 | Avg Losses >> G/D 1.0057/1.1898 [D-Real: 0.5879 D-Fake: 0.4105]
Epoch 009 | Avg Losses >> G/D 1.0240/1.1773 [D-Real: 0.5925 D-Fake: 0.4045]
Epoch 010 | Avg Losses >> G/D 1.0201/1.1883 [D-Real: 0.5897 D-Fake: 0.4086]
Epoch 011 | Avg Losses >> G/D 1.0036/1.1658 [D-Real: 0.5987 D-Fake: 0.4063]
Epoch 012 | Avg Losses >> G/D 1.0372/1.1803 [D-Real: 0.5950 D-Fake: 0.4065]
Epoch 013 | Avg Losses >> G/D 1.0200/1.1680 [D-Real: 0.5986 D-Fake: 0.4056]
Epoch 014 | 