# GAN

We have this generator and discriminator models.

The generator role is to produce realistic image (or data) from random noise. The discriminator role is to tells wether the image (or data) it receives is fake or real.

To produce fake data, the generator is shown a lot of realistic images and is asked to produces more images that come from the same probability distribution. To do that, we use an approximation, the second model Discriminator that is a regular neural net classifier.

They compete to each other, and by trying to fool the discriminator, the generator is actually learning to produce very realistic data over time. Over time the discriminator also gets better at detecting fake data so this ongoing competition is what makes good results.

## Game Theory

It is a branch of maths use to model competitions strategies where payoff and changes in strategies can help understanding how to optimize and reach an equilibrium. The equilibrium happens when neither player can improve their strategy without changing the other player strategy.

You want to find a point that is the local **maximum** for the discriminator. It happens for the disciminator when it accurately estimates the probability that the input is real rather than fake. When the equilibrium happens the generator density is equal to the true data density, and therefore the discriminator should always input $\frac{1}{2}$ as it cannot detect the difference between true and fake anymore.

Unfortunately, equilibrium might exist, but are only get close to it in real.

## Tips for training GAN

1. Use Leaky Relu activation.
2. Use Hyperbolic Tan output for the generator, ranging from -1 to +1.
3. We use a sigmoid function to have a probability as an output of the discriminator.
4. Adam optimizer is a good choice for the generator and discriminator.
5. Use BCE Loss criterion for the discriminator, Binary Cross Entropy (nn.BCEWithLogitsLoss(logits, labels\*0.9). Multiplying by 0.9 the label is part of the smoothing lable strategy that is used to regularized normal classifiers. It helps the discriminator to generalize better.
6. Use BCE Loss criterion for the generator, nn.BCEWithLogitsLoss(logits, flipped_labels).

## Scaling GAN to work on large images

The main trick is to use Convolutional Networks. For the input of the generator we usually use a random vector Z. The problem is that a convolutional expect to get a 4D tensor, one axis for different examples in the mini-batch, one axis for the features, and axis for the width and height. To transform the vector Z we need to reshape up near the start of the generator. The idea is to actually do the opposite as what a classic CNN is doing. Instead of going from and image with features and ending up with a very deep small image, we want to start with a deep vector Z and up scaling it until having an image in the output.

<img src="img/reshapeup.png" width=30% />

You want to use Batch Normalization on each layers except the output layer of the generator and the discriminator.

# Deep Convolutional GAN

## Introduction

<img src="img/GANs.png" width=60% />

The main focus of a GAN model is to generate data from scratch, mostly images but it can be anything, like sound. Two networks are part of a DCGAN, a **generator** and a **discriminator**. The role of the generator is to generate random data, but it alone it would be useless. This is where the discriminator come in play. It will tell if the generated data looks real or fake by comparing it to real data. By training the model with real data it can become really good at generating data that looks real.

## Discriminator

<img src='img/discriminatorfull.png' width=55% />

There are no Max Pooling layer in the discriminator conv net, the downsampling is entirely made using convolutional layers with stride=2.

If the convolutional kernel is moving 2 pixels by 2 pixels (stride of 2) then it will output an image half the size of the input image.

After each convolutional layer there is a **leaky ReLU activation** and a **batch normalization** (so the **mean = 0** and **variance = 1**). This normalization step helps the network train faster and reduces problems due to poor parameter initialization.

<div style="display: flex;">
    <div style="margin: auto;">
        <img src='img/convlayerkernel2.png' style="max-width: 200px;"/>
    </div>
    <div style="margin: auto;">
        <img src='img/leakyrelu.png' style="max-width: 200px;"/>
    </div>
</div>

## Generator

A generator is here to 
<img src="img/generatorfull.png" width=60% />

With transposed convolutional layer you go from narrow and deep inputs like vectors to wide and flat outputs like image.
By using a layer with a stride of 2 it will upsample the output image twice the size the input image.


<div style="display: flex;">
    <div style="margin: auto;">
        <img src="img/transposedconvlayer.png" style="max-width: 400px;"/>
    </div>
    <div style="margin: auto;">
        <img src="img/upsamplestride2.png" style="max-width: 400px;"/>
    </div>
</div>

## What is Batch Normalization?

Instead of normalizing only the inputs of the network, we normalize the inputs to every layer _within_ the network.
We use the _mean_ and _standard deviation_ (or variance) of the values in the current batch.
The normalization of the output from a previous layer happens by subtracting the batch mean and dividing by the batc standard deviation.


Getting the mean and variance

We represent the average as $\mu_B$
$$\mu_B$$

which is simply the sum of all of the values, $x_i$ divided by the number of values, $m$.

$$\mu_B \leftarrow \frac{1}{m}\sum_{i=1}^m x_i$$

We then need to calculate the variance, or mean squared deviation, represented as
$$\sigma_{B}^{2}$$

For each value x_i, we subtract the average value (calculated earlier as mu_B), which gives us what's called the "deviation" for that value. We square the result to get the squared deviation. Sum up the results of doing that for each of the values, then divide by the number of values, again $m$, to get the average, or mean, squared deviation.

$$\sigma_{B}^{2} \leftarrow \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2$$

Once we have the mean and variance, we can use them to normalize the values with the following equation. For each value, it subtracts the mean and divides by the (almost) standard deviation.

$$\hat{x_i} \leftarrow \frac{x_i - \mu_B}{\sqrt{\sigma_{B}^{2} + \epsilon}}$$
 
I said "almost" standard deviation because the real standard deviation for the batch is calculated by
$$\sqrt{\sigma_{B}^{2}}$$

but the above formula adds the term epsilon before taking the square root. The epsilon can be any small, positive constant, ex. the value 0.001. It is there partially to make sure we don't try to divide by zero, but it also acts to increase the variance slightly for each batch.

Why add this extra value and mimic an increase in variance? Statistically, this makes sense because even though we are normalizing one batch at a time, we are also trying to estimate the population distribution – the total training set, which itself is an estimate of the larger population of inputs your network wants to handle. The variance of a population is typically higher than the variance for any sample taken from that population, especially when you use a small sample size (a small sample is more likely to include values near the peak of a population distribution), so increasing the variance a little bit for each batch helps take that into account.

At this point, we have a normalized value, represented as
$$\hat{x_i}$$

But rather than use it directly, we multiply it by a gamma value, and then add a beta value. Both gamma and beta are learnable parameters of the network and serve to scale and shift the normalized value, respectively. Because they are learnable just like weights, they give your network some extra knobs to tweak during training to help it learn the function it is trying to approximate.

$$y_i \leftarrow \gamma \hat{x_i} + \beta$$

We now have the final batch-normalized output of our layer, which we would then pass to a non-linear activation function like sigmoid, tanh, ReLU, Leaky ReLU, etc.

### To add batch normalization layers to a PyTorch model:

* You add batch normalization to layers inside the__init__ function.
* Layers with batch normalization do not include a bias term. So, for linear or convolutional layers, you'll need to set bias=False if you plan to add batch normalization on the outputs.
* You can use PyTorch's [BatchNorm1d] function to handle the math on linear outputs or [BatchNorm2d] for 2D outputs, like filtered images from convolutional layers.
* You add the batch normalization layer before calling the activation function, so it always goes layer > batch norm > activation.

Finally, when you tested your model, you set it to .eval() mode, which ensures that the batch normalization layers use the populationrather than the batch mean and variance (as they do during training).

**Benefits of Batch Normalization**
1. Networks train faster
2. Allows higher learning rates
3. Makes weights easier to initialize
4. Makes more activation functions viable
5. Simplifies the creation of deeper networks
6. Provides a bit of regularization
7. May give a better results overall

# Cycle GAN and Pix2Pix

They can generate a realistic picture of a cat from a hand drawing, or transform a real time video of a horse into a zebra (image to image translation), pretty neat right ?

## Applications

The idea is to provide an image in input, apply a transformation and have an image in output.
The other image application are semantic segmentation, or labeling all the different things in an image, and edge detection.

<img src="img/imgtransfo.png" width=80% />

You can also do automatic colorization, or make an image more sharper.

<img src="img/imgtransfo2.png" width=80% />

## Pix2Pix
For image to image translation they use an image as input, then use a encoder decoder to produce an new desired image as output.

<img src="img/pix2pix.png" width=60% />

Quick recap on encoder / decoder : 

The encoder tries to compress and encode an image to a smaller feature representation :

<img src="img/encoder.png" width=60% />

The decoder will look at features level representation and uses that to generate a new realistic output image :

<img src="img/decoder.png" width=60% />

We then link the output of this encoder / decoder to a discriminator which will say if this image is real or fake as usual.

The discriminator will look at pairs of images. It labels a pair of images as real or fake. This way the network learns how to create a mapping between an input image (a sketch for example) and a real target image (a real image of the sketch).

<img src="img/pix2pixapplication.png" width=60% />

The generator wants the error of the discriminator to be large, it want the input and the generated image to be classified as real. With the discriminator also learning to classify better, the quality of the generated image is constantly getting better and better.

The discriminator is acting as the loss function, and this conditional on both input and output images to the generator is why it is called a Conditional GAN.

## Cycle GAN

It is difficult to always have paired labeled data. It is hard to ask a zebra to do the same pose as a horse in the same environment for example. How to learn from unpaired data then ? We want to find a mapping G that tries its best to map from an image X to an image Y.

The risk of unpaired data while using a encoder decoder is the **mode collapse**.

After doing a mapping from X to Y, we do the inverse mapping from Y to X of the generated image and we can then compare the original image and cycled generated image to measure the difference between those two. The goal is to have no difference in the original image if you do a mapping and then its inverse mapping.


<img src="img/cycleganexample.png" width=50% />

In this example if we translate from French to English, and then from English back to French we should have the same original sentence.

$$G_{YtoX}(G_{XtoY}(x)) \approx x$$

### Cycle Consistency Loss

<img src="img/consistencyloss.png" width=60% />

The complete loss in a Cycle Gan is $$L_Y + L_X + \lambda L_{cyc}$$

It is the sum of the adversarial losses and the cycle consistency loss. Lambda is the rate value that controls the rate of importance of these terms.

A CycleGAN will product only one mapping given an input image. The research is exploring ways to produce multiple styles from one input with networks like Paired CyclGAN, Cros-domain models or StarGAN.

# GANs applications

Medium article about a list of cool GANs applications\
https://medium.com/@jonathan_hui/gan-some-cool-applications-of-gans-4c9ecca35900

Tulips generator\
https://www.fastcompany.com/90237233/this-ai-dreams-in-tulips

Semi-supervised learning video explanation made by Ian Goodfellow\
https://www.youtube.com/watch?v=_LRpHPxZaX0