# The core idea

The core idea behind **generative adversarial networks** is to use discriminative learning in a minimax zero sum game to guide unsupervised learning to learn high quality probabilistic generative models of the data without any explicit prior constraint on the distribution.

To get closer to the meaning of the complex description above, let's first elaborate the "traditional" metaphor for GANs.

## The metaphor

<img src="https://miro.medium.com/max/1437/1*-gFsbymY9oJUQJ-A3GTfeg.png" width=45%>

Imagine a "minimax" zero sum game between forgers and detectives. The forgers are in the business of creating authentic fake paintings from great masters, while the detectives are trying to ensure, that no fake images get into circulation.

One of the interesting extensions of the metaphor is: the best forged image is maybe not the same as a classic piece of a grand master. One would hardly believe, that I own the Mona Lisa and want to sell it. Maybe a __new__ image, adhering to the exact style of the old masters is the way to go! 

So __the best generated image is a one for that I can not tell, if it really comes from the original distribution or not__. 

To reformulate this, we are interested in __new__, that is basically __random__ images, where all of the images are plausible. The forger learns the patterns, stylistic elements of the "masters", the composition, the individual figures. So if I tell him, that I would like to have a new Leonardo, with two people in tha background, and a smiling lady in the front, it can be generated. Basically, **all "knowledge" is in the genérator, the input parameters** (my wish) can be arbitrary **aka. random**.

### Detour: Normalizing flows

**The learning procedure is basically taking the complex distribution of observed pictures, and projects them back to a mixture of simple distributions as latent factors.** (Often times these are gaussians.) 

**(Porperties of these latent factors, such as sparsity and control over them is an important later concern.)**

This concept is in strong connection with the idea of **normalizing flows**: 

<img src="https://siboehm.com/assets/img/nfn/normalizing_flow_network.png" width=55%>

"Normalizing Flows (NFs) (Rezende & Mohamed, 2015) learn an _invertible_ mapping $f: X \rightarrow Z$, where $X$ is our data distribution and $Z$ is a chosen latent-distribution."

[source](https://gebob19.github.io/normalizing-flows/)

More information on normalizing flows can be found [here](https://akosiorek.github.io/ml/2018/04/03/norm_flows.html) and [here](https://gebob19.github.io/normalizing-flows/).

### Detour: (inverse) transform method

Another technique concerning the generation of complex random variables from simple, easily understood ones can be also enlightening in case of GANs, namely the (inverse) transform method.

Simply put, a "complicated" random variable can be generated, if we start from a really simple one, eg. from a uniform distribution between $[0,1]$, and then **apply a function to the simple variable**, basically to "restructure" the distribution. 

<img src="https://miro.medium.com/max/7230/1*Xoz06MKgbw7CZ8aNbMCt6A.jpeg" width=55%>

In this sense, the generator part of the GAN can be considered to be such a function mapping a simple random variable to the complex random variable of real objects. 

A nice elaboration of the transform method can be found [here](https://towardsdatascience.com/understanding-generative-adversarial-networks-gans-cd6e4651a29), which we quote in detail for motivation:

"Suppose that we are interested in generating black and white square images of dogs with a size of n by n pixels. We can reshape each data as a N=nxn dimensional vector (by stacking columns on top of each others) such that an image of dog can then be represented by a vector. However, it doesn’t mean that all vectors represent a dog once shaped back to a square! So, we can say that the N dimensional vectors that effectively give something that look like a dog are distributed according to a very specific probability distribution over the entire N dimensional vector space (some points of that space are very likely to represent dogs whereas it is highly unlikely for some others). In the same spirit, there exists, over this N dimensional vector space, probability distributions for images of cats, birds and so on.

Then, the problem of generating a new image of dog is equivalent to the problem of generating a new vector following the “dog probability distribution” over the N dimensional vector space. So we are, in fact, facing a problem of generating a random variable with respect to a specific probability distribution.

At this point, we can mention two important things. First the “dog probability distribution” we mentioned is a very complex distribution over a very large space. Second, even if we can assume the existence of such underlying distribution (there actually exists images that looks like dog and others that doesn’t) we obviously don’t know how to express explicitly this distribution. Both previous points make the process of generating random variables from this distribution pretty difficult. Let’s then try to tackle these two problems in the following.
… so let’s use transform method with a neural network as function!

Our first problem when trying to generate our new image of dog is that the “dog probability distribution” over the N dimensional vector space is a very complex one and we don’t know how to directly generate complex random variables. However, as we know pretty well how to generate N uncorrelated uniform random variables, we could make use of the transform method. To do so, we need to express our N dimensional random variable as the result of a very complex function applied to a simple N dimensional random variable!"

## The model

<img src="https://miro.medium.com/max/2426/1*XKanAdkjQbg1eDDMF2-4ow.png" width=55%>

### Formal definition, the loss function


From the formulation above we already anticipate, that we are expected to see some deep neural models locked in a competition, that is in a minimax case, where the increase in one's (the discriminators) loss is the direct cause for the decrease of the other's. Or to put it more formally, the classical loss function is as follows:

<img src="http://drive.google.com/uc?export=view&id=1PE_XEredzvOBeLuFXyDA7fDpph8A8Gru" width=35%>

"In this function:

- $D(x)$ is the discriminator's estimate of the probability that real data instance $x$ is real.
- $E_x$ is the expected value over all real data instances.
- $G(z)$ is the generator's output when given noise $z$.
- $D(G(z))$ is the discriminator's estimate of the probability that a fake instance is real.
- $E_z$ is the expected value over all random inputs to the generator (in effect, the expected value over all generated fake instances $G(z))$.

The formula derives from the cross-entropy between the real and generated distributions.

The generator can't directly affect the $log(D(x))$ term in the function, so, for the generator, minimizing the loss is equivalent to minimizing $log(1 - D(G(z)))$."

The original GAN paper from 2014 can be found [here](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf), source for the description [here](https://developers.google.com/machine-learning/gan/loss).

One of the main avenues for innovation in GANs will be the usage of innovative loss functions to counteract negative effects that arise during GAN training. More on that later.


### Some background

Inspiration for the loss function of the GAN framework came from the idea of __[Noise contrastive estimation](http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf)__.

A good grasp of noise contrastive estimation can be gained from [here](https://blog.zakjost.com/post/nce-intro/), original paper [here](http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf).

The basic idea came from the huge burden of softmax layers in word2vec training (thus, mind you, representation learning). With a decent vocabulary size the comtutation of the full output probability distribution over all words was quite infeasibly slow (given the sum over exponentials), thus, **negative sampling** was introduced. This way, the calculation fo the full output probability distribution could be avoided, instead, the model only had to use **discriminative learning** and thus to discriminate between the "true points" from some "negative samples". Later on, the idea got generalized, and the proposal for a **classification task between noise and the data itself** (for which negative sampling happened to be only a subset or variant).

The quality of representations learned through noise constrastive estimation are remarkably good, thus this approached was considered as a strong choice for such tasks. Hence it's influence on Goodfellow and consequently GANs.

This approach is in a sense the opposite of the normalizing flows case, we learn to discriminate between the data and some noise, so if we are able to do a right binary classification over all possible datapoints and a huge amount of noise, we have learned the properties of the data.

# Steps of training a basic GAN model

The basic steps of training defined by Goodfellow in his original paper are:

<img src="http://drive.google.com/uc?export=view&id=1JU7NhsgLqgjoOt8b_3ttdHsWiPa8iSSO" width=55%>

Or to put it visually:

(Blue modules are "frozen", not being updated.)

## GAN discriminator training:

<img src="http://drive.google.com/uc?export=view&id=13d7YsCFl_z0_VoKdKw1eJOHQ0L05Vmji" width=55%>

Take a "half batch" worth of real datapoints, freeze the generator, use it to generate another "half batch" worth of images based on random vectors as inputs. 

Then give this to the discriminator as inputs, and calculate a **binary classification loss** for it, and train it's weights with the resulting gradients from the loss. Generator still frozen.


## GAN generator training:

<img src="http://drive.google.com/uc?export=view&id=1aU3TpQLgJ3fY7dLte-af6yTw-PkJlWQU" width=55%>

Now reverse the procedure.

Freeze the discriminator, then generate a complete batch worth of noise vectors, feed it as input to the generator.

Then get the generator's output, put it into the frozen discriminator with the **good label** (we know, all images are fake at this point, but we "fool" the discriminator, giving it the label "true", effectively inverting the loss), calculate poor discriminator's binary loss, and backpropagate to the generator's weights, discriminator unmodified.

"Rinse and repeat" and you have a GAN trained!

(Naturally, the generator and discriminator can be of any architecture, thus for images, unsurprisingly convolutional solutions emerged under the name of [DCGAN, or deep convolutional GAN](https://arxiv.org/abs/1511.06434).)


# How to keep the training stable?

## Avoid imbalance!

The presence of "one sidedness" in the distributions and in the generator! Remeber, we would not like to make it's task more difficult as it already is, so practically:

- Normalize data between $[-1,1]$ instead of $[0,1]$
- Use `tanh` for the generator output
- Sample noise from a nice Gaussian distribution
- In convolutional GANs use BatchNorm to stabilize distributions

## Avoid sparsity!

It is not healthy, if there is sparsity introduced during the generation process, so counterintuitively, some methods that work well in normal discriminative training are not the best options here. 

Practically:
- Avoid ReLU-s, use some more symmetric version like LeakyReLU
- Don't use pooling, prefer strided convolution

## Other practicalities

- Adaptive gradient methods like Adam (or it's newer "siblings") can help
- Use "one sided" label smoothing for the discriminator (replacing the 1 in onehot to something like 0.9)  (On a more in-depth discussion why label smoothing works in general, see [here](https://arxiv.org/pdf/1906.02629.pdf), and the reason for one sided label smoothing in GANs see [Goodfellow's tutorial](https://arxiv.org/abs/1701.00160).) 
- Use Gaussian initialization (the original DCGAN paper suggests mean 0 and std. dev. of 0.02.

# Control over the generation process - again

Since we assume, that the representation space a GAN learns is complex and interesting, the question naturally arises, if we can discern any kinds of **regularities in the latent space**, and thus try to **exercise control** over the generation process.

At this point, we can remember, that in the learned representation of word2vec, Mikolov and co. found surprisingly systematic relationships. (This gave the model's fame in popular culture.)

<img src="https://www.distilled.net/uploads/word2vec_chart.jpg" width=65%>

What if we try to do the same in GAN space, that is to **detect "directions"** influencing certain properties, and then do **"arithmetic" style operations** on them?

Well, this is exactly what the authors of the original [DCGAN paper](https://arxiv.org/abs/1511.06434) did:

<img src="http://drive.google.com/uc?export=view&id=1acEau_R9um4pp0Ogq2YOXg0ZMMkuhWIj" width=85%>

After this remarkable demonstration, the question arises:

What if we could actually control these factors - eg. the class, as conditions.

Enter conditional GANs!

## ACGAN

If one has DC, one has to have AC also, so soon after the the introduction of DCGAN, a **class conditional variant** [ACGAN](https://arxiv.org/abs/1610.09585) emerged, very much in the spirit of the conditional autoencoders, where we **use the additional available class labels for conditioning the generating process**.

This require some modifications for the architecture in inputs, outputs and losses alike.

### AC-GAN discriminator training:

<img src="http://drive.google.com/uc?export=view&id=1oA1eL4IiMGCPt4Ohs6xkmuwofwFguvtF" width=55%>

The main change from the discriminator perspective is the **additional class input** for the training, as well as the modification of **multi-head output**, whereby the discriminator has to produce, beyond the previous binary classification also a **probability distribution over the classes**, with the added trickery, that the **classes have been extended with an explicit class for "fake"**. 

This in turn necessitates the introduction of two losses, one binary crossentropy for the real/fake decision and categorical crossentropy for the classification task. Though elaborate schemes can be conjured up to weight these losses, as a baseline practice the **simple mean or sum of the losses** is sufficient for training.


### AC-GAN generator training:

<img src="http://drive.google.com/uc?export=view&id=1LexlaEEBwqMT9t60IDz1bIea7-tRJIcH" width=55%>

From the generator's perspective, the addition of the class input is the main challenge. Since we would like to treat the class labels as something coming from the underlying distribution, we have to **mix the input noise and the input labels** somehow. The most natural way to do this is to **learn an embedding from the class input into the vectorspace of the noise** (projecting from a single class integer to a real vector). Again, many considerations and variations can apply, but as a baseline, **multiplication of the noise vector with the embedded class vector** is a decent baseline.

The learned and systematic "distortion" of the noise with the class labels explicitly simulates some kind of **semantic control over the generator**. This conditioning will have a _huge_ career and extremely powerful versions.

### Historic remark: cGAN

<img src="https://paper-attachments.dropbox.com/s_D85DDA7D01FD04AEE96825C4B90F1126BC7D080CA4F2947D4A5DEC07FAD6122C_1559840765144_Screenshot+2019-06-06+at+10.35.29+PM.png" width=55%>

Historically, the conditioning on class labels was carried out first as a concatenation of the class vector with the noise, and no additional discrimination heads. This model was called cGAN, or conditional GAN, but ACGAN came along, and offered better performance...

(Remark: **class conditioning helps the networks to learn interesting objects, not just backgrounds.** If only images are used, background is occupying much of the image, it is realistic to just produce grass, and fool the discriminator... see some discussion [here](https://youtu.be/Z6rxFNMGdn0?t=2666))

# Advantages

To quote Goodfellow:

"Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that **they can represent very sharp, even degenerate distributions**, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes."

When he wrote this, he did not even grasp, how true this is , and how powerful GANs ability in these very deformed distributions is!

## Detour: Why do GANs generalize?

It is an interesting, not too deeply studied qeustion, why GANs, especially the generator does produce __new__ images, instead of just trying to memorize the exact training set, and give it as an output to any noise input. Though we don't have the perfect answer, there are [theoretical works](https://colinraffel.com/publications/idlt2018theoretical.pdf) that show, that this kind of memorizaion is statistically hard, so it far less likely than producing novel images.

# Problems with GANs

## Non-convergence

To quote Goodfellow's NIPS 2016 GAN workshop ([transcript on ArXiv](https://arxiv.org/pdf/1701.00160.pdf)):

"The largest problem facing GANs that researchers should try to resolve is the issue of non-convergence. Most deep models are trained using an optimization algorithm that seeks out a low value of a cost function. While many problems can interfere with optimization, optimization algorithms usually make reliable downhill progress. GANs require finding the equilibrium to a game with two players. Even if each player successfully moves downhill on that player’s update, the same update might move the other player uphill. Sometimes the two players eventually reach an equilibrium, but in other scenarios they repeatedly undo each others’ progress without arriving anywhere useful. This is a general problem with games not unique to GANs, so a general solution to this problem would have wide-reaching applications. ... 
Simultaneous gradient descent converges for some games but not all of them. In the case of the minimax GAN game (section 3.2.2), Goodfellow et al. (2014b) showed that simultaneous gradient descent converges if the updates are made in function space. In practice, the updates are made in parameter space, so the convexity properties that the proof relies on do not apply. Currently, there is neither a theoretical argument that GAN games should converge when the updates are made to parameters of deep neural networks, nor a theoretical argument that the games should not converge.  
In practice, GANs often seem to oscillate, somewhat like what happens in the toy example in section 8.2, meaning that they progress from generating one kind of sample to generating another kind of sample without eventually reaching an equilibrium. Probably the most common form of harmful non-convergence encountered in the GAN game is mode collapse."

So some very cereful parameter tuning is on order to avoid oscillation and non convergence!

<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/Line-Plots-of-Loss-and-Accuracy-for-a-Generative-Adversarial-Network-with-Mode-Collapse.png" width=55%>

### Some forms on non-convergence:

Good tutorials on [GANs failure modes](https://machinelearningmastery.com/practical-guide-to-gan-failure-modes/) has some interesting remarks:

"A stable GAN will have a discriminator loss around 0.5, typically between 0.5 and maybe as high as 0.7 or 0.8. The generator loss is typically higher and may hover around 1.0, 1.5, 2.0, or even higher.

The accuracy of the discriminator on both real and generated (fake) images will not be 50%, but should typically hover around 70% to 80%.

For both the discriminator and generator, behaviors are likely to start off erratic and move around a lot before the model converges to a stable equilibrium."

<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/Line-Plots-of-Loss-and-Accuracy-for-a-Stable-Generative-Adversarial-Network.png" width=55%>

"We can see that all three losses are somewhat erratic early in the run before stabilizing around epoch 100 to epoch 300. Losses remain stable after that, although the variance increases.

This is an example of the normal or expected loss during training. Namely, discriminator loss for real and fake samples is about the same at or around 0.5, and loss for the generator is slightly higher between 0.5 and 2.0. If the generator model is capable of generating plausible images, then the expectation is that those images would have been generated between epochs 100 and 300 and likely between 300 and 450 as well."

Wheras in mode collapse we see:

<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/Line-Plots-of-Loss-and-Accuracy-for-a-Generative-Adversarial-Network-with-Mode-Collapse.png" width=55%>

And in convergence failure:
<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/Line-Plots-of-Loss-and-Accuracy-for-a-Generative-Adversarial-Network-with-a-Convergence-Failure.png" width=55%>

or:
<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/07/Sample-of-100-Generated-Images-of-a-Handwritten-Number-8-at-Epoch-450-from-a-GAN-that-has-a-Convergence-Failure-via-Aggressive-Optimization.png" width=55%>

"We can review the properties of a convergence failure as follows:

- The loss for the discriminator is expected to rapidly decrease to a value close to zero where it remains during training.
- The loss for the generator is expected to either decrease to zero or continually decrease during training.
- The generator is expected to produce extremely low-quality images that are easily identified as fake by the discriminator."

Many times the discriminator has a too easy job of telling apart fakes from reals, thus it is overpowering the poor generator:

<img src="https://i.stack.imgur.com/rNI4P.png" width=45%>

So it is **quite challenging to keep GAN training stabile**!

Other tutorials on [improving GAN performance](https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b) and on [why is it so hard to train GANs](https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) are very well worth studying!


But even in case of careful training, our old friend, mode collapse haunts us again!


## Mode collapse - still!

Although we started out with the promise, that GANs will be able to learn an arbitraty complicated distribution, thus pervent all the blurriness resulting form the strict assumprions plaguing VAEs, in practice, mode collapse pops up quite frequently.

<img src="https://qph.fs.quoracdn.net/main-qimg-2b851bba61af56aba197813073bf3276.webp" width=55%>

If we look at the image above, we see, that for a given condition, the generator (without additional help) tends to stick with one solution that worked.

This represents a kind of **exploration-exploitation tradeoff scenario for the generator**, since it is not worth in the "game" to search for new, "experimantal" images, but to produce one, that has already successfully fooled the generator. **Ensuring diversity** is thus one of the big challenges of GAN training. 

Remember, the Nash equilibrium between D and G is only approximated, and when mode collapse occures, the loss basically gets switched, the generator gets max loss everywhere, except some places, and the discriminator "wins". This "overpoweredness" of the discriminator is quite problematic especially in the early stages of training.

And even when mitigated, in certain parts of the distribution, a "mini mode collapse" can absolutely occure, like demonstrated by the images below:

<img src="https://4.bp.blogspot.com/-qZCsaRY8i1I/WeRu-fBatBI/AAAAAAAA5DY/ycXqRGaOCOkvxv6yBl6A80S18SKZJ5q9wCLcBGAs/s640/20170521132113990.png" width=55%>


##  Counting problems

<img src="http://drive.google.com/uc?export=view&id=1Kzvz6pVz4G_wUhD1mOfAMmV065rcEDu4" width=55%>

Seemingly convolutional GANs have a pretty hard time with counting stuff. Generally it is true, that some eyes shoulw be present for a dog, but sadly an approximate solution of "between 0 and 4" is not totally appropriate, the exact number of 2 (or 0 or 1 in case of occlusion) is highly desired! 


## Perspective problems

<img src="http://drive.google.com/uc?export=view&id=1wHI7TjK1Nt4u1LYEvP-co2F9klqVl6Dw" width=55%>

Well, the right perspective on life matters a lot, don't you think? GANs still have to learn some things about life! 


## Global structure problems

<img src="http://drive.google.com/uc?export=view&id=1_EvsnjFB_yLcQ8ML99RBr-lBaa3UIBsL" width=55%>

Ok, we got all the parts, but maybe, the constellation matters! So sadly, anatomy is a ... 

Happily enough, huge resources have been committed to GANs, since they posed a promising paradigm, so we will see much of these problems simply wiped away during study of GAN's progress.

# Where can we experiment with GANs?

Good question!

For example in your browser, if you open [GAN Lab](https://poloclub.github.io/ganlab/), a tool for interactive exploration of GAN training!

<img src="https://miro.medium.com/max/1142/1*X9Nhi_ECPmrQ7FyhA4gwnQ.png" width=55%>

[GAN dissection](https://gandissect.csail.mit.edu/), a paper and set of tools from MIT can also be very instructive!

<img src="https://gandissect.csail.mit.edu/img/framework-d2.svg" width=55%>

# Additional resources 

[The workshop of Goodfellow on GANs on NIPS 2016](https://www.youtube.com/watch?v=AJVyzd0rqdc) 

[The book GANs in action on Amazon](https://www.amazon.com/GANs-Action-learning-Generative-Adversarial/dp/1617295566/)

[The GAN short course at ZENVA](https://academy.zenva.com/product/generative-adversarial-networks/)