# Table of contents

1. What are they?
2. What makes them work?
3. What is their future?

# What are GANs?

## What are neural nets?

We've all seen diagrams like this:

![](img/neural_network_diagram.png)

But what are neural nets, mathematically?

They are:

* Universal function approximators
* Differentiable

If each layer is written as $a$, $b$, $c$, with weights $V$ and $W$, then the prediction can be written as:

$$ P = p(c(b(a(x, V)), W)) $$

And the loss can be written as:

$$ L = l(p(c(b(a(x, V)), W))) $$


What does differentiable mean? It means we can compute:

$$ \frac{\partial l}{\partial W} $$

$$ \frac{\partial l}{\partial V} $$

etc. Indeed, this is the information we need to "train" the neural network. 

**In addition**, it also means we can compute:

$$ \frac{\partial l}{\partial X} $$

In other words, how much the loss would change if the _input_ changed.

It was _this_ insight that sparked Ian Goodfellow to investigate GANs:

Could a machine learning algorithm use this information to learn how to "trick" another algorithm by producing examples that reduced this loss?

## Origin story

Ian Goodfellow and Yoshua Bengio are about to run a speech synthesis contest. They want to have a discriminator network that could listen to artificially generated speech and decide if it was real or not. 

They conclude that people will just game the system by generating examples that will fool this particular discriminator.

Then, Ian Goodfellow was in a bar one night, and asked the question: can this be changed by the discriminator learning?

## How could you do it?

### Part 1

First: randomly generate a feature vector; feed the feature vector through a randomly initialized neural network to produce an output image.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  ... \\
                  z_{100}
                  \end{bmatrix} $$

![](img/gan_1.png)

Let's denote the matrix of pixels in this image $X$.

Then, feed this image (matrix of pixels X) into a second network and get a prediction:

![](img/gan_2.png)

Use this loss to train the generator. 

Critically, also compute $$ \frac{\partial L}{\partial X} $$.

Then, update the generator with $$ -\frac{\partial L}{\partial X} $$

negative because we want the generator to be continually making the discriminator more likely to say that its images are real.

Use this to update the weights of the generator

![](img/gan_3.png)

Generate _new_ random noise, and repeat the process.

### What's missing?

This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify. So, we'll have to give it images from the true class as well.

![](img/gans_4.png)

[Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6)

# Let's code one up

See notebook here [here](GAN_example/dlnd_face_generation.ipynb).

## Convolutions

## Batch normalization

# DC-GAN Architecture



Three tricks:

## Deconvolutions:

Architecture:

1. Start with random noise vector of length 100
2. Begin by transforming it to a fully connected layer with shape $4*4*512$, say.
3. Perform deconvolution steps to apply a bunch of filters to these images to produce images of the shape you want, say 28 x 28.

Let's go through an illustration here:

[Theano documentation](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)

What do deconvolution operations do? 
Convolutions can be represented as matrix multiplications. 
Deconvolutions are just the inverse of those matrix multiplications.
So, when choosing our filter, padding, and stride of an inverse convolution operation, what output would result?
The answer is that the resulting output will be the one that would produce the input that we give this layer.

Let's look at an example. If we give a Conv 2D transpose a 4x4, and do an inverse convolution operation with:

Stride 1
Filter size 4
Valid padding.

We get a 7x7. Why?

Consider doing this operation on a 7x7. It is clear that we'll get a 4x4.

## Batch normalization

See [here](http://www.deeplearningbook.org/contents/optimization.html), and especially the batch normalization lesson in the Udacity repo. This guy actually figured out the gradients for batch normalization. [here](http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html). Also: [unbelievable](https://github.com/cthorey/CS231).

# What is their future?

## Applications:

* Drug discovery

* Semi-supervised learning


