# Table of contents

1. What are they?
2. What makes them work?
3. What is their future?

# What are GANs?

## What are neural nets?

We've all seen diagrams like this:

![](img/neural_network_diagram.png)

But what are neural nets, mathematically?

They are:

* Universal function approximators
* Differentiable

If each layer is written as $a$, $b$, $c$, with weights $V$ and $W$, then the prediction can be written as:

$$ P = p(c(b(a(x, V)), W)) $$

And the loss can be written as:

$$ L = l(p(c(b(a(x, V)), W))) $$


What does differentiable mean? It means we can compute:

$$ \frac{\partial l}{\partial W} $$

$$ \frac{\partial l}{\partial V} $$

etc. Indeed, this is the information we need to "train" the neural network. 

**In addition**, it also means we can compute:

$$ \frac{\partial l}{\partial X} $$

In other words, how much the loss would change if the _input_ changed.

It was _this_ insight that sparked Ian Goodfellow to investigate GANs:

Could a machine learning algorithm use this information to learn how to "trick" another algorithm by producing examples that reduced this loss?

## Origin story

Ian Goodfellow and Yoshua Bengio are about to run a speech synthesis contest. They want to have a discriminator network that could listen to artificially generated speech and decide if it was real or not. 

They conclude that people will just game the system by generating examples that will fool this particular discriminator.

Then, Ian Goodfellow was in a bar one night, and asked the question: can this be changed by the discriminator learning?

## How could you do it?

### Part 1

First: randomly generate a feature vector; feed the feature vector through a randomly initialized neural network to produce an output image.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  ... \\
                  z_{100}
                  \end{bmatrix} $$

![](img/gan_1.png)

Let's denote the matrix of pixels in this image $X$.

Then, feed this image (matrix of pixels X) into a second network and get a prediction:

![](img/gan_2.png)

Use this loss to train the generator. 

Critically, also compute $$ \frac{\partial L}{\partial X} $$.

Then, update the generator with $$ -\frac{\partial L}{\partial X} $$

negative because we want the generator to be continually making the discriminator more likely to say that its images are real.

Use this to update the weights of the generator

![](img/gan_3.png)

Generate _new_ random noise, and repeat the process.

### What's missing?

This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify. So, we'll have to give it images from the true class as well.

![](img/gans_4.png)

[Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6)

# Let's code one up

See notebook here [here](GAN_example/dlnd_face_generation.ipynb).

## Convolutions

We've all seen diagrams like this in the context of convolutional neural nets:

![](img/AlexNet_0.jpg)

What's really going on here?

Let's say we have an input layer of size $[224x224x3]$, as we do in ImageNet. This next layer seems to be $96$ deep. What does that mean?

For each of 96 _filters_, the following happens:

For each of the 3 _input channels_, the _filter_, which happens to be dimension $11 x 11$ in this case, is slid over the image, "detecting the presence of different features" at each location. We'll call the image that results the "output image".

At the next layer, the values of these three output images are summed together to generate the first of 96 output images in the following layer. 

In addition, the three filters - the one slid over the red, green, and blue color channels - can be combined together and visualized as if they were a mini 11x11 image.

![](img/AlexNet_filt1.png)

That's a review of convolutions:

Point is: for each convolution, the **size of the output** from convolving a filter over an image will be a function of:

1. The input size, $I$
2. The filter size, $F$
3. The stride, $S$
4. The padding, $P$

There are formulas relating these quantities that you can look up; however, I think it's best to just reason through what the output shape should be each time.

## What does this have to do with GANs?

Doing these convolutions is relatively straightforward - but how do we do deconvolutions?

Meaning, if we start with a $4x4$ input, how do we scale it up to say, $28x28$?

Well, we can represent convolutions by a matrix transformation. See [here](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html#convolution-as-a-matrix-operation)

After all, a convolution that transforms a $4x4$ image into a $2x2$ is a linear transformation that can be represented by a $16x4$ matrix.

To "deconvolve" a $2x2$ matrix into a $4x4$, we would multiply it by the inverse of the original matrix.

In [4]:
import tensorflow as tf

In [29]:
def generator(z, out_channel_dim, is_train=True):
    """
    Create the generator network
    :param z: Input z
    :param out_channel_dim: The number of channels in the output image
    :param is_train: Boolean if generator is being used for training
    :return: The tensor output of the generator
    """
    # TODO: Implement Function
    with tf.variable_scope('generator', reuse=not is_train):
        # First fully connected layer
        x1 = tf.layers.dense(z, 4*4*512)
#         # Reshape it to start the convolutional stack
        x1 = tf.reshape(x1, (-1, 4, 4, 512))
        x1 = tf.layers.batch_normalization(x1, training=is_train)
        x1 = tf.maximum(0.2 * x1, x1)

        x2 = tf.layers.conv2d_transpose(x1, 256, 4, strides=1, padding='same')
        x2 = tf.layers.batch_normalization(x2, training=is_train)
        x2 = tf.maximum(0.2 * x2, x2)

#         x3 = tf.layers.conv2d_transpose(x2, 128, 5, strides=2, padding='same')
#         x3 = tf.layers.batch_normalization(x3, training=is_train)
#         x3 = tf.maximum(0.2 * x3, x3)

#         logits = tf.layers.conv2d_transpose(x3, out_channel_dim, 5, strides=2, padding='same')

#         out = tf.tanh(logits)
    
    return x2

In [30]:
from copy import deepcopy

In [31]:
from unittest import mock
class TmpMock():
    """
    Mock a attribute.  Restore attribute when exiting scope.
    """
    def __init__(self, module, attrib_name):
        self.original_attrib = deepcopy(getattr(module, attrib_name))
        setattr(module, attrib_name, mock.MagicMock())
        self.module = module
        self.attrib_name = attrib_name

    def __enter__(self):
        return getattr(self.module, self.attrib_name)

    def __exit__(self, type, value, traceback):
        setattr(self.module, self.attrib_name, self.original_attrib)

In [32]:
with TmpMock(tf, 'variable_scope') as mock_variable_scope:
    z = tf.placeholder(tf.float32, [None, 100])
    out_channel_dim = 3
    output = generator(z, out_channel_dim)
    print(output.shape)

(?, 4, 4, 256)


First, let's review two common terms when it comes to "padding" in TensorFlow or convolutional neural nets more generally:

* "Same" padding means pad the input so that the output shape is the same as the input shape. If we have a 5x5 kernel, that means pad two units on the edges of the image.
* "Valid" padding means don't apply any padding - on a forward convolution pass, having no padding means the output image will be smaller than the input image.

So, why is it that applying a `conv_2d_transpose` operation to a $4x4$ image with `valid` padding results in a $7x7$ whereas applying that operation with `same` padding results in a $4x4$? On the forward pass, this is easy to reason about, but on the backwards pass it is trickier.

The reason is that deconvolutions are just inverses of convolutions - and applying a convolution of kernel size 4 and stride 1 with `valid` padding to a 7x7 image would result in a 4x4. Similar reasoning applys to same padding.

Getting the dimensions right on these deconvolutions is one of the trickiest parts of getting GANs to work!

## Batch normalization

# DC-GAN Architecture



Three tricks:

## Deconvolutions:

Architecture:

1. Start with random noise vector of length 100
2. Begin by transforming it to a fully connected layer with shape $4*4*512$, say.
3. Perform deconvolution steps to apply a bunch of filters to these images to produce images of the shape you want, say 28 x 28.

Let's go through an illustration here:

[Theano documentation](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)

What do deconvolution operations do? 
Convolutions can be represented as matrix multiplications. 
Deconvolutions are just the inverse of those matrix multiplications.
So, when choosing our filter, padding, and stride of an inverse convolution operation, what output would result?
The answer is that the resulting output will be the one that would produce the input that we give this layer.

Let's look at an example. If we give a Conv 2D transpose a 4x4, and do an inverse convolution operation with:

Stride 1
Filter size 4
Valid padding.

We get a 7x7. Why?

Consider doing this operation on a 7x7. It is clear that we'll get a 4x4.

## Batch normalization

See [here](http://www.deeplearningbook.org/contents/optimization.html), and especially the batch normalization lesson in the Udacity repo. This guy actually figured out the gradients for batch normalization. [here](http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html). Also: [unbelievable](https://github.com/cthorey/CS231).

# What is their future?

## Applications:

* Drug discovery

* Semi-supervised learning


