# Table of contents

1. What are they?
2. What makes them work?
3. What is their future?

# What are GANs?

## What are neural nets?

We've all seen diagrams like this:

![](img/neural_network_diagram.png)

But what are neural nets, mathematically?

They are:

* Universal function approximators
* Differentiable

If each layer is written as $a$, $b$, $c$, with weights $V$ and $W$, then the prediction can be written as:

$$ P = p(c(b(a(x, V)), W)) $$

And the loss can be written as:

$$ L = l(p(c(b(a(x, V)), W))) $$


What does differentiable mean? It means we can compute:

$$ \frac{\partial l}{\partial W} $$

$$ \frac{\partial l}{\partial V} $$

etc. Indeed, this is the information we need to "train" the neural network. 

**In addition**, it also means we can compute:

$$ \frac{\partial l}{\partial X} $$

In other words, how much the loss would change if the _input_ changed.

It was _this_ insight that sparked Ian Goodfellow to investigate GANs:

Could a machine learning algorithm use this information to learn how to "trick" another algorithm by producing examples that reduced this loss?

## Origin story

Ian Goodfellow and Yoshua Bengio are about to run a speech synthesis contest. They want to have a discriminator network that could listen to artificially generated speech and decide if it was real or not. 

They conclude that people will just game the system by generating examples that will fool this particular discriminator.

Then, Ian Goodfellow was in a bar one night, and asked the question: can this be changed by the discriminator learning?

## How could you do it?

### Part 1

First: randomly generate a feature vector; feed the feature vector through a randomly initialized neural network to produce an output image.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  ... \\
                  z_{100}
                  \end{bmatrix} $$

![](img/gan_1.png)

Let's denote the matrix of pixels in this image $X$.

Then, feed this image (matrix of pixels X) into a second network and get a prediction:

![](img/gan_2.png)

Use this loss to train the generator. 

Critically, also compute $$ \frac{\partial L}{\partial X} $$.

Then, update the generator with $$ -\frac{\partial L}{\partial X} $$

negative because we want the generator to be continually making the discriminator more likely to say that its images are real.

Use this to update the weights of the generator

![](img/gan_3.png)

Generate _new_ random noise, and repeat the process.

### What's missing?

This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify. So, we'll have to give it images from the true class as well.

![](img/gans_4.png)

[Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6)

# Let's code one up

See notebook here [here](GAN_example/dlnd_face_generation.ipynb). TODO: get this running on AWS (easy)

## Convolutions

We've all seen diagrams like this in the context of convolutional neural nets:

![](img/AlexNet_0.jpg)

What's really going on here?

Let's say we have an input layer of size $[224x224x3]$, as we do in ImageNet. This next layer seems to be $96$ deep. What does that mean?

For each of 96 _filters_, the following happens:

For each of the 3 _input channels_, the _filter_, which happens to be dimension $11 x 11$ in this case, is slid over the image, "detecting the presence of different features" at each location. We'll call the image that results the "output image".

At the next layer, the values of these three output images are summed together to generate the first of 96 output images in the following layer. 

In addition, the three filters - the one slid over the red, green, and blue color channels - can be combined together and visualized as if they were a mini 11x11 image.

![](img/AlexNet_filt1.png)

That's a review of convolutions:

Point is: for each convolution, the **size of the output** from convolving a filter over an image will be a function of:

1. The input size, $I$
2. The filter size, $F$
3. The stride, $S$
4. The padding, $P$

There are formulas relating these quantities that you can look up; however, I think it's best to just reason through what the output shape should be each time.

## What does this have to do with GANs?

Doing these convolutions is relatively straightforward - but how do we do deconvolutions?

Meaning, if we start with a $4x4$ input, how do we scale it up to say, $28x28$?

Well, we can represent convolutions by a matrix transformation. See [here](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html#convolution-as-a-matrix-operation)

After all, a convolution that transforms a $4x4$ image into a $2x2$ is a linear transformation that can be represented by a $16x4$ matrix.

To "deconvolve" a $2x2$ matrix into a $4x4$, we would multiply it by the inverse of the original matrix.

In [4]:
import tensorflow as tf

In [29]:
def generator(z, out_channel_dim, is_train=True):
    """
    Create the generator network
    :param z: Input z
    :param out_channel_dim: The number of channels in the output image
    :param is_train: Boolean if generator is being used for training
    :return: The tensor output of the generator
    """
    # TODO: Implement Function
    with tf.variable_scope('generator', reuse=not is_train):
        # First fully connected layer
        x1 = tf.layers.dense(z, 4*4*512)
#         # Reshape it to start the convolutional stack
        x1 = tf.reshape(x1, (-1, 4, 4, 512))
        x1 = tf.layers.batch_normalization(x1, training=is_train)
        x1 = tf.maximum(0.2 * x1, x1)

        x2 = tf.layers.conv2d_transpose(x1, 256, 4, strides=1, padding='same')
        x2 = tf.layers.batch_normalization(x2, training=is_train)
        x2 = tf.maximum(0.2 * x2, x2)

#         x3 = tf.layers.conv2d_transpose(x2, 128, 5, strides=2, padding='same')
#         x3 = tf.layers.batch_normalization(x3, training=is_train)
#         x3 = tf.maximum(0.2 * x3, x3)

#         logits = tf.layers.conv2d_transpose(x3, out_channel_dim, 5, strides=2, padding='same')

#         out = tf.tanh(logits)
    
    return x2

In [30]:
from copy import deepcopy

In [31]:
from unittest import mock
class TmpMock():
    """
    Mock a attribute.  Restore attribute when exiting scope.
    """
    def __init__(self, module, attrib_name):
        self.original_attrib = deepcopy(getattr(module, attrib_name))
        setattr(module, attrib_name, mock.MagicMock())
        self.module = module
        self.attrib_name = attrib_name

    def __enter__(self):
        return getattr(self.module, self.attrib_name)

    def __exit__(self, type, value, traceback):
        setattr(self.module, self.attrib_name, self.original_attrib)

In [32]:
with TmpMock(tf, 'variable_scope') as mock_variable_scope:
    z = tf.placeholder(tf.float32, [None, 100])
    out_channel_dim = 3
    output = generator(z, out_channel_dim)
    print(output.shape)

(?, 4, 4, 256)


First, let's review two common terms when it comes to "padding" in TensorFlow or convolutional neural nets more generally:

* "Same" padding means pad the input so that the output shape is the same as the input shape. If we have a 5x5 kernel, that means pad two units on the edges of the image.
* "Valid" padding means don't apply any padding - on a forward convolution pass, having no padding means the output image will be smaller than the input image.

So, why is it that applying a `conv_2d_transpose` operation to a $4x4$ image with `valid` padding results in a $7x7$ whereas applying that operation with `same` padding results in a $4x4$? On the forward pass, this is easy to reason about, but on the backwards pass it is trickier.

The reason is that deconvolutions are just inverses of convolutions - and applying a convolution of kernel size 4 and stride 1 with `valid` padding to a 7x7 image would result in a 4x4. Similar reasoning applys to same padding. TODO: add illustration.

Getting the dimensions right on these deconvolutions is one of the trickiest parts of getting GANs to work!

## Batch normalization

What do we do when we train neural nets? At each iteration, we update the weights based on $ \frac{\partial l}{\partial W} $.

Is this really accurate? Well, it's a first order approximation - literally.

We can indeed figure out how much the weights should change _hodling everything else in the neural net constant_.

However, all the weights in the neural net are changing at once!

This can lead to significant "second order effects", where changing one weight can have a much different impact on the loss than we expect because its change interacts with all the other weights' changes as well.

To get concrete, imagine for a given weight $ w $, and we compute$ \frac{\partial l}{\partial w} = 0.1 $: increasing this weight by 1 unit will increase the loss by 1 unit. With the interactions with all the other weight changes, the actual increase could be significantly different than that.

## Solution

**Batch normalization** significantly mitigates this problem. It involves the following:

1. For each batch of images (typically 64 or 128) fed through the network:
At each layer, calculate the mean $\mu$ and the standard deviation $\sigma$ for all the neurons in the layer.
2. Normalize each neuron by subtracting off the mean and standard deviation:

$$N' = \frac{N - \mu}{\sigma}$$

Continue propagating through the network.

Can anyone think of an issue with this?

For convolutional networks, the "neurons" are pixels in output images that have been convolved with a filter. These images are important - they contain spatial information about what is present in the images. If we modify pixels in these images by different amounts, this spatial information could get modified. 

So, we have to calculate $F$ means and standard deviations for each batch and each filter map, so that in a given filter map, each pixel will be "normalized" by the same amount.

## There's more

We don't stop there. We further modify $N'$ to be defined as:

$$ \gamma * N' + \beta $$

We initialize $\gamma$ to 1 and $\beta$ to 0. And then these become parameters that are learned along with all the others in the course of the network training. 

Note: why does this work? Why do we normalize _and then_ apply these parameters?

Let's suppose that the mean of a given layer of features is significant to determining the behavior of the following layer - you can either think of the mean of a hidden layer of neurons, or the mean value across a filter in a convolutional layer. Without normalizing and then applying $\gamma$ and $\beta$, the network will have to learn the mean of this layer by adjusting individual weights. 

By applying these transformations, however, the network can simply learn on parameter $\beta$ that determines the mean of the layer.

Section 8.7 of [http://www.deeplearningbook.org](the definitive textbook on Deep Learning) explains this well:

> ...the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has diﬀerent learning dynamics. In the old parametrization, the mean of [the layer] was determined by a complicated interaction between the parameters in the layers below. In the new parametrization, the mean of is determined solely by $\beta$. The new parametrization is much easier to learn with gradient descent.

# DC-GAN Architecture



Three tricks:

## Deconvolutions:

Architecture:

1. Start with random noise vector of length 100
2. Begin by transforming it to a fully connected layer with shape $4*4*512$, say.
3. Perform deconvolution steps to apply a bunch of filters to these images to produce images of the shape you want, say 28 x 28.

Let's go through an illustration here:

[Theano documentation](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html)

What do deconvolution operations do? 
Convolutions can be represented as matrix multiplications. 
Deconvolutions are just the inverse of those matrix multiplications.
So, when choosing our filter, padding, and stride of an inverse convolution operation, what output would result?
The answer is that the resulting output will be the one that would produce the input that we give this layer.

Let's look at an example. If we give a Conv 2D transpose a 4x4, and do an inverse convolution operation with:

Stride 1
Filter size 4
Valid padding.

We get a 7x7. Why?

Consider doing this operation on a 7x7. It is clear that we'll get a 4x4.

## Batch normalization

See [here](http://www.deeplearningbook.org/contents/optimization.html), and especially the batch normalization lesson in the Udacity repo. This guy actually figured out the gradients for batch normalization. [here](http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html). Also: [unbelievable](https://github.com/cthorey/CS231).

# What is their future?

## Applications:

* Drug discovery

* Semi-supervised learning




# Semi-supervised learning

Semi-supervised learning is a third type of machine learning, other than supervised learning and 

SSL: 

1. Discriminator computes both probabilities of class belonging to each of 10 classes, probability of it being real. Note: it computes the probability of the images being real on both real and fake data, so that you 

How do we make the generator smarter? "Feature matching".

1. The last layer of a convolutinoal netural network, before it gets reshaped to a fully connected layer, is usually a long, narrow layer. In the network we just looked at, the last layer was 2x2x128. We can take the average over these 128 filters to get a dimension 128 vector. This vector "should" represent the features that the network has extracted from the image that ultimately determine what class the image belongs to, whether it is real or fake etc.
2. We can compute this vector for both real or generated images fed through the discrminator. 
3. We can compare the values of these features for the real images vs. the generated images to get a loss.

This is called **"feature matching"** - the main idea is to consider the fetaures that the discriminator while "discrminating" between the real and fake images.

This was the trick that led to breakthrough performance using semi-supervised learning to build powerful classifiers: 

Salimans et. al. from OpenAI in mid-2016 used this approach to get 6% error rate on the Street View House Numbers dataset _with just 1,000 labeled images_ - state-of-the-art, using the entire dataset of roughly 600,000 images, is 2% simply using supervised learning with very deep convolutional networks.

Since then, hybrid approaches that used both feature matching _and_ the traditional GAN loss have been shown to work - see [here](http://aiden.nibali.org/blog/2017-01-18-mode-collapse-gans/) for example. 

More recently, researchers from CMU have shown that a semi-supervised learning approach that prevents the GAN from "getting too good" by penalizing it for generating images too similar to those in the training set - using a "bad GAN" as they call it - outperforms the feature matching approach. 

Again, the central idea - that the GAN shouldn't actually simply be trained to generate the best possible images - is more important than the specific details of the penalty they used. (see the paper [here](https://www.cs.cmu.edu/~wcohen/postscript/nips-2017-badgan.pdf) - especially the second-to-last paragraph of the introduction).

## Progressive GANs

People have been trying to improve the resolution of GANs since their inception.

[Here](http://research.nvidia.com/sites/default/files/publications/karras2017gan-paper.pdf) is the Progressive GAN paper. This technique was able to generate high quality 32x32 images mimicking those from the CelebA dataset. 

## How did they do it?

The fundamental idea was:

1. Begin by downsampling the images to be simply _4x4_.
2. Train a GAN to generate "high quality" 4x4 images. 
3. Then, using the weights already learned in the initial layers, add a layer after the generator and before the discriminator so that this GAN now generates _8x8_ images (see illustration).
4. Even better, when these larger layers are initially added, there's a "grace period" where the generated images are still _mostly_ a function of the weights of the layers that have already been trained. 

"When new layers are added to the networks, we fade them in smoothly...This avoids sudden shocks to the already well-trained, smaller-resolution layers."

This allows them to train the network using something known as "Wasserstein loss", a more sophisticated measure of how similar two distributions are. (See the Introduction to the Wasserstein GAN paper [here](https://arxiv.org/pdf/1701.07875.pdf)).

There are several other clever tricks they use. See the paper, especially Sections 2, 3, and 4 (a total of two pages) for details. 

## Scoring

How do we know how "good" GANs are?

It turns out if you score GANs naively (e.g. using mean squared error), they don't turn out to be much "better" than other popular image generation methods, such as Variational Autonencoders.

As Ian Goodfellow himself noted in his 2016 NIPS tutorial on GANs, there isn't an obvious way to score GANs, and one of their advantages is that they produce images that simply "look" better, even if it is hard to quantify exactly what this means. 

### Inception score

A clever method for scoring GANs was developed by Tim Salimans at OpenAI, that illustrates well some properties that we want GANs to have.

First, observe that if a GAN generated some images, and those images were then fed through a pre-trained neural network, and the resulting probability distribution over the images was:

`[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]`

this probably isn't a very good GAN.

The way we formalize this is that this resulting vector should have _low entropy_ - that is, _not_ an even distribution over class labels. 

However, let's say that every time, the "most likely class" that the pre-trained network was predicting was a zero. We don't want this: we want to generator to generate diverse images. 

The way we formalize this is that we want the vector of the frequency of the predictions to have _high entropy_: that is, class balance. So, if our generator generates 1000 MNIST images, we want 100 of them to be predicted to be 0s, 100 to be predicted to be 1s etc.

Inception score - named after the architecture used to make predictions with the generated images - is actually used to score the different models in the [Progressive GAN paper](http://research.nvidia.com/sites/default/files/publications/karras2017gan-paper.pdf) (see page 15).

## Wasserstein distance

Wasserstein distance is a pixel-based measure proposed in the Progressive GAN paper that represents a totally different way of measuring of how similar the generated images are to the real images. 

The innovation, however, is that the similarity measure used is not a normal similiarity measure, but instead is based on examining the similarity between the 16x16 versions of the images, the 32x32 version, and so on up to the final 1024x1024 version. More specifically, many random 7x7 patches from all these different images (both the generated and real images) are sampled, and the similarity between all these 7x7 patches is calculated. As the authors of the paper put it:

"Intuitively a small Wasserstein distance indicates that the distribution of the patches is similar, meaning that the training images and generator samples appear similar in both appearance and variation at this spatial resolution. In particular, the distance between the patch sets extracted from the lowest-resolution 16 × 16 images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise."