In [20]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 600px;
}

.red {
  color: red;
  display:inline;
}

.blue {
  color: blue;
  display:inline;
}

.green {
  color: green;
  display:inline;
}

</style>
"""
HTML(style)

# GANs

## What are they, what makes them work, and what is their future.

**Seth Weidman, Boston Machine Learning Meetup**

February 1, 2018

# Agenda

* What are GANs
* What makes them work
* What are the latest and greatest cutting edge results
* What is their future

# What are GANs?

"GAN" stands for "<span class='blue'>Generative</span> <span class='green'>Adversarial</span> <span class='red'>Network</span>". 

They are a method of training <span class='red'> neural networks</span> to <span class='blue'>generate</span> images similar to those in the data the neural network is trained on. This training is done via an <span class='green'>adversarial</span> process.

### Basic example

![](img/mnist_gan_8s.png)

Not digits written by a human. Generated by a neural network.

### Cutting edge example, October 2017

![](img/progressive_gan_example.png)

Not real people: images generated by a neural network.

> "[GANs], and the variations that are now being proposed, are the most interesting idea in the last 10 years in ML [machine learning], in my opinion." 

-- Yann LeCunn, Director of AI Research at Facebook, [in 2016 on Quora](https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning/answer/Yann-LeCun)

## What are neural networks?

We've all seen diagrams like this when trying to understand neural nets:

![](img/neural_network_diagram.png)

But what are they _really_?

_Mathematically_, neural nets are:

* Nested functions

* Universal function approximators

* Differentiable

So, if the layers are denoted as $l_1$, $l_2$, $l_3$, with weights $V$ and $W$, then the predictions they make can be written as:

$$ P = p(l_3(l_2(l_1(X, V)), W)) $$

And the loss can be written as:

$$ L = l(p(l_3(l_2(l_1(X, V)), W))) $$

**In other words, the loss, the prediction, and so on, are just some mathematical function of the weights $W$ and $V$ and the input $X$.**

What does differentiable mean? It means that if we have a loss $L$ can compute:

$$ \frac{\partial L}{\partial W} $$

$$ \frac{\partial L}{\partial V} $$

Indeed, this is the information we need to update the weights so that we can "train" the neural network. 

**In addition**, differentiability means we can compute:

$$ \frac{\partial L}{\partial X} $$

In other words, how much the loss would change if the individual pixels of the _input_ changed.

(_This_ turns out to be the key fact that allows GANs to work)

$

![](img/ian_goodfellow_beer.png)

In 2013, Ian Goodfellow (inventor of GANs, then a grad student at the University of Montreal) and Yoshua Bengio (one of the leading researchers on neural networks in the world) are about to run a speech synthesis contest.

Their idea is to have a "discriminator network" that could listen to artificially generated speech and decide if it was real or not. 

They decide not to run the contest, concluding that people will just game the system by generating examples that will fool _this particular_ discriminator network, rather than trying to produce _generally_ good speech.

Then, Ian Goodfellow was in a bar one night, and asked the question: **can this be fixed by the _discriminator network_ learning**?

## What he came up with

### Part 1

First: randomly generate a feature vector; feed the feature vector through a randomly initialized neural network to produce an output image.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  ... \\
                  z_{100}
                  \end{bmatrix} $$

![](img/gan_1.png)

Let's denote the matrix of pixels in this image $X$.

Then, feed this image (matrix of pixels $X$) into a second network and get a prediction:

![](img/gan_2.png)

Use this loss to train this second network, called the "**discriminator**". 

Critically, also compute $$ \frac{\partial L}{\partial X} $$ - how much each of the _pixels generated_ affects the loss.

Then, update the first network, called the **generator**, with $$ -\frac{\partial L}{\partial X} $$

negative because we want the generator to be continually making the discriminator _more_ likely to say that the images it is generating are real.

![](img/gan_3.png)

Finally, generate a _new_ random noise vector $Z$, and repeat the process, so that the generator will learn to turn _any_ random noise vector into an image that the discriminator thinks is real.

### What's missing?

This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify - that is, only fake images, and no real images. 

So, we'll have to give it real images as well.

## Part 2:

![](img/gans_4.png)

Quote from the original paper on GANs:

> "The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles." 

-Goodfellow et. al., "Generative Adversarial Networks" (2014)

Cool aside: this is the [Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6) that he used to generate MNIST digits.

# Let's code one up

Let's check one out!

## What makes GANs work?

There are a lot of tricks that make GANs work. I'm going to discuss two of the most fundamental ones in detail:

* Deep Convolutional-Deconvolutional Architecture

* Batch Normalization

## Convolutions

Let's first cover the Deep Convolutional architecture. We'll review:

* What convolutions are
* What deconvolutions are

## Convolutions

We've all seen diagrams like this in the context of convolutional neural nets:

![](img/AlexNet_0.jpg)

This is the famous [AlexNet](https://en.wikipedia.org/wiki/AlexNet) architecture.

What's really going on here?

Let's say we have an input layer of size $[224x224x3]$, as we do in the ImageNet dataset that AlexNet was trained on. This next layer seems to be $96$ deep. What does that mean?

## Review of convolutions

"_Filters_" are slid over images using the convolution operation. 

In theory, these filters can act as _feature detectors_, and the images that result from the convolving these filters with the image can be thought of as versions of the original image where the detected features have been "highlighted."

See the visual [here](http://cs231n.github.io/convolutional-networks/).

In practice, the neural network _learns_ filters that are useful to solving the particular problem it has been given.

We can then _visualize_ these filters once the network is done training to see the features it has learned.

Let's return to the concrete example of the AlexNet architecture:

For each of 96 _filters_, the following happens:

For each of the 3 _input channels_, one of these _filters_, which happens to be dimension $11 x 11$ in this case, is slid over the image, "detecting the presence of different features" at each location. 

So, there are actually a total of 96 * 3 convolution operations that take place, resulting in 96 filters, each of which has a red, green, and blue component.

We can combine the red, green, and blue filters together and visualize them as if they were a mini $11x11$ image:

### The 96 AlexNet filters:

![](img/AlexNet_filt1.png)

## Convolutional-Deconvolutional Architecture 

These convolution operations change the size of the input image.

In order to build deep convolutional architectures, we have to understand how both convolution and deconvolution operations affect image size.

The way convolutions affect image size is a function of:

1. Their filter size
2. Their "stride" - how much we move the filter by as we convolve it with the image  
3. How much "padding" or space around the image we use

[The Theano documentation](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html) has a very in depth look at convolutions.

Padding is particularly important; here are two pieces of terminology that TensorFlow and other libraries use to describe padding in convolutions:

**"SAME" padding:** pad the image so that the output image size is equal to the input image size. Here's an example with:

* Input image size 5x5
* Filter size 3x3 (and thus padding 1)
* Stride 1
* Output image size 5x5

In [13]:
HTML('<img src="./img/same_padding_no_strides.gif">')

**"VALID" padding:** use no padding, so that the output image size will be smaller than the input image. Here's an example with:

* Input image size 4x4
* Filter size 3x3
* Stride 1
* Output image size 2x2

In [14]:
HTML('<img src="./img/no_padding_no_strides.gif">')

While there are formulas relating the filter size, input size, stride, padding and so on, it is often better to just reason your way through what is going on to figure out what the output size should be.

## What does this have to do with GANs?

Doing these convolutions is relatively straightforward - but how do we do **_de_**-convolutions?

Reasoning about convolutions, in the way just presented, is straightforward, but reasoning about deconvolutions requires thinking about convolutions in a different way, namely as **matrix multiplications where almost all the values are zeros**.

[Back to the Theano documentation](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html#convolution-as-a-matrix-operation).

So, if we represent convolution operations as a matrix multiplication, we can represent the corresponding **_de_**-convolution operation as an **_transpose_** matrix multiplication.

So, when reasoning about deconvolution operations - when deciding what stride and padding to use, starting from some image size A - you should think **"What starting image B, using this stride and padding would transform B into A?"**

Let's do a concrete example, starting with a $4x4$ image and doing a deconvolution operation to it.

In [2]:
import tensorflow as tf
from deconv import tf_deconv

In [4]:
def generator(z, out_channel_dim, padding):
    """
    Create the generator network
    :param z: Input z
    :param out_channel_dim: The number of channels in the output image
    :return: The tensor output of the generator
    """

    # First fully connected layer
    x1 = tf.layers.dense(z, 4*4*256)

    # Reshape it to start the convolutional stack
    x1 = tf.reshape(x1, (-1, 4, 4, 256))
    
    print("Input shape", x1.shape[1:3])
    # Perform an inverse convolutional operation
    x2 = tf.layers.conv2d_transpose(x1, 128, 4, strides=1, padding=padding)
    
    return x2

In [5]:
tf_deconv(generator, "same")

Input shape (4, 4)
Output shape:  (4, 4)


In [6]:
tf_deconv(generator, "valid")

Input shape (4, 4)
Output shape:  (7, 7)


Question:

**Why does "same" padding result in an output shape of 4x4 whereas "valid" padding results in an output shape of 7x7?**

Think about this: if we wanted our _output_ to be 4x4, and we were using "same" padding", then the _input_ shape would be 4x4 as well.

## Illustration - same padding

<img src="./img/convolution_same.png" class="visual">

<img src="./img/deconvolution_same.png" class="visual" >

## Illustration - valid padding

<img src="./img/convolution_valid.png" class="visual">

<img src="./img/deconvolution_valid.png" class="visual">

However, if we wanted our _output_ to be 4x4, using "valid" (no) padding - in other words, no padding, then we would need an input shape of 7x7 to achieve this. 

Here's a concrete example with valid (no) padding transforming a 7x7 image into a 5x5 image using a 3x3 filter.

In [10]:
HTML('<img src="./img/full_padding_no_strides_transposed.gif">')

To see an example of a few deconvolution operations that take a 4x4 image up to a 28x28, see the `generator` architecture in the `GAN_example` folder.

In [7]:
def generator(z, out_channel_dim, is_train=True):
    """
    Create the generator network
    :param z: Input z
    :param out_channel_dim: The number of channels in the output image
    :param is_train: Boolean if generator is being used for training
    :return: The tensor output of the generator
    """

    with tf.variable_scope('generator', reuse=not is_train):
        # First fully connected layer
        x1 = tf.layers.dense(z, 4*4*512)
        # Reshape it to start the convolutional stack
        x1 = tf.reshape(x1, (-1, 4, 4, 512))
        x1 = tf.layers.batch_normalization(x1, training=is_train)
        x1 = tf.maximum(0.2 * x1, x1)

        x2 = tf.layers.conv2d_transpose(x1, 256, 4, strides=1, padding='valid')
        x2 = tf.layers.batch_normalization(x2, training=is_train)
        x2 = tf.maximum(0.2 * x2, x2)

        x3 = tf.layers.conv2d_transpose(x2, 128, 5, strides=2, padding='same')
        x3 = tf.layers.batch_normalization(x3, training=is_train)
        x3 = tf.maximum(0.2 * x3, x3)

        logits = tf.layers.conv2d_transpose(x3, out_channel_dim, 5, strides=2, padding='same')

        out = tf.tanh(logits)
    
    return out

## Batch Normalization

Batch normalization is one of the most powerful and simple tricks to come along in the history of the training of deep neural networks.

<img src="./img/deep_neural_network.png" class="visual">

We know that normalizing the input to a neural network helps with training: the network doesn't have to "learn" that one feature is on a  scale from 0-1000 and another is on a scale from 0-1 and change its weights accordingly, for example.

The same thing applies further down in the network:

<img src="./img/neural_network_weights_hidden.png" class="visual">

Inituitively, batch normalization works for the same reasons that normalizing data before feeding it into a neural network works.

**How is it actually done?**

When passing data through a neural network, we do so in batches - say, 64 or 128 images at a time.

Thus, at every step of the neural network, each neuron has a value _for each observation that is being passed through_.

We normalize _across these observations_, so that _for each batch_, each neuron will have a mean 0 and standard deviation 1. Specifically, we replace the value of the neuron $N$ with:

$$N' = \frac{N - \mu}{\sigma}$$

**Can anyone think of an issue with this?**

Hint:

![](img/one_filter.png)

For convolutional networks, the "neurons" are pixels in output images that have been convolved with a filter. These images are important - they contain spatial information about what is present in the images. If we modify pixels in these images by different amounts, this spatial information could get modified. 

So, instead of calculating means and standard deviations for each _neuron_ in each batch, we calculate means and standard deviations for each _filter_ map in each batch, so that **in a given filter map, each pixel will be modified by the same amount**.

## There's more

We don't stop there. We further modify $N'$ to be defined as:

$$ \gamma * N' + \beta $$

We initialize $\gamma$ to 1 and $\beta$ to 0. And then these become parameters that are learned along with all the others in the course of the network training. 

Question: why does this work? Why do we normalize _and then_ apply these parameters?

Let's suppose that the mean of a given layer of features is significant to determining the behavior of the following layer - you can either think of the mean of a hidden layer of neurons, or the mean value across a filter in a convolutional layer. Without normalizing and then applying $\gamma$ and $\beta$, the network will have to learn the mean of this layer by adjusting individual weights. 

By applying these transformations, however, the network can simply learn on parameter $\beta$ that determines the mean of the layer.

Section 8.7 of the [Goodfellow et al. Deep Learning textbook](http://www.deeplearningbook.org) explains this well:

> ...the new parametrization can represent the same family of functions of the input as the old parametrization, but the new parametrization has diﬀerent learning dynamics. In the old parametrization, the mean of [the layer] was determined by a complicated interaction between the parameters in the layers below. In the new parametrization, the mean of is determined solely by $\beta$. The new parametrization is much easier to learn with gradient descent.

## Cool Application of GANs: 

# "Pose Guided Person Image Generation"

[NIPS 2017 paper](https://papers.nips.cc/paper/6644-pose-guided-person-image-generation.pdf)

Based on the [DeepFashion Dataset](http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

<img src="./img/deep_fashion.png" class="visual">

> "“It contains over 800,000 images, which are richly annotated with massive attributes, clothing landmarks, and correspondence of images taken under different scenarios including store, street snapshot, and consumer. Such rich annotations enable the development of powerful algorithms in clothes recognition and facilitating future researches.”

![](img/deep_fashion_clothing_locations.png)

![](img/pose_generation_1.png)

![](img/pose_generation_2.png)

## One of the most important applications of GANs

# Semi-Supervised Learning

![](img/semi-supervised_gans.png)

Semi-supervised learning is a third type of machine learning, in addition to supervised learning and unsupervised learning.

At a high level:

* The goal of supervised learning is to learn from _labeled_ data.
* The goal of unsupervised learning is to learn from _unlabeled_ data.

Semi-supervised learning asks the question: can you learn from a _combination_ of both labeled and unlabeled data? 

With GANs, the answer turns out to be yes!

How does it work? Basic idea is: 

Normally in a GAN, the discriminator outputs the probability of an image being one of ten classes: 

![](img/ssl_discriminator_1.png)

This is compared with the real values, turned into a loss vector, and backpropagated through the network to train it.

With semi-supervised learning, we simply add another class to this output:

![](img/ssl_discriminator_2.png)

Then, data points are fed through as before:

* _Real_, _labeled_ examples are given labels simply of 0 for all the digits they are not, 1 for the digit they are, and 0 for $P(real)$

* _Fake_ examples generated by the generator are given labels of 0 across the board, including for $P(real)$.

* _Real_, _un_-labeled examples are given labels of 0 for all the classes and 1 for the probability of the image being real.

This allows the discriminator to learn from real, labeled examples, as well as both fake examples, and real, unlabeled examples! 

In practice, using fake examples is used more often than real, unlabeled examples.

In this framework, how do we train the generator? Another innovation that made this semi-supervised learning approach work was a unique way of training the generator called "feature matching".

# Feature matching

Feature matching, a technique for training GANs, was proposed in the same paper that proposed using GANs for Semi-Supervised Learning: [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf), by Salimans et. al. from OpenAI.

## Idea

The last layer of a convolutional netural network, before the values get fed through a fully connected layer, is typically a layer with many features that have been detected  

![](img/gan_layer.png)

For example, in the convolutional architecture used in the example GAN, the last layer is $2x2x128$ - the result of 128 "features of features of features" that the network has learned. 

This is then "flattened" to a single layer of $2 * 2 * 128 = 512$ neurons, and these 512 neurons are then fed through a fully connected layer to produce an output of length 10.

<img src="./img/last_layer_fc.png" class="visual">

Their idea was to train the generator, not simply by using the discriminator's prediction of whether the image was real or fake, but on **how similar this 512 dimensional vector was between _real_ images fed through the discrimintor compared to _fake_ images fed through the discriminator.** 

The delta between these two sets of _features_ is the loss used to train the generator.

Aside: why does this work? Even the authors of the paper don't fully understand it:

 > "This approach introduces an interaction [between the discriminator and the generator] that we do not fully understand yet, but empirically we find that optimizing G using feature matching GAN works very well for semi-supervised learning, while training G using GAN with minibatch discrimination does not work at all. Here we present our empirical results using this approach; developing a full theoretical understanding of the interaction between D and G using this approach is left for future work.

Nevertheless, feature matching was the trick that led to breakthrough performance using semi-supervised learning to build powerful classifiers: 

Salimans et. al. from OpenAI in mid-2016 used this approach to get just under  a **6%** error rate on the [Street View House Numbers dataset](http://ufldl.stanford.edu/housenumbers/) _with just 1,000 labeled images_. Prior approaches achieved roughly **16%** error. 

State-of-the-art error, using the entire dataset of roughly 600,000 images, simply using supervised learning with very deep convolutional networks, is roughly **2%**. 

Semi-supervised learning is perhaps the most important _application_ of GANs - what is the cutting edge of building GANs themselves?

# Progressive GANs

People have been trying to improve the resolution of GANs since their invention. Progressive GANs, published by a few folks at NVIDIA research in November 2017, are a huge step forward in doing so:

In [22]:
HTML('<img src="./img/progressive_gan.gif">')

[Here](http://research.nvidia.com/sites/default/files/publications/karras2017gan-paper.pdf) is the Progressive GAN paper, describing how they generated high quality 1024x1024 images mimicking those from the CelebA dataset. The findings even made the New York Times!

<img src="./img/progressive_gans_nyt.png" class="visual">

What is the main idea behind Progressive GANs?

1. Begin by downsampling the images to be simply _4x4_.
2. Train a GAN to generate "high quality" 4x4 images. 
3. Then, using the weights already learned in the initial layers, add a layer after the generator and before the discriminator so that this GAN now generates _8x8_ images, etc.
![](img/progressive_gans_technique.png)

In addition, when these larger layers are initially added, there's a "grace period" where the generated images are still _mostly_ a function of the weights of the layers that have already been trained. 

![](img/progressive_gans_grace.png)

## An aside: how do we score GANs?

How do we know that these samples are "good"? They "look good", but how can we quantify this?

"Generative Adversarial Networks are generally regarded as producing the best samples [compared to other generative methods such as variational autoencoders] but there is no good way to quantify this."

--Ian Goodfellow, NIPS tutorial 2016

Since then, a couple of methods have been proposed:

### Inception score

A clever method for scoring GANs was developed by Tim Salimans at OpenAI, that illustrates well some properties that we want GANs to have.

Consider a GAN that was intended to generate images that come from one of a finite number of classes, such as MNIST digits.

### Inception score (cont.)

Let's say that the generator generated some images, and those generated images were then fed through a pre-trained neural network, and the resulting probability distribution over the images was:

`[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]`

In other words, the pre-trained model has no idea which class this image should belong to. In this case, we conclude that all else equal, this likely isn't a very good generator.

The way we formalize this is that this resulting vector should have _low entropy_ - that is, _not_ an even distribution over class labels. 

### Inception score (cont.)

There is another way we can use this pre-trained neural network. Let's say that for every image generated, we recorded the "most likely class" that the pre-trained network was predicting. And let's say that 90% of the time, the pre-trained network was classifying the images that our model was generating as zeros, so that the vector of "most likely class" looked like:

`[0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]`

The way we formalize this is that we want the vector of the frequency of the predictions to have _high entropy_: that is, we _do_ want the classes to be balanced.

"Inception" simply refers to the neural network architecture used to score these generated images.

Progressive GANs did indeed show a record Inception score on the CIFAR-10 dataset:

![](img/progressive_gans_inception.png)

However, we can't do this in the Celeb-A dataset: there are no classes!

## Patch similarity

The authors propose a new way of assessing their GANs to identify improvement:

They randomly sample 7x7 patches from the 16x16 versions of the images, the 32x32 versions, etc., up to the 1024x1024 version. They then use a metric called the "Wasserstein distance" to compute the similarity between generated patches and the corresponding real patches.

> "...the distance between the patch sets extracted from the lowest-resolution 16 × 16 images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise."

# The future

What is the future of GANs? More generally, what is the future of Deep Learning? Can we predict it?

I asked Ian Goodfellow in a LinkedIn message if he was surprised by how quickly Progressive GANs were able toget clase to photorealistic image quality on 1024x1024 images. He replied:

> I'm actually surprised at how slow it's been. Back in 2015 I thought that getting to photorealistic video was mostly going to be an engineering effort of scaling the model up and training on more data.

-Ian Goodfellow, in a LinkedIn message to me

## What is the future of GANs?

![](img/question_mark.png)

Nobody knows!

**Thanks!**

![](img/professional_headshot.png)
(real photo, not generated)

[Website](https://www.sethweidman.com) | [Medium](https://medium.com/@sethweidman) | [GitHub](https://github.com/sethHWeidman/) | [Twitter](https://twitter.com/SethHWeidman) | [LinkedIn](https://www.linkedin.com/in/sethhweidman/)

seth@sethweidman.com if you have any questions.