In [1]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 600px;
}

.red {
  color: red;
  display:inline;
}

.blue {
  color: blue;
  display:inline;
}

.green {
  color: green;
  display:inline;
}

</style>
"""
HTML(style)

# GANs

## What are they, what makes them work, and what is their future.

**Seth Weidman, ODSC West 2018**

November 2, 2018

# Agenda

1. Quick Neural Networks Review

2. GANs' origin story

# Agenda (continued)

3. GAN extensions and applications
    * Conditional GANs
    * DCGAN
    * Semi Supervised Learning
    * Scoring GANs

4. Future of GANs (and of Deep Learning in general)

# What are GANs?

You may not know what a GAN is: when the conversation turns to GANs, you may feel like [Homer Simpson](https://www.youtube.com/watch?v=PGLzm-Gy0dQ) does in this clip.

"GAN" stands for "<span class='blue'>Generative</span> <span class='green'>Adversarial</span> <span class='red'>Network</span>". 

They are a method of training <span class='red'> neural networks</span> to <span class='blue'>generate</span> images similar to those in the data the neural network is trained on. This training is done via an <span class='green'>adversarial</span> process.

### Basic example (Goodfellow et. al., 2014)

<img src="img/mnist_gan_8s.png" width=300>

Not digits written by a human. Generated by a neural network.

### "Cutting edge" example: (NVIDIA Research, October 2017)

<img src="img/progressive_gan_example.png" width=500>

Not real people: images generated by a neural network.

### Why you should care

> "[GANs], and the variations that are now being proposed, are the most interesting idea in the last 10 years in ML [machine learning], in my opinion." 

-- Yann LeCunn, Director of AI Research at Facebook, [in 2016 on Quora](https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning/answer/Yann-LeCun)

## What _are_ GANs?

## In fact, what are neural networks?

### Neural network review

We've all seen diagrams like this when trying to understand neural nets:

<img src="img/neural_network_diagram.png" width=500>

But what are they _really_? There are many different ways of explaining what a neural net is. _Mathematically_, they are:

* **Nested functions** (like $f(g(x_1, x_2, ...))$ etc., if the $x_i$ are pixels of the original image.

* **Universal function approximators** (if nest them in the right way, we can in theory approximate any function, no matter how complex)

* **Differentiable** (this allows us to "train" them to actually accomplish things)

This means that you can think of a neural net as being a mathematical function that takes in:

* An input image (or batch) that we could denote:

$X = \begin{bmatrix}
x_{11} & x_{12} & x_{13} & \ldots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \ldots & x_{2n} \\
x_{31} & x_{32} & x_{33} & \ldots & x_{3n} \\
\vdots & \vdots & \vdots & \ddots & \vdots\\
x_{m1} & x_{m2} & x_{m3} & \ldots & x_{mn} \\
\end{bmatrix} $

As well as:

* Several weight matrices that we could denote $W$:

$W = \begin{bmatrix}
w_{11} & w_{12} & w_{13} & \ldots & w_{1p} \\
w_{21} & w_{22} & w_{23} & \ldots & w_{2p} \\
w_{31} & w_{32} & w_{33} & \ldots & w_{3p} \\
\vdots & \vdots & \vdots & \ddots & \vdots\\
w_{n1} & w_{n2} & w_{n3} & \ldots & w_{np} \\
\end{bmatrix} $

The result is a _number_: for example, a number representing the probability that an image contains a cat.

This number is computed by some extremely complicated - yet still **differentiable** - ***function*** of these original pixels and weights.

Something like:

$$ f(X, W) = x_{11}^2 * (w_{11} + w_{12}) * log(x_{12}) + ...$$

Every time we feed a set of inputs and weights through this network, we get **"predictions $P$**; we compare these predictions to the **target $Y$** to get a **loss vector $L$** of the same "shape" (in terms of a multidimensional array) as the predictions.

This loss is the key data we need on **how much we "missed" by**.

These facts mean we can train neural network using the following procedure:

1. Feed a bunch of data points through the neural network.
2. Compute the loss $L$
3. Compute, for every single weight $w$ in the network: $$ \frac{\partial L}{\partial w} $$

And then we can update each _individual weight_ in the network $w_i$ according to the equation:

$$ w_i = w_i - \frac{\partial L}{\partial w_i} $$

(This is the standard "gradient descent" equation. We could also use one of the many modifications of this equation that exist).

**In addition**, differentiability means we can compute, for every _pixel_ $x_i$ in the input image:

$$ \frac{\partial L}{\partial x_i} $$

In other words, how much the loss would change if this pixel in the _input_ image changed.

_This_ fact turns out to be:

* What allows GANs to work
* Why adversarial examples are a thing (ask me about this afterwards if curious)

# How were GANs invented?

<img src="img/ian_goodfellow_beer.png" width=500>

In 2013, Ian Goodfellow (inventor of GANs, then a grad student at the University of Montreal) and Yoshua Bengio (one of the leading researchers on neural networks in the world) are about to run a speech synthesis contest.

Their idea is to have a "discriminator network" that could listen to artificially generated speech and decide if it was real or not. 

They decide not to run the contest, concluding that people will just game the system by generating examples that will fool _this particular_ discriminator network, rather than trying to produce _generally_ good speech.

Then, Ian Goodfellow was in a bar one night, and asked the question: **can this be fixed by the _discriminator network_ learning**?

This led him to develop what ultimately became the GAN framework. Let's dive in and see how it works:

### How GANs work

## What he came up with

### Part 1

First: randomly generate a vector of data.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  \vdots \\
                  z_{100}
                  \end{bmatrix} $$

Feed this "feature vector" through a randomly initialized neural network to produce an output image.

![](img/gan_1.png)



Denote the matrix of pixels in this image - generated by the first neural network - $X_G$.

Then, feed this image (or matrix of pixels $X_G$) into a second network and get a prediction, which we can call $P_{real}^{X_G}$.

<img src="img/gan_prediction.png" width=600>

What now?

### Are neural nets and GANs "AI"?

If not, what is a better mental model?

<img src="img/dog.png" width=500>

They're more like dogs - we can train them to do whatever we want.

<img src="img/gan_prediction_2.png" width=500>

Use this loss from comparing this prediction with _0_, $L_0^{X_G}$, to train this second network, called the "**discriminator**". 

Critically, also compute $$ L_1^{X_G} $$ - the penalty for the discriminator making a prediction different than _1_ (keep in mind that the discriminator's outputs will be normalized to be between 0 and 1.

Then, update the first network, called the **generator**, with

$$ \frac{\partial L_1^{X_G}}{\partial X_G} $$

so the generator will be trained to continually make the discriminator _more_ likely to say that the images it is generating are real.

<img src="img/gan_3.png" width=500>

Finally, generate a _new_ random noise vector $Z$, and repeat the process, so that the generator will learn to turn _any_ random noise vector into an image that the discriminator thinks is real.

Repeat!

### What's missing?



This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify - that is, only fake images, and no real images. 

So, we'll have to give it real images as well.

## Part 2:

Give the discriminator some real images:

<img src="img/gans_4.png" width=500>

Description from the original paper on GANs:

> "The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles." 

-Goodfellow et. al., "Generative Adversarial Networks" (2014)

[Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6)

## Extensions of GANs

* Conditional GANs
* DCGAN
* Semi-Supervised Learning

## Conditional GANs

GANs can be modified slightly to generate images from a particular class. How?

$$ \begin{bmatrix}0 \\
                  0 \\
                  \vdots \\
                  1 \\
                  \vdots \\
                  0
\end{bmatrix} $$

Add a one-hot encoded vector representing the class to both the generator and the discriminator.

## How it works

#### Step 1:

<img src="img/cond_gan_1.png" width=500>

#### Step 2 - feed through discriminator:

<img src="img/cond_gan_2.png" width=500>

### Step 3 - as before, use $\frac{\partial L_1^{X_G}}{\partial X_G}$ to train generator:

<img src="img/gan_3.png" width=500>

### Step 4 - ensure discriminator gets smarter

<img src="img/cond_gan_4.png" width=500>

## Conditional GAN results

Conditioning GANs on the a class from the training data is a common way to illustrate the output of GANs.

Here are some examples from [Self-Attention GAN](https://arxiv.org/pdf/1805.08318.pdf), the most recent GAN architecture from Ian Goodfellow's group at Google Brain.

#### Conditionally generated images from the [ImageNet dataset](http://imagenet.stanford.edu)

<img src="img/cond_gan_example.png" width=700>

## DCGAN - Deep Convolutional GAN (January 2016)

<img src="img/dcgan_1.png" width=700>

All 32 x 32 color images of bedrooms. All fake.

## DCGAN paper

Important for two sets of reasons:

Introduced key concepts that pushed GANs forward:

* Deep Convolutional/Deconvolutional architecture 
* Batch normalization (which had been invented earlier in 2015) first applied to GANs

Included famous visuals of the "latent space" GANs were learning (more on this shortly).

Crazy fact: the lead author of the DCGAN paper (Alec Radford) was still in college when it was published.

### Latent space visualization

Idea: what are GANs actually learning?

<img src="img/gans_latent_space.png" width=600>

Smooth transitions (via linear interpolations) in the latent (100-dimensional input to generator) space:

<img src="img/dcgan_smooth_transition.png" width=800>

Arithmetic in the latent space:

<img src="img/dcgan_arithmetic.png" width=700>

### How did they do it?

## DCGAN Trick #1: Deep Convolutional/Deconvolutional Architecture

### Discriminator

<img src="img/dcgan_discriminator.png" width=700>

### Generator

<img src="img/dcgan_generator.png" width=700>

**What's going on in these convolutions anyway?**

## Convolutions deep dive

We've all seen diagrams like this in the context of convolutional neural nets:

<img src="img/AlexNet_0.jpg" width=700>

This is the famous [AlexNet](https://en.wikipedia.org/wiki/AlexNet) architecture.

What's really going on here?

Let's say we have an input layer of size $[224x224x3]$, as we do in the ImageNet dataset that AlexNet was trained on. This next layer seems to be $96$ deep. What does that mean?

## Review of convolutions

"_Filters_" are slid over images using the convolution operation. 

In theory, these filters can act as _feature detectors_, and the images that result from the convolving these filters with the image can be thought of as versions of the original image where the detected features have been "highlighted."

<img src="./img/same_padding_no_strides.gif">

<font color="blue">Blue</font> = original image
<font color="gray">Gray</font> = convolutional "filter"
<font color="green">Green</font> = output image

In practice, the neural network _learns_ filters that are useful to solving the particular problem it has been given.

We _typically_ visualize the _results_ of applying these filters to the images. However, in certain cases we visualize the filters themselves.

Let's return to the concrete example of the AlexNet architecture:

For each of 96 _filters_, the following happens:

For each of the 3 _input channels_ - usually **red, green, and blue** for color image - one of these _filters_, which happens to be dimension $11 x 11$ in this case, is slid over the image, "detecting the presence of different features" at each location. 

So, there are actually a total of 96 * 3 convolution operations that take place, resulting in 96 filters, each of which has a red, green, and blue component.

We can combine the red, green, and blue filters together and visualize them as if they were a mini $11x11$ image:

### The 96 AlexNet filters:

<img src="img/AlexNet_filt1.png" width=500>

## DCGAN Trick #2: Deep Convolutional/Deconvolutional Architecture

Batch normalization is one of the most powerful and simple tricks to come along in the history of the training of deep neural networks. It was [introduced](https://arxiv.org/abs/1502.03167) by two researchers from Google in March 2015, just nine months before the [DCGAN paper](https://arxiv.org/abs/1511.06434) came out.

Regular neural network:

<img src="img/deep_neural_network.png" width=600>

We know that normalizing the input to a neural network helps with training: the network doesn't have to "learn" that one feature is on a  scale from 0-1000 and another is on a scale from 0-1 and change its weights accordingly, for example.

The same thing applies further down in the network:

<img src="img/neural_network_weights_hidden.png" width=600>

Inituitively, batch normalization works for the same reasons that normalizing data before feeding it into a neural network works.

**How is it actually done?**

When passing data through a neural network, we do so in batches - say, 64 or 128 images at a time.

Thus, at every step of the neural network, each neuron has a value _for each observation that is being passed through_.

We normalize _across these observations_, so that _for each batch_, each neuron will have a mean 0 and standard deviation 1. Specifically, we replace the value of the neuron $N$ with:

$$N' = \frac{N - \mu}{\sigma}$$

**What's wrong with this in _convolutional neural networks specifically_**?

Hint: convolutional neural networks learn by learning groups of neurons which are really filters:

<img src="img/activations_mnist.png" width=700>

This is one image, convolved with 10 different filters in a CNN.

For convolutional networks, the "neurons" are pixels in output images that have been convolved with a filter. These images are important - they contain spatial information about what is present in the images. If we modify pixels in these images by different amounts, this spatial information could get modified. 

<img src="img/four_pixels.png" width=700>

So, instead of calculating means and standard deviations for each _neuron_ in each batch, we calculate means and standard deviations for all the output images for a given batch, so that for **a given image, each pixel will be modified by the same amount**.

<img src="img/activations_mnist.png" width=700>

**Enough theory!**

# Latest and Greatest Results

### GAN Result #1

# "Pose Guided Person Image Generation"

[NIPS 2017 paper](https://papers.nips.cc/paper/6644-pose-guided-person-image-generation.pdf)

> This paper proposes the novel Pose Guided Person Generation Network (PG2) that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose.

Based on the [DeepFashion Dataset](http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

<img src="img/deep_fashion.png" width=700>

### DeepFashion dataset description

> "“It contains over 800,000 images, which are richly annotated with massive attributes, clothing landmarks, and correspondence of images taken under different scenarios including store, street snapshot, and consumer. Such rich annotations enable the development of powerful algorithms in clothes recognition and facilitating future researches.”

### Example data

<img src="img/deep_fashion_clothing_locations.png" width=700>

### Generated poses

<img src="img/pose_generation_1.png" width=700>

<img src="img/pose_generation_2.png" width=700>

How are these generated? Basically: conditional GANs!

# Semi-Supervised Learning

<img src="img/semi-supervised_gans.png" width=700>

Semi-supervised learning made it into Jeff Bezos' most recent letter to Amazon's shareholders!

> "...in the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)"

Semi-supervised learning is a third type of machine learning, in addition to supervised learning and unsupervised learning.

At a high level:

* The goal of supervised learning is to learn from _labeled_ data.
* The goal of unsupervised learning is to learn from _unlabeled_ data.

Semi-supervised learning asks the question: can you learn from a _combination_ of both labeled and unlabeled data? 

With GANs, the answer turns out to be yes! The paper that introduced this idea was [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf), from a team at OpenAI in 2016.

How does it work? Basic idea is: 

Let's say we're trying to classify MNIST digits. The discriminator will output a probability vector of an image belonging to one of ten classes: 

<img src="img/ssl_discriminator_1.png" width=700>

This is compared with the real values, turned into a loss vector, and backpropagated through the network to train it.

With semi-supervised learning, our neural network outputs two things:

* Normalized probabilities of the image being one of the digits 0-9 _or_ being a generated image (11 classes).
* Independently: the probability of the image being real.

<img src="img/ssl_discriminator_2.png" width=700>

Then, data points are fed through a classifier, with the following labels:

* _Real_, _labeled_ examples are given labels simply of 0 for all the digits they are not, 1 for the digit they are, and 1 for $P(real)$.

* _Fake_ examples generated by the generator are given labels of 0 across the board, including for $P(real)$.

* _Real_, *un*-labeled examples are given labels of 0 for all the classes and 1 for the probability of the image being real.

This allows the classifier to learn from real, labeled examples, as well as both fake examples, and real, **un**-labeled examples! 

### An issue

The classifier in this case is acting like a discrminator at the same time it is acting like a classifier. It is trying to learn _both_:

* How to discriminate between fake images it is getting from the generator and real images
* How to classify digits as 1, 2, 3 etc.

This "hybrid discriminator-classifier" is potentially problematic:

As the authors of the paper point out:

> This approach introduces an interaction between G and our classifier that we do not fully understand yet

This leads them to one of their key innovations that allowed this procedure to work: **feature matching**.

## Semi-supervised learning trick: feature matching

## Idea

The last layer of a convolutional netural network, before the values get fed through a fully connected layer, is typically a layer with many features that have been detected  

<img src="img/gan_layer.png" width=700>

For example, in the convolutional architecture used in the discriminator of the DCGAN architecture described above, the last layer is $2x2x128$ - the result of 128 "features of features of features" that the network has learned. 

This is then "flattened" to a single layer of $2 * 2 * 128 = 512$ neurons, and these 512 neurons are then fed through a fully connected layer to produce an output of length 10.

<img src="img/last_layer_fc.png" width=500>

Their idea was to train the generator, not simply by using the discriminator's prediction of whether the image was real or fake, but on **how similar this 512 dimensional vector was between _real_ images fed through the discriminator compared to _fake_ images fed through the discriminator.** 

The delta between these two sets of _features_ is the loss used to train the generator.

### Aside: why does this work? 

Even the authors of the paper don't fully understand it:

 > "[The approach in this paper] introduces an interaction between G and [the hybrid classifier-discriminator] that we do not fully understand yet, but empirically we find that optimizing G using feature matching GAN works very well for semi-supervised learning, while training G using GAN with minibatch discrimination does not work at all. Here we present our empirical results using this approach; developing a full theoretical understanding of the interaction between D and G using this approach is left for future work.

Nevertheless, feature matching was the trick that led to breakthrough performance using semi-supervised learning to build powerful classifiers: 

Salimans et. al. from OpenAI in mid-2016 used this approach to get just under  a **6%** error rate on the [Street View House Numbers dataset](http://ufldl.stanford.edu/housenumbers/) _with just 1,000 labeled images_. Prior approaches achieved roughly **16%** error. 

State-of-the-art error, using the entire dataset of roughly 600,000 images, simply using supervised learning with very deep convolutional networks, is roughly **2%**. 

<img src="img/bezos.png" width=500>

## Getting better all the time

Generating more realistic, higher resolution images using GANs has been an active research area since their inception.

### Example: Progressive GANs from NVIDIA Research

<img src="img/progressive_gan.gif">

What is the main idea behind Progressive GANs?

1. Begin by downsampling the images to be simply _4x4_.
2. Train a GAN to generate "high quality" 4x4 images. 
3. Then, using the weights already learned in the initial layers, add a layer after the generator and before the discriminator so that this GAN now generates _8x8_ images, etc.

<img src="img/progressive_gans_technique.png" width=700>

**But how do we _know_ these GANs are any good?**

## An aside: how do we score GANs?

How do we know that these samples are "good"? They "look good", but how can we quantify this?

> "Generative Adversarial Networks are generally regarded as producing the best samples [compared to other generative methods such as variational autoencoders] but there is no good way to quantify this."

--Ian Goodfellow, NIPS tutorial 2016

Since then, several methods have been proposed, the most prominent of which is the **Inception Score**:

### Inception Score

In the same paper that introduced feature matching, a technique for scoring GANs called **Inception Score** was introduced.

Consider a GAN that was intended to generate images that come from one of a finite number of classes, such as MNIST digits.

### Inception score (cont.)

Let's say that the generator generated some images, and those generated images were then fed through a pre-trained neural network, and the a probability distribution over the images was:

`[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]`

In other words, the pre-trained model has no idea which class this image should belong to. 

In this case, we conclude that all else equal, this likely isn't a very good generator.

The way we formalize this is that this resulting vector should have _low entropy_ - that is, _not_ an even distribution over class labels. 

### Inception score (cont.)

There is another way we can use this pre-trained neural network. Let's say that for every image generated, we recorded the "most likely class" that the pre-trained network was predicting. And let's say that 90% of the time, the pre-trained network was classifying the images that our model was generating as zeros, so that the vector of "most likely class" looked like:

`[0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]`

Again, this would not be a very good GAN!

The way we formalize this is that we want the vector of the frequency of the predictions to have _high entropy_: that is, we *do* want the classes to be balanced.

"Inception" simply refers to the neural network architecture used to score these generated images.

Progressive GANs did indeed show a record Inception score on the CIFAR-10 dataset:

<img src="img/progressive_gans_inception.png" width=700>

However, we can't do this in the Celeb-A dataset: there are no classes!

## Patch similarity

The authors propose a new way of assessing their GANs to identify improvement:

They randomly sample 7x7 patches from the 16x16 versions of the images, the 32x32 versions, etc., up to the 1024x1024 version. They then use a metric called the "Wasserstein distance" to compute the similarity between generated patches and the corresponding real patches.

> "...the distance between the patch sets extracted from the lowest-resolution 16 × 16 images indicate similarity in large-scale image structures, while the finest-level patches encode information about pixel-level attributes such as sharpness of edges and noise."

Using this metric, their method does indeed outperform other GANs that have come before.

# The future

What is the future of GANs? More generally, what is the future of Deep Learning? Can we predict it?

I asked Ian Goodfellow in a LinkedIn message if he was surprised by how quickly Progressive GANs were able toget clase to photorealistic image quality on 1024x1024 images. He replied:

> I'm actually surprised at how slow it's been. Back in 2015 I thought that getting to photorealistic video was mostly going to be an engineering effort of scaling the model up and training on more data.

-Ian Goodfellow, in a LinkedIn message to me

### Ian Goodfellow's background

<img src="img/goodfellow_li_1.png" width=500>

<img src="img/goodfellow_li_2.png" width=500>

<img src="img/goodfellow_li_3.png" width=500>

### Knowledge over time

<img src="img/knowledge_over_time.png" width=500>

## What is the future of GANs?

<img src="img/question_mark.png" width=500>

Nobody knows!

# Thanks!

Contact me if you'd like: seth@sethweidman.com