In [3]:
from IPython.display import HTML
style = """
<style>
.expo {
  line-height: 150%;
}

.visual {
  width: 600px;
}

.red {
  color: red;
  display:inline;
}

.blue {
  color: blue;
  display:inline;
}

.green {
  color: green;
  display:inline;
}

</style>
"""
HTML(style)

# GANs

# https://github.com/SethHWeidman/Conference_talks/tree/master/AirBNB_Lunch_July_2018



## What are they, what makes them work, and what is their future.

**Seth Weidman**
 
AirBNB, July 31, 2018

# Agenda

1. Quick Neural Networks Review

2. GANs' origin story

3. GAN extensions and applications
    * Conditional GANs
    * DCGAN
    * Semi Supervised Learning
    * Scoring GANs

4. Future of GANs (and of Deep Learning in general)

# What are GANs?

You may not know what a GAN is: when the conversation turns to GANs, you may feel like [Homer Simpson](https://www.youtube.com/watch?v=PGLzm-Gy0dQ) does in this clip.

"GAN" stands for "<span class='blue'>Generative</span> <span class='green'>Adversarial</span> <span class='red'>Network</span>". 

They are a method of training <span class='red'> neural networks</span> to <span class='blue'>generate</span> images similar to those in the data the neural network is trained on. This training is done via an <span class='green'>adversarial</span> process.

### Basic example (Goodfellow et. al., 2014)

![](img/mnist_gan_8s.png)

Not digits written by a human. Generated by a neural network.

### "Cutting edge" example: (NVIDIA Research, October 2017)

![](img/progressive_gan_example.png)

Not real people: images generated by a neural network.

### Why you should care

> "[GANs], and the variations that are now being proposed, are the most interesting idea in the last 10 years in ML [machine learning], in my opinion." 

-- Yann LeCunn, Director of AI Research at Facebook, [in 2016 on Quora](https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-in-deep-learning/answer/Yann-LeCun)

## What _are_ GANs?

## In fact, what are neural networks?

### Neural network review

We've all seen diagrams like this when trying to understand neural nets:

![](img/neural_network_diagram.png)

But what are they _really_? There are many different ways of explaining what a neural net is. _Mathematically_, they are:

* **Nested functions** (like $f(g(x_1, x_2, ...))$ etc., if the $x_i$ are pixels of the original image.

* **Universal function approximators** (if nest them in the right way, we can in theory approximate any function, no matter how complex)

* **Differentiable** (this allows us to "train" them to actually accomplish things)

This means that you can think of a neural net as being a mathematical function that takes in:

* An input image (or batch) that we could denote:

$X = \begin{bmatrix}
x_{11} & x_{12} & x_{13} & \ldots & x_{1n} \\
x_{21} & x_{22} & x_{23} & \ldots & x_{2n} \\
x_{31} & x_{32} & x_{33} & \ldots & x_{3n} \\
\vdots & \vdots & \vdots & \ddots & \vdots\\
x_{m1} & x_{m2} & x_{m3} & \ldots & x_{mn} \\
\end{bmatrix} $

As well as:

* Several weight matrices that we could denote $W$:

$W = \begin{bmatrix}
w_{11} & w_{12} & w_{13} & \ldots & w_{1p} \\
w_{21} & w_{22} & w_{23} & \ldots & w_{2p} \\
w_{31} & w_{32} & w_{33} & \ldots & w_{3p} \\
\vdots & \vdots & \vdots & \ddots & \vdots\\
w_{n1} & w_{n2} & w_{n3} & \ldots & w_{np} \\
\end{bmatrix} $

The result is a _number_: for example, a number representing the probability that an image contains a cat.

This number is computed by some extremely complicated - yet still **differentiable** - ***function*** of these original pixels and weights.

Something like:

$$ f(X, W) = x_{11}^2 * (w_{11} + w_{12}) * log(x_{12}) + ...$$

Every time we feed a set of inputs and weights through this network, we get **"predictions $P$**; we compare these predictions to the **target $Y$** to get a **loss vector $L$** of the same "shape" (in terms of a multidimensional array) as the predictions.

This loss is the key data we need on **how much we "missed" by**.

These facts mean we can train neural network using the following procedure:

1. Feed a bunch of data points through the neural network.
2. Compute the loss $L$
3. Compute, for every single weight $w$ in the network: $$ \frac{\partial L}{\partial w} $$

And then we can update each _individual weight_ in the network $w_i$ according to the equation:

$$ w_i = w_i - \frac{\partial L}{\partial w_i} $$

(This is the standard "gradient descent" equation. We could also use one of the many modifications of this equation that exist).

**In addition**, differentiability means we can compute, for every _pixel_ $x_i$ in the input image:

$$ \frac{\partial L}{\partial x_i} $$

In other words, how much the loss would change if this pixel in the _input_ image changed.

_This_ fact turns out to be:

* What allows GANs to work
* Why adversarial examples are a thing (ask me about this afterwards if curious)

# How were GANs invented?

![](img/ian_goodfellow_beer.png)

In 2013, Ian Goodfellow (inventor of GANs, then a grad student at the University of Montreal) and Yoshua Bengio (one of the leading researchers on neural networks in the world) are about to run a speech synthesis contest.

Their idea is to have a "discriminator network" that could listen to artificially generated speech and decide if it was real or not. 

They decide not to run the contest, concluding that people will just game the system by generating examples that will fool _this particular_ discriminator network, rather than trying to produce _generally_ good speech.

Then, Ian Goodfellow was in a bar one night, and asked the question: **can this be fixed by the _discriminator network_ learning**?

This led him to develop what ultimately became the GAN framework. Let's dive in and see how it works:

### How GANs work

## What he came up with

### Part 1

First: randomly generate a vector of data.

$$ \begin{bmatrix}z_1 \\
                  z_2 \\
                  \vdots \\
                  z_{100}
                  \end{bmatrix} $$

Feed this "feature vector" through a randomly initialized neural network to produce an output image.

![](img/gan_1.png)

Denote the matrix of pixels in this image - generated by the first neural network - $X$.

Then, feed this image (or matrix of pixels $X$) into a second network and get a prediction:

![](img/gan_prediction.png)

What now?

### Are neural nets and GANs "AI"?

If not, what is a better mental model?

![](img/dog.png)

They're more like dogs - we can train them to do whatever we want.

![](img/gan_prediction_2.png)

Use this loss from comparing this prediction with _0_ to train this second network, called the "**discriminator**". 

Critically, also compute $$ \frac{\partial L}{\partial X} $$ - how much each of the _pixels generated_ affects the loss.

Then, update the first network, called the **generator**, with

$$ -\frac{\partial L}{\partial X} $$

negative because we want the generator to be continually making the discriminator _more_ likely to say that the images it is generating are real.

![](img/gan_3.png)

Finally, generate a _new_ random noise vector $Z$, and repeat the process, so that the generator will learn to turn _any_ random noise vector into an image that the discriminator thinks is real.

Repeat!

### What's missing?



This will train the generator to generate good fake images, but it will likely result in the discriminator not being a very smart classifier since we only gave it one of the two classes it is trying to classify - that is, only fake images, and no real images. 

So, we'll have to give it real images as well.

## Part 2:

Give the discriminator some real images:

![](img/gans_4.png)

Description from the original paper on GANs:

> "The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles." 

-Goodfellow et. al., "Generative Adversarial Networks" (2014)

Generated MNIST digits with simple neural network architectures with just fully connected layers - no convolutions in either:

![](img/mnist_gan_8s.png)

[Original GitHub repo with Ian Goodfellow's code](https://github.com/goodfeli/galatea/commit/d960968919b0856ba6753198a0e035228d7c03e6)

## Extensions of GANs

* Conditional GANs
* DCGAN
* Semi-Supervised Learning

## Conditional GANs

GANs can be modified slightly to generate images from a particular class. How?

$$ \begin{bmatrix}0 \\
                  0 \\
                  \vdots \\
                  1 \\
                  \vdots \\
                  0
\end{bmatrix} $$

Add a one-hot encoded vector representing the class to both the generator and the discriminator.

## How it works

#### Step 1:

![](img/cond_gan_1.png)

#### Step 2 - feed through discriminator:

![](img/cond_gan_2.png)

### Step 3 - as before, use $- \frac{\partial L}{\partial X}$ to train generator:

![](img/gan_3.png)

### Step 4 - ensure discriminator gets smarter

![](img/cond_gan_4.png)

## Conditional GAN results

Conditioning GANs on the a class from the training data is a common way to illustrate the output of GANs.

Here are some examples from [Self-Attention GAN](https://arxiv.org/pdf/1805.08318.pdf), the most recent GAN architecture from Ian Goodfellow's group at Google Brain.

#### Conditionally generated images from the [ImageNet dataset](http://imagenet.stanford.edu)

![](img/cond_gan_example.png)

## DCGAN - Deep Convolutional GAN (January 2016)

![](img/dcgan_1.png)

All 32 x 32 color images of bedrooms. All fake.

## DCGAN paper

Important for two sets of reasons:

Introduced key concepts that pushed GANs forward:

* Deep Convolutional/Deconvolutional architecture 
* Batch normalization (which had been invented earlier in 2015) first applied to GANs

Included famous visuals of the "latent space" GANs were learning (more on this shortly).

Crazy fact: the lead author of the DCGAN paper (Alec Radford) was still in college when it was published.

### Latent space visualization

Idea: what are GANs actually learning?

![](img/gans_latent_space.png)

Linear interpolation Smooth transitions in the latent (100-dimensional input to generator) space:

![](img/dcgan_smooth_transition.png)

Arithmetic in the latent space:

![](img/dcgan_arithmetic.png)

### How did they do it?

## DCGAN Trick #1: Deep Convolutional/Deconvolutional Architecture

### Reminder:

Convolutions are a way of doing _local_ feature detection that are widely used for image understanding within Deep Learning.

<img src="./img/same_padding_no_strides.gif">

<font color="blue">Blue</font> = original image
<font color="gray">Gray</font> = convolutional "filter"
<font color="green">Green</font> = output image

### Discriminator 

![](img/dcgan_discriminator.png)

### Discriminator in PyTorch:

```python
class Discriminator(torch.nn.Module):
    
    def __init__(self):
        super(Discriminator, self).__init__()
        
        self.conv1 = nn.Sequential(
            nn.Conv2d(
                in_channels=1, out_channels=128, kernel_size=4, 
                stride=2, padding=1, bias=False
            ),
            nn.LeakyReLU(0.2, inplace=True)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(
                in_channels=128, out_channels=256, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(256),
            nn.LeakyReLU(0.2, inplace=True)
        )
        self.conv3 = nn.Sequential(
            nn.Conv2d(
                in_channels=256, out_channels=512, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.2, inplace=True)
        )
        self.conv4 = nn.Sequential(
            nn.Conv2d(
                in_channels=512, out_channels=1024, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.2, inplace=True)
        )
        self.out = nn.Sequential(
            nn.Linear(1024*4*4, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        # Convolutional layers
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        # Flatten and apply sigmoid
        x = x.view(-1, 1024*4*4)
        x = self.out(x)
        return x
```   

### Generator

![](img/dcgan_generator.png)

```python
class Generator(torch.nn.Module):
    
    def __init__(self):
        super(Generator, self).__init__()
        
        self.linear = torch.nn.Linear(100, 1024*4*4)
        
        self.conv1 = nn.Sequential(
            nn.ConvTranspose2d(
                in_channels=1024, out_channels=512, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(512),
            nn.ReLU(inplace=True)
        )
        self.conv2 = nn.Sequential(
            nn.ConvTranspose2d(
                in_channels=512, out_channels=256, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(256),
            nn.ReLU(inplace=True)
        )
        self.conv3 = nn.Sequential(
            nn.ConvTranspose2d(
                in_channels=256, out_channels=128, kernel_size=4,
                stride=2, padding=1, bias=False
            ),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True)
        )
        self.conv4 = nn.Sequential(
            nn.ConvTranspose2d(
                in_channels=128, out_channels=1, kernel_size=4,
                stride=2, padding=1, bias=False
            )
        )
        self.out = torch.nn.Tanh()

    def forward(self, x):
        # Project and reshape
        x = self.linear(x)
        x = x.view(x.shape[0], 1024, 4, 4)
        # Convolutional layers
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = self.conv4(x)
        # Apply Tanh
        return self.out(x)
```   

## DCGAN Trick #2: Batch Normalization

Batch normalization is one of the most powerful and simple tricks to come along in the history of the training of deep neural networks. It was [introduced](https://arxiv.org/abs/1502.03167) by two researchers from Google in March 2015, just nine months before the [DCGAN paper](https://arxiv.org/abs/1511.06434) came out.

Regular neural network:

<img src="img/deep_neural_network.png" class="visual">

We know that normalizing the input to a neural network helps with training: the network doesn't have to "learn" that one feature is on a  scale from 0-1000 and another is on a scale from 0-1 and change its weights accordingly, for example.

The same thing applies further down in the network:

<img src="img/neural_network_weights_hidden.png" class="visual">

Inituitively, batch normalization works for the same reasons that normalizing data before feeding it into a neural network works.

# Semi-Supervised Learning

![](img/semi-supervised_gans.png)

Semi-supervised learning made it into Jeff Bezos' most recent letter to Amazon's shareholders!

> "...in the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)"

Semi-supervised learning is a third type of machine learning, in addition to supervised learning and unsupervised learning.

At a high level:

* The goal of supervised learning is to learn from _labeled_ data.
* The goal of unsupervised learning is to learn from _unlabeled_ data.

Semi-supervised learning asks the question: can you learn from a _combination_ of both labeled and unlabeled data? 

With GANs, the answer turns out to be yes! The paper that introduced this idea was [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf), from a team at OpenAI in 2016.

How does it work? Basic idea is: 

Let's say we're trying to classify MNIST digits. The discriminator will output a probability vector of an image belonging to one of ten classes: 

![](img/ssl_discriminator_1.png)

This is compared with the real values, turned into a loss vector, and backpropagated through the network to train it.

With semi-supervised learning, we simply add another class to this output:

![](img/ssl_discriminator_2.png)

Then, data points are fed through a classifier, with the following labels:

* _Real_, _labeled_ examples are given labels simply of 0 for all the digits they are not, 1 for the digit they are, and 1 for $P(real)$.

* _Fake_ examples generated by the generator are given labels of 0 across the board, including for $P(real)$.

* _Real_, *un*-labeled examples are given labels of 0 for all the classes and 1 for the probability of the image being real.

This allows the classifier to learn from real, labeled examples, as well as both fake examples, and real, **un**-labeled examples! 

### An issue

The classifier in this case is acting like a discrminator at the same time it is acting like a classifier. It is trying to learn _both_:

* How to discriminate between fake images it is getting from the generator and real images
* How to classify digits as 1, 2, 3 etc.

This "hybrid discriminator-classifier" is potentially problematic:

As the authors of the paper point out:

> This approach introduces an interaction between G and our classifier that we do not fully understand yet

This leads them to one of their key innovations that allowed this procedure to work: **feature matching**.

## Semi-supervised learning trick: feature matching

## Idea

The last layer of a convolutional netural network, before the values get fed through a fully connected layer, is typically a layer with many features that have been detected  

![](img/gan_layer.png)

For example, in the convolutional architecture used in the discriminator of the DCGAN architecture described above, the last layer is $2x2x128$ - the result of 128 "features of features of features" that the network has learned. 

This is then "flattened" to a single layer of $2 * 2 * 128 = 512$ neurons, and these 512 neurons are then fed through a fully connected layer to produce an output of length 10.

<img src="./img/last_layer_fc.png" class="visual">

Their idea was to train the generator, not simply by using the discriminator's prediction of whether the image was real or fake, but on **how similar this 512 dimensional vector was between _real_ images fed through the discriminator compared to _fake_ images fed through the discriminator.** 

The delta between these two sets of _features_ is the loss used to train the generator.

### Aside: why does this work? 

Even the authors of the paper don't fully understand it:

 > "[The approach in this paper] introduces an interaction between G and [the hybrid classifier-discriminator] that we do not fully understand yet, but empirically we find that optimizing G using feature matching GAN works very well for semi-supervised learning, while training G using GAN with minibatch discrimination does not work at all. Here we present our empirical results using this approach; developing a full theoretical understanding of the interaction between D and G using this approach is left for future work.

Nevertheless, feature matching was the trick that led to breakthrough performance using semi-supervised learning to build powerful classifiers: 

Salimans et. al. from OpenAI in mid-2016 used this approach to get just under  a **6%** error rate on the [Street View House Numbers dataset](http://ufldl.stanford.edu/housenumbers/) _with just 1,000 labeled images_. Prior approaches achieved roughly **16%** error. 

State-of-the-art error, using the entire dataset of roughly 600,000 images, simply using supervised learning with very deep convolutional networks, is roughly **2%**. 

![](img/bezos.png)

## Getting better all the time

Generating more realistic, higher resolution images using GANs has been an active research area since their inception.

### Example: Progressive GANs from NVIDIA Research

<img src="./img/progressive_gan.gif">

#### NYT article on Progressive GANs:

<img src="./img/progressive_gans_nyt.png" class="visual">

Link to paper [here](http://research.nvidia.com/sites/default/files/publications/karras2017gan-paper.pdf) 

**But how do we _know_ these GANs are any good?**

## An aside: how do we score GANs?

How do we know that these samples are "good"? They "look good", but how can we quantify this?

> "Generative Adversarial Networks are generally regarded as producing the best samples [compared to other generative methods such as variational autoencoders] but there is no good way to quantify this."

--Ian Goodfellow, NIPS tutorial 2016

Since then, several methods have been proposed, the most prominent of which is the **Inception Score**:

### Inception Score

A way to score GANs also introduced in the [Improved Techniques for Training GANs](https://papers.nips.cc/paper/6125-improved-techniques-for-training-gans.pdf) paper.

Consider a GAN that was intended to generate images that come from one of a finite number of classes, such as MNIST digits.

### Inception score (Part 1)

Let's say that the generator generated some images, and those generated images were then fed through a pre-trained neural network, and the a probability distribution over the images was:

`[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]`

In other words, the pre-trained model has no idea which class this image should belong to. 

Is this a good GAN? Probably not.

The way we formalize this is that this resulting vector should have _low entropy_ - that is, _not_ an even distribution over class labels. 

### Inception score (Part 2)

There is another way we can use this pre-trained neural network. Let's say that for every image generated, we recorded the "most likely class" that the pre-trained network was predicting. And let's say that 90% of the time, the pre-trained network was classifying the images that our model was generating as zeros, so that the vector of "most likely class" looked like:

`[0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]`

Would this be a very good GAN? Again, probably not.

The way we formalize this is that we want _the vector of the frequency of the predictions_ to have **high** entropy: that is, we *do* want balance across the classes that the generator generates.

Progressive GANs did indeed show a record Inception score on the CIFAR-10 dataset (32 x 32 images of 10 classes: cars, planes, and so on):

![](img/progressive_gans_inception.png)

# The future

What is the future of GANs? More generally, what is the future of Deep Learning? Can we predict it?

I asked Ian Goodfellow in a LinkedIn message if he was surprised by how quickly Progressive GANs were able toget clase to photorealistic image quality on 1024x1024 images. He replied:

> I'm actually surprised at how slow it's been. Back in 2015 I thought that getting to photorealistic video was mostly going to be an engineering effort of scaling the model up and training on more data.

-Ian Goodfellow, in a LinkedIn message to me

### Ian Goodfellow's background

<img src="img/goodfellow_li_1.png" class="visual">

<img src="img/goodfellow_li_2.png" class="visual">

<img src="img/goodfellow_li_3.png" class="visual">

### Knowledge over time

<img src="img/knowledge_over_time.png" class="visual">

## What is the future of GANs?

![](img/question_mark.png)

Nobody knows - not even the experts!

## Next Steps

Other resources: 
* [This repo](https://github.com/eriklindernoren/PyTorch-GAN) has lots of great implementations of GANs in PyTorch. 
    * Simple [conditional GAN](https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/cgan/cgan.py)
    * Simple [Semi-Supervised Learning Example](https://github.com/eriklindernoren/PyTorch-GAN/blob/master/implementations/sgan/sgan.py)
    * Simple [DCGAN implementation](https://github.com/diegoalejogm/gans/blob/master/2.%20DC-GAN%20PyTorch.ipynb)
    

seth@sethweidman.com if you have any questions.

**Thanks!**

![](img/professional_headshot.png)