# Generative Adversarial Networks

Right now, _generative adversarial networks_, or GANs, are one of the hottest topics in deep learning. A method for unsupervised learning, GANs have shown tremendous potential for leaerning about complicated distributions (such as natural images). In practice, though, GANs are extremely difficult to train. After reading several tutorials and trying out several GAN implementations with mixed results, I decided to write my own GAN implementation. This was extremely frustrating. Below, I share the code for the two networks that make up the GAN, as well as most of the things I tried before I got something that worked to my satisfaction.

A GAN consists of two neural networks, $D$ and $G$. The network $G$ is called a _generator_, and the network $D$ is called a _discriminator_. In the simplest case, our data consists of a set $\mathcal{D}$ of unlabeled data points (for example, images). The goal of the generator is to take random noise as an input and produce an output that "looks real," as if it came from $\mathcal{D}$. The goal of the discriminator is to take an input and decide whether it came from a generator network or a real data set. We train these networks together, with the hope (often misplaced) that each network will force the other to improve, with the end result that the generator learns to generate highly realistic outputs that consistently "fool" the discriminator.



In [None]:
import numpy as np
import tensorflow as tf
%pylab inline

To demonstrate how GANs work, we'll use the MNIST dataset. Our generator will learn how to generate small pictures of handwritten digits, and our discriminator will try to distinguish between real images and fakes. Tensorflow makes working with MNIST pretty trivial. The only thing worth noting in the next cell is that we use the "one hot" encoding for our labels. That means that if the label for our image is 4, then the representation of the label would be a vector of length 10 that contains zeros everywhere except in the fifth position (assuming we put 0 at the beginning of the vector).

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

The two variable below mostly control how long things take. Playing with extreme values for the `batch_size` led mostly to frustration. A nice, small power of two seems safe here. 

The number of training steps can be very large if you'd like (more passes through the data), but you should get decent results with the default settings I have here.

In [None]:
n_train_steps = 100000
batch_size = 128

### The Generator

The goal of our _generator_ network is to take random noise as its input and to produce an image

Our generator will have a very simple structure. The input will be a 100-dimensional noise vector $z$, and we'll use three layers. The final output of the network will have 784 components. The hope is that after training the generator network will produce outputs that look like MNIST images. 

I encountered a number of frustrations here, and there are a few things to explore:

1. The leaky relu seems like a good nonlinearity. I tried tanh, sigmoid, and relu. I didn't notice much difference between relu and leaky relu. The tanh and sigmoid seemed to make things harder to train.
2. I don't know if that 0.2 in the leaky relu makes much difference - might be worth playing with.
3. I tried a range of shapes for the variables. It looks like 1200 works nicely. I'm sure other values do too, but I didn't notice anything substantial enough to document.
4. The variable initialization seems to matter a lot. I tried the following: a random uniform with "large" bounds U(-1.0, 1.0), A standard normal N(0.0, 1.0), Xavier initialization, and random uniform with bounds from Ian Goodfellow's pylab2 code. For both the "large" uniform and the standard normal, I had to make my learning rate very very small to get anything even remotely resembling reasonable behavior. The xavier initialization seemed ok-ish, but ultimately just using the bounds from the original code was the only thing that seemed to produce consistently good results. The smallness of the bounds surprised me.
5. For the output of the network, I tried both sigmoid and tanh. When I tried tanh I also messed around with normalizing the inputs to lie in the range [-1,1]. This didn't appear to help much, and sigmoid ended up working just fine with all the other hyperparameter choices.
6. I tried initializing the biases to 0.0 and 0.1. Using 0.1 is supposed to be better for relu because it lets more gradient info propagate back into the network (a neat tip from the Deep Learning book). I don't know that tip is as helpful with leaky relu, but when I started getting reasonable results I froze my configuration and 0.1 initialization made it to the end :)
7. I have not yet played with the dimensionality of the input noise. I went with 100 because that seems to be a common choice in a lot of what I've read. I would love to know if this makes a difference.
8. I think that the L2 regularization is helpful. May be good to try without it.

In [None]:
def lrelu(x):
    return tf.maximum(x, 0.2 * x)

z = tf.placeholder(tf.float32, shape=(None, 100))
g_w1 = tf.get_variable("g_w1", [100,1200], initializer=tf.random_uniform_initializer(-0.05, 0.05),
                      regularizer=tf.contrib.layers.l2_regularizer(0.8))
g_b1 = tf.get_variable("g_b1", [1200], initializer=tf.constant_initializer(0.1))
g_w2 = tf.get_variable("g_w2", [1200,1200], initializer=tf.random_uniform_initializer(-0.05, 0.05),
                       regularizer=tf.contrib.layers.l2_regularizer(0.8))
g_b2 = tf.get_variable("g_b2", [1200], initializer=tf.constant_initializer(0.1))
g_w3 = tf.get_variable("g_w3", [1200, 784], initializer=tf.random_uniform_initializer(-0.05, 0.05),
                       regularizer=tf.contrib.layers.l2_regularizer(0.8))
g_b3 = tf.get_variable("g_b3", [784], initializer=tf.constant_initializer(0.1))

g_params = [g_w1, g_b1, g_w2, g_b2, g_w3, g_b3]

def generator(z):
    g_y1 = lrelu(tf.matmul(z, g_w1) + g_b1)
    g_y2 = lrelu(tf.matmul(g_y1, g_w2) + g_b2)
    G = tf.nn.sigmoid(tf.matmul(g_y2, g_w3) + g_b3)
    return G

With the `noise_prior`, the choices to explore are the distribution used to generate the noise and the parameters of that distribution. The two obvious choices are the normal and uniform distribution. It seemed like a standard normal was a little more consistent in getting reasonable results.

In [None]:
def noise_prior(batch_size, dim):
    return np.random.normal(0.0, 1.0, size=(batch_size, dim))

### The Discriminator

The architecture of the discriminator is similar to that of the generator. The output can be interpreted as the probability that the input is from the generator or the data (hence the sigmoid at the end). Other items of note:

1. The hidden layers are substantially narrower in the discriminator than the generator. This appears to matter. There is some talk in a paper or two about the need to make the discriminator slightly weaker than the generator. Not sure if this is why the narrower layers is helpful.
2. The story with the initializers is similar to the case of the generator, but notice that the bounds are an order of magnitude smaller. This is taken from Goodfellow's pylearn2 code.
3. Same story here with the L2 regularization as we saw in the generator case. 
4. Contrary to the suggestion in the Deep Learning book, I didn't use dropout. I tried it, but it didn't appear to help. I may not have applied it consistently enough, though. Worth experimenting.

In [None]:
x = tf.placeholder(tf.float32, shape=(None, 784))
d_w1 = tf.get_variable("d_w1", [784,200], initializer=tf.random_uniform_initializer(-0.005, 0.005),
                       regularizer=tf.contrib.layers.l2_regularizer(0.8))
d_b1 = tf.get_variable("d_b1", [200], initializer=tf.constant_initializer(0.1))
d_w2 = tf.get_variable("d_w2", [200,200], initializer=tf.random_uniform_initializer(-0.005,0.005),
                       regularizer=tf.contrib.layers.l2_regularizer(0.8))
d_b2 = tf.get_variable("d_b2", [200], initializer=tf.constant_initializer(0.1))
d_w3 = tf.get_variable("d_w3", [200,1], initializer=tf.random_uniform_initializer(-0.005,0.005),
                       regularizer=tf.contrib.layers.l2_regularizer(0.8))
d_b3 = tf.get_variable("d_b3", [1], initializer=tf.constant_initializer(0.1))

d_params = [d_w1, d_b1, d_w2, d_b2, d_w3, d_b3]

def discriminator(x):
    d_y1 = lrelu(tf.matmul(x, d_w1) + d_b1)
    d_y2 = lrelu(tf.matmul(d_y1, d_w2) + d_b2)
    D = tf.nn.sigmoid(tf.matmul(d_y2, d_w3) + d_b3)
    return D


Now we have to define the outputs we're going to train. We define `G` by passing a placeholder for random noise into the `generator` function. We then define two terms for the discriminator: one that takes a placeholder for data, and one that takes output from the generator. Notice that even though we call `discriminator` twice, we are reusing the Tensorflow variables that make up the weights of the network, so `D_real` and `D_fake` share weights. 

In [None]:
G = generator(z)
D_real = discriminator(x)
D_fake = discriminator(G)

### The Training Objectives

We have an objective for $D$ and an objective for $G$. The objective for $D$ consists of a term that rewards correctly classifying actual data samples, and a term that rewards correctly picking out the fakes generated by $G$.

The objective for $G$ is designed to reward it for correctly fooling $D$. We follow the heuristic of minimizing $-E[\log(D_{fake})]$ rather than $-E[\log(1-D_{fake})]$. This consistently leads to drastically better results.

In [None]:
obj_d = -tf.reduce_mean(tf.log(D_real) + tf.log(1-D_fake))
obj_g = -tf.reduce_mean(tf.log(D_fake))

The actual optimization follows. There are a lot of magic numbers here. Quite honestly, I'm not even sure if they're necessary. They are based on the approach in Goodfellow's code. There are a few key points:

1. The learning rate decay is based on code from pylearn2 for exponential decay. The magic numbers passed to `exp_decay` are taken from Goodfellow's code. I did not play with these numbers at all.
2. The momentum adjustor is again based on pylearn2 code. The momentum jumps to its maximum value pretty early in training the way I have it implemented. There's probably room to play with this a bit, though I didn't do that at all.
3. We use `MomentumOptimizer` for both the generator and discriminator. I played with other optimizers, but this led to a lot of wasted time and frustration. In particular, Adam was not as helpful here as I have found it in other settings. Don't know why.
4. The scalar summaries let you see how the learning rates and momentum adjustments change over time.

In [None]:
def exp_decay(initial_rate, step, decay_factor, min_lr):
    return tf.maximum(initial_rate / tf.pow(decay_factor, tf.to_float(step)), min_lr)

def momentum_adjustor(initial_momentum, step, final_momentum, saturation_point):
    m = initial_momentum + (final_momentum - initial_momentum) * (tf.to_float(step) / saturation_point)
    return tf.minimum(m, final_momentum)

time_step = tf.placeholder(tf.int32)

d_batch = tf.Variable(0, trainable=False)
d_learning_rate = exp_decay(0.01, time_step, 1.000004, 0.000001)
d_momentum = momentum_adjustor(0.5, d_batch, 0.7, 250)
opt_d = tf.train.MomentumOptimizer(d_learning_rate, d_momentum).minimize(obj_d, 
                                                                         var_list=d_params, 
                                                                         global_step=d_batch)

g_batch = tf.Variable(0, trainable=False)
g_learning_rate = exp_decay(0.01, time_step, 1.000004, 0.000001)
g_momentum = momentum_adjustor(0.5, g_batch, 0.7, 250)
opt_g = tf.train.MomentumOptimizer(g_learning_rate, g_momentum).minimize(obj_g, 
                                                                         var_list=g_params,
                                                                         global_step=g_batch)

d_momentum_summary = tf.summary.scalar('d_momentum', d_momentum)
g_momentum_summary = tf.summary.scalar('g_momentum', g_momentum)
d_learning_rate_summary = tf.summary.scalar('d_learning_rate', d_learning_rate)
g_learning_rate_summary = tf.summary.scalar('g_learning_rate', g_learning_rate)
obj_d_summary = tf.summary.scalar('obj_d', obj_d)
obj_g_summary = tf.summary.scalar('obj_g', obj_g)

### The Training Process

In [None]:
sess=tf.InteractiveSession()

merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter("./summaries/train", sess.graph)

tf.global_variables_initializer().run()

The training process is pretty standard. Because we're training two networks, we do have a few choices to make. Here are some of the things I had to work through:

1. We set `n_train_steps` earlier in the notebook. This isn't the best way to ensure we're working through all of the training data, but for the purposes of seeing a GAN work, it gets the job done.
2. We have to choose when and where we generate the random noise. We can either do it once and use the same noise for both networks, or we can generate a new noise vector for each of the two training steps. I tried both ways, and it seems to matter very little which choice you make here.
3. We have the option of taking multiple steps with the discriminator and the generator. You coud (in principle) avoid having one network overpower the other by letting the weaker network train until it was strong enough. This doesn't reliably work. I just take one step for each of the networks.
4. With the configuration I have in this notebook and a GTX 980M on my laptop, training takes about 15 minutes. You can make `n_train_steps` smaller if this is too long, but if you go lower than about 50000 you're much less likely to get anything sensible out of the setup.

In [None]:
for i in range(n_train_steps):
    x_data, t_data = mnist.train.next_batch(batch_size)
    noise = noise_prior(batch_size, 100)

    _, summary = sess.run([opt_d, merged], {x : x_data, z : noise, time_step : i})
    train_writer.add_summary(summary, i)
    sess.run([opt_g], {z : noise, time_step : i})
    
    if i % (n_train_steps/10) == 0:
        print float(i)/n_train_steps

### Visualizing The Result

The code below just runs the generator to produce an image of a random digit. Two items of note:

1. Very often the output looks like a digit. However, often the image looks like a smudge. It may be possible to reduce this with more training, but I haven't checked.
2. You'll notice that the network has a tendency to "miss" certain digits. In particular, it seems to generate images of 2 much less than other digits. This is an example of a known issue with GANs. More training is probably the answer.

In [None]:
out_im = sess.run(G, {z : noise_prior(1, 100)})
out_im.shape = (28,28)
imshow(out_im, cmap="gray")

### More Reading

I was only able to write this notebook because of some amazing blog posts and code repositories. In no particular order, I found the following helpful:

1. [An introduction to Generative Adversarial Networks (with code in TensorFlow](http://blog.aylien.com/introduction-generative-adversarial-networks-code-tensorflow/)
2. [Generative Adversarial Nets in TensorFlow (Part I)](http://blog.evjang.com/2016/06/generative-adversarial-nets-in.html)
3. [Image Completion with Deep Learning in TensorFlow](https://bamos.github.io/2016/08/09/deep-completion/)
4. [Generative Adversarial Nets in TensorFlow](http://wiseodd.github.io/techblog/2016/09/17/gan-tensorflow/)
5. [How to Train a GAN? Tips and tricks to make GANs work](https://github.com/soumith/ganhacks)
6. [goodfeli / adversarial](https://github.com/goodfeli/adversarial)