
# Pix2Pix

In the final notebook of this course, we explore a particularly amusing application of GANs: **Pix2Pix**, originally going back a paper by [Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros](https://arxiv.org/abs/1611.07004).

<img id="gab" src="images/e2c.jpg"  width="700">
https://knowyourmeme.com/photos/1225207-edges2cats

Pix2Pix is a generative model in the sense that it employs deep learning to generate images. However, the generated images are not entirely arbitrary, but are rather trained to satisfy a consistency property. For instance, the above example is trained on a large number of cat images together with a corrspondingly sketched outline.

<img id="gab" src="images/e2c.png"  width="700">
https://laughingsquid.com/edges2cats/

After the generative model is trained, it creates what it believes is the closest thing coming to the picture of a cat under the constraint of the sketched outline.

To explain the principle we follow the [outstanding blogpost](https://affinelayer.com/pix2pix/ ) by Christopher Hesse. We also discuss code snippets from the associated [GitHub repository](https://github.com/affinelayer/pix2pix-tensorflow). For the purpose of presentation, we often provide just pseudo-code and leave out technical details. Everybody is encouraged to see all the details in the [pix2pix.py-file](https://affinelayer.com/pix2pix/)!

Consider the following illustrations from [Christopher Hesse's blogpost](https://affinelayer.com/pix2pix/)

<table>
    <tr>
        <td><img id="gab" src="images/p2pInp.png"  width="200"></td>
        <td><img id="gab" src="images/p2pOut.png"  width="200"></td>
    </tr>
    </table>
https://laughingsquid.com/edges2cats/

Here the task is to train a network for the purpose of image colorization. As training data, we give the color images -- called *targets* -- and their gray-scale transformations -- called *inputs*. The purpose of the generator is to transform an unseen input into a correctly colored *output*.

## Training

Before discussing the precise architecture of discriminator and generator, we quickly go over the training mechanism.

Essentially the training follows the classical GAN paradigm. The discriminator is trained via cross-entropy, whereas for the generator we rely on the standard adaption to avoid the vanishing-gradient problem.

We do stress however, that in addition to the adversarial loss, the generator also contains an L1-reconstruction loss thereby enforcing that the generated output is close to the target.

In [6]:
def create_model(inputs, targets):
    #create generator
    outputs = create_generator(inputs, targets.get_shape()[-1])
    
    #create discriminators
    predict_real = create_discriminator(inputs, targets)
    predict_fake = create_discriminator(inputs, outputs)
    
    #discriminator loss
    discrim_loss = tf.reduce_mean(-(tf.log(predict_real) + tf.log(1 - predict_fake)))
    
    #generator loss
    gen_loss_GAN = tf.reduce_mean(-tf.log(predict_fake))
    gen_loss_L1 = tf.reduce_mean(tf.abs(targets - outputs))
    gen_loss = gen_loss_GAN + gen_loss_L1 

## Discriminator

We provide the discriminator with an input and a target. The task of the discriminator is to find out whether target and the input come from a real data pair or whether the target was produced by the generator.

To achieve this goal, the discriminator is designed as deep CNN. As discussed in the notebook on GANs, we see peculiarities in comparison to CNNs used in classification. For instance, stride-2 convolution replaces the use of max-pooling and Leaky ReLUs are used instead of ordineary ReLUs.

In [2]:
def create_discriminator(discrim_inputs, discrim_targets):
        input = tf.concat([discrim_inputs, discrim_targets], axis=3)

        for i in range(n_layers):
                convolved = discrim_conv(layers[-1], out_channels, stride=2)
                normalized = batchnorm(convolved)
                rectified = lrelu(normalized, 0.2)
                layers.append(rectified)
                
        return layers[-1]

## Generator

It is the task of the generator to convert an uncolored image into a colored image. Hence, both input and output, so that it makes sense to rely on a [U-net architecture](https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/u-net-architecture.png) typically used in image segmentation.

<img id="gab" src="images/unet.png"  width="700">
https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

The first half of the U-net is also known as *encoder*, whereas the second one is typically called *decoder*. Between the decoder and the corresponding encoder layer skip-connections are used to transfer information directly.

In [7]:
def create_generator(generator_inputs, generator_outputs_channels):

    for out_channels in layer_specs:
            rectified = lrelu(layers[-1], 0.2)
            convolved = gen_conv(rectified, out_channels)
            output = batchnorm(convolved)
            layers.append(output)

    num_encoder_layers = len(layers)
    for decoder_layer, (out_channels, dropout) in enumerate(layer_specs):
        skip_layer = num_encoder_layers - decoder_layer - 1

        input = tf.concat([layers[-1], layers[skip_layer]], axis=3)

        rectified = tf.nn.relu(input)
        output = gen_deconv(rectified, out_channels)
        output = batchnorm(output)

        layers.append(output)
        
    return layers[-1]