<h1>CS4618: Artificial Intelligence I</h1>
<h1>Neural Network Architectures</h1>
<h2>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h2>

<h1>Initialization</h1>
$\newcommand{\Set}[1]{\{#1\}}$ 
$\newcommand{\Tuple}[1]{\langle#1\rangle}$ 
$\newcommand{\v}[1]{\pmb{#1}}$ 
$\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ 
$\newcommand{\rv}[1]{[#1]}$ 
$\DeclareMathOperator{\argmax}{arg\,max}$ 
$\DeclareMathOperator{\argmin}{arg\,min}$ 
$\DeclareMathOperator{\dist}{dist}$
$\DeclareMathOperator{\abs}{abs}$

<h1>Acknowledgments</h1>
<ul>
    <li>The diagrams and code are based on diagrams and code in: A. G&eacute;ron: 
        <i>Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edn)</i>, O'Reilly, 2019
    </li>
</ul>

<h1>Warning</h1>
<ul>
    <li>The code does not run. I'm providing snippets only.</li>
</ul>

<h1>Introduction</h1>
<ul>
    <li>Mostly, we've been looking at quite simple neural network architectures: stacks of layers.
        <ul>
            <li>In Keras, <code>Sequential</code> models are ideal for this.</li>
        </ul>
    </li>
    <li>But, the architecture could be more complicated
        <ul>
            <li>E.g. you might want multiple inputs:
                <img src="images/mult_inputs_nn.png" /> 
                This might be useful if your dataset comprises images and text, which might
                best be handled by different subnetworks.
            </li>
            <li>E.g. you might want multiple outputs:
                <img src="images/mult_outputs_nn.png" />
                This might be useful if, from pictures of faces, you want to classify the expression
                (similing, surprised) but you also want to classify by eyeware (wearing glasses or not).
            </li>
            <li>In Keras, the <code>Model</code> class makes this possible.</li>
        </ul>
    </li>
    <li>We'll look at some examples. We won't do all the code (just snippets), so don't try to execute
        them. And you don't need to 
        learn the code.
    </li>
</ul>

<h1>Classification &amp; Localization, and Object Detection</h1>
<ul>
    <li>You may want to <b>locate</b> and <b>classify</b> the main object in a picture.
        <ul>
            <li>This is a classification task: what kind of object is in the image (cat, dog, &hellip;)</li>
            <li>It is also a regression task. In fact, you will want to predict four numbers that describe
                a bounding box around the object:
                <ul>
                    <li>$x$-coordinate of the centre of the bounding box;</li>
                    <li>$y$-coordinate of the centre of the bounding box;</li>
                    <li>width of the bounding box;</li>
                    <li>height of the bounding box.</li>
                </ul>
            </li>
        </ul>
    </li>
    <li>Hence, this requires a network with multiple outputs. How many outputs?</li>
</ul>

In [None]:
resnet50_base = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))

inputs = Input(shape=(224, 224, 3))
x = preprocess_input(inputs)
x = resnet50_base(x)
class_outputs = Dense(n_classes, activation="softmax")(x)
location_outputs = Dense(4, activation="linear")(x)
model = Model(inputs, outputs=[class_outputs, location_outputs])

model.compile(optimizer=RMSprop(lr=0.00003),
              loss=["sparse_categorical_cross_entropy", "mse"],
              loss_weights=[0.7, 0.3])

<ul>
    <li>In this snippet you can see the <code>Model</code> class has multiple outputs.</li>
    <li>,And you can see that each output requires its own loss function.
        <ul>
            <li>By default, Keras sums the losses.</li>
            <li>If you care more about one loss than another, you supply weights.</li>
        </ul>
    </li>
    <li>Getting a labelled dataset is now even harder: every image must come with the class and bounding box
        of the main object in the image. 
    </li>
</ul>

<ul>
    <li>What we've been discussing is classification and localization of the <em>main object</em> in an image.
        But we can use convolutional neural networks for <b>object detection</b>, which refers to classifying 
        and localizing <em>multiple objects</em> in an image, e.g. to
        say that an image contains a car and a pedestrian and to give bounding boxes for each.
    </li>
    <li>One way to do object detection:
        <ul>
            <li>Train a convolutional neural network to classify and locate a single object.</li>
            <li>Slide it across the image, i.e. predict and shift, e.g., for each $3\times 3$ region.</li>
            <li>Perhaps repeat this for each $4 \times 4$ region.</li>
            <li>Post-process the results because you will have detected the same object mutliple times.</li>
        </ul>
    </li>
    <li>There are other clever ways of doing object detection, but we don't have time to look at them
        (look up Fully Convolutional Networks and YOLO, if you are interested).
    </li>
</ul>

<h1>Autoencoders</h1>
<ul>
    <li>Autoencoders learn to copy their inputs to their outputs.
        <ul>
            <li>They have the same number of outputs as inputs.</li>
            <li>Reconstruction loss: They are trained with a loss function that penalizes them if the output 
                for each example is
                not the same as the input.
            </li>
            <li>The network will have constraints that prevent the autoencoder from simply copying inputs to
                outputs; the constraints force the autoencoder 
                to learn an efficient representation, rather than just memorizing; the autoencoder must
                find features that capture the inputs and allow it to reconstruct them as outputs.
            </li>
        </ul>
    </li>
    <li>Stacked autoencoders, for example, have
        <ul>
            <li>an encoder: layers that convert the input to a more compact internal representation; and</li>
            <li>a decoder: layers that convert the internal representation to the outputs.</li>
        </ul>
        <img src="images/stacked_autoencoder.png" />
    </li>
</ul>

In [None]:
stacked_encoder = Sequential([
    Dense(100, activation="relu", input=(200,)), 
    Dense(30, activation="relu") 
])
stacked_decoder = Sequential([
    Dense(100, activation="relu", input_shape=[30]),
    Dense(200, activation="linear")
])
stacked_autoencoder = Sequential([stacked_encoder, stacked_decoder])

stacked_autoencoder.compile(optimizer=RMSprop(lr=0.00003), loss="mse") 

stacked_autoencoder.fit(X_train, X_train, epochs=10, validation_data=[X_valid, X_valid])

<ul>
    <li>In the snippet, we create two submodels (encoder and decoder) and combine.
    </li>
    <li>I have pretended we have 200 numeric-valued features. Hence, I use linear as the activation function
        on the output layer. Hence, also mse
        is a suitable loss function.
    </li>
    <li>Note that <code>X_train</code> is used for both the inputs and the targets.</li>
    <li>Autoencoders have several uses, and we will look at one.</li>
</ul>

<h2>Unsupervised Pretraining using Stacked Autoencoders</h2>
<ul>
    <li>Suppose you want to build, e.g., an image classifier.</li>
    <li>Suppose you have lots of unlabeled data and a little labeled data.
        <ul>
            <li>E.g. you've downloaded millions of images from the web, but you've manually labeled only
                a small subset.
            </li>
        </ul>
    </li>
    <li>First train a stacked autoencoder on all your data: hopefully you learn an autoencoder that 
        is good at detecting features 
        in the images.
    </li>
    <li>Then reuse its lower layers (the encoder) like we reused lower layers of a pretrained network in a previous lecture:
        <ul>
            <li>Create a network that has the lower layers of the autoencoder and then a few additional
                layers that implement a classifier.
            </li>
            <li>Train this new network on the labeled data, but with all or most of the encoder's weights frozen.
            </li>
        </ul>
    </li>
</ul>

<h2>Some Other Autoencoders</h2>
<ul>
    <li>We force stacked autoencoders to learn useful features by giving the internal layers lower
        dimensionality. But there are other ways of constructing autoencoders.
    </li>
    <li>In a denoising autoencoder:
        <ul>
            <li>all the layers might have the same number of neurons;</li>
            <li>to force it to learn useful features, during training noise is added to the inputs;</li>
            <li>but the network is trained to reconstruct the noise-free inputs (a bit like dropout).</li>
        </ul>
    </li>
    <li>In a sparse autoencoder:
        <ul>
            <li>a regularization term is added to the loss function to reduce the number of active
                neurons;
            </li>
            <li>this forces the autoencoder to represent each input as a combination of a small number of
                activations.
            </li>
        </ul>
    </li>
    <li>Variational encoders can generate new instances that look like they were sampled from the training
        set. However, a new kind of network has become more popular for this &hellip;
    </li>
</ul>

<h1>Generative Adversarial Networks</h1>
<ul>
    <li>Generative Adversarial Networks (GANs) allow for the <em>generation</em> of fairly realistic
        synthetic images.
    </li>
    <li>They comprise:
        <ul>
            <li>a generator network: takes a random vector (think of it as if it were samples from the
                coded representations in the middle of an autoencoder) and decodes it into a synthetic
                image; and
            </li>
            <li>a discriminator network: takes an image which might be real (from the training set) or 
                might be synthetic (from the generator) and
                predicts whether it is real or synthetic.
            </li>
            Analogy: a forger and an art expert.
        </ul>
    </li>
    <li>Training:
        <ul>
            <li>The generator is trained to fool the discriminator, so it must produce ever more realistic
                images.
            </li>
            <li>The discriminator is trained to tell synthetic from real images with high accuracy.</li>
        </ul>
        As one network gets better, the other will have to get better.
        <ul>
            <li>Because it's a dynamic system, there isn't a fixed minimum.</li>
            <li>Instead of seeking a minimum, we are seeking an equilibrium between two adversaries.</li>
            <li>Hence, very difficult to train successfully.</li>
        </ul>
    </li>
</ul>

<h2>Training (in more detail)</h2>
<ul>
    <li>In the first phase, we train the discriminator:
        <img src="images/gan_first.png" />
        <ul>
            <li>Sample real images from the training set, labeled 1.</li>
            <li>Generate an equal number of synthetic images, labeled 0.</li>
            <li>Use an epoch of backprop on the discriminator only with 
                binary cross-entropy as the loss function.
            </li>
        </ul>
    </li>
    <li>In the second phase, we train the generator:
        <img src="images/gan_second.png" />
        <ul>
            <li>Generate some synthetic images, label them all 1 (yes, 1!).</li>
            <li>Use an epoch of backprop on the whole GAN with binary cross-entropy as the
                loss function but with the weights of the discriminator frozen within the GAN.
            </li>
        </ul>
    </li>   
</ul>

<p>
    This code snippets (adapted from Chollet's book, 2nd edn) assume that the coded representation has 128 features 
    and that the images are $64 \times 64$.
</p>

In [None]:
latent_dim = 128

generator = Sequential([
    Input(shape=(latent_dim,)),
    Dense(8 * 8 * 128),
    Reshape((8, 8, 128)),
    Conv2DTranspose(128, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Conv2DTranspose(256, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Conv2DTranspose(512, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Conv2D(3, kernel_size=5, padding="same", activation="sigmoid"),
])

In [None]:
discriminator = Sequential([
    Input(shape=(64, 64, 3)),
    Conv2D(64, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Conv2D(128, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Conv2D(128, kernel_size=4, strides=2, padding="same"),
    LeakyReLU(alpha=0.2),
    Flatten(),
    Dropout(0.2),
    Dense(1, activation="sigmoid"),
])

In [None]:
# The regular fit method cannot be used. We need a fit method that call this...
        
def train_step(real_images):
    batch_size = tf.shape(real_images)[0]
    random_latent_vectors = tf.random.normal(shape=(batch_size, latent_dim))
    generated_images = generator(random_latent_vectors)
    combined_images = tf.concat([generated_images, real_images], axis=0)
    labels = tf.concat(
        [tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0
    )
    labels += 0.05 * tf.random.uniform(tf.shape(labels))

    with tf.GradientTape() as tape:
        predictions = discriminator(combined_images)
        d_loss = loss_fn(labels, predictions)
    grads = tape.gradient(d_loss, discriminator.trainable_weights)
    d_optimizer.apply_gradients(
        zip(grads, discriminator.trainable_weights)
    )

    random_latent_vectors = tf.random.normal(
        shape=(batch_size, latent_dim))

    misleading_labels = tf.zeros((batch_size, 1))

    with tf.GradientTape() as tape:
        predictions = discriminator(generator(random_latent_vectors))
        g_loss = loss_fn(misleading_labels, predictions)
    grads = tape.gradient(g_loss, generator.trainable_weights)
    g_optimizer.apply_gradients(zip(grads, generator.trainable_weights))

    d_loss_metric.update_state(d_loss)
    g_loss_metric.update_state(g_loss)
    return {"d_loss": d_loss_metric.result(), "g_loss": g_loss_metric.result()}

<p>
    The code snippet is just a rough idea. As already said, successful training is hard, and would require
    a lot of tweaks, which you can look up if you're interested.
</p>

GANs are now used for these and other purposes:
<ul>
    <li>increasing the resolution of images;</li>
    <li>colorization;</li>
    <li>image editing, e.g. replacing photo-bombers with realistic backgrounds;</li>
    <li>turning sketches into photo-like images;</li>
    <li>augmenting image datasets;</li>
    <li>&hellip;</li>
</ul>
<!--
https://www.ft.com/content/56dde36c-aa40-11e9-984c-fac8325aaa04 (paywall now)
https://www.forbes.com/sites/korihale/2019/05/28/google-microsoft-banking-on-africas-ai-labeling-workforce/?sh=34298493541c
https://scale.com/
https://www.sama.com/
https://www.amazon.co.uk/Old-Ireland-Colour-John-Breslin/dp/1785373706
https://www.insight-centre.org/our-team/dr-john-breslin/
http://oldirelandincolour.com/
https://www.slideshare.net/Cloud/old-ireland-in-colour
https://thispersondoesnotexist.com/
https://www.whichfaceisreal.com/index.php
https://thisrentaldoesnotexist.com/
https://thisxdoesnotexist.com/
-->

<p>
    This ends AI1! We start AI2 with lectures on getting computers to understand human languages.
</p>