In this practical session we will answer questions about autoencoders and build one that compresses and reconstructs Fashion MNIST items.

![](images/autoencoder_schema.jpg)

Auto-encoders learn a *data-specific* compression function in an unsupervised manner. Depending on model complexity, autoencoders can generate very realistic reconstructions, but are always **lossy**. That is, some information about the original image will be lost.  

Auto-encoders are unsupervised. This is a very useful property. Why do you think so?

Mathematically, an autoencoder can be denoted as follows:

$$\hat{x} = d(e(x)) = d(z)$$

That is, it consists of an encoding function $e(x)$, a decoding function $d(z)$, where $z \in \mathbb{R}^d$ is some compressed state of $x$. Note that in all previous sessions we denoted our model output as $\hat{y}$. This time we do not, since our metric function will be one that compares $\hat{x}$ (our reconstructed image) to $x$ (the original image) instead of a target vector $y$.

If we parameterize $d$ and $e$ as neural networks, and use some differentiable loss function that compares the quality of the original input to the reconstructed input, we can simply use backpropagation to optimize!

Autoencoders are mainly used for three tasks:

* If we add stochasticity (Variational Auto Encoder) we can sample new datapoints.
* We can visualize our data distribution more clearly using the compressed representation. Simpler algorithms (T-SNE, PCA) might fall short here.

![](images/compression.png)
* We can use it for denoising our data, which we will do in this session.

The following picture shows an autoencoder in the style that we visualized neural networks before.

![](images/autoencoder_neurons.png)

Let's build one using keras!

In [None]:
library(keras)
source("06-helpers.R")

use_multi_cpu()

data <- dataset_fashion_mnist()
data_train <- data$train
data_test <- data$test

Let's start with the preprocessing. Note that we don't need the labels this time. Try to do the following steps with the dataset:

* Extract the training and testing images.
* Normalize them.
* Part a slice of the training data for validation.

If you can't remember how to do these steps, try looking at notebook 03b!

In [None]:
<YOUR ANSWER>

Let's continue with constructing the autoencoder model. As can be seen from the previous image, an autoencoder can be constructed using a single sequence of operations. 

This time, we only provide the bare skeleton of the model. Try using hidden dimensions of 8 neurons and a latent dimension (that is, the dimensionality of the vector that the encoder outputs) of 4 neurons. Use ReLU activation for the intermediate layers, and sigmoid activation at the final output.

Why do you think we need sigmoid activation at the end?

In [None]:
model <- keras_model_sequential() %>%
    <FILL IN>

cat(summary(model))

What makes up the difference between the encoding parameters and the decoding parameters?

Let's compile the model. Use `adam` as the optimizer with learning rate `0.01`. Next, use the `binary_crossentropy` loss function. That is, we treat every output neuron as a Bernoulli distribution with parameter $p$ and compare the original an reconstructed distributions.

In [None]:
model %>% compile(
    <FILL IN>
)



In [None]:
history <- model %>% fit(
    x = x_train,
    y = x_train,
    validation_data = list(x_val, x_val),
    epochs = 80,
    batch_size = 4096,
    callbacks=list(Progress$new())
)
plot(history)

Let's see what we have generated!

In [None]:
to_predict <- x_test[1:10,]

predictions <- model %>% predict(to_predict, batch_size = 1)
predictions <- array_reshape(predictions, c(10, 28, 28))

index <- 2

Change this index to compare different images.

Original:

In [None]:
library(ggplot2)
library(reshape2)


options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(array_reshape(to_predict[index,], c(28, 28)), 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

Generated:

In [None]:
library(ggplot2)
library(reshape2)

options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(predictions[index,,], 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

It seems like the model is trying, but the reconstructions aren't of high quality... Try to beef up the network by increasing the amount of layers, layer sizes and latent space.

In [None]:
model <- keras_model_sequential() %>%
    <FILL IN>

cat(summary(model))

In [None]:
model %>% compile(
    optimizer = optimizer_adam(lr = 0.01),
    loss = "binary_crossentropy",
)

In [None]:
history <- model %>% fit(
    x = x_train,
    y = x_train,
    validation_data = list(x_val, x_val),
    epochs = 80,
    batch_size = 4096,
    callbacks=list(Progress$new())
)
plot(history)

Let's see if we improved.

In [None]:
to_predict <- x_test[1:10,]

predictions <- model %>% predict(to_predict, batch_size = 1)
predictions <- array_reshape(predictions, c(10, 28, 28))

index <- 2

In [None]:
library(ggplot2)
library(reshape2)

options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(predictions[index,,], 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

Looks a bit better!

## Denoising Autoencoder

Autoencoders can be used to denoise data. 'Noise' can be loosely interpreted. In fact, we can alter the original data with almost any transformation and recover the originals. For example, you could manually turn colored images into grayscale, and train an autoencoder to reverse this. After the model has converged, you could use the model to colorize old grayscale photos!

Since we are dealing with grayscale data here, we will focus on denoising.

Generate Gaussian noise vectors for train, validation and test using the `rnorm` function. Use a standard deviation of 1. Reshape these vectors into the same shape as train, validation and test data using `array_reshape`.

In [None]:
noise_train <- <FILL_IN>
noise_val <- <FILL_IN>
noise_test <- <FILL_IN>

In [None]:
x_train_noisy <- x_train + noise_train
x_val_noisy <- x_val + noise_val
x_test_noisy <- x_test + noise_test

# We make sure max and min values are within the valid image range.
x_train_noisy[x_train_noisy > 1] = 1
x_train_noisy[x_train_noisy < 0] = 0

x_val_noisy[x_val_noisy > 1] = 1
x_val_noisy[x_val_noisy < 0] = 0

x_test_noisy[x_test_noisy > 1] = 1
x_test_noisy[x_test_noisy < 0] = 0


Create a denoising autoencoder model. You could use the same architecture as in the previous exercise.

In [None]:
model <- keras_model_sequential() %>%
    <FILL IN>

cat(summary(model))

In [None]:
model %>% compile(
    optimizer = optimizer_adam(lr = 0.01),
    loss = "binary_crossentropy",
)

In [None]:
history <- model %>% fit(
    x = x_train_noisy,
    y = x_train,
    validation_data = list(x_val_noisy, x_val),
    epochs = 80,
    batch_size = 4096,
    callbacks=list(Progress$new())
)
plot(history)

Let's take a look at the results.

In [None]:
to_predict <- x_test_noisy[1:10,]

predictions <- model %>% predict(to_predict, batch_size = 1)
predictions <- array_reshape(predictions, c(10, 28, 28))

index <- 4

In [None]:
library(ggplot2)
library(reshape2)


options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(array_reshape(to_predict[index,], c(28, 28)), 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

In [None]:
library(ggplot2)
library(reshape2)

options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(predictions[index,,], 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

We've effectively removed the noise! 

As you might expect, this is partly due to the fact that our model is not complex enough to learn even the text on the shirt (index 2). In the bonus sections we will try to tackle this.

In the bonus sections we will try to tackle this.

## Bonus 1: Going convolutional

In the presentations we have seen that convolutional neural networks (CNNs) can be extremely powerful for image data. Let's change our network by using convolutions and increasing the parameters to generate images of higher quality.


First, we reshape the data such that it is compatible with a convolutional approach.


In [None]:
x_train <- array_reshape(x_train, c(48000, 28, 28, 1))
x_val <- array_reshape(x_val, c(12000, 28, 28, 1))
x_test <- array_reshape(x_test, c(10000, 28, 28, 1))


In [None]:
noise_train <- array_reshape(0.1 * rnorm(48000 * 28 * 28), c(48000, 28, 28, 1))
noise_val <- array_reshape(0.1 * rnorm(12000 * 28 * 28), c(12000, 28, 28, 1))
noise_test <- array_reshape(0.1 * rnorm(10000 * 28 * 28), c(10000, 28, 28, 1))

In [None]:
x_train_noisy <- x_train + noise_train
x_val_noisy <- x_val + noise_val
x_test_noisy <- x_test + noise_test


x_train_noisy[x_train_noisy > 1] = 1
x_train_noisy[x_train_noisy < 0] = 0

x_val_noisy[x_val_noisy > 1] = 1
x_val_noisy[x_val_noisy < 0] = 0

x_test_noisy[x_test_noisy > 1] = 1
x_test_noisy[x_test_noisy < 0] = 0

Create the Convolutional Autoencoder model. Use two convolutional layers `layer_conv_2d` with 32 filters, kernel sizes of 3, 'SAME' padding and ReLU activation. After each convolutional layer use a `layer_average_pooling_2d` layer to reduce the spatial dimensionality. Then flatten the activations using `layer_flatten` and map to a dense vector with dimensionality 64. 

Try to 'mirror' the encoder into a decoder (e.g. the amount of layers and parameters should be roughly the same), using `layer_upsamlpling_2d` instead of average pooling.

Your final layer should be a convolutional layer with 1 filter and sigmoid activation.

In [None]:
model_cnn <- keras_model_sequential() %>%
    <FILL IN>
cat(summary(model_cnn))

In [None]:
model_cnn %>% compile(
    optimizer = optimizer_adam(lr = 0.01),
    loss = "binary_crossentropy",
)



Because of the 'heavy' network, we reduce epochs and batch size. ß

In [None]:
history <- model_cnn %>% fit(
    x = x_train_noisy,
    y = x_train,
    validation_data = list(x_val_noisy, x_val),
    epochs = 5,
    batch_size = 64,
    callbacks=list(Progress$new())
)
plot(history)

In [None]:

to_predict <- array_reshape(x_test_noisy[1:10,,,], c(10, 28, 28, 1))
predictions <- model_cnn %>% predict(to_predict, batch_size = 1)
predictions <- array_reshape(predictions, c(10, 28, 28))


In [None]:
library(ggplot2)
library(reshape2)
index=4

options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(to_predict[index,,,], 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

In [None]:
library(ggplot2)
library(reshape2)

options(repr.plot.width = 3, repr.plot.height = 3)
ggplot(melt(t(apply(predictions[index,,], 2, rev)), varnames=c('x', 'y')), aes(x=x, y=y, fill=value)) +
    geom_raster() +
    scale_x_continuous(expand = c(0, 0)) +
    scale_y_continuous(expand = c(0, 0)) +
    scale_fill_gradient(low="#000000", high="#FFFFFF") +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste('Label:', x_test[index]))
options(repr.plot.width = 6, repr.plot.height = 5)

Very nice! This network could probably become much better if we weren't computationally constrained.

## Bonus 2:

Try using BatchNormalization to speed up training.