# Autoencoders:

<table>
    <tr>
        <td width="70%">            
            <img src="images/autoencoder.png" width="100%">
        </td>
        <td>
            <img src="images/auto-encoder-labeled.png" width="100%">
            <p style="text-align: center;">If the encoder is given input $x$ and encodes it as $z=f(x)$, then the decoder computes output $g(f(x))$. The goal is to minimise some distance function between input $x$ and the reconstructed output $g(f(x))$:</p>
            <p style="text-align: center;">$E = L(x, g(f(x)))$</p>
        </td>
    <tr>
</table>


The input is forced through a bottleneck hidden layer which requires features to be extracted and compressed to a smaller dimension.

This can considered as an alternative to deep Boltzmann machines which do almost the same job. Deep Boltzmann machines are trained by layerwise unsupervised learning while autoencoders are trained by backpropagation.

#### Autoencoder Training as Pre-Training:

When an autoencoder is trained, its decoder network can be substituted for a classification network. In this sense, autoencoder training is a pre-training method for other networks. The resulting weights from the pre-trained network become the initial weights used in another network for another task.

Autoencoders can be trained in a *greedy layerwise* manner, similar to deep Boltzmann machines. We can train an autoencoder with one hidden layer to reconstruct the input, then we can train the next hidden layer to reproduce the input it receives and so on. In other words, each subsequent layer is trained to reconstruct the previous layer. The resulting network and its pre-trained parameters can then be attached to a classification layer. 

## Regularised Autoencoders:

If there are more hidden units than there are input units, then there is a risk that trivial identity mappings from input to output will be learnt by the encoder. Regularised autoencoders combat this problem:
- Introducing dropout at hidden layers
- Sparse autoencoders
- Contractive autoencoders
- Denoising autoencoders

### Sparse Autoencoder:

The sparse autoencoder introduces a penalty term to the cost function, determined by the hidden unit activations. This is similar to the weight decay strategy which aims to tone down larger weights, except here we're trying to tone down the hidden unit activations.

Note: having the hidden unit activations, $z=f(x)$, of the autoencoder follow a normal distribution of $\mu=0$ and $\sigma=1$ allows you to generate new images where we can just sample a random vector $\tilde{z}$ from the standard normal distribution have that produce a completely new image, similar to the training items.   


$L_1\texttt{-regularisation}$ &mdash; penalising the sum of the absolute values of the hidden unit activations:

$$
    E = \underbrace{L(x, g(f(x)))}_{\text{Distance function}} + \underbrace{\lambda \sum_i |h_i|}_{\text{Penalty}}, \tag{1}
$$
for each hidden unit activation $h_i$. This is a common choice for how sparse autoencoders penalise the hidden unit activations.

$L_2-\texttt{regularisation}$ penalises the square $h_i^2$ instead of the absolute value $|h_i|$.

With $L_1\texttt{-regularisation}$, some hidden units tend towards zero, producing a sparse hidden unit activation vector.

### Contractive Autoencoder:

The penalty term for contractive autoencoders is $L_2 - \texttt{norm}$:
$$
    E = \underbrace{L(x, g(f(x)))}_{\text{Distance function}} + \underbrace{\lambda\sum_i \parallel \nabla_x h_i \parallel^2}_{\text{Penalty}}. \tag{2}
$$

With this, we are taking $\nabla_x h_i$ which is the vector of derivatives of each hidden unit activation $h_i$ with respect to each input $x_i$. The term $\parallel \nabla_x h_i \parallel^2$ is the square of the length of the gradient vector.

With error function $(2)$, we can minimise the effect of a small changes in the input. Similar inputs get mapped to similar points in the hidden unit space. This is why it's called 'contractive'.

### Denoising Autoencoder:

Denoising the input is another regularisation method, similar in spirit to contractive autoencoder in that it aims to minimise the effect of small variations in the input layer nodes on the output prediction.

Denoising works by first superimposing noise on the input layer, then training the network to recover the original input layer from the noise.

1. Pick a training item $x$ and add a noise vector to it, producing $\tilde{x}$
2. Train the network to minimise $E=L(x, g(f(\tilde{x})))$
3. Repeat for all training items in the dataset

## Variational Autoencoder:

For autoencoders, the decoder can be seen as defining a conditional probability distribution $p_\theta(x|z)$ of possible output images given a particular hidden layer vector $z$, based on the decoder network with parameters $\theta$. 

The encoder part can be seen as defining a conditional probability distribution $q_\phi (z | x)$ of possible hidden layer vectors given input image $x$, based on the encoder network with parameters $\phi$.

For *variational autoencoders*, instead of producing just a hidden layer vector $z$ for each $x^{(i)}$, the encoder network with parameters $\phi$ can also give the mean $\mu_{z|x^{(i)}}$ and standard deviation $\sigma_{z | x^{(i)}}$ of the normal probability distribution of the possible hidden layer outputs $q_\phi (z | x^{(i)})$, given the input $x^{(i)}$.

We can then train the system to maximise:

$$
    \underbrace{E_{z \text{~} q_\phi(z|x^{(i)})}\big( \log(p_\theta(x^{(i)}|z)) \big)}_{\text{Probability to reproduce image }x^{(i)} \text{ given }z} - \underbrace{D_{KL}\big( q_\phi(z|x^{(i)}) \parallel p(z)\big).}_{\text{KL-divergence between } q_\phi(z|x^{(i)} \text{ and standard normal dist}}
$$

TODO: What the fuck does this even mean. 38:00