# Autoencoders 

### Sources:

Deep Learning - Bishop and Bishop

Autoencoders, or autoassociative neural networks, are one way to discover representations of data for many subsequent, downstream tasks.

Autoencoders consist of neural networks that are trained to generate an output y, that is similar to the input x. Once this model is trained to minimize the difference between the reconstructed input and the original input, an internal layer in the model gives, $z(x)$ gives a representation for each input. This representation is compressed*, so that the network can find non-trivial representations of the input data, and is often called the latent representation. 

There are two main parts to an autoencoder:

1. Encoder: defines a mapping between the input $x$ and the latent variable $z(x)$.

2. Decoder: defines a mapping between the latent representation $z$ and the output $y(z)$.

*Note: the representation may also be a sparse representation. Or, the network may be forced to learn non-trivial representations by reconstructing corrupted inputs, such as by addding noise or missing data. It is hoped that these constraints (compression, sparsity, corruption) will force the network to learn interesting representations of the data. 


## Deterministic Autoencoders

Using a neural network with the same number of inputs as outputs and minizmizing the reconstruction error, it is possible to learn a latent manifold of the input data. Unlike PCA, which is a linear method, it is possible to learn non-linear latent manifolds with neural networks. Simple autoencoders are now rarely used in the literature, but they are important conceptual for understanding more powerful deep generative models, for example, variational autoencoders. 

### Linear autoencoders 

![Linear Autoencoder](./figures/linear_autoencoder.png)

The above linear autoencoder consists of a MLP with $D$ inputs, $D$ ouputs and $M$ hidden units where $M < D$ where the objective is to map the input onto itself. This is known as a *auto-associative* mapping. The model is trained by minimzing the sum-of-squares error between the input $x$ and the output $y$:

\begin{equation}
E(w) = \frac{1}{2} \sum_{n=1}^N ||y(x_n, w) - x_n||^2
\end{equation}

"Even with nonlinear units in the hidden layer, such a network is equivalent to linear principal component analysis." (Bishop and Bishop, 2024). It has been shown that if the hidden units have linear activations, the error has a unique global minimum and that at this minimum, the network performs projection onton the M-dimensional subspace spanned by the first M principal components of the data (see Bishop for references). Even if non-linear activations are used, the error function is still convex and so the minimum error solution for the hidden representation still spans the principal subspace. (Note: these vectors need not be orthogonal or normalized, as they are in PCA.) So there is no advantage in using two-layer neural networks to perform dimensionality reduction, use PCA instead.

## Deep Autoencoders

The result changes if additional nonlinear layers are added to the network. The network is now capable of learning non-linear basis components. This is, however, more computational intensive since nonlinear optimzation techniques must be used. Another aspect to consider is that the dimensionality of the subspace must be specified prior to training the network.


![Deep Autoencoder](./figures/deep_autoencoder.png)

## Sparse Autoencoders

Another way to constrain the internal representation is to use a regularizer to encourage sparse representations, leading to a lower effective dimensionality. For example, the L1 regularizer encourages sparseness by using the additional term in the loss function:

\begin{equation}
\tilde{E}(w) = E(w) + \lambda \sum_{k=1}^K |z_k|
\end{equation}

where $E(w)$ is the reconstruction loss and sum over $K$ activation values of all of the units  in one of the hidden layers. (Note: regularization is usually applied to the parameters of a network, where as it is being applied to the unit activations in this case.) 

## Denoising autoencoders 

Another constraint that forces an autoencoder to learn interesting internal structure of the data is by using a denoising autoencoder. Each input vector, $x_n$ is corrupted with noise to give a modified vector $\tilde{x}_n$. This corrupted vector is fed into the autoencoder to produce a reconstructed output $y(\tilde{x}_n, w)$ which is learnt using an error function, such as the sum-of-squares:

\begin{equation}
E(w) = \sum_{n=1}^{N} ||y(\tilde{x}_n, w) - x_n||^2
\end{equation}

One form of noise is randomly setting a fraction ($0 < \mu < 1$) of the input variables to zero. Or, each input vector could be randomly perturbed by a zero-mean Gaussian. Through denoising, the network is forced to learn aspects of the structure of the data. 

Formally, the training of denoising autoencoders is related to 'score matching': the score is defined by $s(x) = \nabla_x log p(x)$. Some intuition is shown in the figure below:

![Denoising Autoencoder](./figures/denoising.png)

In this figure, the autoencoder learns to reverse the distortion vector $\tilde{x}_n - x_n$. This means that network learns a vector for each point data space that points towards the manifold of the data and is directed towards the region of high data density. (Note: this is an important concept for diffusion models).

## Masked autoencoders 

Masked autoencoders reconstruct an image given a corrupted version of the image, similar to denoising autoencoders, by setting masking or dropping out regions of the input. (This is usually done with a vision transformer.) Omitting a large fraction of the input patches also saves signficant computation, particulary for transformers which scale poorly with sequence length - $O(N^2)$, and so a masked autoencoder is a good choice for pre-training large transformer encoders. Usually the decoder is discarded for downstream tasks. 

## Variational autoencoder (VAE)


The likelihood function for a latent-variable model is given by:

\begin{equation}
p(x|w) = \int p(x|z, w) p(z) dz
\end{equation}

where $p(x|z, w)$ is defined by a deep neural network. This integral is intractable as the integral over $z$ cannot be evaluated analytically. A VAE works with an approximation to this likelihood when training the model. The VAE consists of three ideas:

1) Use of the evidence lower bound (ELBO) to approximate the likelihood function (a close relation to the EM algorithm).

2) Amortizes inference: a second model, called the encoder, is used to approximate the posterior distribution over the latent variables in the E step, rather than evaluating the posterior distribution for each data point exactly.

3) Making the training of the encoder tractable using the *reparameterization trick*.

Consider a generative model with a conditional distribution $p(x|z,w)$ over some $D$-dimensional data $x$, which is controlled by the output of a deep neural network $g(z, w)$. ($g(z, w)$ may represent the mean of a Gaussian conditional distribution, for example.) Further, consider the distribution over a $M$-dimensional latent variable $z$ that is given by a zero-mean unit-variance Gaussian:

\begin{equation}
p(z) = \mathcal{N}(z| 0, I)
\end{equation}

For an arbitary probability distribution $q(z)$ over a space described by the latent variable $z$, the likelihood function can be decomposed into the evidence lower bound (or variational lower bound) and a KL divergence term between the variational posterior and the prior posterior:

\begin{equation}
\ln p(x|w) = \mathcal{L}(w) + KL \ (q(z)||p(z|x, w))
\end{equation}

where the ELBO is given by:

\begin{equation}
\mathcal{L}(w) = \int q(z) \ln \frac{p(x|z, w) \ p(z)}{q(z)} dz
\end{equation}

The KL divergence $KL(\cdot || \cdot)$ term is given by:

\begin{equation}
KL \ (q(z)||p(z|x, w)) = - \int q(z) \ln \frac{p(z| x, w)}{q(z)} dz
\end{equation}

As $KL \ (q||p) \geq 0$, it follows that

\begin{equation}
\ln p(x|w) \geq \mathcal{L}
\end{equation}

and so $mathcal{L}$ is a lower bound on $\ln p(x|w)$. Although the log likelihood $\ln p(x|w)$ is intractable, the ELBO can be evaluated using a Monte Carlo estimate, thus approximating the true log likelihood.

For a set of training data points $\mathcal{D} = \{x_1, x_2, ..., x_N\}$, the log likelihood is given by: 

\begin{equation}
\ln p(\mathcal{D}|w) = \sum_{n=1}^N \mathcal{L_n} + \sum_{n=1}^N KL \ (q_n(z_n)||p(z_n|x_n, w))
\end{equation}

where $\mathcal{L}_n$ is

\begin{equation}
\mathcal{L}_n = \int q_n(z_n) \ln \frac{p(x_n|z_n, w) \ p(z_n)}{q(z_n)} dz_n
\end{equation}

and $q(z_n)$ is the variational distribution over the latent variables for the nth data point and note: this introduces a separate latent variable $z_n$ corresponding to each data vector $x_n$ and consequently, each latent variable has it's own distribution which can be optimized separately. 

The exact posterior distribution of $z_n$ is given from Bayes' theorem by:

\begin{equation}
p(z_n |x_n, w) = \frac{p(x_n | z_n, w) \ p(z_n)}{p(x_n|w)}
\end{equation}

The numerator is straightforward to evaluate for a deep generative model. However, the likelihood is intractable. Therefore an approximation is required to find the posterior distribution. It is possible, however very inefficient, to create a separate parameterized model for each of the distributions $q_n(z_n)$ and optimize each numerically. A more efficient approximation framework requires the introduction of a second neural network. 

In the VAE, instead of trying to evaluate a separate posterior distribution $p(z_n|x_n, w)$ for each of the data points $x_n$ indvidually, an 'encoder' network is trained is used to approximate all of these distributions. This is known as 'amortized' inference. The encoder should produce a single distribution $q(z|x, \phi)$ that is conditioned on $x$, where $\phi$ represents the parameters of the network. The objective function, given by the ELBO, now has a dependence on $\phi$ and as well as $w$, and gradient-based optimization can be used to maximize the bound jointly with respect to both sets of parameters: the weights of the the network and the parameters of the posterior distribution over the latent space. 

A VAE therefore comprises two neural networks tht have independent parameters but which are trained jointly: an encoder that takes a data vector and maps it to a latent space, and the orginal network that takes a latent space vector and maps it back to the data space. This latter network can be interpreted as a decoder. 

This architecture is similar to the vanilla autoencoder defined above with the important distinction that there is now a probabillity distribution defined over the latent space. (The encoder calculates an approximate probabilistic inverse of the decoder according to Bayes'.)

A typical choice for the encoder is a Gaussian with a diagonal covariance matrix whose mean and variance parameters, $\mu_j$ and $\sigma_j^2$ are given by the outputs of a neural network that takes $x$ as input:

\begin{equation}
q(z | x, \phi) = \prod_{j=1}^{M} \mathcal{N}(z_j | \mu_j(x, \phi), \sigma_j^2(x, \phi))
\end{equation}

Note: the means $\mu_j(x, \phi)$ lie in the range $(-\infty, \infty)$, and so the corresponding output-unit activation can be linear. The variances $\sigma_j^2$ must be non-negative so their associated output units must use $exp(\cdot)$ as their activation function. 

Although $\phi$ and $w$ are optimized together, it is possible to imagine alternating between $\phi$ and $w$ just as in the EM algorithm. Although this is not strictly true. A key difference compared to the EM algorithm is that for a given value of $w$, optimizing with respect to the parameters $\phi$ of the encoder does not in general reduce the KL divergence to zero because the encoder network is not a perfect predictor of the posterior latent distribution and so there is a residual gap between the lower bound and the true log likelihood. Although the encoder is flexible, it is not expected to model the true posterior exactly for the following reasons:

1) The true conditional posterior distribution will not be a factorized Gaussian

2) Even a large neural network has limited flexibility.

3) The training process is only an approximate optimization. 

![EM vs VAE](./figures/em_v_vae_training.png)


The lower bound is still intractable as it involves integrals over the latent variables $\{ z_n \}$ in which the integrand has a complicated dependence on the latent variables because of the decoder network. For data point $x_n$ we can write the contribution to the lower bound in the form 

\begin{equation}
\mathcal{L_n}(w, \phi) = \int q(z_n, x_n, \phi) \ln \{ \frac{p(x_n|z_n, w) p(z_n)}{q(z_n|x_n, \phi)} \} dz_n
\end{equation}
\begin{equation}
= \int q(z_n|x_n, \phi) \ln p(x_n|z_n, w) dz_n - KL \ (q(z_n|x_n, \phi) || p(z_n))
\end{equation}

The second term on the right hand is the KL divergence between two Gaussian distributions and can be evaluated analytically:

\begin{equation}
KL \ (q(z_n | x_n, \phi)) = \frac{1}{2} \sum_{j=1}^{M} {1 + \ln \sigma^2_j (x_n) - \mu_j^2 (x_n) - \sigma_j^2 (x_n)}
\end{equation}

The first term within the integral for the ELBO above can be approximated using a Monte Carlo estimator:

\begin{equation}
\int q(z_n|x_n, \phi) \ln p(x_n|z_n, w) dz_n \approx \frac{1}{L} \sum_{l=1}^{L} \ln p(x_n|z_n^{(l)}, w)
\end{equation}

where $\{ z_n^{(l)} \}$ are samples drawn from the encoder distribution $q(z_n|x_n, \phi)$. This can be easily differentiated with respect to $w$ but the gradient with respect to $\phi$ is problematic because changes to $\phi$ will change the distribution $q(z_n | x_n, \phi)$ from which the samples are drawn. Yet, these samples are fixed values so that we don't have a way to obtain the derivatives of these samples with respect to $\phi$. We can think of the process of fixing $z_n$ to a specific sample value as blocking the back-propagation of the error signal to the encoder network. 

This can be resolved by using the reparameteriztion trick in which we reformulate the Monte Carlo sampling procedure such that derivatives with respect to $\phi$ can be calculated explicitly. If $\epsilon$ is a Gaussian random variable with zero mean and unit variance, then 

\begin{equation}
z = \sigma \epsilon + \mu
\end{equation}

will have a Gaussian distribution with mean $\mu$ and variance $\sigma^2$. This is applied to the samples in the Monte Carlo estimator in which $\mu$ and $\sigma$ are determined by the outputs of the encoder, $\mu_j(x_n, \phi), \sigma^2_j(x_n, \phi)$, which represent the means and covariances of the posterior of the latent variable $z_n$. 

So instead of drawing samples of $z_n$ directly, we draw samples for $\epsilon$ and use $z = \sigma \epsilon + \mu$ to evaluate corresponding samples for $z_n$:

\begin{equation}
z_{nj}^{(l)} = \mu_j(x_n, \phi) \epsilon_{nj}^{(l)} + \sigma_j^{2}(x_n, \phi)
\end{equation}

where $l = 1, \dots, L$ indexes the samples. This makes the dependence of $\phi$ explicit and allows gradients with respect to $\phi$ to be evaluated. Although there exist techniques to evaluate gradients without the reparameterization trick, these estimators tend to have high variances. Therefore reparametrization can be viewed as a variance reduction technique. 

The full error function for the VAE can be written as follows:

\begin{equation}
\mathcal{L} = \sum_n \{ \frac{1}{2} \sum_{j=1}^M \{ 1 + \ln \sigma_{nj}^2 - \mu_{nj}^2 - \sigma_{nj}^2 \} + \frac{1}{L} \ln p(x_n | z_n^{(l)}, w) \}
\end{equation}

where $z_n^{(l)}$ has components $z_{nj}^{(l)} = \sigma_{nj} \epsilon^{(l)} + \mu_{nj}$, in which $\mu_{nj} = \mu_{j} (x_n, \phi)$ and $\sigma_{nj} = \sigma_j(x_n, \phi)$ and n is the size of a mini-batch. Typically, the number of samples is $L=1$. Although this gives a noisy estimate of the bound, it forms part of the stochastic gradient optimization step, which is already noisy, and overall helps lead to more efficient optimization.

VAE summary:

1) Each data point in a mini-batch is forwarded through the encoder to evaluate the means and variances of the appropriate latent distributions.

2) The distribution is sampled from using the reparameterization trick.

3) Samples are propagated through the decoder to evaluate the elbow

4) Back propagate gradients

# ![VAE arch](./figures/vae_architecture.png)