In [1]:
%load_ext autoreload
%autoreload 2
%load_ext autotime

# AutoEncoders 101 

AutoEncoders typically have a hidden layer $h$ that represents the input vector $\textbf{x}$. Autoencoders consist of two parts, an encoder that models $h = f(x)$ and a decoder that produces a reconstruction $r = g(h)$. AutoEncoders are trained in the hope that $h$ will take on a useful representation of the training data and are therefore usually restricted in some way to prevent the decoder from learning to perfectly reconstruct the input. 

AutoEncoders are trained to minimize a loss objective such as

$$
L(x, g(f(x)))
$$

Where L is a loss function such as cross entropy loss. 

## Denoising AutoEncoders (DAEs)

Denoising AutoEncoders are very similar to AutoEncoders with a slight variation to the loss function: 

$$
L(x, g(f(\tilde{x})))
$$

where $\tilde{x}$ is a copy of $\textbf{x}$ that has undergone some sort of perturbation to corrupt the copy. This introduces noise to the training data and helps to prevent the autoencoder from learning the identity function.  

## Variational AutoEncoders

Variational AutoEncoders (VAEs) were introduced by Diederik Kingma and Max Welling in 2013. 

Key innovation is that they can be trained to maximize the variational lower bound $L(q)$ w.r.t x:

$$
\mathrm{L}(q) = \mathbb{E}_{z \sim q(z | x)}log_{p_model}(z|x) + H(q(z|x)) \\
= \mathbb{E}_{z \sim q(z | x)}logp_{model}(z|x) - D_{KL}(q(z|x)||p_{model}(z)) 
$$

The first term is the reconstruction loss found in other autoencoders while the second term tries to make the approximate posterior distribution $q(z | x)$ and the model prior $p_{model}(z)$ approach each other. 

By choosing $\mathbf{q}$ to be gaussian and noise added to the predicted mean. This encourages the VAE to place high probability mass on many Z values rather than focusing on the most likely point. 

Benefits: 

* Can represent much more complex relationships than traditional dimensionality reduction e.g., PCA
* No need for MCMC
* q is user defined 

### The reparameterization trick 

The reparameterization trick allows the second term in the loss function to be computed analytically by assuming the posterior has a Gaussian distribution with added noise. Reparameterizing $z$ as:

$$
z = \mathbf{\epsilon}\mathbf{\sigma_x} + \mu_x
$$


<img src="./www/reparam_trick.png" alt="Reparameterization Trick" style="width: 400px;"/>


## Adversarial AutoEncoders



In [10]:
[1] * 0

[]

time: 9.04 ms


# Resources/References

[1 - AutoEncoding Variational Bayes](https://arxiv.org/abs/1312.6114)

[2 - Deep Learning Chapter 14, Goodfellow](https://www.deeplearningbook.org/contents/autoencoders.html)

[3 - Adversarial AutoEncoders](https://arxiv.org/abs/1511.05644)

[4 - An Introduction to Variational AutoEncoders](https://arxiv.org/pdf/1906.02691.pdf)