# Motivation: Representation learning as a side-effect

As a recap, we have to remind ourselves, what happens inside a deep neural network, when it learns to classify a non-linearly separable dataset:

In case of deep neural networks there exists a powerful effect, in a sense a combination of the piecewise consruction of decision boundaries with the learning of successive, hierarchic "kernel operations", that "embed" the data space into a meaningful representation, enabling easy separability.

<img src="http://drive.google.com/uc?export=view&id=1tQu8JagtQKjd7xVbB5uDBA0CebjQcZ2B" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1q6TEXhcZ0hU9nv4CycGcNJyUb9RqC_Xy" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1UFV35b84geZTymaTKQBpLXva8efafloW" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1jAyFn9iKhjSADG-YViN73goVic1x8iu5" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1XSrsBdnan08LVjcVjiJwRwn6_u3HyzvA" width=50%>
<img src="http://drive.google.com/uc?export=view&id=1Aqx6qLy9pVt1-p2IG_CM0SKLnh7-Y2cI" width=50%>

[source](https://github.com/random-forests/applied-dl/blob/master/examples/twist-and-fold-moons.ipynb)

For some interactive / video illustrations see [here](https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html) and [here](https://srome.github.io/Visualizing-the-Learning-of-a-Neural-Network-Geometrically/).

**One could argue, that this in itself is the decisive feature behind the success of deep learning.**

In this sense, **representation learning** is the key for the potential of modern machine learning.

What is we could achieve the same representations, without an explicit supervised task?

# Unsupervised learning in general

<img src="https://media.geeksforgeeks.org/wp-content/uploads/clusteringg.jpg" width=55%>

In unsupervised learning, the only information we have is presented in the form of measurements $X$, without any labels, so the main objective is the **discovery of structure** inside the dataset, or to put it another way: **to discover relations of similarity** among the datapoints. 

This, in turn can be understood as **modeling the probability distribution** of the data, or - in case of a generative model - trying to estimate the true, generative distribution behind our observations, that is, the (probabilistic) mechanism, that give rise to our data as we were able to observe it.

In this sense, unsupervised learning has deep connections with generative modeling.

On the other hand, the discovery of such relations and associations presupposes **efficient representation** of data, in which the relationships and similarities ("distances") can be measured well. This is in parallel with techniques of **dimensionality reduction**. 

Coming up with the optimal representation is, in turn an exercise in **capturing the information content** of the dataset, which is in deep connection with **compressability** (see the [Hütter prize](https://en.wikipedia.org/wiki/Hutter_Prize)), and probaility (see entropy and likelihood).

The core question will be: **How will we utilize the representation abilities of deep networks for unsupervised learning?**

## Sidenote: Following is not the first idea, not even the last...

It is worth noting, that autoencoders were absolutely not the first unsupervised neural like models (think: [restricted Bolzmann machines](https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine)), not even the first deep ones (think: [deep belief networks](https://en.wikipedia.org/wiki/Deep_belief_network)), so fol a full historic coverage of unsupervised learning, those have to be taken into consideration.

Interestingly, the idea of autoencoders in a sense goes back to the "original" deep learning solution of [Bengio et al.](https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf) which was layerwise pre-training neural nets with "autoencoders" or "restricted Bolzmann machines", so they first started out as technical tools, and only then rose to prominence as full fledged models.

On the other hand, new [efficient representation learning techniques](https://medium.com/syncedreview/geoffrey-hinton-google-brain-unsupervised-learning-algorithm-improves-sota-accuracy-on-imagenet-f0537f5b716a) are still experimented with, so 

# Autoencoders and their variants (recap)

With all this in mind, let's do a short recap on the evolution of unsupervised representation learning with deep neural models.

## Simple AE

The most basic approach for representation learning with deep networks is to use an encoder-decoder framework, wherby the encoder turns the input into a dense vector representation (code), and the decoder in turn tries to reconstruct the original data as well as possible. The reconstruction error is often times taken to be a simple squared error metric (eg. pixelwise error, in case of images).

<img src="https://miro.medium.com/max/3148/1*44eDEuZBEsmG_TCAKRI3Kw@2x.png" width=55%>

The most notable structural porperty of this setup is the relative **low dimensional bottleneck** formed by the code itself. This fact, that only a severely limited "storage space" is available for the representation of the data, this presents an **[information bottleneck](https://en.wikipedia.org/wiki/Information_bottleneck_method)**, thus enforces compression, hence, the extraction of information. 

Naturally, this encoder-decoder structure, as well as the employed loss / learning method does not pose any kind of structural constraints, so any kind of "transposable" operation can be a part of the network. If one figures out, what is the transpose of the convolution and pooling operator, one can easily imagine a convolutiuonal network as being the encoder and decoder as well.

#### Detour: The transpose of convolution

The task in the decoder is to somehow **upsample** the code into an actually usable image. In a convolutional architecture, this is not absolutely trivial.

Enter "deconvolution" layers!

<img src="https://github.com/vdumoulin/conv_arithmetic/raw/master/gif/padding_strides_transposed.gif" width=35%>

It is well worth noting, that in the mathematical sense, this **can not be considered [deconvolution](https://en.wikipedia.org/wiki/Deconvolution) proper**, so the so called deconvolution layers are somewhat of a misnomer. So what happens here instead is, that these **fractionally strided convolutions** (or even sometimes called transposed convolutions) are learning upsampling patterns ("filters") for the increase of the image dimensions (and the decrease of channel numbers).

<img src="https://www.researchgate.net/profile/Xifeng_Guo/publication/320658590/figure/fig1/AS:614154637418504@1523437284408/The-structure-of-proposed-Convolutional-AutoEncoders-CAE-for-MNIST-In-the-middle-there.png" width=55%>

##### The problem with pooling

One of the main problems with the pooling operator in classical convnets is the fact, that it is not invertible. (Since there is information loss during the pooling itself.)

Or to put it another way:

"Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus."

[source](https://arxiv.org/pdf/1311.2901v3.pdf)

The introduction of pooling and unpooling into convolutional autoencoder models would thus complicate the model considerably. Instead of this, the usage of **strided convolution** as an alternative is often used. (Some people suggest, that even for other use cases, the usage of strides instead of explicit pooling is [not a bad idea](https://stats.stackexchange.com/questions/387482/pooling-vs-stride-for-downsampling).

<img src="https://cdn-images-1.medium.com/max/1000/0*MN0gWvDpIthuBZqT.gif" width=55%>

### Limitations of AE

<img src="https://i.stack.imgur.com/2gSs1.png" width=55%>

One of the main limitaitons of the simple autoencoder can well be observed on the picture above.

There are densely packed regions of space with the appropriate digits placed near to each-other. So far good. But on the other hand, the variability of the data is not so good, so if we would try to generate new datapoints, for many regions, we would be out of luck. (see the gaps between the classes...)

Or to put it another way: **the model is in a sense "overfit" to the observed data**.

What can help preventing overfitting? 

Well, for example noise!

### Making it more robust: Denoising AE

The underlying motivation behind denoising autoencoders is to use input noise ("corruption") to enforce the stability and generalization of the learned inner representations of the network, much along the lines how noise acts positively in general deep learning training, in the form of data augmentation, dropout, minibatch gradient noise, etc.


The main trick is to apply noise before the trainig, but compare the output not to the noisy, but the original, clean datapoints.

<img src="https://camo.githubusercontent.com/a183220225ec8f7ed632488b3e09a84d78a21d29/68747470733a2f2f63646e2d696d616765732d312e6d656469756d2e636f6d2f6d61782f313830302f312a47305634647a34524b544b47706562656f53574230412e706e67" width=55%>

The noise can be a gaussian type, classical form of noise:

<img src="https://static.packt-cdn.com/products/9781788399906/graphics/f9b44226-662e-43a1-aaa8-f9f952d8ce60.png" width=55%>

But it can also present itself in curious form, eg. as "occlusion":

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcSNCJLO3y3YqfwLJ8AhGATjhQkzFYGWYT7b2LBErslhvf-b7BX7" width=45%>

This scenario will give rise to the use case of [inpainting](https://en.wikipedia.org/wiki/Inpainting), which demonstrates, that the networks are forced to learn contextual deendencies between the visible objects, thus, have to acuire some kind of "knowledge of the world".

The topic of "distinguishing noise from the real data" will definitely come back in play in GANs! 

### Making it interpretable (?): Sparse autoencoders

In everyday settings, we are having strong assumption, that some finite amount of real causes are underlying the great variety of experiences we are faced with. In a visual setting eg. it is rational to assume, that the number of objects in the room is way lower, than the resulting pixels from different view angles, lighting conditions, etc. Thus, it is a very natural urge, to **disentangle input**, that is **to find a _sparse_ low dimensional representation of it's underlying causes** (hopefully).

The most suitable solution for enforcing sparsity of the code is to introduce an **additional regularization / penalty term in the loss function**.

<img src="https://miro.medium.com/max/1412/1*9eaYup-JypPPKJ3KfxZYbA.gif" width=45%>

A nice intro to sparse autoencoders can be found [here](https://medium.com/@syoya/what-happens-in-sparse-autencoder-b9a5a69da5c6).

The most suitable form of regularization is the **L1 constraint**, that ensures, that only some nodes in the hidden layers are allowed to be active at one time.

<img src="https://miro.medium.com/max/1648/1*k9RX5_kDYt2kG0u9ZREu5w.png" width=55%>

More on sparsity and causality in frames of conducting experiments can be found in [this paper](https://cims.nyu.edu/~bramley/publications/magic_box_cogsci.pdf), where the interaction between people's beliefs about sparsity and the experimentation is also detailed.


There are [other options to add sparsity](https://ieeexplore.ieee.org/document/7280364) constraints to the network, one amongst them is the **Kullback-Leibler (KL) divergence** we will discuss in a bit different context below. Here, the underlying distribution is assumed to be Bernoulli, see [here](https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf).



## Variational Auroencoders (VAE)

A further bold step in the direction of modeling the data in a generative way is to relax the constraints on the autoencoding process to be fully deterministic. This, in turn allows us to think of the problem of **modeling the underlying data distribution**, and the generative process, that **draws a sample from the distribution**, thus creating a new datapoint.

This is the domain of **variational autoencoders**.

### The model

The basic assumtion here is, that the latent distribution can be defined as a multivariate distribution (often times a Gaussian, for simplicity), and the code learned by the encoder is only serving as a parameter for these distributions. Or to put it another way: the encoder projects the datapoint to become parameters (points) in the gaussian space, and the decoder draws a sample from this space to generate new, unseen examples, that none the less resemble the original input.

<img src="https://camo.githubusercontent.com/a1ee306e347488d13d7c4dc1cb72ad84e0578595/68747470733a2f2f6b656974616b75726974612e66696c65732e776f726470726573732e636f6d2f323031372f31322f7661655f636f6d706c6574652e706e673f726573697a653d373530253243333232" width=55%>


#### How can we backprop through random variables? (reparametrization trick)

The main hurdle of training such models with gradient based methods is the fact, that **the random sampling operator is not differentiable**.

A generally useful tool called **reparametrization trick** can be applied here. The main idea is, that a random variable can be decomposed to be a random sample form a distribution with mean zero and unit variance, and parameters, that is a variance with which we multiply, and a mean, that we add. 

<img src="https://i.stack.imgur.com/Djjr1.png" width=45%>

Observe, that with this seamingly unremarkable reordering of the operations, we separated the differentiable part (parameters) from the non differentiable ones (random draw), hence we can calculate gradinets and do backpropagation without any problems.

### The loss function

From practical standpoint, and for the sake of simplicity, we posited the original distribution that we reparametrize with the mean(s) and varaince(s) (in a multivariate case) to be a gaussian. For all intents and purposes we would thus like to ensure, that the learned distribution is not far away from the one we posited.

We ensure this by adding a "regularization" term to the VAE loss (beside the recosntruction error), that measures the "distance" of the learned and the posited distribution with **Kullbach-Leibler divergence**:

$\sum_{i=1}^n \sigma^2_i + \mu_i^2 - \log(\sigma_i) - 1$

In fact, this will act as a regularizer towards a distribution with mean zero and unit variance.

A very nice in-depth discussion of the VAE loss can be found [here](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/).

#### Connections with variational inference

It is to be noted, that the idea of positing some kind of "family" of distributions, then to find the specific perameters of this distribution via optimization methods is a much more general concept, and is called [variational inference](https://en.wikipedia.org/wiki/Variational_Bayesian_methods). Thus, the VAE model can be considered to be an example of the general class of variational inference methods.  

For a detailed introduction to variational inference see futher [here](https://arxiv.org/abs/1601.00670).


### Controlling the latent space

Since we are interested in the "true" generative factors behind the data, and would like to excert control over the latent representation, one form we can choose is - paradoxically - to **train a VAE on a supervised dataset**, where **we use the class labels as additional conditions** which constrain the generation process, thus, ensure some kind of control over the latent representation.

This approach is called **conditional variational autoencoder**, and the gist of the idea is to add the original class labels as input vectors for the generator and the discriminator as well:

<img src="https://3.bp.blogspot.com/-X-2BF2ZJzlE/WVDBq5YaD_I/AAAAAAAADE0/IXhdDWg8L9oS_fR5kT9iTK8gQOoLullXgCLcBGAs/s1600/conditional_vae.png" width=45%>

We will see the resurfacing of this idea in the case of conditional GANs as well.

### Comparison of AE and VAE models

If we compare the latent representation of AEs and VAEs, we can see some very stark differences, mainly arising from the constraints we pose on the VAE model in form of the pre-defined distribution, and the KL divergence regularization term:

<img src="https://t1.daumcdn.net/cfile/tistory/99F936435CD4EBB235" width=55%>

A very nice analysis of the latent representations of AEs and VAEs can be found [here](https://thilospinner.com/towards-an-interpretable-latent-space/).

Some takeaways:

"
- Our visualization illustrates the large difference between AEs and VAEs regarding their behavior in latent space. For the given input images, the value range in latent space is $[−5.93,5.98]$ for the VAE in contrast to $[−22.59,74.46]$ for the AE.
- This makes clear why generating new content with AEs is difficult. To come up with new latent vectors, we need a lot of luck to find a value for which the decoder knows a useful reconstruction. In case of our VAE, we simply need to draw random samples from a standard normal distribution with a good chance that the decoder produces valid output.
- With respect to the full range in which latent values occur, the VAE produces reasonable outputs over a larger range than the AE.
- To our own surprise, interpolations between multiple images work quite well with both VAE and AE. It is interesting where the compression locates values in latent space. By just modifying the value of a single dimension, it is possible to generate multiple images of different digits."

### Limitations of VAE-s

The strong constraints we put on the VAE model come at a cost.

It is far more difficult to represent a complex distribution, if you are penalized strognly with a KL term, since two regions with high probability, that no longer fit a single gaussian well will get "averaged out", so the multiple **"modes"** of a multimodal distribution **"collapse"** into one rather "average" distribution.

<img src="https://3.bp.blogspot.com/-oNG_y0SUvg4/V6ghC_9q9MI/AAAAAAAAFFs/q4kH7gxJIrQqBsyM-wIUJ_xNeeys2nZqACLcB/s1600/forward-KL.png" width=45%>

This will lead to high level of blurring in case of image generation:

<img src="https://www.researchgate.net/profile/Mark_Hasegawa-Johnson/publication/305654682/figure/fig9/AS:739387545513990@1553295136791/Images-generated-by-a-VAE-and-a-DCGAN-First-row-samples-from-a-VAE-Second-row-samples.jpg" width=55%>


**The biggest drawback of VAEs is the inflexibility of the distribution and the KL penalty. We would like to relax the assumtions on the form of the distributions, as well as retain as many model as possible.**

This is where GANs will come to the rescue.

# Did "disentanglement" succeed?

In search of the real generative factors multiple specialized models like [$\beta$-VAE](https://openreview.net/references/pdf?id=Sy2fzU9gl) were created, with the explicit aim to constraint / guide the VAE training process so, that it learns disentangled representations of the input, hopefully corresponding to the real causes we assume.

<img src="https://github.com/google-research/disentanglement_lib/raw/master/sample.gif?raw=true" width=45%>

In their exceptionally thorough investivative paper [Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations](https://arxiv.org/abs/1811.12359) the authors examined 12000 model variants with "approximately 2.52 GPU years (NVIDIA P100)" in a simulated environment (seen above), where we know what objects and proeprties cause the change in the visual field.

(The code for their work is openly [available](https://github.com/google-research/disentanglement_lib).)

Sadly, what they found is quite discouraging:

"We analyze our experimental results and challenge common beliefs in unsupervised disentanglement learning: 
1. While all considered methods prove effective at ensuring that the individual dimensions of the aggregated posterior (which is sampled) are not correlated, we observe that the dimensions of the representation (which is taken to be the mean) are correlated. 
2. We do not find any evidence that the considered models can be used to reliably learn disentangled representations in an unsupervised manner as random seeds and hyperparameters seem to matter more than the model choice. Furthermore, good trained models seemingly cannot be identified without access to ground-truth labels even if we are allowed to transfer good hyperparameter values across data sets. 
3. For the considered models and data sets, we cannot validate the assumption that disentanglement is useful for downstream tasks, for example through a decreased sample complexity of learning."

So basically: we are not even close. The search continues.