# Book Part II: Neural Networks & Deep Learning
   
<img src="res/book.jpg" width = 25% align = "right">
   
---
CH 15 - Autoencoders
---

---
**CH 15 - Autoencoders** **<---  <span style="color: #FF0000">THIS WEEK !</span>**


---
**CH 16 - Reinforcement Learning** <--- <span style="color: #0000FF">NEXT WEEK</span>

---


# What is an Autoencoder?

Lets start with a dataset of many features that we feed into a network. In the next layer, we are going to **choke up** the number of neurons and limit the amount of information it can store. Then finally we will have an output layer that fans out so it has the **same number of outputs** as there were **inputs**. 

<img src="./images/autoencoder.png" />

The networks goal is to output the **same** input after all the loss of information in the middle. If it can do that accurately then it has **learned a pattern** from the dataset that can be exploited!

This is an autoencoder.

### Autoencoders are great at uncovering patterns from unsupervised data. 

### Use Cases
- dimensionality reduction
- feature extraction 
- unsupervised pretraining
- generative models

An autoencoder is always composed of two parts: an **encoder** (or recognition network) that converts the inputs to an internal representation, followed by a **decoder** (or generative network) that converts the internal representation to the outputs

<img src="./images/undercomplete_autoencoder.png" />

Because the internal representation has a **lower dimensionality** than the input data (it is 2D instead of 3D), the autoencoder is said to be **undercomplete**. An undercomplete autoencoder cannot trivially copy its inputs to the codings, yet it must find a way to output a copy of its inputs. It is **forced to learn** the most important features in the input data (and drop the unimportant ones). 

## Simple Autoencoder... 

<img src="./images/pca_autoencoder_output.png" />

### PCA Dimension Reduction using an Undercomplete Autoencoder

In [None]:
import tensorflow as tf 
from tensorflow.contrib.layers import fully_connected

n_inputs = 3  # 3D inputs 
n_hidden = 2  # 2D codings

In [None]:
n_outputs = n_inputs # Important! 
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=[None, n_inputs]) 
hidden = fully_connected(X, n_hidden, activation_fn=None) #  activation_fn=None -> Learning is Linear
outputs = fully_connected(hidden, n_outputs, activation_fn=None)

In [None]:
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X))  # MSE

optimizer = tf.train.AdamOptimizer(learning_rate) 
training_op = optimizer.minimize(reconstruction_loss)
init = tf.global_variables_initializer() 

In [None]:
X_train, X_test = [...] # load the dataset
n_iterations = 1000 
codings = hidden  # the output of the hidden layer provides the codings

with tf.Session() as sess:    
    init.run()    
    for iteration in range(n_iterations):        
        training_op.run(feed_dict={X: X_train})  # no labels (unsupervised)    
        codings_val = codings.eval(feed_dict={X: X_test}) 

## Stacked Autoencoders (Deep Autoencoders)
<img src="./images/stacked_autoencoder.png" />

Autoencoders can have multiple hidden layers. In this case they are called stacked autoencoders (or deep autoencoders). 

The architecture of a stacked autoencoder is typically symmetrical with regards to the central hidden layer (the coding layer). To put it simply, it looks like a sandwich. 



## Techniques

### Tying Weights

Since autoencoders are symmetrical, we can 'tie' together the weights in the encoder and decoder layers. What this means is we can share the weights in similar layers. This halves the number of weights, speeding up our training and reducing the risk of overfitting (less degree's of freedom & complexity).


### Training One Autoencoder at a Time

Training an entire stacked autoencoder can take a long time. It turns out training them separately and then assembling them after is much faster with the same results.

#### Phase 1. Train a simple Autoencoder

The first autoencoder learns to reconstruct the inputs

<img src="./images/phase-1.png" />

#### Phase 2. Train an Inner Autoencoder (Repeat for however many nested autoencoders...)

The second autoencoder learns to reconstruct the output of the first autoencoder’s hidden layer

<img src="./images/phase-2.png" />

#### Phase 3. Encoders Assemble

Stack all the autoencoders

<img src="./images/phase-3.png" />

<img src="./images/one-at-a-time.png" />


### Unsupervised Pretraining with Stacked Autoencoders

We can reuse the layers in a trained autoencoder to quickly make a neural network that understands patterns in an unlabelled dataset.

<img src="./images/pretrain-autoencoder.png" />


### Other Type of Autoencoders:

#### Denoising Autoencoders
The autoencoder is trained to remove noise from inputs (by adding noise to the input data but evaluating it based on noiseless inputs). This method is good for making sure the autoencoder is learning patterns and not memorizing your data, since memorizing the noisy image inhibits its performance. They usually add Gaussian Noise to the inputs or use dropout to achieve this.

#### Sparse Autoencoders

<img src='./images/kl_divergence_sparsity_loss.PNG' />
These autoencoders limit the number of active neurons allowed at a time pushing the autoencoder to represent each input with fewer activiations. To do this you need to measure the **mean activation per neuron** and then penalize neurons that are too active by adding a **sparsity loss** to the cost function.

#### Variational Autoencoders (VAE)
These are **generative autoencoders**, meaning that they can generate new instances that look like they were sampled from the training set. They use Gaussian noise to crea

<img src='./images/variational_autoencoder.PNG' />

Instead of directly producing a coding for a given input, the encoder produces a **mean coding (μ)** and a **standard deviation (σ)**. The actual coding is then sampled randomly from a Gaussian distribution with mean μ and standard deviation σ.

<img src='./images/variational_autoencoder2.PNG' />

#### Contractive autoencoder (CAE)
The autoencoder is constrained during training so that the derivatives of the codings with regards to the inputs are small. In other words, two similar inputs must have similar codings

#### Stacked convolutional autoencoders
Autoencoders that learn to extract visual features by reconstructing images processed through convolutional layers. 

#### Generative stochastic network (GSN)
A generalization of denoising autoencoders, with the added capability to generate data. 

#### Winner-take-all (WTA) autoencoder
During training, after computing the activations of all the neurons in the coding layer, only the top k% activations for each neuron over the training batch are preserved, and the rest are set to zero. Naturally this leads to sparse codings. Moreover, a similar WTA approach can be used to produce sparse convolutional autoencoders. 

#### Adversarial autoencoders
One network is trained to reproduce its inputs, and at the same time another is trained to find inputs that the first network is unable to properly reconstruct. This pushes the first autoencoder to learn robust codings.