# Encoders and Decoders



# Autoencoder

A very simple, but useful, MLP architecture is the autoencoder. At its core it is a model which self-encodes vectors i.e. it is the identity function $f(x) = x$. At first glance this might not seem useful *at all*. Given everything we know about neural networks using a large amount of compute to train the simplest possible function seems like a waste of time. However, it can be incredibly valuable as a space saving tool. 

Consider the human genome with its $10^9$ base pairs and suppose that we have a dataset of 400,000 humans. That is a very large amount of data which would incur substantial costs of storage and read-times when analysing it. We also know there is a large amount of redunancy in the data: mostly the human genome is the same. Therefore, we would like to strip away this redunancy and encode our genomes for storage and create a decoder to reconstruct them when required. This is the identity function mentioned before. The autoencoder has two parts: an encoder $f_{\text{enc}}$ and decoder $f^{\text{dec}}$ obeying the relationship: 

$$f(x) = f^{\text{dec}} \circ f_{\text{enc}} (x) = x$$

They are clearly inverses of each other. We proceed in the normal fashion stating that they are both feed-forward neural networks with the encoder mapping vectors of length $n$ into a latent space of dimension $m$: $f_{\text{enc}}: \mathbb{R}^n \rightarrow \mathbb{R}^m$. The decoder is the reverse $f^{\text{dec}}: \mathbb{R}^m \rightarrow \mathbb{R}^n$. There are several trade-offs to be made: a lower $m$ gives a lower storage size for the latents but requires more layers and thus parameters to train. A higher $m$ requires less parameters but more storage. 

## 1.x Creating the Network
Let's proceed in the normal fashion by loading our data:

Now we have our data and have inspected its basic properties lets define our encoder and decoder neural networks. Together they make the complete autoencoder but since they perform distinct (and usually independent tasks) it is useful to keep them as two different objects. We will use the general layer structure that we created for MLPs to create a 4 layer network that maps into a dimension of size 3 as the encoder, and a 2 layer network that maps into the original dimension size of 10 as the decoder. We will define the activation function as ``tanh`` for all of the layers and initialise the weights randomly.

## 1.x Training the Network

To train the network we are going to have to differentiate through all of the layers. We will use the forward-mode automatic differentation routine we defined in the previous notebook for all of our derivatives. Note that we are doing this for simplicity, it would be better to use reverse-mode AD for the encoder and forward-mode AD for the decoder

In [1]:
# import routine

Now we also need to define the loss function. We care about the precise acid at each position so we will use the Hamming distance between base pairs as the loss: ``L(x, y) = sum((x_i == xhat_i) ? 0 : 1, 1:n)``. Note the use of the terneray operator here. Fortunately, we know precisely how the data aught to be mapped: to itself! This lets us define the backpropagation routine:

In [2]:
# define Loss

# define back-prop

All