### Recurrent Neural Networks (RNN): a better version of neural networks for sequential data 
Recurrent neural networks are a special case of neural networks specialized for sequential data. In this model, instead of taking a whole document as input, we read the document character by character. This is useful because it reduces the complexity of the model (the input of the model is one character instead of being thousands of characters.)[3] By studing the relationship between RNNs and the Enigma Machine, we will notice two gaps in Machine Learning research that apply to cryptography problems. So let's start by understanding how a RNN works.

The inner working of a recurrent neural network is to perform dot products between the input and its internal weights. We have a vector $x \in R^{n \times m}$ with $n$ being the length of the data (eg the length of a document with 1000 characters is just 1000) and $m$ being the dimensions of each character (26 words plus null character). Then, we sequentially input to the model one vector $x \in R^m$ that represents a character. The recurrent neural network concatenates that input with a hidden vector $h \in R^h$ which initially is full of zeros (later on, it will store a representation of the input). Then, it multiplies these two concatenated values by a weight matrix $w_{hid} \in R^{h \times (m + h)}.$ Finally, it applies a function to the resulting value to allow the model to achieve more complex functions. The function it applies is the hyperbolic tangent (tanh). The tanh is a non-linear function that "squashes" the input into the range $(-1, 1).$ It looks as follows.
<img src="http://mathworld.wolfram.com/images/interactive/TanhReal.gif" width="500px">
<div align="center">
    Source: mathworld.wolfram.com
</div>

Formally, we have that 
$$h_{t+1} = tanh(w_{hid} \cdot [x_t h_t]^T)$$

This hidden state corresponds to having an internal representation of the input the model has seen so far. But it's not enough for the model to know its state, we want the model to generate something interpretable for us. So what we naturally do is to create a new matrix $w_{out} \in R^{h \times n}$ that will yield the output after being multiplied by the state of the model.

Formally, we have that
$$y_t = tanh(w_{out} \cdot h_t)$$

The idea is to have several of these units, so we can compute several features at the same time. One property of the data in our universe is that often it's hierarchical. For instance, a picture of a neighborhood is composed of house and streets, which are composed of triangles and squares, which in turn are composed of lines, which are composed of dots. This is why it's useful to stack several layers of these units, where each layer computes higher-level features from taking as input the features computed by the previous layer.

Now, there are some problems with this recurrent neural network. The principal problem is that the hidden state vector (which has a fixed-size representation) has to remember increasingly bigger amounts of information. Clearly, as time goes on, the model starts to forget the first things it received.

Note that solving this problem is relevant for having a model that cracks the Enigma. It's only when we have a model capable of having long-term memories that we can solve difficult puzzles like the Enigma. To illustrate this point, think about this: how could be possible to decode the Enigma if we can only remember 10 characters of what we have seen so far? It's difficult to find any patterns with such small amount of data.

Note that processing the whole input every time is not possible. When humans read a book, we aren't reading every page of the book at once. Instead, we want to exploit the sequential property of the data. There are several methods to solve this problem. We are going to analyze only two.
* More complex units: Long-Short term memory and Gated Recurrent units
* Attention: referring to the whole input, paying more "attention" to some part of the input than to others.


### More complex unit: LSTM
We said that we want a way to carry dependencies over time. What we want is to have gates that leave the hidden state unchanged if the unit wants to do so. Note that in the standard recurrent neural network, if two units share the same hidden state, they will output the same vector. However, we want to have a recurrent neural network where we have different output in different units, but we don't modify the hidden state. 

Thus, instead of directly computing the hidden state from the previous hidden states, we compute the value for four vectors. 

We are going to use the sigmoid function, which we can picture as follows: 

<img src="https://cdn-images-1.medium.com/max/1600/1*Xu7B5y9gp0iL5ooBj7LtWw.png" width="500px">
<div align="center">
    Source: http://towardsdatascience.com
</div>

This function "squashes" the domain into the range $(0, 1).$ This is useful because the LSTM applies this function to three values to convert them into "gates." This gates will allow us to combine different variables. For instance, the LSTM can decide when it will entirely add the new input to the internal state (using a value of 1 for the gate,) when it will entirely discard the new input (using a value of 0 for the gate,) or when it will do something in-between (using a value between 0 and 1.) Note that theoretically we can't have values of 0 and 1 as the output of the sigmoid function, but in practice we get them because of computers rounding numbers.

Now, $w_{hid} \in R^{4h \times 2h}$

$$
\begin{bmatrix}
    i \\
    f \\
    o \\
    g \\
\end{bmatrix} =
\begin{bmatrix}
    sigmoid \\
    sigmoid \\
    sigmoid \\
    tanh \\
\end{bmatrix} \Bigg( w_{hid} \cdot 
\begin{bmatrix} 
    h_{t-1}^l \\
    h_t^{l-1}
\end{bmatrix}
\Bigg)
$$

Note that $i, f,$ and $o$ are the gates, with values between 0 and 1. The vector $g$ isn't a gate. It represents the new memory we are going to store. Note that the tanh function on $g$ is a way of normalizing it, preventing it from becoming too large.

We apply the following operations

$$
c_t^l = f \odot c_{t - 1}^l + i \odot g \\
h_t^l = o \odot tanh(c_t^l)
$$

In the first equation, the vector $f$ stands for the forget term and $i$ for the input term. Thus, in that equation we decide how much we forget from the previous state vector $c_{t - 1}^l$ and how much new information we add from the vector $g.$ Then, in the second equation, we compute the next hidden state vector based on the state vector $c_t^l.$

The process of calculating the value of the equations can be visually summarized as follows.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="700">

Source: (http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

$$
c_t = gate(c_{t-1}, 0) + gate(nn(x)), 0) \\
h_t = gate(nn(c_t)), 0) \\
$$

Gates are parametereized by $h_{t-1}$ and $x.$ (The peephole LSTM variant also uses $c_{t-1}$ and $c_t$ to parameterize the gates.)

nn(x) = tanh(affine(x))

## Another unit: GRU
We generate the next state of the GRU as a gated combination of the previous state and a new state. The new state is generated by combining the previous state and the input and then applying a transformation.

$$
\tilde h = [gate(h_{t-1}, 0), x_t] \\
h_t = gate(h_{t-1}, nn(\tilde h)) \\
$$

LSTM comparison
* we merged the input and forget gate.
