# Long Short Term Memory (LSTM)

LSTM was first introduced by <a href="https://www.bioinf.jku.at/publications/older/2604.pdf">Hochreiter & Schmidhuber (1997)</a>, and it solved the the issues of long-term dependencies that previous reccurent neural networks (RNN) had. We will refer to the <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">explanatory blog post</a> for a thorough  introduction to LSTM Networks, also the figures presented in this numerical example are taken from this post. 

A RNN consist of a chain of modules that are repeating throught a network. For the standard RNN this module is pretty simple and can be viewed in figure below. 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" style="width:50%;display: block;margin-left: auto;margin-right: auto;">

LSTMs however as a much more complex module, but the repetition of these modules stays the same. In the figure below presents the different operations carried out within a LSTM module. The LSTM consist of four neural network layers, and we will now further explain these layers and the operations within.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" style="width:50%;display: block;margin-left: auto;margin-right: auto;">

In LSTM network two activation functions are primarily used. The sigmoid function with the expression
$$
 \sigma(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x+1},
$$
and the tanh function
$$
 \mathrm{tanh}(x) = \frac{\mathrm{sinh}(x)}{\mathrm{cosh}(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}.
$$
Sigmoid will squeeze values between zero and one, and the tanh will squeeze values between minus one and one. 

## Cell state

The cell state is the top line running straight throught the module, and it runs through the entire chain of modules. It has a multiplcative interaction with the forget gate layer and a additive interaction with the input gate and tanh layer. By having these interaction the LSTM has the option of either removing or adding information along the chain. A gate consists of a neural network layer with a sigmoid acitvation function i.e. it would ouput values between zero and one, which means let no information through or all the information through respectivly.

## Forget Gate Layer

The Forget gate layer is the first layer within the LSTM module (furthest to the left) and it is connected to the cell state with a pointwise multiplcation operator. It is responsible for deciding what information from the previous module is kept by looking at previous information $h_{t-1}$ and new information $x_t$. It is important to note that the information in $h_{t-1}$ is not the same as the cell state information $C_{t-1}$, this will be further explained later. The ouput from the activation in this layer is given by
$$
f_t = \sigma (W_f [h_{t-1},x_t] + b_f),
$$
where $\sigma$ is the sigmoid function, $W_f$ the weights applied to the vector of combined features $h_{t-1}$ and $x_t$, and $b_f$ is the bias term. We have now desided which members of the cells state are kept, and will now look at the layer that adds information to the cell state.

## Input Gate Layer and Tanh Layer

The input gate layer is the second layer from the left in the diagram of the module. It decides which member in the cell state we will update with new information. This layer works in combination with a tanh layer that proposes candidate values that are weighted by importance by the input gate. This is done by a multiplicative combination of the output of these two layers, and the output of which is added to the cell state members. 
The ouput of the activation in the input gate is calculated by
$$
i_t = \sigma(W_i[h_{h_t},x_t] + b_i),
$$
where again $\sigma$ is the sigmoid activation function, $W_i$ is the weights in the input layer applied to the combined vector with a added bias of the layer $b_i$.
The ouput of the activation in the tanh layer is given by the expression
$$
\tilde{C}_t = \mathrm{tanh}(W_C[h_{t-1},x_t] + b_C),
$$
where $\tilde{C}_t$ is the candidate update values, $W_C$ the weights applied to the vector and $b_C$ the bias in the layer.

These outputs of these two layers $i_t$ and $\tilde{C}_t$ is multiplied together and the cell state is updated by addition to the new cell state $C_t$. The following expression explains the updating of the cell state within a module
$$
 C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t.
$$

## Output Gate Layer

In each module a ouput is given from the network, this output is also given to the subsequent modules, and this output is calculated in the output gate layer. First the ensemble of new inputs $x_t$ and previous output $h_{t-1}$ is applied to a sigmoid layer to decide which part of the cell state will be part of the output. This with a multiplicative operation with the cell state $C_t$. However beforehand the cell state is sent through a tanh function to squeeze the values between $-1$ and $1$. The output gate operations can be summerized in the following expressions:

$$
 o_t = \sigma(W_o[h_{t-1},x_t] + b_o),
$$
where $o_t$ is the output of the sigmoid layer with corresponding weights $W_o$ and bias term $b_o$, and
$$
 h_t = o_t\times \mathrm{tanh}(C_t)
$$
defines the output of the module.

