# Long Short Term Memory (LSTM)

LSTM was first introduced by <a href="https://www.bioinf.jku.at/publications/older/2604.pdf">Hochreiter & Schmidhuber (1997)</a>, and it solved the the issues of long-term dependencies that previous reccurent neural networks (RNN) had. We will refer to the <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">explanatory blog post</a> for a thorough  introduction to LSTM Networks, also the figures presented in this numerical example are taken from this post. 

A RNN consist of a chain of modules that are repeating throught a network. For the standard RNN this module is pretty simple and can be viewed in figure below. 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" style="width:50%;display: block;margin-left: auto;margin-right: auto;">

LSTMs however as a much more complex module, but the repetition of these modules stays the same. In the figure below presents the different operations carried out within a LSTM module. The LSTM consist of four neural network layers, and we will now further explain these layers and the operations within.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" style="width:50%;display: block;margin-left: auto;margin-right: auto;">

In LSTM network two activation functions are primarily used. The sigmoid function with the expression
$$
 \sigma(x) = \frac{1}{1+e^{-x}} = \frac{e^x}{e^x+1},
$$

This is implemented in the `lstm` module:

In [1]:
import lstm
%psource lstm.sigmoid

[0;32mdef[0m [0msigmoid[0m[0;34m([0m[0mx[0m[0;34m:[0m [0mnp[0m[0;34m.[0m[0mndarray[0m[0;34m)[0m [0;34m->[0m [0mnp[0m[0;34m.[0m[0mndarray[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0;36m1[0m [0;34m/[0m [0;34m([0m[0;36m1[0m [0;34m+[0m [0mnp[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0;34m-[0m[0mx[0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m


and the tanh function
$$
 \mathrm{tanh}(x) = \frac{\mathrm{sinh}(x)}{\mathrm{cosh}(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{e^{2x} - 1}{e^{2x} + 1}.
$$

The implementation is as follows:

In [2]:
%psource lstm.tanh

[0;32mdef[0m [0mtanh[0m[0;34m([0m[0mx[0m[0;34m:[0m [0mnp[0m[0;34m.[0m[0mndarray[0m[0;34m)[0m [0;34m->[0m [0mnp[0m[0;34m.[0m[0mndarray[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0mnumerator[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0mx[0m[0;34m)[0m [0;34m-[0m [0mnp[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0;34m-[0m[0mx[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mdenominator[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0mx[0m[0;34m)[0m [0;34m+[0m [0mnp[0m[0;34m.[0m[0mexp[0m[0;34m([0m[0;34m-[0m[0mx[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0mnumerator[0m [0;34m/[0m [0mdenominator[0m[0;34m[0m[0;34m[0m[0m


Sigmoid will squeeze values between zero and one, and the tanh will squeeze values between minus one and one. 

## Cell state

The cell state is the top line running straight throught the module, and it runs through the entire chain of modules. It has a multiplcative interaction with the forget gate layer and a additive interaction with the input gate and tanh layer. By having these interaction the LSTM has the option of either removing or adding information along the chain. A gate consists of a neural network layer with a sigmoid acitvation function i.e. it would ouput values between zero and one, which means let no information through or all the information through respectivly.

In order to implement an LSTM cell in numpy, we start by creating a `Parameters` class containing the entire parametrization of an LSTM network.

In [3]:
%psource lstm.Parameters

[0;32mclass[0m [0mParameters[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mevent_size[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mhidden_size[0m[0;34m:[0m [0mint[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;31m# Weights and biases for the sigmoid "f-function"[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mevent_forget_weights[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mzeros[0m[0;34m([0m[0;34m([0m[0mhidden_size[0m[0;34m,[0m [0mevent_size[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mevent_forget_bias[0m [0;34m=[0m [0;36m0[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mhidden_forget_weights[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mzeros[0m[0;34m([0m[0;34m([0m[0mhidden_size[0m[0;34m,[0m [0mhidden_size[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mhidden_forget_bias

We will now instantiate this class.
For our dummy example, we will set the input size to 3 and the hidden size to 3 as well.

In [4]:
EVENT_SIZE, HIDDEN_SIZE = 3, 3
parameters = lstm.Parameters(event_size=EVENT_SIZE, hidden_size=HIDDEN_SIZE)

## Forget Gate Layer

The Forget gate layer is the first layer within the LSTM module (furthest to the left) and it is connected to the cell state with a pointwise multiplcation operator. It is responsible for deciding what information from the previous module is kept by looking at previous information $h_{t-1}$ and new information $x_t$. It is important to note that the information in $h_{t-1}$ is not the same as the cell state information $C_{t-1}$, this will be further explained later. The ouput from the activation in this layer is given by
$$
f_t = \sigma (W_f [h_{t-1},x_t] + b_f),
$$
where $\sigma$ is the sigmoid function, $W_f$ the weights applied to the vector of combined features $h_{t-1}$ and $x_t$, and $b_f$ is the bias term. We have now desided which members of the cells state are kept, and will now look at the layer that adds information to the cell state.

The forget gate is implemented as follows:

In [5]:
%psource lstm.forget_gate

[0;32mdef[0m [0mforget_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mprev_cell_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Forget gate deciding how much of the previous cell state to keep."""[0m[0;34m[0m
[0;34m[0m    [0mforget_hidden[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_forget_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_forget_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mforget_event[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_forget_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mevent_forget_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0;31m# Values b

Let's assume that we have an existing hidden state of $\vec{h} = [0, 0, 10]$, a previous cell state of $\vec{C} = [1, 1, 1]$, and a new event $\vec{x} = [10, 0, 0]$.
What happens in the forget gate with the current parametrization?

In [6]:
import numpy as np

event = np.array([10, 0, 0])
hidden_state = np.array([0, 0, 10])
prev_cell_state = np.array([1, 1, 1])
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.5, 0.5, 0.5])

Since we only have zero-weights, and zero biases, the new cell state has become $[0.5, 0.5, 0.5]$ since $\mathrm{sigmoid}(1) = 0.5$.
Let's change the weight matrices to identity matrices:

In [7]:
parameters.event_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.9999546, 0.5      , 0.9999546])

Changing both weight matrices to the identity matrices, the previous hidden state and the new event has influenced the new cell state.
We can also let the new event influence the cell state, and the hidden state be ignored completely:

In [8]:
parameters.event_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_forget_weights = np.zeros((HIDDEN_SIZE, EVENT_SIZE))
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.9999546, 0.5      , 0.5      ])

We see that the first index has been changed from $0.5$ to $\approx 1$, while the third index is still $0.5$ as expected.

## Input Gate Layer and Tanh Layer

The input gate layer is the second layer from the left in the diagram of the module. It decides which member in the cell state we will update with new information. This layer works in combination with a tanh layer that proposes candidate values that are weighted by importance by the input gate. This is done by a multiplicative combination of the output of these two layers, and the output of which is added to the cell state members. 
The ouput of the activation in the input gate is calculated by
$$
i_t = \sigma(W_i[h_{h_t},x_t] + b_i),
$$
where again $\sigma$ is the sigmoid activation function, $W_i$ is the weights in the input layer applied to the combined vector with a added bias of the layer $b_i$.
The ouput of the activation in the tanh layer is given by the expression
$$
\tilde{C}_t = \mathrm{tanh}(W_C[h_{t-1},x_t] + b_C),
$$
where $\tilde{C}_t$ is the candidate update values, $W_C$ the weights applied to the vector and $b_C$ the bias in the layer.

These outputs of these two layers $i_t$ and $\tilde{C}_t$ is multiplied together and the cell state is updated by addition to the new cell state $C_t$. The following expression explains the updating of the cell state within a module
$$
 C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t.
$$

The input gate is implemented as follows:

In [9]:
%psource lstm.input_gate

[0;32mdef[0m [0minput_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Input gate deciding how to update the cell state."""[0m[0;34m[0m
[0;34m[0m    [0;31m# We have certain candidates from the new event and the hidden state[0m[0;34m[0m
[0;34m[0m    [0;31m# we would like to update the cell state with[0m[0;34m[0m
[0;34m[0m    [0mhidden_candidates[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_candidate_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_candidate_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mevent_candidates[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_candidate_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m      

We try to evaluate this gate with the default $0$ weights.

In [10]:
lstm.input_gate(event=event, hidden_state=hidden_state, parameters=parameters)

array([0., 0., 0.])

This time we get $\vec{0}$ as the output from the input gate, due to the $\tanh$ activation function being zero for zero input.
Let's see what happens if we set the weight matrix for the hidden state to the identity matrix, but leaving the event weights as zero.

In [11]:
parameters.hidden_candidate_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_update_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
lstm.input_gate(event=event, hidden_state=hidden_state, parameters=parameters)

array([0.       , 0.       , 0.9999546])

The resulting update weighs the hidden state and ignores the new event data completely, as expected.
The new cell state is the sum of the forget gate output and the input gate output, as follows:

In [12]:
%psource lstm.cell_state

[0;32mdef[0m [0mcell_state[0m[0;34m([0m[0mforget_gate_output[0m[0;34m,[0m [0minput_gate_output[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    New cell state, a combination of the partially forgotten cell state[0m
[0;34m    and the newly proposed state.[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0mforget_gate_output[0m [0;34m+[0m [0minput_gate_output[0m[0;34m[0m[0;34m[0m[0m


## Output Gate Layer

In each module a ouput is given from the network, this output is also given to the subsequent modules, and this output is calculated in the output gate layer. First the ensemble of new inputs $x_t$ and previous output $h_{t-1}$ is applied to a sigmoid layer to decide which part of the cell state will be part of the output. This with a multiplicative operation with the cell state $C_t$. However beforehand the cell state is sent through a tanh function to squeeze the values between $-1$ and $1$. The output gate operations can be summerized in the following expressions:

$$
 o_t = \sigma(W_o[h_{t-1},x_t] + b_o),
$$
where $o_t$ is the output of the sigmoid layer with corresponding weights $W_o$ and bias term $b_o$, and
$$
 h_t = o_t\times \mathrm{tanh}(C_t)
$$
defines the output of the module.

The output of the LSTM cell is a combination of the new event, the hidden state, and the cell state.
It is implemented as follows:

In [13]:
%psource lstm.output_gate

[0;32mdef[0m [0moutput_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mcell_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Decide what to output from the LSTM cell."""[0m[0;34m[0m
[0;34m[0m    [0mhidden_output[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_output_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_output_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mevent_output[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_output_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mevent_output_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0;34m([0m[0;34m[0m
[0;

For the given LSTM network we have implemented, the event input size is $x \times 1$, and the hidden state is of size $h \times 1$.
The output of the LSTM cell becomes $h \times 1$, which is not necessarily optimal for a given problem.
The output of the LSTM cell therefore often passed to final, fully connected layer with an output of more optimal dimensions.
We have opted to not implement this last part, since it is not a core, new concept of LSTM cells.