In [1]:
import lstm

We will start by defining a simple wrapper class that contains the entire parametrization of the  LSTM network.
This is implemented in `lstm.Parameters`.

In [2]:
%psource lstm.Parameters

[0;32mclass[0m [0mParameters[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;32mdef[0m [0m__init__[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mevent_size[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mhidden_size[0m[0;34m:[0m [0mint[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m        [0;31m# Weights and biases for the sigmoid "f-function"[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mevent_forget_weights[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mzeros[0m[0;34m([0m[0;34m([0m[0mhidden_size[0m[0;34m,[0m [0mevent_size[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mevent_forget_bias[0m [0;34m=[0m [0;36m0[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mhidden_forget_weights[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0mzeros[0m[0;34m([0m[0;34m([0m[0mhidden_size[0m[0;34m,[0m [0mhidden_size[0m[0;34m)[0m[0;34m)[0m[0;34m[0m
[0;34m[0m        [0mself[0m[0;34m.[0m[0mhidden_forget_bias

We will now instantiate this class.
For our dummy example, we will set the input size to 3 and the hidden size to 3 as well.

In [3]:
EVENT_SIZE, HIDDEN_SIZE = 3, 3
parameters = lstm.Parameters(event_size=EVENT_SIZE, hidden_size=HIDDEN_SIZE)

The forget gate is implemented as follows:

In [4]:
%psource lstm.forget_gate

[0;32mdef[0m [0mforget_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mprev_cell_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Forget gate deciding how much of the previous cell state to keep."""[0m[0;34m[0m
[0;34m[0m    [0mforget_hidden[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_forget_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_forget_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mforget_event[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_forget_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mevent_forget_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0;31m# Values b

Let's assume that we have an existing hidden state of $\vec{h} = [0, 0, 10]$, a previous cell state of $\vec{C} = [1, 1, 1]$, and a new event $\vec{x} = [10, 0, 0]$.
What happens in the forget gate with the current parametrization?

In [5]:
import numpy as np

event = np.array([10, 0, 0])
hidden_state = np.array([0, 0, 10])
prev_cell_state = np.array([1, 1, 1])
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.5, 0.5, 0.5])

Since we only have zero-weights, and zero biases, the new cell state has become $[0.5, 0.5, 0.5]$ since $\mathrm{sigmoid}(1) = 0.5$.
Let's change the weight matrices to identity matrices:

In [6]:
parameters.event_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.9999546, 0.5      , 0.9999546])

Changing both weight matrices to the identity matrices, the previous hidden state and the new event has influenced the new cell state.
We can also let the new event influence the cell state, and the hidden state be ignored completely:

In [7]:
parameters.event_forget_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_forget_weights = np.zeros((HIDDEN_SIZE, EVENT_SIZE))
lstm.forget_gate(event=event, hidden_state=hidden_state, parameters=parameters, prev_cell_state=prev_cell_state)

array([0.9999546, 0.5      , 0.5      ])

We see that the first index has been changed from $0.5$ to $\approx 1$, while the third index is still $0.5$ as expected.

---

The input gate is implemented as follows:

In [8]:
%psource lstm.input_gate

[0;32mdef[0m [0minput_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Input gate deciding how to update the cell state."""[0m[0;34m[0m
[0;34m[0m    [0;31m# We have certain candidates from the new event and the hidden state[0m[0;34m[0m
[0;34m[0m    [0;31m# we would like to update the cell state with[0m[0;34m[0m
[0;34m[0m    [0mhidden_candidates[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_candidate_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_candidate_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mevent_candidates[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_candidate_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m      

We try to evaluate this gate with the default $0$ weights.

In [9]:
lstm.input_gate(event=event, hidden_state=hidden_state, parameters=parameters)

array([0., 0., 0.])

This time we get $\vec{0}$ as the output from the input gate, due to the $\tanh$ activation function being zero for zero input.
Let's see what happens if we set the weight matrix for the hidden state to the identity matrix, but leaving the event weights as zero.

In [10]:
parameters.hidden_candidate_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
parameters.hidden_update_weights = np.eye(HIDDEN_SIZE, EVENT_SIZE)
lstm.input_gate(event=event, hidden_state=hidden_state, parameters=parameters)

array([0.       , 0.       , 0.9999546])

The resulting update weighs the hidden state and ignores the new event data completely, as expected.
The new cell state is the sum of the forget gate output and the input gate output, as follows:

In [11]:
%psource lstm.cell_state

[0;32mdef[0m [0mcell_state[0m[0;34m([0m[0mforget_gate_output[0m[0;34m,[0m [0minput_gate_output[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""[0m
[0;34m    New cell state, a combination of the partially forgotten cell state[0m
[0;34m    and the newly proposed state.[0m
[0;34m    """[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0mforget_gate_output[0m [0;34m+[0m [0minput_gate_output[0m[0;34m[0m[0;34m[0m[0m


---

The output of the LSTM cell is a combination of the new event, the hidden state, and the cell state.
It is implemented as follows:

In [13]:
%psource lstm.output_gate

[0;32mdef[0m [0moutput_gate[0m[0;34m([0m[0mevent[0m[0;34m,[0m [0mhidden_state[0m[0;34m,[0m [0mcell_state[0m[0;34m,[0m [0mparameters[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0;34m"""Decide what to output from the LSTM cell."""[0m[0;34m[0m
[0;34m[0m    [0mhidden_output[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mhidden_output_weights[0m [0;34m@[0m [0mhidden_state[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mhidden_output_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0mevent_output[0m [0;34m=[0m [0;34m([0m[0;34m[0m
[0;34m[0m        [0mparameters[0m[0;34m.[0m[0mevent_output_weights[0m [0;34m@[0m [0mevent[0m[0;34m[0m
[0;34m[0m        [0;34m+[0m [0mparameters[0m[0;34m.[0m[0mevent_output_bias[0m[0;34m[0m
[0;34m[0m    [0;34m)[0m[0;34m[0m
[0;34m[0m    [0;32mreturn[0m [0;34m([0m[0;34m[0m
[0;

For the given LSTM network we have implemented, the event input size is $x \times 1$, and the hidden state is of size $h \times 1$.
The output of the LSTM cell becomes $h \times 1$, which is not necessarily optimal for a given problem.
The output of the LSTM cell therefore often passed to final, fully connected layer with an output of more optimal dimensions.
We have opted to not implement this last part, since it is not a core, new concept of LSTM cells.