# Understanding the Weights in RNNs

## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Look at the code in [Part A: Single Unit Simple Recurrent Layer](#Part-A:-Single-Unit-Simple-Recurrent-Layer) and complete the [Part A Exercise](#Part-A-Exercise)
0. Look at the code in [Part B: Two Unit Simple Recurrent Layer](#Part-B:-Two-Unit-Simple-Recurrent-Layer) and complete the [Part B Exercise](#Part-B-Exercise)
0. Optionally, look at the code in [Part C: LSTM Layer](#Part-C:-LSTM-Layer) and complete the [Part C Exercise](#Part-C-Exercise)

## Documentation/Sources
* [Class Notes](https://jennselby.github.io/MachineLearningCourseNotes/#recurrent-neural-networks)
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://keras.io/](https://keras.io/) Keras API documentation
* [Keras recurrent tutorial](https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent)

## Part A: Single Unit Simple Recurrent Layer

Before we dive into something as complicated as LSTMs, Let's take a deeper look at simple recurrent layer weights.

In [1]:
import numpy
from keras.layers import SimpleRNN
from keras.models import Sequential
from keras.layers import LSTM

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


The neurons in the recurrent layer pass their output to the next layer, but also back to themselves. The input shape says that we'll be passing in one-dimensional inputs of unspecified length (the None is what makes it unspecified).

In [2]:
one_unit_SRNN = Sequential()
one_unit_SRNN.add(SimpleRNN(units=1, input_shape=(None, 1), activation='linear', use_bias=False))

In [3]:
one_unit_SRNN_weights = one_unit_SRNN.get_weights()
one_unit_SRNN_weights

[array([[-0.07458556]], dtype=float32), array([[-1.]], dtype=float32)]

We can set the weights to whatever we want, to test out what happens with different weight values.

In [4]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 1
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.get_weights()

[array([[1.]], dtype=float32), array([[1.]], dtype=float32)]

We can then pass in different input values, to see what the model outputs.

The code below passes in a single sample that has three time steps.

In [5]:
one_unit_SRNN.predict(numpy.array([ [[3], [3], [7]] ]))

array([[13.]], dtype=float32)

## Part A Exercise
Figure out what the two weights in the one_unit_SRNN model control. Be sure to test your hypothesis thoroughly. Use different weights and different inputs.

## Hypothesis #1:
First weight multiplies all the inputs. Second weight determines how the inputs are combined. 

Second weight:
- if 0: output is final input + 0*(sum of all other inputs)
- if 1: output is final input + 1*(sum of all other inputs)
- if 2: output is final input + 2*(sum of all other inputs)
- if 3: output is final input + 3*(sum of all other inputs)
- ...
- if n: output is final input + n*(sum of all other inputs)

In [57]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 3
# 5 --> 97 = 7 + 5*(3+3)
# 4 --> 67 = 7 + 4*(3+3)
# 3 --> 43 = 7 + 3*(3+3)
# 2 --> 25 = 7 + 2*(3+3)
# 1 --> 13 = 7 + 1*(3+3)
# 0 --> 7  = 7 + 0*(3+3)
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.predict(numpy.array([ [[3], [3], [7]] ]))

# Hypothesis holds true in this example

array([[43.]], dtype=float32)

In [51]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 3

one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.predict(numpy.array([ [[1], [1], [2]] ]))
# Following hypothesis, expected output should be 2 + 3(1+1) = 8, but the actual output is 14...

array([[14.]], dtype=float32)

## Hypothesis #2:
- The first weight multiplies the input of n time step, 
- the second weight multiples the output of the (n-1) time step, 
- and these multiplications are added together and are the output of the node at n time step.

In [58]:
one_unit_SRNN_weights[0][0][0] = 1
one_unit_SRNN_weights[1][0][0] = 0.5
one_unit_SRNN.set_weights(one_unit_SRNN_weights)
one_unit_SRNN.predict(numpy.array([ [[3], [3], [7]] ]))

# (0.5*((0.5*(1*3))+(1*3)))+(1*7) = 9.25
# Expected output!

array([[9.25]], dtype=float32)

## Part B: Two Unit Simple Recurrent Layer

In [59]:
two_unit_SRNN = Sequential()
two_unit_SRNN.add(SimpleRNN(units=2, input_shape=(None, 1), activation='linear', use_bias=False))

In [60]:
two_unit_SRNN_weights = two_unit_SRNN.get_weights()
two_unit_SRNN_weights

[array([[0.06964445, 0.3499601 ]], dtype=float32),
 array([[ 0.98779154, -0.15578215],
        [ 0.15578215,  0.9877914 ]], dtype=float32)]

In [61]:
two_unit_SRNN_weights[0][0][0] = 1
two_unit_SRNN_weights[0][0][1] = 1
two_unit_SRNN_weights[1][0][0] = 0
two_unit_SRNN_weights[1][0][1] = 1
two_unit_SRNN_weights[1][1][0] = 0
two_unit_SRNN_weights[1][1][1] = 1
two_unit_SRNN.set_weights(two_unit_SRNN_weights)
two_unit_SRNN.get_weights()

[array([[1., 1.]], dtype=float32),
 array([[0., 1.],
        [0., 1.]], dtype=float32)]

This passes in a single sample with four time steps.

In [9]:
two_unit_SRNN.predict(numpy.array([ [[3], [3], [7], [5]] ]))

array([[ 5., 31.]], dtype=float32)

## Part B Exercise
What do each of the six weights of the two_unit_SRNN control? Again, test out your hypotheses carefully.

## Hypothesis:
There are 2 inputs that multiply the regular inputs (one for each node), and 4 weights that multiply the outputs fed back in. 
The output of one node gets fed back in to all of the nodes, and hence there are 2 weights of this nature for the 2 nodes, totalling in 6 weights for a layer of 2 nodes.

My comments in 

For a layer of n nodes, there will be n + n^2 weights.

In [66]:
two_unit_SRNN_weights[0][0][0] = 1 # multiplies the regular input of node #1
two_unit_SRNN_weights[0][0][1] = 0 # multiplies the regular input of node #2
two_unit_SRNN_weights[1][0][0] = 0 # multiplies the feedback connection of node 1 --> 1
two_unit_SRNN_weights[1][0][1] = 0 # multiplies the feedback connection of node 1 --> 2
two_unit_SRNN_weights[1][1][0] = 0 # multiplies the feedback connection of node 2 --> 1
two_unit_SRNN_weights[1][1][1] = 0 # multiplies the feedback connection of node 2 --> 2
two_unit_SRNN.set_weights(two_unit_SRNN_weights)

two_unit_SRNN.predict(numpy.array([ [[3], [3], [7], [5]] ]))
# Expected output [5, 0]!
# It is hard to thoroughly demonstrate that this works, 
# as there are so many different combinations needed to sufficiently prove this hypothesis

array([[5., 0.]], dtype=float32)

In [67]:
two_unit_SRNN_weights[0][0][0] = 1 # multiplies the regular input of node #1
two_unit_SRNN_weights[0][0][1] = 1 # multiplies the regular input of node #2
two_unit_SRNN_weights[1][0][0] = 1 # multiplies the feedback connection of node 1 --> 1
two_unit_SRNN_weights[1][0][1] = 1 # multiplies the feedback connection of node 1 --> 2
two_unit_SRNN_weights[1][1][0] = 1 # multiplies the feedback connection of node 2 --> 1
two_unit_SRNN_weights[1][1][1] = 1 # multiplies the feedback connection of node 2 --> 2
two_unit_SRNN.set_weights(two_unit_SRNN_weights)

two_unit_SRNN.predict(numpy.array([ [[3], [4]] ]))

# Expected output [(1*(1*3))+(1*(1*3))+(1*4), (1*(1*3))+(1*(1*3))+(1*4)] = [10, 10]
# Just another example to convince you.

array([[10., 10.]], dtype=float32)

## Part C: LSTM Layer
### Optional

In [163]:
one_unit_LSTM = Sequential()
one_unit_LSTM.add(LSTM(units=1, input_shape=(None, 1),
                       activation='linear', recurrent_activation='linear',
                       use_bias=False, unit_forget_bias=False,
                       kernel_initializer='zeros',
                       recurrent_initializer='zeros',
                       return_sequences=True))

In [164]:
one_unit_LSTM_weights = one_unit_LSTM.get_weights()
one_unit_LSTM_weights

[array([[0., 0., 0., 0.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [165]:
one_unit_LSTM_weights[0][0][0] = 1
one_unit_LSTM_weights[0][0][1] = 0
one_unit_LSTM_weights[0][0][2] = 1
one_unit_LSTM_weights[0][0][3] = 1
one_unit_LSTM_weights[1][0][0] = 0
one_unit_LSTM_weights[1][0][1] = 0
one_unit_LSTM_weights[1][0][2] = 0
one_unit_LSTM_weights[1][0][3] = 0
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[1., 0., 1., 1.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [166]:
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4]] ]))

array([[[ 0.],
        [ 1.],
        [ 8.],
        [64.]]], dtype=float32)

## Part C Exercise
### Optional
Conceptually, the [LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) has several _gates_:

* __Forget gate__: these weights allow some long-term memories to be forgotten.
* __Input gate__: these weights decide what new information will be added to the context cell.
* __Output gate__: these weights decide what pieces of the new information and updated context will be passed on to the output.

It also has a __cell__ that can hold onto information from the current input (as well as things it has remembered from previous inputs), so that it can be used in later outputs.

Identify which weights in the one_unit_LSTM model are connected with the context and which are associated with the three gates. This is considerably more difficult to do by looking at the inputs and outputs, so you could also treat this as a code reading exercise and look through the keras code to find the answer.

_Note_: The output from the predict call is what the linked explanation calls $h_{t}$.

In [171]:
# code from keras https://github.com/keras-team/keras/blob/bd968bf156b4346ac58e679ccd92f02796294885/keras/layers/recurrent.py#L2385

def _compute_carry_and_output(self, x, h_tm1, c_tm1):
    """Computes carry and output using split kernels."""
    x_i, x_f, x_c, x_o = x
    h_tm1_i, h_tm1_f, h_tm1_c, h_tm1_o = h_tm1
    i = self.recurrent_activation(
        x_i + backend.dot(h_tm1_i, self.recurrent_kernel[:, :self.units]))
    f = self.recurrent_activation(x_f + backend.dot(
        h_tm1_f, self.recurrent_kernel[:, self.units:self.units * 2]))
    c = f * c_tm1 + i * self.activation(x_c + backend.dot(
        h_tm1_c, self.recurrent_kernel[:, self.units * 2:self.units * 3]))
    o = self.recurrent_activation(
        x_o + backend.dot(h_tm1_o, self.recurrent_kernel[:, self.units * 3:]))

def call(self, inputs, states, training=None):
    h_tm1 = states[0]  # previous output
    c_tm1 = states[1]  # previous cell state

    dp_mask = self.get_dropout_mask_for_cell(inputs, training, count=4)
    rec_dp_mask = self.get_recurrent_dropout_mask_for_cell(
        h_tm1, training, count=4)

    if self.implementation == 1:
      if 0 < self.dropout < 1.:
        inputs_i = inputs * dp_mask[0]
        inputs_f = inputs * dp_mask[1]
        inputs_c = inputs * dp_mask[2]
        inputs_o = inputs * dp_mask[3]
      else:
        inputs_i = inputs
        inputs_f = inputs
        inputs_c = inputs
        inputs_o = inputs
      k_i, k_f, k_c, k_o = tf.split(
          self.kernel, num_or_size_splits=4, axis=1)
      
      x_i = backend.dot(inputs_i, k_i)
      x_f = backend.dot(inputs_f, k_f)
      x_c = backend.dot(inputs_c, k_c)
      x_o = backend.dot(inputs_o, k_o)
      if self.use_bias:
        b_i, b_f, b_c, b_o = tf.split(
            self.bias, num_or_size_splits=4, axis=0)
        x_i = backend.bias_add(x_i, b_i)
        x_f = backend.bias_add(x_f, b_f)
        x_c = backend.bias_add(x_c, b_c)
        x_o = backend.bias_add(x_o, b_o)

      if 0 < self.recurrent_dropout < 1.:
        h_tm1_i = h_tm1 * rec_dp_mask[0]
        h_tm1_f = h_tm1 * rec_dp_mask[1]
        h_tm1_c = h_tm1 * rec_dp_mask[2]
        h_tm1_o = h_tm1 * rec_dp_mask[3]
      else:
        h_tm1_i = h_tm1
        h_tm1_f = h_tm1
        h_tm1_c = h_tm1
        h_tm1_o = h_tm1
      x = (x_i, x_f, x_c, x_o)
      h_tm1 = (h_tm1_i, h_tm1_f, h_tm1_c, h_tm1_o)
      c, o = self._compute_carry_and_output(x, h_tm1, c_tm1)
    else:
      if 0. < self.dropout < 1.:
        inputs = inputs * dp_mask[0]
      z = backend.dot(inputs, self.kernel)
      z += backend.dot(h_tm1, self.recurrent_kernel)
      if self.use_bias:
        z = backend.bias_add(z, self.bias)

      z = tf.split(z, num_or_size_splits=4, axis=1)
      c, o = self._compute_carry_and_output_fused(z, c_tm1)

    h = o * self.activation(c)
    return h, [h, c]

In [181]:
# https://github.com/keras-team/keras/blob/bd968bf156b4346ac58e679ccd92f02796294885/keras/layers/recurrent.py#L2385

# weights for input of time step 
one_unit_LSTM_weights[0][0][0] = 1 # input
one_unit_LSTM_weights[0][0][1] = 0 # forget
one_unit_LSTM_weights[0][0][2] = 1 # context
one_unit_LSTM_weights[0][0][3] = 1 # output

# weights from the feedback from the previous time step
one_unit_LSTM_weights[1][0][0] = 0 # input
one_unit_LSTM_weights[1][0][1] = 0 # forget 
one_unit_LSTM_weights[1][0][2] = 0 # context 
one_unit_LSTM_weights[1][0][3] = 0 # output
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.get_weights()

[array([[1., 0., 1., 1.]], dtype=float32),
 array([[0., 0., 0., 0.]], dtype=float32)]

In [182]:
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4], [8]] ]))

array([[[  0.],
        [  1.],
        [  8.],
        [ 64.],
        [512.]]], dtype=float32)

In [178]:
one_unit_LSTM_weights[0][0][0] = 0 # input
one_unit_LSTM_weights[0][0][1] = 0 # forget
one_unit_LSTM_weights[0][0][2] = 0 # context
one_unit_LSTM_weights[0][0][3] = 0 # output

one_unit_LSTM_weights[1][0][0] = 1 # input
one_unit_LSTM_weights[1][0][1] = 0 # forget 
one_unit_LSTM_weights[1][0][2] = 1 # context 
one_unit_LSTM_weights[1][0][3] = 1 # output
one_unit_LSTM.set_weights(one_unit_LSTM_weights)
one_unit_LSTM.predict(numpy.array([ [[0], [1], [2], [4], [8]] ]))

array([[[0.],
        [0.],
        [0.],
        [0.],
        [0.]]], dtype=float32)

# Logic for the above order of weights:
- The last index corresponds to if it is an input, forget, context, or output weight.
    - [][][0] --> input
    - [][][1] --> forget
    - [][][2] --> context
    - [][][3] --> output
    - This was determined by looking at patterns in the code, where it always goes i, f, c, and o.
        - Additionally, [0][0][0], [0][0][1], and [0][0][3] all have to be nonzero for the prediction to be nonzero, which backs this up.

- The first index corresponds to if it is a weight for the input of the time step, or if it is a weight for the feedback from the previous time step.
    - [0][][] --> input of time step
    - [1][][] --> output of previous time step
    - This was determined because the prediction is always 0 when the first 4 weights are set to 0.



