# Assignment 2: Long Short-Term Memory

*Author:* Thomas Adler

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

## Exercise 1: Analysis of the Jordan network

In the previous assignment we found that the Jordan network is incapable of memorizing information over longer time spans. Now we want to delve deeper into why this is case and how to solve this problem. To learn long-term dependencies, it is necessary that the gradient carries the error signal backwards in time. For instance, if the output at time $t$ depends on the input at time $s$ with $s < t$, then the Jacobian 
$$
J(t, s) = \frac{\partial a(t)}{\partial a(s)}
$$
is responsible for carrying the error signals from time $t$ backwards to time $s$. Calculate this Jacobian for the Jordan network and elaborate on the numerical stability of $J(t, s)$ for long time spans, i.e., when $t-s$ becomes large. Why is the Jordan network incapable of learning long-term dependencies?

########## YOUR SOLUTION HERE ##########

The following figure depicts the LSTM architecture (without forget gate, which was introduced later). 

<img src="lstm_noFG.png" alt="LSTM" width="600"/>

We have
$$
a(t) = \varphi(W_a x(t) + R_a h(t-1) + b_a),
$$
where $a \in \{z, i, f, o\}$ and $\varphi$ is either sigmoid (for $i, f, o$) or tanh (for $z$). We alter the notation of the figure in that we write $h(t)$ instead of $y(t)$, which lets us use the latter for the output variable as we are used to. The LSTM forward rule is 
$$
c(t) = c(t-1) + z(t) \odot i(t) \\
h(t) = \tanh(c(t)) \odot o(t).
$$
To obtain predictions $\hat y(t)$ we facilitate an output layer
$$
\hat y(t) = \sigma(W_y h(t) + b_y).
$$

## Exercise 2: Forward pass of the gates

Consider the layer 
$$
a = \varphi(W x + R h + b).
$$
The modules $a \in \{z, i, o\}$ are called cell input, input gate, and output gate, respectively. The cell input uses $\varphi = \tanh$ whereas input gate and output gate use $\varphi = \sigma$. Implement the forward pass of the class `Gate` by implementing the methods `__init__` and `forward`. The method `__init__` should initialize the parameters uniformly in $[-0.01, 0.01]$. The method `forward` should implement the forward logic of the module. The activation function $\varphi$ should be exchangeable. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import expit as sigmoid

class Gate(object):
    pass

########## YOUR SOLUTION HERE ##########

## Exercise 3: Gradients for the gates

To train a gate module, we need the gradients of the loss with respect to the parameters $W, R, b$. Further, if there are layers below that need training, then we also need the gradients with respect to $x$ and $h$. Given the gradient $\nabla_a L$, derive expressions for the gradients w.r.t $W, R, b, x, h$. 

########## YOUR SOLUTION HERE ##########

## Exercise 4: Backward pass of the gates

Implement the backward pass of the class `Gate` by implementing the methods `zero_grad`, `backward`, and `update`. The method `zero_grad` should initialize/overwrite the gradients w.r.t. the parameters to zero. The method `backward` should take $\nabla_a L$ as argument and return $\nabla_x L$ and $\nabla_h L$. Moreover, it should add $\nabla_W L, \nabla_R L, \nabla_b L$ to the gradient buffers. The method update should perform a gradient-descent step to update the parameters using a learning rate $\eta$. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 5: Forward pass of LSTM

Implement the forward pass of the class `LSTM` using the `Gate` module. Again, it should have the methods `__init__` and `forward` using the same initialization scheme as before. The method `forward` should evaluate and return $L(\hat y(T), y(T))$ where $L$ is the binary cross-entropy loss function and $T$ is the index of the last sequence element. Make sure to store all activations which are needed for the backward pass. 

In [None]:
class LSTM(object):
    pass

########## YOUR SOLUTION HERE ##########

## Exercise 6: Backward pass of LSTM with BPTT

Realize the backward pass of LSTM by implementing the methods `zero_grad`, `backward`, and `update`. Equations for the gradients can be found in the lecture script section 3.1 "Backpropagation for LSTM". 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 7: LSTM training

Train an LSTM with 32 hidden units on the task from assignment 1 with a sequence length of 100. Tune the number of update steps and the learning rate. After training, evaluate the model on 1000 sequences. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 8: The LSTM learning method and RTRL

Analyzing full BPTT for LSTM we find that the recurrent connections through the gates causes most of the intricacy of the backward logic, while the simple inner recurrent connections are responsible for carrying the error signals over long time spans. The *LSTM learning method* truncates the gradient of these outer recurrent connections. In other words, the gates treat $h(t)$ as if they were external inputs and disregard their dependence on the model. This simplifies the gradients and makes RTRL feasible. 

Since $h(t)$ are treated as external inputs, the key to RTRL is the recursion
$$
\frac{\partial c(t)}{\partial \theta} = \sum_{s=1}^t \frac{\partial c(t)}{\partial \theta(s)} = \frac{\partial c(t)}{\partial \theta(t)} + \frac{\partial c(t)}{\partial c(t-1)} \frac{\partial c(t-1)}{\partial \theta},
$$
where $\theta$ is the parameter vector that contains all the model parameters in one large vector. The parameters are shared in time and $\theta(t)$ denotes their usage at time $t$. Above recursion lets us collect the part of the gradient that depends on the past during forward pass. Due to the recurrent weights the size of $\theta$ is $O(I^2)$ and therefore $\frac{\partial c(t)}{\partial \theta}$ is $O(I^3)$. The matrix product on the right-hand side raises the computational complexity of RTRL to $O(I^4)$, which is the reason why RTRL is infeasible for most recurrent architectures. 

Let, e.g., $w_{jk}^i$ denote the element in the $j$-th row and $k$-th column of the matrix $W$ belonging to the input gate $i$. Show that $\frac{\partial c(t)}{\partial \theta}$ for the LSTM learning method has the form 
$$
\frac{\partial c_n(t)}{\partial w_{jk}^i} = z_n(t) i_n(t)(1-i_n(t)) x_k(t) [n=j] \qquad
\frac{\partial c_n(t)}{\partial r_{jk}^i} = z_n(t) i_n(t)(1-i_n(t)) h_k(t-1) [n=j] \qquad
\frac{\partial c_n(t)}{\partial b_j^i} = z_n(t) i_n(t)(1-i_n(t)) [n=j] \\
\frac{\partial c_n(t)}{\partial w_{jk}^z} = i_n(t) (1-z_n(t)^2) x_k(t) [n=j] \qquad
\frac{\partial c_n(t)}{\partial r_{jk}^z} = i_n(t) (1-z_n(t)^2) h_k(t-1) [n=j] \qquad
\frac{\partial c_n(t)}{\partial b_j^z} = i_n(t) (1-z_n(t)^2) [n=j],\\
$$
where $[n=j]$ is the Iverson bracket that evaluates to 1 if the expression inside is true and to 0 otherwise. What is the complexity of RTRL for the LSTM learning method?

########## YOUR SOLUTION HERE ##########

## Exercise 9: Prepare LSTM for RTRL

Write a class `LSTM_RTRL` and implement the methods `__init__`, `zero_grad`, `update`. Do not use the `Gate` class this time but implement the gates directly so they can be trained via RTRL. Make sure to initialize all weights and gradient buffers accordingly. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 10: Implement RTRL for the LSTM learning method

Add a method `forward` to the class `LSTM_RTRL` that processes one time step of an input sequence and updates the RTRL gradient buffers using the LSTM learning method. 

In [None]:
########## YOUR SOLUTION HERE ##########

## Exercise 11: LSTM training with RTRL

Again, train the LSTM on the task from assignment 1 using real-time recurrent learning in combination with the LSTM learning method. Start with a sequence length of 1. What is the maximum sequence length for which the LSTM can learn the task (at reasonable computational cost)? Compare the training behavior to that with BPTT. Explain possible differences.  

In [None]:
########## YOUR SOLUTION HERE ##########