In this notebook we'll describe the basics of the Element RNN library, which offers implementations and variations of LSTM and GRU recurrent networks. You should check out the documentation and examples at https://github.com/Element-Research/rnn

## Review
Recall that a recurrent neural network maps a sequence of vectors $\mathbf{x}_1, \ldots, \mathbf{x}_n$ to a sequence of vectors $\mathbf{s}_1, \ldots, \mathbf{s}_n$ using the recurrence

\begin{align*}
\mathbf{s}_{i} =  R(\mathbf{x}_{i}, \mathbf{s}_{i-1}; \mathbf{\theta}),
\end{align*}

where $R$ is a function parameterized by $\mathbf{\theta}$, and we define $\mathbf{s}_0$ as some initial vector (such as a vector of all zeros).

### N.B. For the first part of this notebook we will deal only with "acceptor" RNNs
* That is, we'll assume we only use the final state $\mathbf{s}_n$ for making predictions


## Element RNN Basics
At the heart of the Element RNN library is the abstract class 'AbstractRecurrent', which is designed to allow calling :forward() on each element $\mathbf{x}_i$ of a sequence (in turn), with the abstract class keeping track of the $\mathbf{s}_i$ for you. Consider the following example.

In [1]:
require 'rnn' -- this imports 'nn' as well, and adds 'rnn' objects to the 'nn' namespace

lstm = nn.LSTM(5, 5) -- inherits from AbstractRecurrent
data = torch.randn(3, 5) -- a sequence of 3 random vectors
outputs = torch.zeros(3, 5)

for i = 1, data:size(1) do
    outputs[i] = lstm:forward(data[i]) -- note that we don't need to keep track of the s_i
end

print(outputs)

 0.0354 -0.2311 -0.0782 -0.0994  0.1091
-0.0087 -0.1867 -0.1211 -0.2268  0.2223
 0.0590  0.1845 -0.0216 -0.2605  0.1665
[torch.DoubleTensor of size 3x5]



Above, the lstm is able to compute :forward() at each step by keeping track of its states internally. As we discussed in lecture, in an LSTM the states $\mathbf{s}_i$ comprise the 'hidden state' $\mathbf{h}_i$ as well as the 'cell' $\mathbf{c}_i$. In the RNN package's terminology, the $\mathbf{h}_i$ are known as 'outputs', and the $\mathbf{c}_i$ as 'cells', and these are stored internally in the nn.LSTM object, as tables:

In [2]:
print(lstm.outputs)
print(lstm.cells)

{
  1 : DoubleTensor - size: 5
  2 : DoubleTensor - size: 5
  3 : DoubleTensor - size: 5
}
{
  1 : DoubleTensor - size: 5
  2 : DoubleTensor - size: 5
  3 : DoubleTensor - size: 5
}


We can see that lstm.outputs are the same as the outputs tensor we stored manually

In [3]:
print(lstm.outputs[1])
print(lstm.outputs[2])
print(lstm.outputs[3])

 0.0354
-0.2311
-0.0782
-0.0994
 0.1091
[torch.DoubleTensor of size 5]

-0.0087
-0.1867
-0.1211
-0.2268
 0.2223
[torch.DoubleTensor of size 5]

 0.0590
 0.1845
-0.0216
-0.2605
 0.1665
[torch.DoubleTensor of size 5]



So, the first thing the RNN library gives us is implementations of many of the $R$ functions we're interested in using, such as LSTMs, and GRUs. However, it does much more!

## BPTT

To do backpropagation (through time) correctly, we would technically need to loop backwards over the input sequence, as in the following pseudocode:

In [None]:
-- note this doesn't work! just supposed to convey how you might have to implement this...
dLdh_i = gradOutForFinalH()
dLdc_i = gradOutForFinalC()

for i = data:size(1), 1, -1 do
    dLdh_iminus1, dLdc_iminus1 = lstm:backward(data[i], {dLdh_i, dLdc_i})
    dLdh_i, dLdc_i = dLdh_iminus1, dLdc_iminus1
end

Fortunately, however, the Element RNN library can do this for us, using its nn.Sequencer objects!

## Sequencers

An nn.Sequencer transforms a module (such as one inheriting from AbstractRecurrent) into a module that can call :forward() on an entire sequence, and :backward() on the entire sequence, thus abstracting away the looping required for backpropagation (through time). 

In particular, sequencers expect a **table** as input, and also output a table. Let's use an nn.Sequencer on an lstm:

In [16]:
lstm = nn.LSTM(5, 5)
seq_lstm = nn.Sequencer(lstm)
inp_table = torch.split(data, 1) -- make a table from our sequence of 3 vectors
out_table = seq_lstm:forward(inp_table)
print(out_table)

{
  1 : DoubleTensor - size: 1x5
  2 : DoubleTensor - size: 1x5
  3 : DoubleTensor - size: 1x5
}


For calling :backward(), an nn.Sequencer expects gradOutput in the same shape as its input, just as every other nn module does. In this case, then, an nn.Sequencer expects gradOutput to be table with an entry for each time step. In particular, gradOutput[i] should contain:

\begin{align*}
\frac{\partial \text{ loss at timestep } i}{\partial \mathbf{h}_i}
\end{align*}

Since for now we're dealing only with an "acceptor" RNN, there is only loss at the final timestep, and so gradOutput[i] is going to be all zeros, when $i < n$, as follows:

In [17]:
gradOutput = torch.split(torch.zeros(3, 5), 1)
-- randomly set final gradOutput
gradOutput[#gradOutput] = torch.randn(5) -- note that ordinarily you'd get this from a criterion
-- now we can BPTT with a single call!
seq_lstm:backward(inp_table, gradOutput)

## Avoiding Tables
Since your data will generally not be in tables (but in tensors), it's common to add additional layers to your network to map from tables to tensors and back. For instance, we can create an lstm that takes in a tensor (rather than a table), by using an nn.SplitTable

In [12]:
seq_lstm2 = nn.Sequential():add(nn.SplitTable(1)):add(nn.Sequencer(nn.LSTM(5, 5)))
print(seq_lstm2:forward(data))

{
  1 : DoubleTensor - size: 5
  2 : DoubleTensor - size: 5
  3 : DoubleTensor - size: 5
}


In an acceptor RNN we only care about the last state, so we can make our final layer a SelectTable (which also simplifies calling :backward(), since it implicitly passes back zeroes for all but the selected table index)

In [13]:
seq_lstm3 = seq_lstm2:clone()
seq_lstm3:add(nn.SelectTable(-1)) -- select the last element in the output table

print(seq_lstm3:forward(data))
gradOutFinal = gradOutput[#gradOutput] -- note that gradOutFinal is just a tensor
seq_lstm3:backward(data, gradOutFinal)

 0.1290
-0.0619
 0.1116
-0.0100
-0.0775
[torch.DoubleTensor of size 5]



If you cared about more than the last state of your LSTM, you could add an nn.JoinTable to your network after the Sequencer, which would give a tensor as output.

## Batching
In order to make RNNs fast, it is important to batch. When batching with an Element RNN, time-steps continue to be represented as indices in a table, but this time each element in the table is a **matrix** rather than a vector. In particular, batching occurs along the first dimension (as usual). Thus, a sequence of length 3 with each vector in $\mathbb{R}^5$ and a batch-size of 2 could be created in the following way:

In [15]:
-- data representing a sequence of length 3, vectors in R^5, and batch-size of 2
batchSequenceDataTbl = {torch.randn(2, 5), torch.randn(2, 5), torch.randn(2, 5)}
print(batchSequenceDataTbl)
-- do a batched :forward() call
print(nn.Sequencer(nn.LSTM(5, 5)):forward(batchSequenceDataTbl))

{
  1 : DoubleTensor - size: 2x5
  2 : DoubleTensor - size: 2x5
  3 : DoubleTensor - size: 2x5
}


{
  1 : DoubleTensor - size: 2x5
  2 : DoubleTensor - size: 2x5
  3 : DoubleTensor - size: 2x5
}


## Stacking RNNs

## Remember and Forget

## FastLSTM

## Use with LookupTables

## Transducer RNNs