RNNs in practice:

In [1]:
import torch
import torch.nn as nn

In [3]:
torch.manual_seed(1)

rnn_layer = nn.RNN(input_size=5, hidden_size=2, num_layers=1, batch_first=True)

In [5]:
w_xh = rnn_layer.weight_ih_l0
w_hh = rnn_layer.weight_hh_l0
b_xh = rnn_layer.bias_ih_l0
b_hh = rnn_layer.bias_hh_l0

In [6]:
print('W_xh shape: ', w_xh.shape)
print('W_hh shape: ', w_hh.shape)
print('b_xh shape: ', b_xh.shape)
print('b_hh shape: ', b_hh.shape)

W_xh shape:  torch.Size([2, 5])
W_hh shape:  torch.Size([2, 2])
b_xh shape:  torch.Size([2])
b_hh shape:  torch.Size([2])


The input shape for this layer is (batch_size, sequence_length, 5), first dimension is the batch dimension (as we set batch_first to True), the second dimension corresponds to the sequence, and the last dimension corresponds to the features. Now, we call a forward pass on the rnn and manually compute the outputs at each time step and compare them:

In [7]:
x_seq = torch.tensor([[1.0]*5, [2.0]*5, [3.0]*5]).float()

In [8]:
x_seq.shape

torch.Size([3, 5])

In [9]:
x_seq = torch.reshape(x_seq, (1,3,5))

In [10]:
output, hn = rnn_layer(x_seq)

In [11]:
output.shape

torch.Size([1, 3, 2])

In [12]:
hn.shape

torch.Size([1, 1, 2])

In [16]:
print(hn)

tensor([[[-0.8649,  0.9047]]], grad_fn=<StackBackward0>)


In [15]:
# manually computing the output

out_man = []

for t in range(3):
    xt = x_seq[:,t,:]
    print(f'Time step {t} =>')
    print(' Input  :', xt.numpy())

    ht = torch.matmul(xt, torch.transpose(w_xh, 0,1)) + b_xh
    print(' Hidden :', ht.detach().numpy())
    if t>0:
        prev_h = out_man[t-1]
    else:
        prev_h = torch.zeros((ht.shape))
    ot = ht + torch.matmul(prev_h, torch.transpose(w_hh, 0, 1)) + b_hh
    ot = torch.tanh(ot)
    out_man.append(ot)
    print(' Output (manual): ', ot.detach().numpy())
    print(' RNN output: ', output[:,t].detach().numpy())
    print()


Time step 0 =>
 Input  : [[1. 1. 1. 1. 1.]]
 Hidden : [[-0.47019297  0.58639044]]
 Output (manual):  [[-0.35198015  0.52525216]]
 RNN output:  [[-0.3519801   0.52525216]]

Time step 1 =>
 Input  : [[2. 2. 2. 2. 2.]]
 Hidden : [[-0.8888316  1.2364398]]
 Output (manual):  [[-0.68424344  0.76074266]]
 RNN output:  [[-0.68424344  0.76074266]]

Time step 2 =>
 Input  : [[3. 3. 3. 3. 3.]]
 Hidden : [[-1.3074702  1.8864892]]
 Output (manual):  [[-0.8649416  0.9046636]]
 RNN output:  [[-0.8649416  0.9046636]]



In our manual forward config, we used the hyberbolic tangent (tanh) activation function since its also used in RNNs (the default activation). As you can see from the printed results, the outputs from the manual forward computations exactly match the outputs of the RNN layer at each time step. 

**The challenges of learning long-range interactions:**
- Come across vanishing and exploding gradients
- In practice, there's at least 3 solutions to this problem:
  - Gradient Clipping
  - Truncated backpropagation through time (TBPTT)
  - LSTM
- Using gradient clipping, we specify a cut-off or threshold value for the gradients, and we assign this cut-off value to gradient values that exceed this value. In contrast, TBPTT simply limits the number of timesteps that the signal can backpropagate after each forward pass. For example, even if the signal has 100 elements or steps, we may only backpropagate the most recent 20 time steps. 
While both gradient clipping and tbptt can solve the exploding gradient problem, the truncation limits the number of steps that the gradient can effectively flow back and properly update the weights. On the other hand, LSTMs have been more successful in vanishing and exploding gradient problems while modeling long-range dependencies through the use of memory cells.