Given the input sequence of tokens $X_0,\dots,X_n$, RNN creates a sequence of neural network blocks and trains this sequence end-to-end using backpropagation. Each network block takes a pair $(X_i,S_i)$ as input and produces $S_{i+1}$ as output. The final state $S_n$ or output $X_n$ is passed into a linear classifier to generate the result. All network blocks share the same weights and are trained end-to-end in a single backpropagation pass.

Because the state vectors $S_0,\dots,S_n$ are passed through the network, it can learn sequential dependencies between words. For instance, when the word *not* appears somewhere in the sequence, the network can learn to negate certain elements within the state vector, effectively capturing negation.

> Since the weights of all RNN blocks in the diagram are shared, the same diagram can be simplified into a single block (on the right) with a recurrent feedback loop that feeds the network's output state back into its input.

Let’s explore how recurrent neural networks can assist in classifying our news dataset.


In [1]:
import torch
import torchtext
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)

Loading dataset...
Building vocab...


## Simple RNN classifier

For a simple RNN, each recurrent unit is a straightforward linear network that takes a concatenated input vector and state vector to produce a new state vector. PyTorch represents this unit using the `RNNCell` class, while a network of such cells is represented as an `RNN` layer.

To create an RNN classifier, we will first use an embedding layer to reduce the dimensionality of the input vocabulary, followed by an RNN layer on top:


In [2]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

> **Note:** Here, we use an untrained embedding layer for simplicity, but for better results, we can use a pre-trained embedding layer with Word2Vec or GloVe embeddings, as explained in the previous unit. To deepen your understanding, you might want to modify this code to work with pre-trained embeddings.

In our case, we will use a padded data loader, so each batch will consist of padded sequences of the same length. The RNN layer will process the sequence of embedding tensors and produce two outputs:  
* $x$ is a sequence of RNN cell outputs at each step  
* $h$ is the final hidden state for the last element of the sequence  

We then apply a fully connected linear classifier to determine the number of classes.

> **Note:** Training RNNs can be quite challenging because, once the RNN cells are unrolled along the sequence length, the number of layers involved in backpropagation becomes very large. Therefore, we need to choose a small learning rate and train the network on a larger dataset to achieve good results. This process can take a significant amount of time, so using a GPU is recommended.


In [3]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
net = RNNClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.3090625
6400: acc=0.38921875
9600: acc=0.4590625
12800: acc=0.511953125
16000: acc=0.5506875
19200: acc=0.57921875
22400: acc=0.6070089285714285
25600: acc=0.6304296875
28800: acc=0.6484027777777778
32000: acc=0.66509375
35200: acc=0.6790056818181818
38400: acc=0.6929166666666666
41600: acc=0.7035817307692308
44800: acc=0.7137276785714286
48000: acc=0.72225
51200: acc=0.73001953125
54400: acc=0.7372794117647059
57600: acc=0.7436631944444444
60800: acc=0.7503947368421052
64000: acc=0.75634375
67200: acc=0.7615773809523809
70400: acc=0.7662642045454545
73600: acc=0.7708423913043478
76800: acc=0.7751822916666666
80000: acc=0.7790625
83200: acc=0.7825
86400: acc=0.7858564814814815
89600: acc=0.7890513392857142
92800: acc=0.7920474137931034
96000: acc=0.7952708333333334
99200: acc=0.7982258064516129
102400: acc=0.80099609375
105600: acc=0.8037594696969697
108800: acc=0.8060569852941176


## Long Short Term Memory (LSTM)

One of the main issues with classical RNNs is the so-called **vanishing gradients** problem. Since RNNs are trained end-to-end in a single backpropagation pass, it becomes difficult to propagate errors to the earlier layers of the network. As a result, the network struggles to learn relationships between distant tokens. One way to address this issue is by introducing **explicit state management** through the use of **gates**. Two of the most well-known architectures that use this approach are **Long Short Term Memory** (LSTM) and **Gated Relay Unit** (GRU).

![Image showing an example long short term memory cell](../../../../../lessons/5-NLP/16-RNN/images/long-short-term-memory-cell.svg)

The LSTM network is structured similarly to an RNN, but it passes two states from layer to layer: the actual state $c$ and the hidden vector $h$. At each unit, the hidden vector $h_i$ is concatenated with the input $x_i$, and together they control what happens to the state $c$ through **gates**. Each gate is a neural network with a sigmoid activation function (output in the range $[0,1]$), which can be thought of as a bitwise mask when multiplied by the state vector. The following gates are present (from left to right in the image above):
* **Forget gate**: Takes the hidden vector and determines which components of the vector $c$ should be forgotten and which should be retained.
* **Input gate**: Extracts some information from the input and hidden vector and inserts it into the state.
* **Output gate**: Transforms the state using a linear layer with $\tanh$ activation, then selects certain components using the hidden vector $h_i$ to produce the new state $c_{i+1}$.

The components of the state $c$ can be thought of as flags that can be turned on or off. For example, when encountering the name *Alice* in a sequence, we might assume it refers to a female character and set a flag in the state to indicate the presence of a female noun in the sentence. Later, when encountering the phrase *and Tom*, we might set a flag to indicate the presence of a plural noun. By manipulating the state in this way, we can theoretically keep track of the grammatical properties of different parts of a sentence.

> **Note**: A fantastic resource for understanding the inner workings of LSTMs is the article [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

Although the internal structure of an LSTM cell may seem complex, PyTorch abstracts this implementation within the `LSTMCell` class and provides the `LSTM` object to represent the entire LSTM layer. As a result, implementing an LSTM classifier is quite similar to the simple RNN we discussed earlier:


In [4]:
class LSTMClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,(h,c) = self.rnn(x)
        return self.fc(h[-1])

In [5]:
net = LSTMClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch(net,train_loader, lr=0.001)

3200: acc=0.259375
6400: acc=0.25859375
9600: acc=0.26177083333333334
12800: acc=0.2784375
16000: acc=0.313
19200: acc=0.3528645833333333
22400: acc=0.3965625
25600: acc=0.4385546875
28800: acc=0.4752777777777778
32000: acc=0.505375
35200: acc=0.5326704545454546
38400: acc=0.5557552083333334
41600: acc=0.5760817307692307
44800: acc=0.5954910714285714
48000: acc=0.6118333333333333
51200: acc=0.62681640625
54400: acc=0.6404779411764706
57600: acc=0.6520138888888889
60800: acc=0.662828947368421
64000: acc=0.673546875
67200: acc=0.6831547619047619
70400: acc=0.6917897727272727
73600: acc=0.6997146739130434
76800: acc=0.707109375
80000: acc=0.714075
83200: acc=0.7209134615384616
86400: acc=0.727037037037037
89600: acc=0.7326674107142858
92800: acc=0.7379633620689655
96000: acc=0.7433645833333333
99200: acc=0.7479032258064516
102400: acc=0.752119140625
105600: acc=0.7562405303030303
108800: acc=0.76015625
112000: acc=0.7641339285714286
115200: acc=0.7677777777777778
118400: acc=0.77112331081

(0.03487814127604167, 0.7728)

## Packed sequences

In our example, we had to pad all sequences in the minibatch with zero vectors. While this leads to some memory being wasted, the bigger issue with RNNs is that additional RNN cells are created for the padded input items. These cells participate in training but do not carry any meaningful input information. It would be much better to train the RNN only up to the actual sequence length.

To address this, PyTorch introduces a special format for storing padded sequences. Suppose we have a padded input minibatch that looks like this:
```
[[1,2,3,4,5],
 [6,7,8,0,0],
 [9,0,0,0,0]]
```
Here, 0 represents the padded values, and the actual length vector of the input sequences is `[5,3,1]`.

To train an RNN effectively with padded sequences, we want to start training the first group of RNN cells with the largest minibatch (`[1,6,9]`), then stop processing the third sequence, and continue training with smaller minibatches (`[2,7]`, `[3,8]`), and so on. A packed sequence is thus represented as a single vector—in this case, `[1,6,9,2,7,3,8,4,5]`—along with a length vector (`[5,3,1]`), from which the original padded minibatch can easily be reconstructed.

To create a packed sequence, we can use the `torch.nn.utils.rnn.pack_padded_sequence` function. All recurrent layers, including RNN, LSTM, and GRU, support packed sequences as input and produce packed output, which can be converted back using `torch.nn.utils.rnn.pad_packed_sequence`.

To generate a packed sequence, we need to pass the length vector to the network. Therefore, we require a different function to prepare minibatches:


In [6]:
def pad_length(b):
    # build vectorized sequence
    v = [encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch and length sequence itself
    len_seq = list(map(len,v))
    l = max(len_seq)
    return ( # tuple of three tensors - labels, padded features, length sequence
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v]),
        torch.tensor(len_seq)
    )

train_loader_len = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=pad_length, shuffle=True)

The actual network will be very similar to the `LSTMClassifier` mentioned above, but the `forward` pass will take both the padded minibatch and the vector of sequence lengths as input. After calculating the embedding, we create a packed sequence, pass it through the LSTM layer, and then unpack the result.

> **Note**: We don't actually use the unpacked result `x`, as the output from the hidden layers is used in subsequent computations. Therefore, the unpacking step can be removed entirely from this code. The reason it's included here is to make it easier for you to modify the code if you need to use the network's output in further computations.


In [7]:
class LSTMPackClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.embedding.weight.data = torch.randn_like(self.embedding.weight.data)-0.5
        self.rnn = torch.nn.LSTM(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x, lengths):
        batch_size = x.size(0)
        x = self.embedding(x)
        pad_x = torch.nn.utils.rnn.pack_padded_sequence(x,lengths,batch_first=True,enforce_sorted=False)
        pad_x,(h,c) = self.rnn(pad_x)
        x, _ = torch.nn.utils.rnn.pad_packed_sequence(pad_x,batch_first=True)
        return self.fc(h[-1])

In [8]:
net = LSTMPackClassifier(vocab_size,64,32,len(classes)).to(device)
train_epoch_emb(net,train_loader_len, lr=0.001,use_pack_sequence=True)


3200: acc=0.285625
6400: acc=0.33359375
9600: acc=0.3876041666666667
12800: acc=0.44078125
16000: acc=0.4825
19200: acc=0.5235416666666667
22400: acc=0.5559821428571429
25600: acc=0.58609375
28800: acc=0.6116666666666667
32000: acc=0.63340625
35200: acc=0.6525284090909091
38400: acc=0.668515625
41600: acc=0.6822596153846154
44800: acc=0.6948214285714286
48000: acc=0.7052708333333333
51200: acc=0.71521484375
54400: acc=0.7239889705882353
57600: acc=0.7315277777777778
60800: acc=0.7388486842105263
64000: acc=0.74571875
67200: acc=0.7518303571428572
70400: acc=0.7576988636363636
73600: acc=0.7628940217391305
76800: acc=0.7681510416666667
80000: acc=0.7728125
83200: acc=0.7772235576923077
86400: acc=0.7815393518518519
89600: acc=0.7857700892857142
92800: acc=0.7895043103448276
96000: acc=0.7930520833333333
99200: acc=0.7959072580645161
102400: acc=0.798994140625
105600: acc=0.802064393939394
108800: acc=0.8051378676470589
112000: acc=0.8077857142857143
115200: acc=0.8104600694444445
118400

(0.029785829671223958, 0.8138166666666666)

> **Note:** You may have noticed the parameter `use_pack_sequence` that we pass to the training function. Currently, the `pack_padded_sequence` function requires the length sequence tensor to be on the CPU device, and thus the training function needs to avoid moving the length sequence data to the GPU when training. You can look into the implementation of the `train_emb` function in the [`torchnlp.py`](../../../../../lessons/5-NLP/16-RNN/torchnlp.py) file.


## Bidirectional and Multilayer RNNs

In our examples, all recurrent networks operated in a single direction, from the start of a sequence to its end. This seems natural because it mirrors how we read or listen to speech. However, in many practical scenarios, we have random access to the input sequence, so it might make sense to perform recurrent computation in both directions. Such networks are called **bidirectional** RNNs, and they can be created by passing the `bidirectional=True` parameter to the RNN/LSTM/GRU constructor.

When working with a bidirectional network, we need two hidden state vectors—one for each direction. PyTorch encodes these vectors as a single vector with double the size, which is quite convenient. This is because you typically pass the resulting hidden state to a fully connected linear layer, and you only need to account for the increased size when creating the layer.

A recurrent network, whether one-directional or bidirectional, captures certain patterns within a sequence and can store them in the state vector or pass them to the output. Similar to convolutional networks, we can stack another recurrent layer on top of the first one to capture higher-level patterns, built from the low-level patterns extracted by the first layer. This brings us to the concept of a **multi-layer RNN**, which consists of two or more recurrent networks, where the output of one layer is passed as input to the next layer.

![Image showing a Multilayer long-short-term-memory- RNN](../../../../../translated_images/multi-layer-lstm.dd975e29bb2a59fe58b429db833932d734c81f211cad2783797a9608984acb8c.en.jpg)

*Image from [this excellent post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando López*

PyTorch makes constructing such networks straightforward. You simply need to pass the `num_layers` parameter to the RNN/LSTM/GRU constructor to automatically build multiple layers of recurrence. This also means that the size of the hidden/state vector will increase proportionally, and you’ll need to account for this when processing the output of the recurrent layers.


## RNNs for other tasks

In this unit, we have seen that RNNs can be used for sequence classification, but they are capable of handling many other tasks, such as text generation, machine translation, and more. We will explore these tasks in the next unit.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
