# Recurrent neural networks

In the previous module, we explored rich semantic representations of text. The architecture we used captures the overall meaning of words in a sentence, but it doesn't account for the **order** of the words. This is because the aggregation operation applied after the embeddings removes the sequence information from the original text. Since these models cannot represent word order, they struggle with more complex or ambiguous tasks like text generation or question answering.

To understand the meaning of a text sequence, we'll use a neural network architecture called **recurrent neural network**, or RNN. With an RNN, we process a sentence through the network one token at a time, and the network generates a **state**, which is then passed back into the network along with the next token.

![Image showing an example recurrent neural network generation.](../../../../../translated_images/rnn.27f5c29c53d727b546ad3961637a267f0fe9ec5ab01f2a26a853c92fcefbb574.en.png)

Given the input sequence of tokens $X_0,\dots,X_n$, the RNN constructs a sequence of neural network blocks and trains this sequence end-to-end using backpropagation. Each network block takes a pair $(X_i,S_i)$ as input and produces $S_{i+1}$ as output. The final state $S_n$ or output $Y_n$ is passed into a linear classifier to generate the result. All network blocks share the same weights and are trained end-to-end in a single backpropagation pass.

> The figure above illustrates a recurrent neural network in its unrolled form (on the left) and its more compact recurrent representation (on the right). It's important to note that all RNN cells share the same **weights**.

Because the state vectors $S_0,\dots,S_n$ are passed through the network, the RNN can learn sequential dependencies between words. For instance, if the word *not* appears somewhere in the sequence, the network can learn to negate certain elements within the state vector.

Each RNN cell internally contains two weight matrices: $W_H$ and $W_I$, along with a bias $b$. At each RNN step, given the input $X_i$ and the input state $S_i$, the output state is calculated as $S_{i+1} = f(W_H\times S_i + W_I\times X_i+b)$, where $f$ is an activation function (commonly $\tanh$).

> For tasks like text generation (which we'll cover in the next unit) or machine translation, we also want to produce an output value at each RNN step. In such cases, an additional matrix $W_O$ is used, and the output is calculated as $Y_i=f(W_O\times S_i+b_O)$.

Now, let's explore how recurrent neural networks can help us classify our news dataset.

> In the sandbox environment, we need to run the following cell to ensure the required library is installed and the data is preloaded. If you're working locally, you can skip this step.


In [1]:
import sys
!{sys.executable} -m pip install --quiet tensorflow_datasets==4.4.0
!cd ~ && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/data/tfds-ag-news.tgz | tar xz

In [2]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

# We are going to be training pretty large models. In order not to face errors, we need
# to set tensorflow option to grow GPU memory allocation when required
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

ds_train, ds_test = tfds.load('ag_news_subset').values()

When training large models, GPU memory allocation can become an issue. We might also need to experiment with different minibatch sizes to ensure the data fits into GPU memory while keeping the training process efficient. If you're running this code on your own GPU machine, you can try adjusting the minibatch size to accelerate training.

> **Note**: Some versions of NVidia drivers are known to not release memory after training a model. Since we are running several examples in this notebook, this could lead to memory exhaustion in certain setups, especially if you're conducting your own experiments within the same notebook. If you encounter unusual errors when starting to train the model, consider restarting the notebook kernel.


In [3]:
batch_size = 16
embed_size = 64

## Simple RNN classifier

For a simple RNN, each recurrent unit is a straightforward linear network that takes an input vector and a state vector, then generates a new state vector. In Keras, this is implemented using the `SimpleRNN` layer.

Although it's possible to feed one-hot encoded tokens directly into the RNN layer, this approach is not ideal due to their high dimensionality. Instead, we'll use an embedding layer to reduce the dimensionality of word vectors, followed by an RNN layer, and finally a `Dense` classifier.

> **Note**: In scenarios where the dimensionality is not as high, such as when using character-level tokenization, it might be reasonable to input one-hot encoded tokens directly into the RNN cell.


In [4]:
vocab_size = 20000

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size,
    input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4,activation='softmax')
])

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 64)          1280000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 16)                1296      
_________________________________________________________________
dense (Dense)                (None, 4)                 68        
Total params: 1,281,364
Trainable params: 1,281,364
Non-trainable params: 0
_________________________________________________________________


> **Note:** We use an untrained embedding layer here for simplicity, but for better results, we can use a pretrained embedding layer using Word2Vec, as described in the previous unit. It would be a good exercise for you to modify this code to work with pretrained embeddings.

Now let's train our RNN. RNNs are generally quite challenging to train because, once the RNN cells are unrolled along the sequence length, the number of layers involved in backpropagation becomes quite large. Therefore, we need to choose a smaller learning rate and train the network on a larger dataset to achieve good results. This process can take a significant amount of time, so using a GPU is recommended.

To make the process faster, we will train the RNN model only on news titles, excluding the description. You can try training with the description included and see if you can get the model to train successfully.


In [5]:
def extract_title(x):
    return x['title']

def tupelize_title(x):
    return (extract_title(x),x['label'])

print('Training vectorizer')
vectorizer.adapt(ds_train.take(2000).map(extract_title))

Training vectorizer


In [6]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize_title).batch(batch_size),validation_data=ds_test.map(tupelize_title).batch(batch_size))



<tensorflow.python.keras.callbacks.History at 0x7f3e0030d350>

> **Note** that accuracy is likely to be lower here, because we are training only on news titles.


## Revisiting variable sequences 

Remember that the `TextVectorization` layer automatically pads sequences of varying lengths in a minibatch with padding tokens. However, these tokens also participate in training, which can make it harder for the model to converge.

There are several strategies to reduce the amount of padding. One option is to reorder the dataset by sequence length, grouping all sequences of similar size together. This can be achieved using the `tf.data.experimental.bucket_by_sequence_length` function (see [documentation](https://www.tensorflow.org/api_docs/python/tf/data/experimental/bucket_by_sequence_length)).

Another option is to use **masking**. In Keras, certain layers support additional input that indicates which tokens should be considered during training. To add masking to our model, we can either include a separate `Masking` layer ([docs](https://keras.io/api/layers/core_layers/masking/)) or set the `mask_zero=True` parameter in our `Embedding` layer.

> **Note**: Training will take approximately 5 minutes per epoch for the entire dataset. Feel free to stop the training at any point if you lose patience. Alternatively, you can reduce the amount of data used for training by adding a `.take(...)` clause to the `ds_train` and `ds_test` datasets.


In [7]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size,embed_size,mask_zero=True),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))



<tensorflow.python.keras.callbacks.History at 0x7f3dec118850>

Now that we're using masking, we can train the model on the entire dataset of titles and descriptions.

> **Note**: Have you noticed that we have been using a vectorizer trained on the news titles, and not the full body of the article? This could potentially cause some tokens to be ignored, so it would be better to re-train the vectorizer. However, the impact might be minimal, so we will continue using the previously pre-trained vectorizer for the sake of simplicity.


## LSTM: Long short-term memory

One of the main challenges with RNNs is **vanishing gradients**. RNNs can be quite deep, and during backpropagation, it may be difficult to propagate gradients all the way back to the first layer of the network. When this happens, the network struggles to learn relationships between distant tokens. A solution to this problem is to introduce **explicit state management** using **gates**. The two most common architectures that use gates are **long short-term memory** (LSTM) and **gated relay unit** (GRU). Here, we'll focus on LSTMs.

![Image showing an example long short term memory cell](../../../../../lessons/5-NLP/16-RNN/images/long-short-term-memory-cell.svg)

An LSTM network is structured similarly to an RNN, but it passes two states from layer to layer: the actual state $c$, and the hidden vector $h$. At each unit, the hidden vector $h_{t-1}$ is combined with the input $x_t$, and together they control what happens to the state $c_t$ and the output $h_{t}$ through **gates**. Each gate uses sigmoid activation (output in the range $[0,1]$), which can be thought of as a bitwise mask when multiplied by the state vector. LSTMs include the following gates (from left to right in the image above):
* **Forget gate**: Determines which components of the vector $c_{t-1}$ should be discarded and which should be retained.
* **Input gate**: Decides how much information from the input vector and the previous hidden vector should be added to the state vector.
* **Output gate**: Takes the updated state vector and decides which of its components will be used to generate the new hidden vector $h_t$.

The components of the state $c$ can be thought of as flags that can be toggled on or off. For instance, when we encounter the name *Alice* in a sequence, we might infer that it refers to a woman and activate a flag in the state indicating a female noun in the sentence. Later, when we come across the words *and Tom*, we might activate a flag indicating a plural noun. By manipulating the state, we can track grammatical properties of the sentence.

> **Note**: Here's an excellent resource for understanding the inner workings of LSTMs: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Christopher Olah.

Although the internal structure of an LSTM cell may seem complex, Keras abstracts this implementation within the `LSTM` layer. In the example above, all we need to do is replace the recurrent layer:


In [8]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size),
    keras.layers.LSTM(8),
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(8),validation_data=ds_test.map(tupelize).batch(8))



<tensorflow.python.keras.callbacks.History at 0x7f3d6af5c350>

> **Note** that training LSTMs is also quite slow, and you may not see much increase in accuracy in the beginning of training. You may need to continue training for some time to achieve good accuracy.


## Bidirectional and multilayer RNNs

In the examples we've covered so far, recurrent networks process sequences from start to finish. This approach feels intuitive because it mirrors the way we read or listen to speech. However, in situations where random access to the input sequence is required, it makes more sense to perform recurrent computations in both directions. RNNs that compute in both directions are called **bidirectional** RNNs, and they can be created by wrapping a recurrent layer with a special `Bidirectional` layer.

> **Note**: The `Bidirectional` layer duplicates the layer it wraps and sets the `go_backwards` property of one copy to `True`, enabling it to process the sequence in reverse.

Recurrent networks, whether unidirectional or bidirectional, identify patterns within a sequence and store them in state vectors or return them as output. Similar to convolutional networks, we can stack another recurrent layer on top of the first one to capture higher-level patterns derived from the lower-level patterns identified by the initial layer. This concept is known as a **multi-layer RNN**, which consists of two or more recurrent networks, where the output of one layer serves as the input for the next.

![Image showing a Multilayer long-short-term-memory- RNN](../../../../../translated_images/multi-layer-lstm.dd975e29bb2a59fe58b429db833932d734c81f211cad2783797a9608984acb8c.en.jpg)

*Image sourced from [this excellent post](https://towardsdatascience.com/from-a-lstm-cell-to-a-multilayer-lstm-network-with-pytorch-2899eb5696f3) by Fernando López.*

Keras simplifies the process of building these networks. You just need to add additional recurrent layers to your model. For all layers except the final one, you must set the `return_sequences=True` parameter to ensure the layer outputs all intermediate states, rather than just the final state of the recurrent computation.

Let's create a two-layer bidirectional LSTM for our classification task.

> **Note**: This code takes a significant amount of time to execute, but it achieves the highest accuracy we've seen so far. It might be worth the wait to see the results.


In [9]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, 128, mask_zero=True),
    keras.layers.Bidirectional(keras.layers.LSTM(64,return_sequences=True)),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),    
    keras.layers.Dense(4,activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),
          validation_data=ds_test.map(tupelize).batch(batch_size))



## RNNs for other tasks

So far, we've concentrated on using RNNs to classify text sequences. However, they are capable of handling many other tasks, including text generation and machine translation — we'll explore these tasks in the next unit.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
