![Practicum AI Logo image](https://github.com/PracticumAI/practicumai.github.io/blob/main/images/logo/PracticumAI_logo_250x50.png?raw=true)
***
# *Practicum AI:* RNN - Advanced RNN

This exercise was adapted from Baig et al. (2020) <i>The Deep Learning Workshop</i> from <a href="https://www.packtpub.com/product/the-deep-learning-workshop/9781839219856">Packt Publishers</a> (Exercises 6.01 - 6.05, page 269)

In this exercise, we will explore several types of Recurrent Neural Network (RNN) models for sentiment classification. These models include a plain RNN, variations of the plain RNN, a Long Short-Term Memory (LSTM) model, a Gated Recurrent Unit (GRU) model, a bidirectional RNN, and a stacked RNN.

We first define the architecture for each of these models and then evaluate their performance on the test data. This allows us to compare the performance of each model and determine which one is the most effective for this particular task.

To finish this exercise, follow these steps:

#### Data preparation
##### Loading the Data - page 269

To import the dataset, we will use the `imdb.load_data` function from the keras.datasets module.  A vocabulary size must be specified and passed to this function for it to work properly.  The dataset is automatically tokenized and split into training and test sets for us.

In [None]:
from tensorflow.keras.datasets import imdb

In [2]:
vocab_size = 8000

In [None]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)

Let's take a quick look at the *X_train* variable, to better understand the dataset and what needs to be done to prepare it for the model.

In [4]:
print(type(X_train))
print(type(X_train[5]))
print(X_train[5])

<class 'numpy.ndarray'>
<class 'list'>
[1, 778, 128, 74, 12, 630, 163, 15, 4, 1766, 7982, 1051, 2, 32, 85, 156, 45, 40, 148, 139, 121, 664, 665, 10, 10, 1361, 173, 4, 749, 2, 16, 3804, 8, 4, 226, 65, 12, 43, 127, 24, 2, 10, 10]


The *X_train* variable is a numpy array, where each element of the array is a list representing the text for a single review. Instead of being in raw text form, the terms in the text are represented as numerical tokens.

##### Staging and pre-processing our data - page 271
Because sequence length varies significantly, we must ensure that they are all the same length before being fed to the model. To do this, we use the `pad_sequences` function from the `keras.preprocessing.sequence` module. This function allows us to specify a maximum sequence length, and it pads or truncates any sequences shorter or longer than that with zeros.  For this exercise, we set maximum sequence length to 200. 

In [5]:
from tensorflow.keras import preprocessing

maxlen = 200

X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

In [8]:
print(X_train[5])

[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    1  778  128   74   12  630  163   15    4 1766 7982
 1051    2   32   85  156   45   40  148  139  121  664  665   10   10
 1361  173    4  749    2   16 3804    8    4  226   65   12   43  127
   24 

***
#### Exercise 6.01: (Student) - page 276

#### Plain RNN Model for Sentiment Classification

To classify the sentiment by a plain RNN model, this process will involve three steps:

*  **First**: Define a sequential RNN model for sentiment classification.
*  **Second**: Add embedding, RNN, dropout, and dense layers to the base model created in step 1. 
*  **Third**: Check the accuracy of the predictions on the test data to assess how well the model generalizes.

[Basic structure of Recurrent Neural Network](https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg)
![image](images/Recurrent_neural_network_unfold.svg.png)

#### 1. Import requisite libraries and set seed

In [9]:
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

#### 2. Import Keras libraries and initialize the model

In [10]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Flatten, Dense, Embedding, SpatialDropout1D, Dropout

In [11]:
model_rnn = Sequential()

#### 3. Specify the embedding layer

To define the input and output dimensions for our embedding layer, we need to set the input_dim parameter to the value of the *vocab_size* variable and the output_dim parameter to the desired number of dimensions.

For example, if we set the input_dim to the value of vocab_size and the output_dim to 32, 
this will create an embedding layer with vocab_size input dimensions and 32 output dimensions. The input dimensions represent the size of the vocabulary, while the output dimensions represent the number of dimensions that the embedding layer will reduce the input to.

```python
model_rnn.add(Embedding(vocab_size, output_dim=32))
model_rnn.add(SpatialDropout1D(0.4))
```

In [3]:
# Code it!

#### 4. Add a simple RNN layer with 32 neurons

```python
model_rnn.add(SimpleRNN(32))
```

In [1]:
# Code it!

#### 5. Add a dropout layer with 40% dropout

```python
model_rnn.add(Dropout(0.4))
```

In [13]:
# Code it!

#### 6. Add a dense layer

```python
model_rnn.add(Dense(1, activation = 'sigmoid'))
```

In [14]:
# Code it!

#### 7. Compile the model and view its summary

```python
model_rnn.compile(loss  = 'binary_crossentropy',
              optimizer = 'rmsprop',
              metrics   = ['accuracy'])

model_rnn.summary()
```

In [15]:
# Code it!

As shown here, we see that the majority of parameters are in the embedding layer, with 256,000 out of 278,241 total parameters. This is because we learn the embedding matrix during training, which has a dimensionality of `vocab_size(8000)*output_dim(32)`.

#### 8. Fit (train) the model

To fit our model to the training data, we use the `fit` method and specify the following hyperparameters:

* *batch_size*: The number of samples per gradient update.
* *epochs*: The number of times the model will cycle through the entire dataset.

We can also specify a validation split of 0.2 which reserves 20% of the training data for the validation step of the training process.  For example, to fit the model on the training data with a batch size of 128 for 10 epochs and a validation split of 0.2, we use the following code:

```python
history_rnn = model_rnn.fit(X_train, y_train, batch_size = 128, validation_split = 0.2, 
                            epochs = 10)
```

In [6]:
# Code it!

From the training output, we see that the validation accuracy reaches about 85.16%. 

#### 9. Make predictions on the test data using predict_classes()

```python
y_test_pred = model_rnn.predict_classes(X_test)
```

In [2]:
# Code it!

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
print(accuracy_score(y_test, y_test_pred))

And, the test accuracy is 84.15%. The model is performing fairly well.

***
##### Making Predictions on Unseen Data - page 280

Now that you have trained the model on some test data and assessed its performance, the next step is to see how well the model performs with new data.

In [39]:
inp_review = "An excellent movie!"

The sentiment in the text is positive. If the model is working well, it should predict the sentiment as positive.

To test our trained model with new text data, we execute the following steps:

* Tokenize the text into its individual terms, normalize the case, and remove any punctuation.
* Use a defined vocabulary for the data. We can load the vocabulary and the term-to-index mapping using the `get_word_index` method from the `imdb` module.
* Create a vocab map that converts the tokenized sentence into a sequence of term indices by performing a lookup for each term and returning the corresponding index.

In [29]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence

In [30]:
text_to_word_sequence(inp_review)

['an', 'excellent', 'movie']

In [31]:
word_map = imdb.get_word_index()

In [32]:
vocab_map = dict(sorted(word_map.items(), key = lambda x: x[1])[:vocab_size])

And finally, let's define a function that processes raw text and returns the corresponding sequence of integers.  We do this with the following steps:

1. Apply the `text_to_word_sequence` utility to the text to tokenize it and normalize the case.
2. Perform a lookup in the *vocab_map* dictionary to convert the tokenized text into a sequence of term indices.
3. Return the corresponding sequence of integers.

In [33]:
def preprocess(review):
    inp_tokens = text_to_word_sequence(review)
    seq = []
    for token in inp_tokens:
        seq.append(vocab_map.get(token))
    return seq

In [40]:
preprocess(inp_review)

[32, 318, 17]

Use the `predict` method to classify the sentiment. This method takes in a batch of new data and returns a prediction for each sample in the batch. The prediction will be a single value between 0 and 1, with 0 representing a negative sentiment and 1 representing a positive sentiment.

In [41]:
model_rnn.predict_classes([preprocess(inp_review)])

array([[1]])

The out prediction is 1 (positive). Let's apply the function to another raw text review and supply it to the model for prediction. 

In [42]:
inp_review = "Don't watch this movie - poor acting, poor script, bad direction."

In [49]:
model_rnn.predict_classes([preprocess(inp_review)])

array([[0]])

The prediction is 0, and the sentiment in the review is negative as we expect.

***
#### Exercise 6.02: (Student) - page 288 
#### LSTM-Based Sentiment Classification Model

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network that is capable of learning order dependence in sequence prediction problems. This means that they are able to remember and make use of information from previous time steps when making predictions at future time steps.

LSTM networks achieve this by using a more complex recurrent unit called the LSTM cell. This type of cell is able to store and manipulate information over longer periods of time, allowing the network to do a better job of capturing patterns and dependencies in the data.

However, the increased complexity of the LSTM cell comes at a cost, as it requires more resources to train and run compared to simple recurrent units like the RNN cell. That is, LSTM networks are can be computationally intensive, but they are frequently more effective at learning complex data patterns.

[Basic structure of Long Short-Term Memory Network](https://commons.wikimedia.org/wiki/File:Long_Short-Term_Memory.svg)
![image](images/1600px-Long_Short-Term_Memory.svg.png)<br>

Let's build a simple LSTM-based model to predict sentiment in our data.

#### 1. Import the LSTM layer from Keras

In [None]:
from tensorflow.keras.layers import LSTM

#### 2. Instantiate a sequential model & add embedding/dropout layers

```python
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, output_dim=32))
model_lstm.add(SpatialDropout1D(0.4))
```

In [7]:
# Code it!

#### 3. Add an LSTM layer with 32 nodes

```python
model_lstm.add(LSTM(32))
```

In [8]:
# Code it!

#### 4. Add dropout and dense layers and then compile and summarize the model

```python
model_lstm.add(Dropout(0.4))
model_lstm.add(Dense(1, activation = 'sigmoid'))

model_lstm.compile(loss = 'binary_crossentropy',
              optimizer = 'rmsprop',
              metrics   = ['accuracy'])

model_lstm.summary()
```

In [9]:
# Code it!

By examining the model summary, we see that the number of parameters in the LSTM layer is 8320. This is exactly four times the number of parameters in the plain RNN layer. LSTM models, as noted earlier, are more complex and that is reflected in the number of parameters we see here.

#### 5. Fit (train) the model

Now, let's fit the model on the training data for 5 epochs with a batch size of 128. 

```python
history_lstm = model_lstm.fit(X_train, y_train, batch_size = 128, validation_split = 0.2, 
                              epochs = 5)
```

In [10]:
# Code it!

It looks like the increased complexity of the LSTM cell has improved performance. Here we see that the validation accuracy of the LSTM model is higher than that of the plain RNN model, indicating that the model is better able to generalize when faced with new data. 

#### 6. Make predictions on the test data and print accuracy score

In [None]:
y_test_pred = model_lstm.predict_classes(X_test)

In [None]:
print(accuracy_score(y_test, y_test_pred))

The accuracy we got for the test data is 87.05%, an improvement from the accuracy we got using plain RNNs at 84.15%. It looks like the extra parameters and the extra predictive power came in handy for this task.

***
#### Exercise 6.03: (Student) - page 294

#### GRU-Based Sentiment Classification Model

Gated Recurrent Unit (GRU) is a type of recurrent neural network that can be used in place of Long Short-Term Memory (LSTM) networks in specific cases. GRUs are similar to LSTMs in that they are able to capture long-term dependencies in sequential data, but they use a simpler and more efficient type of recurrent unit.

One advantage GRUs have over LSTMs is that they require less memory and are faster to train and run. This is because the GRU cell has fewer parameters than the LSTM cell.  And as a result, it requires fewer resources.

LSTMs and GRUs do equally well on most tasks.  In some cases, though, LSTMs may still be the better choice due to their ability to capture more complex patterns in the data.  So, you may want to experiment with both types of recurrent network to determine which one performs better on a particular task.

[Basic structure of Gated Recurrent Unit](https://commons.wikimedia.org/wiki/File:Gated_Recurrent_Unit.svg)
![image](images/1600px-Gated_Recurrent_Unit.svg.png)<br>

Let's build a simple GRU-based model to predict the sentiment of a review.

#### 1. Import the GRU layer from Keras

In [None]:
from tensorflow.keras.layers import GRU

#### 2. Instantiate the model and add embedding and dropout layers

```python
model_gru = Sequential()
model_gru.add(Embedding(vocab_size, output_dim = 32))
model_gru.add(SpatialDropout1D(0.4))
```   

In [12]:
# Code it!

#### 3. Add a GRU layer with 32 nodes

```python
model_gru.add(GRU(32, reset_after = False))
```

In [13]:
# Code it!

#### 4. Add dropout and dense layers - compile and summarize the model

```python
model_gru.add(Dropout(0.4))
model_gru.add(Dense(1, activation='sigmoid'))

model_gru.compile(loss  = 'binary_crossentropy',
              optimizer = 'rmsprop',
              metrics   = ['accuracy'])

model_gru.summary()
```

In [14]:
# Code it!

By examining the model summary, we see that the number of parameters in the GRU layer is 6240. This is approximately three times the number of parameters in the plain RNN layer.

#### 5. Fit (train) the model

Now, let's fit the model on the training data for 5 epochs with a batch size of 128.#### 5. Fit (train) the model

```python
history_gru = model_gru.fit(X_train, y_train, batch_size = 128, validation_split = 0.2, 
                            epochs = 4)
```

In [15]:
# Code it!

The training time for our GRU model took much longer than the plain RNN, but it was faster than the LSTM model. The validation accuracy is also better than the plain RNN and close to that of the LSTM.

#### 6. Make predictions on the test data and print accuracy score

In [None]:
y_test_pred = model_gru.predict_classes(X_test)

In [None]:
accuracy_score(y_test, y_test_pred)

0.87156

After training and evaluating the GRU model on the test data, we see that its accuracy is similar to that of the LSTM model (87.06% vs 87.05%). This suggests that the GRU model is able to capture the important patterns and dependencies in the data and make accurate predictions, despite having fewer parameters than the LSTM model.

This is an important point.  GRUs are often used as a simplified alternative to LSTMs, due to their ability to provide similar accuracy with fewer parameters. This makes them more efficient to train and run.  This is a useful property, especially when computational resources are limited.

***
#### Exercise 6.04: (Teacher) - page 299

#### Bi-directional LSTM-Based Sentiment Classification Model

Recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), are powerful tools for processing sequential data and can achieve excellent results on a wide range of tasks. However, there are ways to make these models even more powerful by modifying their architecture.

One such modification is the use of bidirectional RNNs. A bidirectional RNN is a type of neural network that processes the sequence information in both directions, either backward (from future to past) or forward (from past to future). This can be particularly useful for tasks such as machine translation, parts-of-speech tagging, name entity recognition, and word prediction, where understanding the context of a word or phrase is important.

[Basic structure of bidirectional LSTM.](https://www.mdpi.com/2076-3417/11/17/8129/htm)
<div>
<img src="images/bidirectional_LSTM.png" width="400"/>
</div>

Now, let's apply a bidirectional LSTM-based model to our sentiment classification task.

#### 1. Import the Bidirectional layer from Keras

In [None]:
from tensorflow.keras.layers import Bidirectional

#### 2. Instantiate the model and add embedding and dropout layers

```python
model_bilstm = Sequential()
model_bilstm.add(Embedding(vocab_size, output_dim = 32))
model_bilstm.add(SpatialDropout1D(0.4))
```

In [17]:
# Code it!

#### 3. Add a Bidirectional wrapper to an LSTM layer

```python
model_bilstm.add(Bidirectional(LSTM(32)))
```

In [18]:
# Code it!

#### 4. Add dropout and dense layers - compile and summarize the model

```python
model_bilstm.add(Dropout(0.4))
model_bilstm.add(Dense(1, activation='sigmoid'))

model_bilstm.compile(loss = 'binary_crossentropy',
              optimizer   = 'rmsprop',
              metrics     = ['accuracy'])

model_bilstm.summary()
```           

In [19]:
# Code it!

The bidirectional LSTM layer has twice the number of parameters as the LSTM layer, with a total of 16,640 parameters. This is eight times the number of parameters in a plain RNN, which has a total of 8,320 parameters. It is not surprising that the bidirectional LSTM has more parameters given its increased complexity compared to the LSTM and plain RNN models.

#### 5. Fit (train) the model

Now, let's fit the model on the training data for 5 epochs with a batch size of 128

```python
history_bilstm = model_bilstm.fit(X_train, y_train, batch_size = 128, validation_split = 0.2, 
                                  epochs = 4)
```

In [20]:
# Code it!

Notice that training a bidirectional LSTM model takes significantly longer than a regular LSTM model. Despite this, the validation accuracy of the bidirectional LSTM model appears to be similar to that of the LSTM 

#### 6. Make predictions on the test data and print accuracy score

In [None]:
y_test_pred = model_bilstm.predict_classes(X_test)

In [None]:
accuracy_score(y_test, y_test_pred)

0.877

The accuracy of the test data for the bidirectional LSTM model is 87.60%, which is a slight improvement over the accuracy of the LSTM model at 87.05%. It is possible to further tune the hyperparameters of the bidirectional LSTM model to potentially improve its performance even more. This demonstrates the potential of this powerful architecture to achieve high levels of accuracy on a variety of tasks.

***
#### Exercise 6.05: (Student) - page 302

#### Stacked LSTM-based Sentiment Classification Model

An alternative approach to increasing the performance of RNNs is to use stacked RNNs. Stacking RNNs involves feeding the output of one RNN layer into another RNN layer, effectively creating a multi-layer RNN model. This can potentially improve the model's ability to learn and make more accurate predictions.

[Basic structure of Stacked LSTM](https://medium.com/@amardeepchauhan/paradigms-of-various-lstm-networks-e95ef1d6caaa)

<div>
<img src="images/stacked_LSTM.png" width="450"/>
</div>

Let's build a stacked LSTM-based model by stacking two LSTM layers to predict sentiment in our data.

#### 1. Instantiate the model

```python
model_stack = Sequential()
model_stack.add(Embedding(vocab_size, output_dim = 32))
model_stack.add(SpatialDropout1D(0.4))
```

In [22]:
# Code it!

#### 2. Add an LSTM layer with 32 nodes

Specify *return_sequences* as *True* in the LSTM layer. This will return the output of the LSTM at each time step, which can then be passed to the next LSTM layer.

```python
model_stack.add(LSTM(32, return_sequences = True))
```

In [23]:
# LSTM Layer 1 - return_sequences is True
# Code it!

#### 3. Add a second LSTM layer with 32 nodes

This time we don't need to return the sequence. You can either specify the *return_sequences* option as *False* or skip it altogether. des

```python
model_stack.add(LSTM(32, return_sequences = False))
```

In [24]:
# LSTM Layer 2 - return_sequences is False
# Code it!

#### 4. Add dropout and dense layers - compile and summarize the model

```python
model_stack.add(Dropout(0.5))
model_stack.add(Dense(1, activation = 'sigmoid'))

model_stack.compile(loss = 'binary_crossentropy',
              optimizer  = 'rmsprop',
              metrics    = ['accuracy'])

model_stack.summary()
```

In [25]:
# Code it!

Note that the stacked LSTM model has the same number of parameters as the bidirectional model.

#### 5. Fit (train) the model

Now, let's fit the model on the training data for 5 epochs with a batch size of 128.

```python
history_stack = model_stack.fit(X_train, y_train, batch_size=128, validation_split=0.2, 
                                epochs = 4)
```

In [26]:
# Code it!

Training stacked LSTM models takes less time than training bidirectional LSTM models. Despite this, the validation accuracy of the stacked LSTM model appears to be similar to that of the bidirectional LSTM model. This suggests that while stacked LSTM models may be more efficient to train, they do not necessarily sacrifice accuracy or performance compared to more complex models such as bidirectional LSTMs.

#### 6. Make predictions on the test data and print accuracy score

In [None]:
y_test_pred = model_stack.predict_classes(X_test)

In [None]:
accuracy_score(y_test, y_test_pred)

0.87572

The accuracy of 87.53% is a slight improvement over the LSTM model (87.05%) and is practically the same as that of the bidirectional model (87.60%).

Now let's take a broader look at the situation and compare the models. Consider the table below, which compares five models in terms of parameters, training time, and test accuracy on our dataset.

| Model | RNN layer parameters | Training time | Test accuracy |
| --- | --- | --- | --- |
| Plain RNN | 2,080 | Low | 84.15% |
| LSTM | 8,320 | High | 87.05% |
| GRU | 6,240 | Medium-High | 87.06% |
| Bi-directional LSTM | 16,640 | Very High | 87.60% |
| Stacked LSTM | 16,640 | Very High | 87.53% |

According to the table, plain RNNs have the lowest number of parameters and shortest training times, but also have the lowest accuracy of all the models. LSTMs and GRUs perform better than plain RNNs, but their increased accuracy comes at the cost of longer training times and a larger number of parameters, increasing the risk of overfitting.

The stacked and bidirectional approaches seem to offer incremental improvements in terms of predictive power, but this comes at the price of significantly longer training times and a larger number of parameters. Despite this, the stacked and bidirectional approaches yielded the highest accuracy, even on a small dataset.

#### Summary 

In this exercise, we explored various versions of RNNs, including plain RNNs, LSTMs, and GRUs. We saw that plain RNNs are not practical for modeling long-range dependencies due to the vanishing gradient problem.  LSTMs, on the other hand, handle long sequences better but have many more parameters.  GRUs are a simpler alternative that work well on small datasets.  We then considered bidirectional and stacked RNNs.  These approaches have greatly improved the ability of RNNs to achieve state-of-the-art results on various tasks.