# **Practice 2.2. Recurrent Neural Networks**

- Alejandro Dopico Castro ([alejandro.dopico2@udc.es](mailto:alejandro.dopico2@udc.es)).
- Ana Xiangning Pereira Ezquerro ([ana.ezquerro@udc.es](mailto:ana.ezquerro@udc.es)).

This notebook contains execution examples of the recurrent neural architectures proposed for the [Amazon Reviews dataset](https://www.kaggle.com/datasets/bittlingmayer/amazonreviews). The Python scripts submitted include auxiliar code to simplify the readibility of the coding cells:

- [data.py](data.py): Defines the `AmazonDataset` class to load, split, transform and stream the Amazon Reviews dataset. 
- [recurrent_models.py](recurrent_models.py): Defines the `create_recurrent_model` function to instantiate a Keras model varying its architecture. 
- [utils.py](utils.py): Defines auxiliary function to train and plot the performance of a Keras model.

In [1]:
from data import AmazonDataset
from model import AmazonReviewsModel
import plotly.io as pio
import plotly.graph_objects as go
from collections import OrderedDict
from keras.layers import LSTM, GRU, SimpleRNN
from keras.regularizers import Regularizer, L1, L2, L1L2
from keras.optimizers import Adam, RMSprop
import pandas as pd
from itertools import product

pio.renderers.default = "vscode"

# global parameters
MAX_FEATURES = 1000
MODEL_PATH = "results/"

# model default parameters
train_default = dict(epochs=30, batch_size=500, lr=1e-3, dev_patience=5)

# load data
path_dir = "AmazonDataset/"

2024-04-24 19:10:42.133173: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-04-24 19:10:42.165187: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-24 19:10:42.165214: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-24 19:10:42.166077: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-24 19:10:42.171268: I tensorflow/core/platform/cpu_feature_guar

Using TensorFlow backend


## Table of Contents

1. [Simple Recurrent Baseline](#simple-recurrent-baseline)

    - [Exploring the vocabulary size](#exploring-the-vocabulary-size)
    - [Exploring the recurrent cell](#exploring-the-recurrent-cell)
    - [Exploring the dimension of  the model](#exploring-the-dimension-of-the-model)
    
2. [Enhancing the architecture with hyperparameter tuning](#enhancing-the-architecture-with-parameter-tuning)
3. [Bidirectional processing](#bidirectional-processing)
4. [Transformer](#transformer)
5. [Optimal configuration of the recurrent architecture](#optimal-configuration-of-the-recurrent-architecture)
6. [Final comparison](#final-comparison)

## Simple Recurrent Baseline 

### Exploring the vocabulary size

We used a simple recurrent architecture to set our baseline performance. This model is conformed by two stacked modules: a recurrent encoder of 2-stacked [RNN cells](https://keras.io/api/layers/recurrent_layers/simple_rnn/) ([Rumelhart et al., 1985](https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap8_PDP86.pdf)) and a [feed-forward layer](https://keras.io/api/layers/core_layers/dense/) with a sigmoidal activation to return the probability of a good review. We used an input embedding layer of dimension $d_x=64$ and maintained the dimension of the decoder to $d_h=64$. In order to analyze the impact of the vocabulary size ($|\mathcal{V}|$) we repeated three experiments varying this value (200, 500 and 1000) maintaining the same architecture.

In [8]:
dataset = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=200,
)
rnn_model_200 = AmazonReviewsModel(200, 64, SimpleRNN, name="SimpleRNN-200")
_, fig = rnn_model_200.train(
    dataset, f"{MODEL_PATH}/{rnn_model_200.name}.weights.h5", **train_default
)
print(rnn_model_200.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
[0.48217952251434326, 0.7735999822616577]


In [10]:
dataset = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=500,
)
rnn_model_500 = AmazonReviewsModel(500, 64, SimpleRNN, name="SimpleRNN-500")
_, fig = rnn_model_500.train(
    dataset, f"{MODEL_PATH}/{rnn_model_500.name}.weights.h5", **train_default
)
print(rnn_model_500.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
[0.43902644515037537, 0.8077200055122375]


In [5]:
dataset = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=1000,
)
rnn_model_1000 = AmazonReviewsModel(1000, 64, SimpleRNN, name="SimpleRNN-1000")
_, fig = rnn_model_1000.train(
    dataset, f"{MODEL_PATH}/{rnn_model_1000.name}.weights.h5", **train_default
)
print(rnn_model_1000.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
[0.37227746844291687, 0.8421199917793274]


The [simple RNN cell](https://keras.io/api/layers/recurrent_layers/simple_rnn/) with $|\mathcal{V}|=1000$ achieves 84.21% of test accuracy and is capable of learning from 95% of the training data. It can be observed that the vocabulary size plays an important role in the performance of the model: when using a small vocabulary size (e.g. 200) the performance does not reach more than the 80% of the accuracy. By increasing its value up to $|\mathcal{V}| = 1000$ we see a five-points improvement in the evaluation set. However, this improvement comes with an aggravation in the test performance: the larger $|\mathcal{V}|$ value, the larger difference between the train and test accuracy. This phenomenon (overfitting) is likely due to the increased complexity and dimensionality of the input data, which can challenge the model's ability to generalize effectively.

In [7]:
# change name 
rnn_model = rnn_model_1000
rnn_model._name = 'SimpleRNN-base'

### Exploring the recurrent cell


In the next cells we maintain $|\mathcal{V}|=1000$ and substitute the simple RNN by two different recurrent cells: the [LSTM](https://keras.io/api/layers/recurrent_layers/lstm/) ([Hochreiter et al., 1997](https://www.bioinf.jku.at/publications/older/2604.pdf)) and the [GRU](https://keras.io/api/layers/recurrent_layers/gru/) ([Chung et al., 2014](https://arxiv.org/abs/1412.3555)). In the original papers, authors claimed to improve the performance of the simple RNN with a better inner representation of the temporal data flow by the introduction of different gates modeled with different learnable weights.

In [8]:
lstm_model = AmazonReviewsModel(1000, 64, LSTM, name="LSTM-base")
_, fig = lstm_model.train(
    dataset, f"{MODEL_PATH}/{lstm_model.name}.weights.h5", **train_default
)
print(lstm_model.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30


2024-04-24 18:58:46.402658: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}

	for Tuple type infernce function 0
	while inferring type of node 'cond_19/output/_22'


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
[0.3445417881011963, 0.8532000184059143]


In [9]:
gru_model = AmazonReviewsModel(1000, 64, GRU, name="GRU-base")
_, fig = gru_model.train(
    dataset, f"{MODEL_PATH}/{gru_model.name}.weights.h5", **train_default
)
print(gru_model.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
[0.33640894293785095, 0.8580800294876099]


Using the same architecture but only replacing the simple RNN layer by LSTMs or GRUs, we see that the performance reaches the 85.32% and 85.80% of accuracy, respectively, proving that the LSTM and GRU cells are better options for the baseline recurrent architecture than the simple RNN cell.

### Exploring the dimension of the model ($d_h$)

In the next cells we maitain the vocabulary size ($|\mathcal{V}|=1000$) and the type of recurrent cell ([LSTM](https://keras.io/api/layers/recurrent_layers/lstm/)) to explore the impact of the model dimension $d_h\in\{64, 128, 256, 512\}$.

In [9]:
lstm_model_128 = AmazonReviewsModel(1000, 128, LSTM, name="LSTM-128")
_, fig = lstm_model_128.train(
    dataset, f"{MODEL_PATH}/{lstm_model_128.name}.weights.h5", **train_default
)
print(lstm_model_128.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30


2024-04-24 19:29:05.890379: W tensorflow/core/common_runtime/type_inference.cc:339] Type inference failed. This indicates an invalid graph that escaped type checking. Error message: INVALID_ARGUMENT: expected compatible input types, but input 1:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_INT32
    }
  }
}
 is neither a subtype nor a supertype of the combined inputs preceding it:
type_id: TFT_OPTIONAL
args {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_TENSOR
    args {
      type_id: TFT_FLOAT
    }
  }
}

	for Tuple type infernce function 0
	while inferring type of node 'cond_19/output/_22'


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
[0.34408462047576904, 0.8536400198936462]


In [10]:
lstm_model_256 = AmazonReviewsModel(1000, 256, LSTM, name="LSTM-256")
_, fig = lstm_model_256.train(
    dataset, f"{MODEL_PATH}/{lstm_model_256.name}.weights.h5", **train_default
)
print(lstm_model_256.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
[0.3391265571117401, 0.8568800091743469]


In [11]:
lstm_model_512 = AmazonReviewsModel(1000, 512, LSTM, name="LSTM-512")
_, fig = lstm_model_512.train(
    dataset, f"{MODEL_PATH}/{lstm_model_512.name}.weights.h5", **train_default
)
print(lstm_model_512.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
[0.34632524847984314, 0.8533200025558472]


The next table shows the results with different hidden dimension ($d_h$) in the train, validation and test set:

$d_h$ | train | val   | test  |
|:---:|:-----:|:-----:|:-----:|
| 64  | 89.21 | 86.72 | 85.32 |
| 128 | 87.48 | 84.48 | 85.36 | 
| 256 | 90.07 | 85.64 | 85.68 | 
| 512 | 87.14 | 85.42 | 85.33 | 

Results show that there are not significative differences between the performance of different model dimensions, which might indicate that, in order to increase the model complexity (and hence the flexibility to learn the input data), instead of exploring the hyperparameter $d_h$, other hyperparameters should be tuned, such as the number of hidden layers.

## Enhancing the architecture with hyperparameter tuning

Once we have a first estimation of the performance with small models we are going launch experiments with larger architectures. We increased the model dimension to $d_h=128$ and the vocabulary size to $|\mathcal{V}|=2000$. The encoder is now conformed by 3-stacked recurrent cells and the decoder adds a new extra feed-forward network between the last state of the encoder and the output layer. In order to balance this enhancement and avoid a possible overfitting, we included a [dropout](https://keras.io/api/layers/regularization_layers/dropout/) of the 10% in the latent space of the network (between the encoder and decoder).

In [5]:
# relaad the dataset
dataset = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=2000,
)

In [6]:
rnn_enhanced = AmazonReviewsModel(
    2000,
    256,
    SimpleRNN,
    num_recurrent_layers=3,
    dropout=0.1,
    ffn_dims=[64],
    name="SimpleRNN-enhanced",
)
_, fig = rnn_enhanced.train(
    dataset, f"{MODEL_PATH}/{rnn_enhanced.name}.weights.h5", **train_default
)
print(rnn_enhanced.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
[0.6853593587875366, 0.5634400248527527]


In [7]:
lstm_enhanced = AmazonReviewsModel(
    2000,
    256,
    LSTM,
    num_recurrent_layers=3,
    dropout=0.1,
    ffn_dims=[64],
    name="LSTM-enhanced",
)
_, fig = lstm_enhanced.train(
    dataset, f"{MODEL_PATH}/{lstm_enhanced.name}.weights.h5", **train_default
)
print(lstm_enhanced.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
[0.33539456129074097, 0.8589199781417847]


In [8]:
gru_enhanced = AmazonReviewsModel(
    2000,
    256,
    GRU,
    num_recurrent_layers=3,
    dropout=0.1,
    ffn_dims=[64],
    name="GRU-enhanced",
)
_, fig = gru_enhanced.train(
    dataset, f"{MODEL_PATH}/{gru_enhanced.name}.weights.h5", **train_default
)
print(gru_enhanced.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
[0.33767375349998474, 0.8643199801445007]


We see a slight improvement with the LSTM (85.89%) and GRU-based (86.43%) architectures when increasing the number of learnable hyperparameters (both the train and the test set metrics are improved). However, the simple RNN only reaches 56.34% of accuracy. This drop in the performance evidences the clear superiority of the LSTM and GRU when modelling high-dimensional temporal data. The simple RNN is instead more useful for simpler problems (when the dimension of the model is small, e.g. $d_h=64$ and $|\mathcal{V}|=1000$) and we see that when the input increases its complexity the RNN lacks of a good representation to learn temporal relations.

## Bidirectional Processing

In this section we tried to boost the performance of our model with the introduction of bidirectional processing. The Keras API has a [Bidirectional Layer](https://keras.io/api/layers/recurrent_layers/bidirectional/) which accepts as input a recurrent cell ([LSTM](https://keras.io/api/layers/recurrent_layers/lstm/), [GRU](https://keras.io/api/layers/recurrent_layers/gru/) or [SimpleRNN](https://keras.io/api/layers/recurrent_layers/simple_rnn/)) and generates two different cells left-to-right a right-to-left contextualization. The final output is finally obtained via the concatenation of both representations. 


In [9]:
birnn_model = AmazonReviewsModel(
    2000,
    256,
    SimpleRNN,
    num_recurrent_layers=4,
    dropout=0.15,
    ffn_dims=[128, 64],
    name="BiRNN",
    bidirectional=True,
)
_, fig = birnn_model.train(
    dataset, f"{MODEL_PATH}/{birnn_model.name}.weights.h5", **train_default
)
print(birnn_model.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
[0.34082308411598206, 0.8606799840927124]


In [10]:
bilstm_model = AmazonReviewsModel(
    2000,
    256,
    LSTM,
    num_recurrent_layers=4,
    dropout=0.15,
    ffn_dims=[128, 64],
    name="BiLSTM",
    bidirectional=True,
)
_, fig = bilstm_model.train(
    dataset, f"{MODEL_PATH}/{bilstm_model.name}.weights.h5", **train_default
)
print(bilstm_model.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
[0.30964571237564087, 0.8692399859428406]


In [11]:
bigru_model = AmazonReviewsModel(
    2000,
    256,
    GRU,
    num_recurrent_layers=4,
    dropout=0.15,
    ffn_dims=[128, 64],
    name="BiGRU",
    bidirectional=True,
)
_, fig = bigru_model.train(
    dataset, f"{MODEL_PATH}/{bigru_model.name}.weights.h5", **train_default
)
print(bigru_model.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
[0.31589433550834656, 0.871720016002655]


Althought the performance of the Bidirectional LSTM (86.92%) and Bidirectional GRU (87.17%) do not seem to strongly improve the unidirectional processing, we see that the simple RNN cell takes a great advantage of introducing right-to-left contextualization. While the unidirectional RNN cell barely reached ~60% of accuracy, the bidirecitonal RNN is able to reach 86.06 points, obtaining a similar performance to the other recurrent cells.

## Transformer

In this section we introduce the [Transformer block](https://keras.io/api/keras_nlp/modeling_layers/transformer_encoder/) to reinforce the word contextualization between the input embedding layer and the recurrent layers. We conducted experiments adding three Transformer layers of $4$ heads before the bidirectional recurrent block using a hidden size of $d_h=128$. We see that the Transformer improves ~1% of the accuracy of all models. 

In [12]:
birnn_transformer = AmazonReviewsModel(
    2000,
    128,
    SimpleRNN,
    num_recurrent_layers=4,
    dropout=0.2,
    ffn_dims=[128, 64],
    name="BiRNN-transformer",
    num_transformers=3,
    bidirectional=True,
)
_, fig = birnn_transformer.train(
    dataset, f"{MODEL_PATH}/{birnn_transformer.name}.weights.h5", **train_default
)
print(birnn_transformer.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
[0.3136318325996399, 0.8738399744033813]


In [13]:
bilstm_transformer = AmazonReviewsModel(
    2000,
    128,
    LSTM,
    num_recurrent_layers=4,
    dropout=0.2,
    ffn_dims=[128, 64],
    name="BiLSTM-transformer",
    num_transformers=3,
    bidirectional=True,
)
_, fig = bilstm_transformer.train(
    dataset, f"{MODEL_PATH}/{bilstm_transformer.name}.weights.h5", **train_default
)
print(bilstm_transformer.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
[0.2989642918109894, 0.8779600262641907]


In [14]:
bigru_transformer = AmazonReviewsModel(
    2000,
    128,
    GRU,
    num_recurrent_layers=4,
    dropout=0.2,
    ffn_dims=[128, 64],
    name="BiGRU-transformer",
    num_transformers=3,
    bidirectional=True,
)
_, fig = bigru_transformer.train(
    dataset, f"{MODEL_PATH}/{bigru_transformer.name}.weights.h5", **train_default
)
print(bigru_transformer.evaluate(dataset.X_test, dataset.y_test))
fig

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
[0.299507200717926, 0.8766000270843506]


## Optimal configuration of the recurrent architecture 

In this section we experimented with the regularization options of the full network. We tested the performance of different weight regularizers, initializers and optimizers. The cell below shows the deployment of the hyperaparameter search and the final result:

In [4]:
grid = OrderedDict(
    regularizer=[L1(1e-4), L2(1e-3), L1L2(1e-4)],
    initializer=["random_normal", "glorot_uniform", "he_normal", "orthogonal"],
    optimizer=[Adam, RMSprop],
)
Regularizer.__repr__ = lambda x: x.__class__.__name__
dataset = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=2000,
)


def tostring(x):
    if isinstance(x, type):
        return x.__name__
    else:
        return repr(x)


def applydeep(lists, func):
    result = []
    for item in lists:
        result.append(list(map(func, item)))
    return result


df = pd.DataFrame(
    columns=["train", "val", "test"],
    index=pd.MultiIndex.from_product(applydeep(grid.values(), tostring)),
)
df.index.names = ["regularizer", "initializer", "optimizer"]

for i, params in enumerate(product(*grid.values())):
    params = dict(zip(grid.keys(), params))
    optimizer = params.pop("optimizer")
    model = AmazonReviewsModel(
        2000,
        128,
        LSTM,
        num_recurrent_layers=4,
        dropout=0.2,
        ffn_dims=[128, 64],
        num_transformers=3,
        bidirectional=True,
        **params,
    )
    model.train(dataset, "results/amazon.weights.h5", opt=optimizer, **train_default)
    _, train_acc = model.evaluate(dataset.X_train, dataset.y_train)
    _, val_acc = model.evaluate(dataset.X_val, dataset.y_val)
    _, test_acc = model.evaluate(dataset.X_test, dataset.y_test)
    df.loc[tuple(map(tostring, params.values()))] = [train_acc, val_acc, test_acc]
    df.to_csv("grid.csv")
df = df.applymap(lambda x: round(x * 100, 2))
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,train,val,test
regularizer,initializer,optimizer,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
L1,'random_normal',Adam,95.29,88.26,87.42
L1,'random_normal',RMSprop,95.29,88.26,87.42
L1,'glorot_uniform',Adam,94.03,88.14,87.57
L1,'glorot_uniform',RMSprop,94.03,88.14,87.57
L1,'he_normal',Adam,96.73,85.02,84.52
L1,'he_normal',RMSprop,96.73,85.02,84.52
L1,'orthogonal',Adam,95.05,87.34,86.74
L1,'orthogonal',RMSprop,95.05,87.34,86.74
L2,'random_normal',Adam,94.52,87.98,87.46
L2,'random_normal',RMSprop,94.52,87.98,87.46


In [8]:
print(df.max())
print(df.idxmax())

train    99.39
val      88.26
test     87.57
dtype: float64
train       (L1L2, 'he_normal', Adam)
val       (L1, 'random_normal', Adam)
test     (L1, 'glorot_uniform', Adam)
dtype: object


The final outcome does not show a special improvement varying other the configuration of the network from the performance obtained with bidirectional cells and Transformers (near 87-88% of accuracy in the validation and test set). For this reason, we did not include weight regularization or special initializations for the final proposed architecture.

## Final comparison

The next cell shows the final comparison of all the models evaluated in this work. 

In [26]:
base = [rnn_model, lstm_model, gru_model]
models = [
    lstm_enhanced,
    gru_enhanced,
    birnn_model,
    bilstm_model,
    bigru_model,
    birnn_transformer,
    bilstm_transformer,
    bigru_transformer,
]
dataset1 = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=1000,
)
dataset2 = AmazonDataset.load(
    train_path=path_dir + "train_small.txt",
    test_path=path_dir + "test_small.txt",
    max_features=2000,
)

names = [model.name for model in base + models]
train_accs = [
    model.evaluate(dataset1.X_train, dataset1.y_train)[1] for model in base
] + [model.evaluate(dataset2.X_train, dataset2.y_train)[1] for model in models]
test_accs = [model.evaluate(dataset1.X_test, dataset1.y_test)[1] for model in base] + [
    model.evaluate(dataset2.X_test, dataset2.y_test)[1] for model in models
]

In [45]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=train_accs,
        y=test_accs,
        text=names,
        mode="text+markers",
        textposition="top right",
    )
)
fig.update_layout(
    height=600,
    width=1000,
    margin=dict(t=50, b=10, r=10, l=10),
    title_text="Comparison of the Amazon Reviews models",
    xaxis_title="train",
    yaxis_title="test",
    template="seaborn",
)
fig.update_yaxes(range=[0.835, 0.88])
fig.update_xaxes(range=[0.875, 0.94])
fig.show()

In our analysis, base models exhibited the poorest performance on both the training and test sets. Enhanced architectures notably improved accuracy on the training set; however, their performance did not significantly surpass baseline models in the final evaluation, with the exception of the simple RNN-based architecture achieving only 56% accuracy.

Introducing bidirectionality notably enhanced the performance on the test set, while the integration of Transformer layers across architectures slightly improved evaluation metrics. The most effective architecture, BiLSTM-transformer, achieved an impressive 87.79% accuracy score. This architecture likely excelled due to its enhanced temporal representations in the BiLSTM cell (utilizing three gates compared to the two in GRU) and strengthened embedding contextualization through Transformer layers prior to the recurrent layers.
Although Transformer-based architectures demonstrated some improvement, the associated computational costs may not justify the gains achieved. GRU-based models demonstrated comparable performance to LSTM-based models, with the exception of the Transformer-based architecture, where both achieved similar accuracy levels. Additionally, fine-tuned parameters in enhanced architectures consistently yielded superior results. Notably, selecting a vocabulary size that includes the maximum explored (2000 words) appears optimal, suggesting that richer information facilitates improved model performance.

In addition to investigating various architectural enhancements, several regularization techniques were employed to mitigate overfitting. However, these techniques did not result in significant improvements in performance, and thus, their impact was not reflected in the graphs or final evaluations. Despite their potential benefits in controlling model complexity and improving generalization, the specific configurations tested did not demonstrate superior results compared to the best-performing architectures highlighted in our analysis.

Given the limitations of our dataset size and the performance of our custom architectures, a exploration with pretrained models is a promising avenue for achieving improved results. Leveraging pretrained embeddings and transformer models, which encapsulate extensive prior knowledge from large-scale datasets, could offer significant advantages. By incorporating pretrained representations, our models may benefit from enhanced feature extraction and contextual understanding, potentially outperforming our base architectures. This avenue of investigation could potentially lead to significant performance gains, particularly when working with smaller datasets, by harnessing the wealth of information encoded within pretrained models.