# Practice 2.2 (Recurrent Neural Networks)

Authors:

1. Ovidio Manteiga Moar
1. Carlos Villar Martínez


# Introduction

## Dataset

For the second part of the RNN assignment, we will use the Amazon Reviews for Sentiment Analysis (Kaggle dataset). This dataset consists of a few million Amazon customer reviews (input text) and star ratings (output labels). 

The classes are `__label__1` and `__label__2`, and there is only one class per row. `__label__1` corresponds to 1-star and 2-star reviews, and `__label__2` corresponds to 4-star and 5-star reviews. 3-star reviews (i.e. reviews with neutral sentiment) were not included in the original dataset. Most of the reviews are in English, but there are a few in other languages, like Spanish. The original dataset has 3,600,000 examples for training and 400,000 for testing. We will use a reduced version of the dataset, with 25,000 examples for training and 25,000 examples for testing.

The function `readData` in this notebook reads the dataset (train and test) and the `transformData` function transforms the text yielding the preprocessed train and test sets to use. The transformed datasets represents the texts as sequences of integers representing each word based on a vocabulary using the Keras function `TextVectorization`. It requires two hyperparameters:

- The size of the vocabulary (maxFeatures).
- The maximum length of the text (seqLength). By default, seqLength has 
been set to the average length of the training samples plus two times their standard deviation. 


## Problem

Given the dataset described above, the problem is to predict the correct label indicating the sentiment (positive `__label__2` or negative `__label__1`) of given a review as a text. This is an instance of a binary classification problem, where the inputs are the texts with the reviews and the outputs the labels indicating the sentiment.

The problem is to be tackled using some kind of RNNs, which should be able to capture some of the meaning in the reviews that determines if a review is considered positive or negative.


## Evaluation

The metric to evaluate the performance of the models will be the *accuracy* achieved in the test set provided. In the implementation we treat the test set as the validation set to get the value of the metric after each epoch of training.


In [3]:
import os
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
#import matplotlib.pyplot as plt

In [33]:
#reads a file. Each line has the format: label text
#Returns a list with the text and a list with the labels
def readData(fname):

    with open(fname, 'r', encoding="utf-8") as f:
        fileData = f.read()
  
    lines = fileData.split("\n")
    textData = list()
    textLabel = list()
    lineLength = np.zeros(len(lines))
    
    for i, aLine in enumerate(lines):     
        if not aLine:
            break  
        label = aLine.split(" ")[0]
        lineLength[i] = len(aLine.split(" "))
        if(label == "__label__1"):
            textLabel.append(0)
            textData.append(aLine.lstrip("__label__1 "))

        elif(label == "__label__2"):
            textLabel.append(1)
            textData.append(aLine.lstrip("__label__2 "))

        else:
            print("\nError in readData: ", i, aLine)
            exit()
    
    f.close()
    return textData, textLabel, int(np.average(lineLength)+2*np.std(lineLength))

In [34]:
def transformData(x_train, y_train, x_test, y_test, maxFeatures, seqLength):
    #transforms text input to int input based on the vocabulary
    #max_tokens = maxFeatures is the size of the vocabulary
    #output_sequence_length =  seqLength is the maximum length of the transformed text. Adds 0 is text length is shorter
    precLayer = layers.experimental.preprocessing.TextVectorization(max_tokens = maxFeatures, 
    standardize =  'lower_and_strip_punctuation', split = 'whitespace', output_mode = 'int', 
    output_sequence_length =  seqLength)
    precLayer.adapt(x_train)
    #print(precLayer.get_vocabulary())
    x_train_int = precLayer(x_train)
    y_train = tf.convert_to_tensor(y_train)
    #print(x_train_int)
    #print(y_train)
    x_test_int= precLayer(x_test)
    y_test = tf.convert_to_tensor(y_test)
    #print(x_test_int)
    #print(y_test)

    return x_train_int, y_train, x_test_int, y_test

In [35]:
x_train, y_train, seqLength = readData("./amazon/train_small.txt")
x_test, y_test, tmp = readData("./amazon/test_small.txt")

# Hyperparameters
maxFeatures = 1000
embedding_dim = 64
seqLength = seqLength * 2

x_train_int, y_train, x_test_int, y_test = transformData(x_train, y_train, x_test, y_test, maxFeatures, seqLength)


# The model

In the following cell, it is defined the model that achieved the best performance, with more than 88% accuracy in the test set. It consists of an embedding layer mapping the vectorized sequences of words into vectors representing their meaning, followed by a recurrent layer of GRUs with 64 units, and finally the output layer as a dense layer with a single unit and a sigmoid activation to produce the binary ouput. The GRU is configured to return only the output of the last cell to be used to predict the output label (`return_sequences=False`), so that the type of recurrent architecture is many-to-one. Since the dimensionality of the output will be 64 (as the number of units), a dense layer with a single neuron and a sigmoid activation is added to produce a single binary value as output.

First of all we defined the input layer, in shape section we can write `None` instead of `seqLength` but, if all of our sequences have the same length it is recomended to specify the full shape as it may help to unlock some performance optimizations.

The embedding model has the input dimension as the number of features (representing the number of words in the vocabulary), the output dimension as the specified length of the vectors that it will produce for each word, and the input length as the length of the sequences of words. Also, the parameter `mask_zero` was set to true, so that the zeroes in the sequences (which appear as padding) are not considered to train the embedding layer nor subsequently in the recurrent layers.

As a baseline to compare, without an RNN, only with a dense layer of 64 units after the embedding layer, it achieves a 84% accuracy in the test set (with a 100% accuracy in the train set).
```
Epoch 10/20
196/196 [==============================] - 8s 43ms/step - loss: 0.0019 - accuracy: 0.9999 - val_loss: 0.8515 - val_accuracy: 0.8401
```

We tried many different models, for example using Simple RNNs, GRUs, LSTMs, bidirectional LSTMs with multiple configurations (single or multiple layers), but none of them worked better and the GRU with 64 units was the simpler model we found that achieved the best accuracy in the test set around 88%.

We also experimented with different regularization techniques like dropout, batch normalization, L1/L2 regularization in the recurrent layers, but none of them outperformed the single GRU layer, whose test accuracy stagnates but does not drop considerably.


In [36]:
input_shape = (seqLength)
inputs = keras.Input(shape=input_shape)
x = layers.Embedding(input_dim=maxFeatures,
                     output_dim=embedding_dim,
                     input_length=seqLength, 
                     mask_zero=True)(inputs)
x = layers.GRU(64, activation='tanh', return_sequences=False)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs, outputs)
model.summary()

Model: "model_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_12 (InputLayer)       [(None, 332)]             0         
                                                                 
 embedding_11 (Embedding)    (None, 332, 64)           64000     
                                                                 
 gru_7 (GRU)                 (None, 64)                24960     
                                                                 
 dense_19 (Dense)            (None, 1)                 65        
                                                                 
Total params: 89,025
Trainable params: 89,025
Non-trainable params: 0
_________________________________________________________________


In [37]:
callbacks = [ keras.callbacks.ModelCheckpoint("jena_gru_amazon.keras") ]
model.compile(optimizer="adam", loss='binary_crossentropy', metrics=["accuracy"])
history = model.fit(x_train_int, 
                    y_train, epochs=20,
                    batch_size=128, 
                    validation_data=(x_test_int, y_test), 
                    callbacks=callbacks)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [38]:
max_val_accuracy = max(history.history['val_accuracy'])
print("MAX TEST ACC = {mva:.2f}%".format(mva=max_val_accuracy*100))

MAX TEST ACC = 88.16%


# Conclusions

1. The hyperparameters that worked better were a vocabulary size of 1000, a sequence length of double the default and an embedding dimension of 64. The increased vocabulary size and sequence length allows to capture more information about the meaning of the words and texts.
1. The models that performed better were the single-layer RNN models in general, and among those the GRUs of 64 units, which are also preferrable for simplicity.
1. None of the regularization techniques applied (dropout, batch normalization, L1/L2 regularization) improved the performance of the model, which does not clearly overfit, but keeps increasing slightly the train accuracy while the validation accuracy plateaus around the maximum. This can be due to the significant size of the dataset.
1. None of the more complex models with multiple RNN layers and even multiple dense layer worked better. Some performed similarly, but the preference is for simpler models. Moreover when the training times are significantly longer for complex recurrent models.