# SENTIMENT ANALYSIS USING RECURRENT NEURAL NETWORKS & EMBEDDINGS

_**Building recurrent neural networks and a model with pretrained embeddings for sentiment analysis on IMDb dataset and comparing their performance.**_

In [2]:
# Imports required packages

import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as tfhub

2025-03-03 15:59:26.527864: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-03 15:59:26.528668: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-03 15:59:26.532742: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-03 15:59:26.544207: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1740997766.564495  944224 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740997766.57

## Loading & Analyzing Data

This experiment uses TensorFlow IMDb (Internet Movie Database) dataset containing English reviews for 50,000 movies - 25,000 for training and 25,000 for testing along with single binary target for each review indicating whether it is positive (1) or negative (0). Approximate download size is 80 megabytes (MB). The details of the dataset is available at https://www.tensorflow.org/datasets/catalog/imdb_reviews.

In [3]:
# Following call may take several seconds to initiate downloading from the TensorFlow datasets.
# The dowloading itself take few minutes to get complete.

(train_set_raw, val_set_raw, test_set_raw), ds_info = tfds.load(
    name="imdb_reviews",                                # Name of the dataset
    # Splits dataset into train set of 22,500 [90%] instances,
    # validation set of 2,500 [10%] instances and test set of 25,000 instances
    split=["train[:90%]", "train[90%:]", "test"],       # Split of the data to load
    as_supervised=True,                                 # To attach labels with each split
    with_info=True                                      # Also to return dataset information
)

2025-03-03 15:59:30.769941: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


In [4]:
# [OPTIONAL] Prints information related to train set
print(ds_info)

tfds.core.DatasetInfo(
    name='imdb_reviews',
    full_name='imdb_reviews/plain_text/1.0.0',
    description="""
    Large Movie Review Dataset. This is a dataset for binary sentiment
    classification containing substantially more data than previous benchmark
    datasets. We provide a set of 25,000 highly polar movie reviews for training,
    and 25,000 for testing. There is additional unlabeled data for use as well.
    """,
    config_description="""
    Plain text
    """,
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    data_path='/home/pradip/tensorflow_datasets/imdb_reviews/plain_text/1.0.0',
    file_format=tfrecord,
    download_size=80.23 MiB,
    dataset_size=129.83 MiB,
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=int64, num_classes=2),
        'text': Text(shape=(), dtype=string),
    }),
    supervised_keys=('text', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=25000, num_shards=

In [5]:
# [OPTIONAL] Previews few of the reviews

for review, label in train_set_raw.take(4):           # Takes first 5 reviews
    print("Review:", review.numpy().decode("utf-8"))  # numpy().decode() converts string tensor into byte array first, then
                                                      # the byte array to string
    print("\nLabel:", label.numpy())                  # numpy() converts integer tensor to a scaler
    print("\n")

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.

Label: 0


Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. T

2025-03-03 15:59:31.033947: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:376] The default buffer size is 262144, which is overridden by the user specified `buffer_size` of 8388608
2025-03-03 15:59:31.052671: W tensorflow/core/kernels/data/cache_dataset_ops.cc:914] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2025-03-03 15:59:31.053147: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Data Preparation

In [6]:
# Prefetching overlaps the data preprocessing for step s+1 and while
# the model performs training at step s to save time.

# Sets the global random seed for prerations that rely on a random seed
tf.random.set_seed(42)

# Sets the batch size
batch_size = 32

# Shuffles instances and enables prefetching for effective batch processing
# Shuffling is applicable only for train set
train_set = train_set_raw.shuffle(buffer_size=5000, seed=42).batch(32).prefetch(1)
val_set = val_set_raw.batch(32).prefetch(1)
test_set = test_set_raw.batch(32).prefetch(1)

## Modeling
_First models with recurrent units and training word embedding without masking and then models with the same approach but with masking enabled._

**Tokenizing Text**

Considering language of the text as English, it prepares a tokenizer to tokenize text at the word level and to use the tokenizer in both the RNN based models.

In [7]:
# Limits vocabulary to 1000 words: 998 tokens for frequent words plus
# one token for padding and one for unknown words
vocabulary_size = 1000

# Initializes TextVectorization layer to tokenize the input text
text_vectorizer_layer = tf.keras.layers.TextVectorization(max_tokens=vocabulary_size)

# Passes train set for that layer to adapt for it so that it can tokenize the 
# input text during training and inferencing
text_vectorizer_layer.adapt(
    train_set.map(lambda reviews, labels: reviews))

2025-03-03 15:59:34.959520: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


### Modeling with Trainable Word Embeddings

In [8]:
# Defines the size of the embedding 
embedding_size = 128

tf.random.set_seed(42)        # Sets the global random seed for operations that rely on a random seed

# Creates a sequential model
model = tf.keras.Sequential([
    text_vectorizer_layer,          # Vectorizer [already adapted ealier] to tokenize input string

    # Trainable embedding that learns to represent each token into a dense vector of fixed size
    tf.keras.layers.Embedding(input_dim=vocabulary_size, output_dim=embedding_size),

    # Recurrent unit as a layer
    tf.keras.layers.GRU(
        units=128,                  # Number of outputs
        return_sequences=False),    # Returns the full sequence instead of last output in output sequence     

    # Regular dense layer with no. of outputs and activation function 
    # appropriate for binary classification task
    tf.keras.layers.Dense(1, activation="sigmoid")   
])

In [9]:
# Compiles and fits the model

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_set, validation_data=val_set, epochs=5)

Epoch 1/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m198s[0m 278ms/step - accuracy: 0.4948 - loss: 0.6939 - val_accuracy: 0.5008 - val_loss: 0.6929
Epoch 2/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m197s[0m 280ms/step - accuracy: 0.5004 - loss: 0.6929 - val_accuracy: 0.5032 - val_loss: 0.6928
Epoch 3/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 289ms/step - accuracy: 0.5075 - loss: 0.6921 - val_accuracy: 0.5020 - val_loss: 0.6936
Epoch 4/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m201s[0m 285ms/step - accuracy: 0.5035 - loss: 0.6908 - val_accuracy: 0.5012 - val_loss: 0.6965
Epoch 5/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m203s[0m 288ms/step - accuracy: 0.5078 - loss: 0.6885 - val_accuracy: 0.5008 - val_loss: 0.6975


**Observation:** The above model fails to learn anything as the accuracy remains close to 50%. As `TextVectorization` layer pads shorter sequences with padding token (with ID 0) to make them as long as the longest sequence in the batch, the gated recurrent layer which is not good at remembering long sequences, when goes through the sequence of padding tokens, forgets the review that was in the beginning of the sequence. That made the model perform poorly.

### Modeling with Masking
_Masking is enabled at the embedding layer for it to propagate this information to all downstream layers to skip the padding tokens so that prediction performance of the model improves._

In [10]:
# Creates a sequential model
model = tf.keras.Sequential([
    text_vectorizer_layer,
    tf.keras.layers.Embedding(input_dim=vocabulary_size, 
                              output_dim=embedding_size, 
                              mask_zero=True),              # Masks padding tokens (whose ID is 0)
    tf.keras.layers.GRU(units=128, return_sequences=False), 
    tf.keras.layers.Dense(1, activation="sigmoid")])

In [11]:
# Compiles and fits the model

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_set, validation_data=val_set, epochs=5)

Epoch 1/5


2025-03-03 16:16:20.134056: E tensorflow/core/util/util.cc:131] oneDNN supports DT_BOOL only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.


[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 314ms/step - accuracy: 0.6548 - loss: 0.5946 - val_accuracy: 0.8236 - val_loss: 0.4118
Epoch 2/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m223s[0m 317ms/step - accuracy: 0.8592 - loss: 0.3371 - val_accuracy: 0.8660 - val_loss: 0.3174
Epoch 3/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m223s[0m 317ms/step - accuracy: 0.8844 - loss: 0.2809 - val_accuracy: 0.8680 - val_loss: 0.3051
Epoch 4/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m225s[0m 319ms/step - accuracy: 0.8940 - loss: 0.2610 - val_accuracy: 0.8640 - val_loss: 0.3135
Epoch 5/5
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m224s[0m 318ms/step - accuracy: 0.9043 - loss: 0.2380 - val_accuracy: 0.8536 - val_loss: 0.3253


In the above model with masking, the accuracy on the validation set has reached around 86%.

In [12]:
# Evaluates model over test set
model_test_performance = model.evaluate(test_set)

print("Test Performance [Accuracy]: {:.1f}%".format(model_test_performance[1] * 100))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 96ms/step - accuracy: 0.8630 - loss: 0.3209
Test Performance [Accuracy]: 86.4%


### Modeling with Pretrained Embeddings
_Experimenting with pretrained sentence-level embeddings for classification task._

One of the pretrained sentense encoders called _Universal Sentence Encoder_ (USE) from TensorFlow Hub was used. It is already trained over a large corpus and helps in finding sentence-level meaning similarity enenbling better performance for downstream classification tasks. The model could also be fine-tuned over the task in hand to improve prediction performance.

In [19]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"    # Location of USE to download from

In [24]:
class USE_Embedding(tf.keras.layers.Layer):
    """
    Defines a custom layer to wrap USE. This is primarily due to the fact that newer TensorFlow 
    frameworks [version 2.15+] do not anymore support saved model to be used as layer in Keras sequential model.
    """
    def __init__(self, module_url, **kwargs):           # Used as constructor
        super(USE_Embedding, self).__init__(**kwargs)   # Calls initialization method of the parent class

        # Class variable 'embed' is set to embedding layer
        # NOTE: Downloading the below module which is around 1 GB in size may take few minutes to complete
        self.embed = tfhub.KerasLayer(module_url, trainable=True, input_shape=[], dtype=tf.string)

    def call(self, inputs):                             # Gets called when an instance behaves like a function
        return self.embed(inputs)                       # Returns the embeddings against the inputs

In [25]:
# Creates a functional model containing USE as a layer

input = tf.keras.Input(shape=(), dtype=tf.string)               # shape=() indicates input sequence length is unknown or may vary

embedding = USE_Embedding(module_url)(input)                    # Pretrained embedding layer to receive input

dense = tf.keras.layers.Dense(64, activation="relu")(embedding) # Regular dense layer

output = tf.keras.layers.Dense(1, activation="sigmoid")(dense)  # Regular one-output dense layer for binary classification

model = tf.keras.Model(inputs=input, outputs=output)            # Groups layers as a model for training and/or inference


In [26]:
# [OPTIONAL] Shows the model summary
model.summary()

In [30]:
# Compiles the mode and trains it

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])

history = model.fit(train_set, 
                    validation_data=val_set, 
                    epochs=10, 
                    callbacks=[
                        tf.keras.callbacks.ModelCheckpoint(
                            "./model_weights/my_universal_sentence_encoder_model.weights.h5", 
                            monitor='val_accuracy', 
                            save_best_only=True, 
                            save_weights_only=True)
                        # tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', restore_best_weights=True)]
                    ])

# Loads back the weights of the best model
model.load_weights("./model_weights/my_universal_sentence_encoder_model.weights.h5")

Epoch 1/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 31ms/step - accuracy: 0.7967 - loss: 0.4675 - val_accuracy: 0.8504 - val_loss: 0.3283
Epoch 2/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 31ms/step - accuracy: 0.8605 - loss: 0.3290 - val_accuracy: 0.8548 - val_loss: 0.3232
Epoch 3/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 31ms/step - accuracy: 0.8628 - loss: 0.3197 - val_accuracy: 0.8496 - val_loss: 0.3250
Epoch 4/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.8683 - loss: 0.3183 - val_accuracy: 0.8520 - val_loss: 0.3222
Epoch 5/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.8680 - loss: 0.3125 - val_accuracy: 0.8564 - val_loss: 0.3200
Epoch 6/10
[1m704/704[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 30ms/step - accuracy: 0.8674 - loss: 0.3120 - val_accuracy: 0.8520 - val_loss: 0.3168
Epoch 7/10
[1m7

The validation accuracy of the above model has reached to 86%.

In [31]:
# Evaluates model over test set
model_test_performance = model.evaluate(test_set)

print("Test Performance [Accuracy]: {:.1f}%".format(model_test_performance[1] * 100))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 26ms/step - accuracy: 0.8605 - loss: 0.3230
Test Performance [Accuracy]: 86.0%


**Observations**

This experiment used three modeling approaches as mentioned below.

1. Modeling with trainable embedding layer and GRU layer that could learned nothing.

2. Same modeling approach with masking achieved much better prediction performance due to the fact that the GRU could ignore padding token without processing them and it helped it for not to forget about the far past reviews.

3. Pretrained language model _Universal Sentence Encoder_ from TensorFlow Hub could achieve around 85% accuracy over validation set. Having access to specialized accelerators, its trained weights can be fine-tuned and the prediction performance is expected to be much better than what was achieved in this experiment.

**Known Issues**

1. One of the concerns of this experiment is that the pretrained embedding is non-tunable [refer model summary to find that the trainable parameters is shown as zero] even though the layer that wraps the embedding was set as trainable [refer wrapper class `USE_Embedding`]. Fixing that problem could improve the prediction performance (accuracy) of the model.

2. While training the USE pretrained embeddings based model in Google Colab on T4 GPU with TensorFlow version 2.18.0, XLA compiler that accelerates computations on GPU raised TensorFlow graph execution error. Since the error message highlighted "T=DT_STRING" it is highly likely that the string input to the USE embedding layer is causing the problem. The XLA compiler might not have the necessary kernels to handle string data on the GPU. Fixing this problem could further improve model training time performance.
