# SENTIMENT ANALYSIS

_**Consider IMDb reviews dataset and train two models – one without any pretrained embeddings and other one with a contextualized pretrained embeddings, to classify sentiment of a movie review and then compare performance of these two models.**_

**NOTES:** 

1. **Accelerated Hardware:** This notebook is advised to be executed with GPU to save time during model training. For relevant instructions and guidelines, please refer the README located at https://github.com/PradipKumarDas/Teaching/tree/master/21AML171-Deep_Learning.

2. **Dependencies:** The following experiment was tested on TensorFlow 2.15.0. Later version of this packages was found to be default in Google Colaboratory and incompatible with this experiment as `tensorflow_hub.KerasLayer` was not not compatible as a layer in Keras sequence model. Hence, the following statement is suggested to be executed to install the spefied version of TensorFlow in the runtime.

In [None]:
# Shows the installed version of TensorFlow in Google Colab.
# "pip" can be replaced with "conda" for local computer with conda package manager.

!pip show tensorflow

In [None]:
# Installs specific version of TensorFlow in Google Colab.
# "pip" can be replaced with "conda" for local computer with conda package manager.

!pip install tensorflow==2.15.0

In [3]:
# Imports required packages

import os
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as tfhub

## Retrieval of Data

This experiment uses TensorFlow IMDb (Internet Movie Database) dataset containing English reviews for 50,000 movies - 25,000 for training and 25,000 for testing along with single binary target for each review indicating whether it is positive (1) or negative (0). Approximate download size is 80 megabytes (MB).

The details of the dataset is available at https://www.tensorflow.org/datasets/catalog/imdb_reviews.

In [4]:
# Following call may take several seconds to initiate downloading from the TensorFlow datasets.
# The dowloading itself take few minutes to get complete.

raw_train_set, raw_val_set, raw_test_set = tfds.load(
    name="imdb_reviews",

    # Splits dataset into train set of 22,500 [90%] instances,
    # validation set of 2,500 [10%] instances and test set of 25,000 instances
    split=["train[:90%]", "train[90%:]", "test"],

    as_supervised=True  # Attaches targets with train, validation and test set
)

In [3]:
# Previews few of the reviews

for review, label in raw_train_set.take(5):           # Takes first 5 reviews
    print("Review:", review.numpy().decode("utf-8"))  # numpy().decode() converts string tensor into byte array first, then
                                                      # the byte array to string
    print("\nLabel:", label.numpy())                  # numpy() converts integer tensor to a scaler
    print("\n")

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.

Label: 0


Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. T

In [5]:
# Shuffles and batches the train set over 32 instances per batch
# For validation and test set, only batches are prepared as shuffling is not required in these sets
# Prefetching overlaps the data preprocessing for step s+1 and while
# the model performs training at step s to save time.

tf.random.set_seed(42)  # Ensures reproducibility

train_set = raw_train_set.shuffle(buffer_size=5000, seed=42).batch(32).prefetch(1)
val_set = raw_val_set.batch(32).prefetch(1)
test_set = raw_test_set.batch(32).prefetch(1)

## Preparation of Data

**Preprocessing layer to map text into integer sequences.**

In [5]:
# Performs tokenization (preprocessing) at the word level as reviews are in English

# Limits vocabulary to 1000 tokens: 998 tokens for frequent words plus
# a padding token and a token for unknown words
vocabulary_size = 1000


# Tokenizes the string data with TextVectorization layer

text_vectorizer_layer = tf.keras.layers.TextVectorization(max_tokens=vocabulary_size)  # Initializes the layer

text_vectorizer_layer.adapt(
    train_set.map(lambda reviews, labels: reviews))  # Tokenizes the data by calling layer's adapt() method


## Modeling with Recurrent Units & Trainable Word Embeddings

**Creates the following sequential model and trains it.**

In [6]:
# Defines the size of the embedding
embedding_size = 128

tf.random.set_seed(42)

# Creates a sequential model
model = tf.keras.Sequential([
    text_vectorizer_layer,
    tf.keras.layers.Embedding(input_dim=vocabulary_size, output_dim=embedding_size),
    tf.keras.layers.GRU(units=128, activation="tanh", recurrent_activation="sigmoid", return_sequences=False),
    tf.keras.layers.Dense(1, activation="sigmoid")])

# Compiles and fits the model

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_set, validation_data=val_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The above model fails to learn anything as the accuracy remains close to 50%. As `TextVectorization` layer pads shorter sequences with padding token (with ID 0) to make them as long as the longest sequence in the batch, the gated recurrent layer which is not good at remembering long sequences, when goes through the sequence of padding tokens, it forgets the review that was in the beginning of the sequence. That made the model perform poorly.

## Modeling with Masking

In the below mentioned technique called _masking_, the recurrnet layer is made aware of the padding tokens for it to ignore so that its prediction performance can improve.

In [7]:
# Defines the size of the embedding
embedding_size = 128

tf.random.set_seed(42)

# Creates a sequential model
model = tf.keras.Sequential([
    text_vectorizer_layer,
    tf.keras.layers.Embedding(
        input_dim=vocabulary_size, 
        output_dim=embedding_size, 
        mask_zero=True),  # Masks input with ID=0 and propagates the info to lower layers for them to skip the padding tokens
    tf.keras.layers.GRU(units=128, activation="tanh", recurrent_activation="sigmoid", return_sequences=False), 
    tf.keras.layers.Dense(1, activation="sigmoid")])

# Compiles and fits the model

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_set, validation_data=val_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In the above model with masking, the accuracy on the validation set has reached around 86%.

## Modeling with Pretrained Language Models

The model's prediction performance could also be improved if a pretrained language model which is already trained over a large corpus is used and just gets fine-tuned on the task in hand. Amongst many, _Universal Sentence Encoder_ - the prerained language model from Google TensorFlow Hub is being considered here.

But it is to be noted that due to the availability of only single commidity GPU over Google Colaboratory, only pretrained weights of the model was used. Having access to sufficient GPUs, pretrained weights can further be fine-tuned to get improved prediction performance.

In [10]:
os.environ["TFHUB_CACHE_DIR"] = "my_tfhub_cache"

tf.random.set_seed(42)

model = tf.keras.Sequential([
    # Serves the mention pretrained saved model (v4) as a keras layer
    tfhub.KerasLayer(
        handle="https://tfhub.dev/google/universal-sentence-encoder/4",
        trainable=False,   # If set to True, it enables the pretrained model to be fine-tuned during training, but may take around one hour per epoch on a single commodity GPU 
        dtype=tf.string,                              # Expects a tf.string input tensor
        input_shape=[]),                              # Expects a tensor of shape [batch_size] as input
    tf.keras.layers.Dense(64, activation="relu"),     # Additional hidden layer to reduct output dimension before combining to output layer
    tf.keras.layers.Dense(1, activation="sigmoid")
])

# Compiles and fits the model

model.compile(optimizer="nadam", loss="binary_crossentropy", metrics=["accuracy"])
history = model.fit(train_set, validation_data=val_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x77fd550d9930>

In the above model with (fixed-weight) pretrained language model, the accuracy on the validation set has reached around 85%.

**OBSERVATIONS:**
1. Sentiment analysis was performed on a English movie reviews dataset.

2. It was a supervised machine learning to predict viewer's (positive or negative) sentiment about a movie given its review provided by the viewer.

3. TensorFlow was used for splitting, shuffling, batching and prefetching the data during training and prediction.

4. Keras `TextVectorization` layer was used to tokenize each word in the reviews. Only 1000 mostly used words were used in the vocabulary.

5. This experiment used three modeling approach as mentioned below.
    - Modeling with trainable embedding layer and GRU layer that could learned nothing.
    - Same modeling approach with masking achieved much better prediction performance due to the fact that the GRU could ignore padding token without processing them and forgetting about the far past review comment.
    - Pretrained language model _Universal Sentence Encoder_ from TensorFlow Hub was used without fine-tuning its trained weights due to lack of powerful GPUs, and the same with fixed weights could achieve around 85% accuracy over validation set. Having access to specialized accelerators, its trained weights can be fine-tuned and the prediction performance is expected to be much better than what was achieved in this experiment.