# Sentiment Analysis

This project aims to impart an understanding of how to process English sentences, apply NLP techniques, make the deep learning model understand the context of the sentence, and classify the sentiment the sentence implies.

Our real-world is being flooded with a lot of reviews all around us. Be it an online shopping mart, movie reviews, offline market, or anything else. It has become very common for us to rely on these reviews. Hence it would be really helpful for a Machine Learning aspirant to understand various techniques related to processing such reviews and make the machine learning models to understand the context of the sentences.

Skills used:

TensorFlow 2

NLP

Deep Learning

Python Programming

We will be getting the dataset from the tensorflow_datasets. Then we will develop an understanding of how to preprocess the data, convert the English words to numerical representations, and prepare it to be fed as input for our deep learning model with GRUs.

In [7]:
#Importing the Modules

import numpy as np
import tensorflow as tf
from tensorflow import keras 
import tensorflow_datasets as tfds

tf.random.set_seed(42)

In [8]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /home/anuprava958470/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/anuprava958470/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6HG966/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/anuprava958470/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6HG966/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /home/anuprava958470/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incomplete6HG966/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /home/anuprava958470/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [9]:
print(datasets.keys())

dict_keys(['test', 'train', 'unsupervised'])


In [10]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples

print(train_size , test_size)

25000 25000


## Exploring the Data
 Let us see how the data looks like

In [14]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



`datasets["train"]` contains the train data. Similarly, `datasets["test"]` contains the test data.

`datasets["train"].batch(2)` batches 2 data samples at a time.

`datasets["train"].batch(2).take(1)` allows to take 1 batch at a time.

Each batch is of type `Eager Temsor`. We could convert it to numpy array using `X_batch.numpy()`.

## Defining the preprocess function
* Now we will create this preprocessing function where we will:

    * Truncate the reviews, keeping only the first 300 characters of each since you can generally tell whether a review is positive or not in the first sentence or two.

    * Then we use regular expressions to replace <br/> tags with spaces and to replace any characters other than letters and quotes with spaces.

    * Finally, the preprocess() function splits the reviews by the spaces, which returns a ragged tensor, and it converts this ragged tensor to a dense tensor, padding all reviews with the padding token <pad> so that they all have the same length.

In [18]:
#preprocessing the data

def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

`tf.strings` - Operations for working with string Tensors.

`tf.strings.substr(X_batch, 0, 300)` - For each string in the input Tensor `X_batch`, it creates a substring starting at index `pos`(here 0) with a total length of `len`(here 300). So basically, it returns substrings from Tensor of strings.

`tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ") `- Replaces elements of `X_batch` matching regex pattern `<br\s*/?>` with rewrite .

`tf.strings.split(X_batch)` - Split elements of input `X_batch` into a RaggedTensor.

`X_batch.to_tensor(default_value=b"<pad>")` - Converts the RaggedTensor into a `tf.Tensor. default_value` is the value to set for indices not specified in X_batch. Empty values are assigned `default_value`(here `<pad>`).

In [19]:
#Lets see how the data looks like after preprocessing
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 53), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'pi', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', b'warm', b'and', b'comfortable',
         b'on', b'the', b'sette', b'and', b'having', b'j

## Constructing the Vocabulary

Next, we will construct the vocabulary. This requires going through the whole training set once, applying our `preprocess()` function, and using a `Counter()` to count the number of occurrences of each word.

In [25]:
from collections import Counter
vocabulary = Counter()

#For each review in every batch of the train data, let us make a vocabulary dictionary containing the words and their counts correspondingly

for X_batch, y_batch in datasets["train"].batch(2).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))
        

`Counter().update(`) : We can add values to the Counter by using `update()` method.

`map(myfunc)` of the tensorflow datasets maps the function(or applies the function) `myfunc` across all the samples of the given dataset. More here.

In [26]:
#Let’s look at the 5 most common words

vocabulary.most_common()[:5]

[(b'<pad>', 63155),
 (b'the', 61137),
 (b'a', 38564),
 (b'of', 33983),
 (b'and', 33431)]

In [27]:
#Let us find the length of the vocabulary
len(vocabulary)

53893

## Truncating the Vocabulary

There are more than 50,000 words in the vocabulary. So let us truncate it to have only 10,000 most common words.

In [31]:
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

## Creating a lookup table
Computer can only process numbers but not words. Thus we need to convert the words in truncated_vocabulary into numbers.

So we now need to add a preprocessing step to replace each word with its ID (i.e., its index in the truncated_vocabulary). We will create a lookup table for this, using 1,000 out-of-vocabulary (oov) buckets.

We shall create the lookup table such that the most frequently occurring words have lower indices than less frequently occurring words.

In [34]:
words = tf.constant(truncated_vocabulary)

In [35]:
#Create the word_ids using the corresponding indices of words in truncated_vocabulry.
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)

In [36]:
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)

In [37]:
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [38]:
#Let's use the above table to look up the IDs of a few words:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10053]])>

**NOTE:**

tf.lookup.KeyValueTensorInitializer : Table initializer given keys and values tensors. More here

tf.lookup.StaticVocabularyTable : String to Id table wrapper that assigns out-of-vocabulary keys to buckets. More here

If <other term> -> bucket_id, where bucket_id will be between 3 and 3 + num_oov_buckets - 1, calculated by: hash(<term>) % num_oov_buckets + vocab_size

table.lookup : Looks up keys in the table, outputs the corresponding values.

## Creating the Final Train and Test sets
Now we will create the final training and test sets.

For creating the final training set train_set,

* we batch the reviews

* then we convert them to short sequences of words using the preprocess() function

* then encode these words using a simple encode_words() function that uses the table we just built and finally prefetch the next batch.

Let us test the model(after training) on 1000 samples of the test data as it takes a lot of time to test on the whole test set. So we shall create the final test set on 1000 samples as follows.

For creating the final test set test_set,

* we create a batch of 1000 test samples

* then we convert them to short sequences of words using the preprocess() function

* then encode these words using a simple encode_words() function that uses the table we just built.

In [47]:
#Defining the encode_words() function to encode the words of train data using the lookup table table.

def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

In [49]:
# Apply the function preprocess on every batch of data with 32 samples repeatedly on the train data datasets["train"].

train_set = datasets["train"].repeat().batch(32).map(preprocess)

In [50]:
# Applying the function encode_words to the train_set and parallelly fetch the next batch.

train_set = train_set.map(encode_words).prefetch(1)

In [51]:
#Similarly, applying the function preprocess on the test data datasets["test"].


test_set = datasets["test"].batch(1000).map(preprocess)


In [52]:
#Applying the function encode_words to the test_set.

test_set = test_set.map(encode_words)

In [53]:
#Let us see how the first data sample of the thus obtained train_set looks like:

for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  22   11   28 ...    0    0    0]
 [   6   21   70 ...    0    0    0]
 [4099 6881    1 ...    0    0    0]
 ...
 [  22   12  118 ...  331 1047    0]
 [1757 4101  451 ...    0    0    0]
 [3365 4392    6 ...    0    0    0]], shape=(32, 60), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


**Note:**

dataset.repeat().batch(32) repeatedly creates the batches of 32 samples in the dataset.

dataset.repeat().batch(32).map(preprocess) applies the function preprocess on every batch.

dataset.map(encode_words).prefetch(1) applies the function encode_words to the data samples and paralelly fetches the next batch.

## Building the Model
* Now that we have preprocessed and created the dataset, we can create the model:

    * The first layer is an Embedding layer, which will convert word IDs into embeddings. The embedding matrix needs to have one row per word ID (vocab_size + num_oov_buckets) and one column per embedding dimension (this example uses 128 dimensions, but this is a hyperparameter you could tune).
    * Whereas the inputs of the model will be 2D tensors of shape [batch size, time steps], the output of the Embedding layer will be a 3D tensor of shape [batch size, time steps, embedding size].

**Note:**

* keras.layers.Embedding : Turns positive integers (indexes) into dense vectors of fixed size.
* keras.layers.GRU : The GRU(Gated Recurrent Unit) Layer.

In [58]:
#Set embed_size to 128, which is the embedding size of each word.
embed_size = 128


Create the model model with

* Embedding layer

* GRU layer with 4 units

* GRU layer with 2 units

* Dense layer with 1 unit and sigmoid activation

In [59]:
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
               mask_zero=True,
               input_shape=[None]),
    keras.layers.GRU(4, return_sequences=True),
    keras.layers.GRU(2),
    keras.layers.Dense(1, activation="sigmoid")
])

In [60]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])


## Training and Testing the Model
* It's time for training the model on the train data.

* Let us also measure the time of training using time module.

* Finally, let us test the model performance on the test data.

In [63]:
import time
start  =time.time()
model.fit(train_set, steps_per_epoch= train_size // 32, epochs=2)

Train for 781 steps
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f113834e5f8>

In [64]:
end = time.time()

In [65]:
print("Time of execution:", end-start)

Time of execution: 417.3640847206116


In [66]:
model.evaluate(test_set)

     25/Unknown - 10s 407ms/step - loss: 0.5338 - accuracy: 0.7560

[0.5337878572940826, 0.75596]