## RNN Sentiment

A common dataset for language processing examples is the IMDB reviews dataset. It consists of 50000 movie reviews in English, along with a simple binary target from each review, indicating wheter it is negative (0) or positive (1).

### Setup

First, we will import libraries and define constants and functions that will help us during the examples in this notebook.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")
    if IS_KAGGLE:
        print("Go to Settings > Accelerator and select GPU.")

# Common imports
import numpy as np
import os
from pathlib import Path

# to make this notebook's output stable across runs
my_seed = 42
np.random.seed(my_seed)
tf.random.set_seed(my_seed)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "nlp"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

No GPU was detected. LSTMs and CNNs can be very slow without a GPU.


In [2]:
tf.random.set_seed(my_seed)

### Understanding the Data

The dataset is already preprocessed. `X_train` will contain a list of reviews, each represented an a NumPy array of integers, where each integer represents a word. All puntucations were removed, and the letters converted to lowercase, split by spaces, and finally, indexed by frequency. The integers 0, 1, and 2 represent the padding token, the *start-of-sequence* token, and unknown words, respectively.

In [3]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

To know which are the first ten words of the first review of our training set we do the following. Notice that the first word is *\<sos\>*, which indicates the start of the sentence.

In [4]:
word_index = keras.datasets.imdb.get_word_index()
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

We proceed to download the dataset

In [5]:
import tensorflow_datasets as tfds

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

Let's take a look to the first two reviews that we have, we are goingto print the first 200 characters of each review, as well their target variable.

In [6]:
for X_batch, y_batch in datasets["train"].batch(2).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review.decode("utf-8")[:200], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting  ...
Label: 0 = Negative

Review: I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However  ...
Label: 0 = Negative



2023-05-25 16:55:54.852240: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-05-25 16:55:54.868847: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


Let's write a preprocessing function.

* Truncate the reviews, keeping only the first 300 characters of each, in order to speed up training. We can generally tell wheter a review is positive or not in the first two sentences.
* Replace \<br /\> tags with spaces.
* Replace any character other than letters and quotes with spaces.
* Split the reviews by the spaces
* Convert the reviews to tensors, padding all reviews with the padding token, so they all have the same length.

In [7]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 350)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [8]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(2, 61), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b"Don't", b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b"movie's", b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'piece', b'The', b'most', b'pathetic', b'scenes',
         b'were', b'those', b'when', b'the', b'<pad>', b'<pad>', b'<pad>'],
        [b'I', b'have', b'been', b'known', b'to', b'fall', b'asleep',
         b'during', b'films', b'but', b'this', b'is', b'usually', b'due',
         b'to', b'a', b'combination', b'of', b'things', b'including',
         b'really', b'tired', b'being', 

Next, we need to construct the vocabulary. This requires going through the whole training set oncem applying our `preprocess` functionm and using `Counter` to count the number of ocurrences of each word.

In [9]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

Let's look at the three most common words.

In [10]:
vocabulary.most_common()[:3]

[(b'<pad>', 245391), (b'the', 71886), (b'a', 44352)]

In [11]:
len(vocabulary)

57722

We may don't need our model to know all the words in the dictionary to get good performance, so let's truncate the vocabulary by keeping only the 10000 most common words.

In [12]:
vocab_size = 20000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

Now, we need to add a preprocessing step to replace each word with its ID (i.e., its index in the vocabulary)

In [13]:
word_to_id = {word: index for index, word in enumerate(truncated_vocabulary)}
for word in b"This movie was faaaaaantastic".split():
    print(word_to_id.get(word) or vocab_size)

24
13
12
20000


To implement this preprocessing step, we will create a lookup table using 1000 out-of-vocabulary buckets.

In [14]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [15]:
table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   24,    13,    12, 20053]])>

Now, we proceed to create the final training set. We bath the reviews, convert them to short sentences if words with the `preprocess` function, then encode these words using the lookup table. Finally, prefetcg the next batch.

In [16]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [17]:
for X_batch, y_batch in train_set.take(1):
    print(X_batch)
    print(y_batch)

tf.Tensor(
[[  24   12   28 ...    0    0    0]
 [   7   21   72 ...    0    0    0]
 [3965 6829    1 ...    0    0    0]
 ...
 [  24   13  121 ...    0    0    0]
 [1899 4215  465 ...    0    0    0]
 [3589 4756    7 ...    0    0    0]], shape=(32, 70), dtype=int64)
tf.Tensor([0 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 0], shape=(32,), dtype=int64)


Finally, let's create the model and train it.

The first layer is an `Embedding` layer, which converts words IDs into embeddings. The embedding matrix needs to have one row per word ID, and one column per embedding dimension, which for this example is 128. The output of the `Embedding` layer will be 3D *[batch size, time steps, embedding size]*.

The rest of the model is composed by two GRU layers, with the second ine returning only the output of the last time step. The output layer consists of a single neuron with the sigmoid activation function to estimate the sentiment with a probability.

In [18]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, # not shown in the book
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Testing

In [19]:
for X_test_batch, y_test_batch in datasets["test"].batch(10).take(1):
    for review, label in zip(X_test_batch.numpy(), y_test_batch.numpy()):
        print("Review:", review.decode("utf-8")[:350], "...")
        print("Label:", label, "= Positive" if label else "= Negative")
        print()

Review: There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw TH ...
Label: 1 = Positive

Review: A blackly comic tale of a down-trodden priest, Nazarin showcases the economy that Luis Bunuel was able to achieve in being able to tell a deeply humanist fable with a minimum of fuss. As an output from his Mexican era of film making, it was an invaluable talent to possess, with little money and extremely tight schedules. Nazarin, however, surpasses ...
Label: 1 = Positive

Review: Scary Movie 1-4, Epic Movie, Date Movie, Meet the Spartans, Not another Teen Movie and Another Gay Movie. Making "Superhero Movie" the eleventh in a series that single handily ruined the parody genre. Now I'll admit it I hav

2023-05-25 17:01:34.692891: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [20]:
my_test_set = datasets["test"].batch(10).take(1)
my_test_set = my_test_set.map(preprocess)
my_test_set = my_test_set.map(encode_words).prefetch(1)

In [21]:
preds = model.predict(my_test_set)



2023-05-25 17:01:35.575952: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [22]:
["1 - Positive" if pred > 0.5 else "0 - Negative" for pred in model.predict(my_test_set)]



2023-05-25 17:01:35.717541: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


['1 - Positive',
 '0 - Negative',
 '0 - Negative',
 '0 - Negative',
 '0 - Negative',
 '1 - Positive',
 '1 - Positive',
 '1 - Positive',
 '0 - Negative',
 '0 - Negative']

In [23]:
preds

array([[8.7829018e-01],
       [1.8437991e-03],
       [5.6936301e-04],
       [2.2718550e-03],
       [2.1810536e-01],
       [9.9974781e-01],
       [9.9989420e-01],
       [9.9982154e-01],
       [2.4527591e-04],
       [1.3223001e-03]], dtype=float32)