# <span style="color:#0b486b">  FIT5215: Deep Learning (2022)</span>
***
*CE/Lecturer:*  **Dr Trung Le** | trunglm@monash.edu <br/> <br/>
*Tutor:*  **Mr Tuan Nguyen**  \[tuan.ng@monash.edu \] |**Mr Anh Bui** \[tuananh.bui@monash.edu\] | **Mr Xiaohao Yang** \[xiaohao.yang@monash.edu \] | **Mr Md Mohaimenuzzaman** \[md.mohaimen@monash.edu \] |**Mr Thanh Nguyen** \[Thanh.Nguyen4@monash.edu \] |
<br/> <br/>
Faculty of Information Technology, Monash University, Australia
******

# <span style="color:#0b486b">Tutorial 08b: RNN for Sentiment Analysis in TF 2.x</span><span style="color:red"></span> #

This tutorial shows you how to apply Recurrent Neural Networks in sequence classification. Particularly, we are going to explore how to implement an RNN for sentiment analysis with the dataset *IMDB movie review*.

## <span style="color:#0b486b">Part I: Implementation using existed dataset stored in a storage device</span><span style="color:red">*****</span> ##

In many real-world projects with recurrent models, you possess the data stored in a storage device. In this case, you are required to preprocess data from scratch. In what follows, we study how to cope with this situation. 

### <span style="color:#0b486b">I.1. The pipeline of sentence classification</span> ###


In what follows, we present the common pipeline of sentence classification (e.g., sentiment analysis or review analysis) from how to transform sentences as sequences of words to sequences of numbers to feed into a recurrent neural network.

Assume that we are learning from a tiny movie review dataset with the following sentences with labels:

<img src="./images/movie_dataset.png" align="left" width=500/>

We next extract the important or most frequent words to build up the vocabulary. Let's assume that we have the following vocabulary:

<img src="./images/vocabulary.png" align="left" width=500/>

Because the vocabulary does not contain all words in the dataset, there are some words being out of the vocabulary and further assume that we employ two indices (e.g., 7 and 8) for out of vocabulary bucket. Now each word in the dataset is associated with one index which is later useful to transform sentences into sequences of indices. In addition, from the vocabulary we can conduct two dictionaries:
- **word2indx**: keys are words and values are indices.
- **indx2word**: keys are indices and values are words.

Next, we assume that embedding size is $5$, meaning that each word will be embedded into a 5-dim vector space (each word becomes a 5-dim vector). For this purpose, we employ an embedding matrix as follows:

<img src="./images/embedding_matrix.png" align="left" width=300/>

Note that at first, this embedding matrix will be `randomly initialized` and then will be `learned` during the training process. We now explain how to use this embedding matrix to embed a mini-batch of two sentences to a 3D tensor. Assume that we have two following sentences in the mini-batch:
1. `I like this movie`.
2. `This is a bad movie to watch`.

Assume that we set sequence length (or timesteps) to 6. We need to pad the first sentence and truncate the second one as:
1.  `I like this movie` **pad pad**.
2. `This is a bad movie to`.

We next use the vocabulary and word2indx dictionary to transform two sentences into two sequences of indices in the form of a 2D tensor with the shape $[batch\_size, seq\_length]$.
1. `I like this movi` $\rightarrow$ $[7, 1, 8, 7, 7, 7]$.
2. `This is a bad movie to watch` $\rightarrow$ $[8, 8, 7, 3, 7, 7]$.

Next, we use embedding matrix to transform indices to embedding vectors as:
1. `I like this movie` $\rightarrow$ $[7, 1, 8, 7, 7, 7]$ $\rightarrow$ $[U_7, U_1, U_8, U_7, U_7, U_7]$.
2. `This is a bad movie to watch` $\rightarrow$ $[8, 8, 7, 3, 7, 7]$ $\rightarrow$ $[U_8, U_8, U_7, U_3, U_7, U_7]$. 

We then concatenate two sequences of embedding vectors to a 3D tensor with the shape $[batch\_size, seq\_length, embed\_size]$. This 3D tensor will be later fed to subsequent recurrent layers of cells. Note that this embedding process will be performed automatically in TF 2.x using the embedding layer.
-  `tf.keras.layers.Embedding`.

The following figure details the entire process from inputting a batch of sentences, embedding to a 3D tensor, to feeding through recurrent layers.

<img src="./images/all_in_once.png" align="left" width=1000/>

### <span style="color:#0b486b">I.2. Download, play around, and preprocess with the dataset</span> ###

Assume that our dataset can be found online at a specific URL. We need to write code to download the dataset and store it on our storage device.

In [1]:
import tensorflow as tf
import os
import numpy as np

The function *download_and_read(url)* supports us to download the dataset at the specific url and store to the folder *datasets* inside the current directory. 

In [2]:
def download_and_read(url):
    local_file = url.split('/')[-1]
    local_file = local_file.replace("%20", " ")
    p = tf.keras.utils.get_file(local_file, url, extract=True, cache_dir=".")
    local_folder = os.path.join("datasets", local_file.split('.')[0])
    labeled_sentences = []
    for labeled_filename in os.listdir(local_folder):
        if labeled_filename.endswith("_labelled.txt"):
            with open(os.path.join(local_folder, labeled_filename), "r") as f:
                for line in f:
                    sentence, label = line.strip().split('\t')
                    labeled_sentences.append((sentence, label))
    return labeled_sentences

We now employ the above function to download the online dataset and store it on our hard disk. At this step, we also have the dataset including sentences and labels.

In [3]:
labeled_sentences = download_and_read("https://archive.ics.uci.edu/ml/machine-learning-databases/" + "00331/sentiment%20labelled%20sentences.zip")
sentences = [s for (s, l) in labeled_sentences]
labels = [int(l) for (s, l) in labeled_sentences]

In [4]:
for (s,l) in zip(sentences[0:5], labels[0:5]):
    print("Sentence: {}".format(s))
    print("Label: {}".format(l))
    print("\n")

Sentence: So there is no way for me to plug it in here in the US unless I go by a converter.
Label: 0


Sentence: Good case, Excellent value.
Label: 1


Sentence: Great for the jawbone.
Label: 1


Sentence: Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!
Label: 0


Sentence: The mic is great.
Label: 1




We need to inspect the sentences in the dataset to work out the vocabulary and two dictionaries: *word2idx* and *idx2word*. This task is very convenient with the assistance of *tf.keras.preprocessing.text.Tokenizer*.

In [5]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(sentences)
vocab_size = len(tokenizer.word_counts)
print("vocabulary size: {:d}".format(vocab_size))
word2idx = tokenizer.word_index
idx2word = {v:k for (k, v) in word2idx.items()}

vocabulary size: 5271


In [6]:
print(list(word2idx.items())[0:10])

[('the', 1), ('and', 2), ('i', 3), ('a', 4), ('is', 5), ('it', 6), ('to', 7), ('this', 8), ('of', 9), ('was', 10)]


In [7]:
print(list(idx2word.items())[0:10])

[(1, 'the'), (2, 'and'), (3, 'i'), (4, 'a'), (5, 'is'), (6, 'it'), (7, 'to'), (8, 'this'), (9, 'of'), (10, 'was')]


We need to choose the sequence length for our RNN. The sequence length should be chosen so that the truncated sentences with the sequence length contain substantial information for our task. In what follows, we investigate the percentiles to decide the sequence length.

In [8]:
seq_lengths = np.array([len(s.split()) for s in sentences])
print([(p, np.percentile(seq_lengths, p)) for p
in [75, 80, 90, 95, 99, 100]])

[(75, 16.0), (80, 18.0), (90, 22.0), (95, 26.0), (99, 36.0), (100, 71.0)]


We decide to choose $max\_seqlen=64$ which is the percentile of $>99\%$, hence covering almost all sentences. We then transform the dataset of sentences to that of indices. Finally, we create a Tensorflow dataset (i.e., stored in the variable *dataset*). Tensorflow dataset assists us in manipulating the data and splitting it into mini-batches later. You can find more information on Tensorflow datasets [here](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

In [9]:
max_seqlen = 64
# create dataset
sentences_as_ints = tokenizer.texts_to_sequences(sentences) #transform to sequences of indices
sentences_as_ints = tf.keras.preprocessing.sequence.pad_sequences(sentences_as_ints, maxlen=max_seqlen, padding= "post")  #pad some 0(s) or truncate
labels_as_ints = np.array(labels)
dataset = tf.data.Dataset.from_tensor_slices((sentences_as_ints, labels_as_ints))

We now split the dataset to train_dataset, val_dataset, and test_dataset. Subsequently, we create batch datasets for training, valid, and test sets. 

In [10]:
dataset = dataset.shuffle(10000)
test_size = len(sentences) // 3
val_size = (len(sentences) - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)
batch_size = 64
train_dataset = train_dataset.batch(batch_size)
val_dataset = val_dataset.batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

In [11]:
print(train_dataset)

<BatchDataset element_spec=(TensorSpec(shape=(None, 64), dtype=tf.int32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>


### <span style="color:#0b486b">I.2. Building up and training the model</span> ###

In what follows, we build up an RNN with the stack of two GRU cells. The input $x$ has the shape $batch\_size \times timesteps$, however, according to the Keras convention, we omit the $batch\_size$. The embedding layer takes the input $x$ and transforms it to 3D tensor $batch\_size \times timesteps \times embed\_size$. When declaring an embedding layer, we need to specify vocabulary size ($vocab\_size + 1$, and the $embed\_size$. TF Keras will automatically infer the $timesteps$.

In [12]:
embed_size = 128
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embed_size,
                           mask_zero=True, # padded 0(s) at the end of each sequence of indices will be ignored during training
                           input_shape=[None]),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_dataset, validation_data=val_dataset, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In case you get the error `Cannot convert a symbolic Tensor (gru/strided_slice:0) to a numpy array`, please downgrade the Numpy version to 1.19, restart the kernel, and rerun the code.

In [13]:
model.evaluate(test_dataset)



[0.015705283731222153, 0.9940000176429749]

In the declaration of `tf.keras.layers.Embedding()`, there is a quite important parameter named `mask_zero` which can be set to `True` or `False`. If we set `mask_zero = True`, zero values padded at the end of the sequence of indices for one sentence will be ignored during training. This means that we can handle variable-length sentences more elegantly. Otherwise, if we set `mask_zero = False`, those zero values padded will be treated as a normal index.

Let us together consider the following example.

In [14]:
raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

# By default, this will pad using 0s; it is configurable via the "value" parameter.
# Note that you could "pre" padding (at the beginning) or "post" padding (at the end).
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs, padding="post")
print(padded_inputs)

[[ 711  632   71    0    0    0]
 [  73    8 3215   55  927    0]
 [  83   91    1  645 1253  927]]


In [15]:
embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padded_inputs)

print(masked_output._keras_mask)

tf.Tensor(
[[ True  True  True False False False]
 [ True  True  True  True  True False]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)


If we set `mask_zero = True`, zero values marked `False` will be ignored when passing to other layers during training. 

In [16]:
print(masked_output._keras_mask)
print(masked_output.shape)

tf.Tensor(
[[ True  True  True False False False]
 [ True  True  True  True  True False]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)
(3, 6, 16)


Here is the code to reuse the mask in the following layers.

In [17]:
mask = embedding.compute_mask(padded_inputs)
h = tf.keras.layers.LSTM(32)(masked_output, mask= mask)
print(h.shape)

(3, 32)


We now set `mask_zero = False` to see the difference in performance.

In [18]:
embed_size = 128
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embed_size,
                           mask_zero= False, #padded 0(s) at the end of each sequence of indices will be taken into account during training
                           input_shape=[None]),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_dataset, validation_data= val_dataset, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [19]:
model.evaluate(test_dataset)



[0.6932116746902466, 0.48899999260902405]

## <span style="color:#0b486b">Part II: Implementation with Tensorflow Keras dataset</span><span style="color:red">***</span> ##

Since in the tutorial, we will use *tensorflow_datasets* to download and store the IMDB movie review dataset on the hard disk, we need first to install this module using pip. Please activate your TensorFlow environment and install the package *tensorflow_datasets* using:
- > <span style="color:red"> pip install tensorflow_datasets </span>

We first import the necessary modules.

In [20]:
import tensorflow as tf
import tensorflow_datasets as tfds

For some datasets supported by Tensorflow Keras, you can use *tensorflow_datasets* to download and preprocess. In what follows, we explicitly implement the pipeline from downloading the dataset, preprocessing it, and building up an RNN model to train this dataset.

### <span style="color:#0b486b">II.1. Download, play around, and preprocess with the dataset</span> ###

In [21]:
tf.random.set_seed(42)

In [22]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to ~\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling ~\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteYKBLNW\imdb_reviews-train.tfrecord*...…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling ~\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteYKBLNW\imdb_reviews-test.tfrecord*...:…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling ~\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteYKBLNW\imdb_reviews-unsupervised.tfrec…

[1mDataset imdb_reviews downloaded and prepared to ~\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m


In [23]:
datasets.keys()

dict_keys([Split('train'), Split('test'), Split('unsupervised')])

In [24]:
train_size = info.splits["train"].num_examples
test_size = info.splits["test"].num_examples
print("Train size: {}".format(train_size))
print("Test size: {}".format(test_size))

Train size: 25000
Test size: 25000


Here we take one mini-batch of $5$ reviews in the training set and print out the content of those reviews with their labels.

In [25]:
for X_batch, y_batch in datasets["train"].batch(5).take(1):
    for review, label in zip(X_batch.numpy(), y_batch.numpy()):
        print("Review:", review)
        print("Label:", label, "= Positive" if label else "= Negative")
        print("\n")

Review: b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label: 0 = Negative


Review: b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film

As you can see from some reviews, there might be some strange characters, for example *\<br/\>, \<br  /\>*. The reason is that this dataset was extracted from the HTML format. Therefore, we need to preprocess the data by removing those strange characters and splitting a review into a list of words.

Regular expression is a great tool to process texts and strings. Fortunately, Tensorflow also supports regular expression via *tf.string.regex_place* with the following syntax:
- `tf.strings.regex_replace(input, pattern, rewrite, replace_global=True, name=None)`: we need to specify regular expression pattern in *pattern* and all substrings matched this pattern will be replaced by *rewrite*. Please refer to [here](https://github.com/google/re2/wiki/Syntax) for more detail of regular expressions.

We use the regular expression <span style="color:red"> <br\s*/?> </span> to find the pattern<span style="color:red"><br \[any character](one or many times) /(zero or one time)></span> and replace them by " ". We then replace any non-character (i.e., <span style="color:red">[^a-zA-Z]</span>) by " ".

We next split the reviews in the mini-batch into words and extract the first $100$ words from each review. This would only slightly drop the performance because, for most of the reviews, we can figure out its sentiment by taking a look at its first one or two sentences.



In [26]:
def preprocess(X_batch, y_batch):
    X_batch = tf.strings.regex_replace(X_batch, "<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z]", b" ")
    X_batch = tf.strings.split(X_batch)
    X_batch = X_batch[:, :100]
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

In [27]:
preprocess(X_batch, y_batch)

(<tf.Tensor: shape=(5, 100), dtype=string, numpy=
 array([[b'This', b'was', b'an', b'absolutely', b'terrible', b'movie',
         b'Don', b't', b'be', b'lured', b'in', b'by', b'Christopher',
         b'Walken', b'or', b'Michael', b'Ironside', b'Both', b'are',
         b'great', b'actors', b'but', b'this', b'must', b'simply', b'be',
         b'their', b'worst', b'role', b'in', b'history', b'Even',
         b'their', b'great', b'acting', b'could', b'not', b'redeem',
         b'this', b'movie', b's', b'ridiculous', b'storyline', b'This',
         b'movie', b'is', b'an', b'early', b'nineties', b'US',
         b'propaganda', b'piece', b'The', b'most', b'pathetic', b'scenes',
         b'were', b'those', b'when', b'the', b'Columbian', b'rebels',
         b'were', b'making', b'their', b'cases', b'for', b'revolutions',
         b'Maria', b'Conchita', b'Alonso', b'appeared', b'phony', b'and',
         b'her', b'pseudo', b'love', b'affair', b'with', b'Walken',
         b'was', b'nothing', b'but',

### <span style="color:#0b486b">II.2. Create vocabulary and relevant dictionaries</span> ###

We declare *vocabulary* as a Counter to do statistics on the population of the words in the training set. We update the statistics using mini-batches of $32$. Note that for each mini-batch, we apply the function *preprocess* to preprocess the reviews in that mini-batch and split the reviews into an array of words.

In [28]:
from collections import Counter

vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

We now list the thirty most common words. As you can observe, the vocabulary is a list of 2-tuple of a word and its frequency.

In [29]:
vocabulary.most_common()[:30]

[(b'the', 114272),
 (b'<pad>', 90649),
 (b'a', 65720),
 (b'and', 62647),
 (b'of', 59225),
 (b'to', 51613),
 (b'is', 44804),
 (b'I', 43737),
 (b'in', 34599),
 (b'it', 33541),
 (b'this', 28635),
 (b'that', 27640),
 (b's', 24401),
 (b'was', 24078),
 (b'movie', 22842),
 (b'The', 20992),
 (b'film', 17231),
 (b'with', 16821),
 (b'as', 16297),
 (b'for', 15917),
 (b'but', 14140),
 (b'on', 13538),
 (b't', 13319),
 (b'have', 11726),
 (b'you', 11489),
 (b'not', 11304),
 (b'are', 11290),
 (b'one', 10626),
 (b'be', 10500),
 (b'his', 9601)]

In [30]:
len(vocabulary)

61865

We now create a truncated vocabulary of the $10,000$ most common words.

In [31]:
vocab_size = 10000
truncated_vocabulary = [word for word, count in vocabulary.most_common()[:vocab_size]]

We now create the dictionary *word_to_id* which allows us to quickly find the id for a given word.

In [32]:
word_to_id = {word: index for index, word in enumerate(truncated_vocabulary)}
print(list(word_to_id.items())[0:10])

[(b'the', 0), (b'<pad>', 1), (b'a', 2), (b'and', 3), (b'of', 4), (b'to', 5), (b'is', 6), (b'I', 7), (b'in', 8), (b'it', 9)]


The following code aims to find a list of indices for the words in a sentence. Note that because the word "faaaantastic" is not in the list, it is returned None.

In [33]:
[word_to_id.get(word) for word in b"This movie was faaaantastic".split()]

[32, 14, 13, None]

Next, we create a lookup table that enables us to return 2D tensor of indices from given sentences. It is very important to note that because we limit to consider the vocabulary of the most $1000$ common words, for a given sentence in the training or testing set, some words might be out of the list. To address this issue, we set the number of *out of vocabulary* bucket $num\_oov\_buckets = 1000$. Therefore, a word in *out of the vocabulary* will be mapped to one of *out of bucket* index. 

Note that you now can imagine that we have an extended vocabulary with the vocabulary size to be equal to $vocab\_size + num\_oov\_buckets= 11,000$.

In [34]:
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

As you can see, although the word "faaaaantastic" is not in the vocabulary, it is now mapped to the index $10791$ in the out of vocabulary bucket.

In [35]:
table.lookup(tf.constant([b"This movie was faaaaantastic".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   32,    14,    13, 10791]], dtype=int64)>

The function *encode_words* returns the word indices for the words in the sentences in a mini-batch. 

In [36]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

In the following code, we first apply the function *preprocess* to preprocess the training set and then apply the function *encode_words* to convert words to their indices.

In [37]:
train_set = datasets["train"].repeat().batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

In [38]:
for X_batch, y_batch in train_set.take(2):
    print(X_batch.shape)
    print(y_batch.shape)

(32, 100)
(32,)
(32, 100)
(32,)


### <span style="color:#0b486b">II.3. Create an RNN model and train on the training set</span> ###

In what follows, we build up an RNN with the stack of two GRU cells. The input $x$ has the shape $batch\_size \times timesteps$, however, according to the Keras convention, we omit the $batch\_size$. The embedding layer takes the input $x$ and transforms it to 3D tensor $batch\_size \times timesteps \times embed\_size$. When declaring an embedding layer, we need to specify vocabulary size (extended one with the size $vocab\_size + num\_oov\_buckets$, and the $embed\_size$. TF Keras will automatically infer the $timesteps$.

In [39]:
embed_size = 128
x = tf.keras.Input(shape=[None], dtype="int64")
print(x.shape)
h = tf.keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size)(x)
print(h.shape)
h = tf.keras.layers.GRU(embed_size, return_sequences=True)(h)
print(h.shape)
h = tf.keras.layers.GRU(64)(h)
print(h.shape)
h = tf.keras.layers.Dense(1, activation="sigmoid")(h)
rnn_model = tf.keras.models.Model(inputs = x, outputs= h)

(None, None)
(None, None, 128)
(None, None, 128)
(None, 64)


In [40]:
rnn_model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = rnn_model.fit(train_set, steps_per_epoch=train_size // 32, epochs=2)

Epoch 1/2
Epoch 2/2


Here is another way to implement our RNN.

In [41]:
embed_size = 128
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           mask_zero=True, #this specifies that padded 0(s) will be ignored during training
                           input_shape=[None]),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(train_set, steps_per_epoch=train_size // 32, epochs=2)

Epoch 1/2
Epoch 2/2


We now evaluate our model on the test set.

In [42]:
test_set = datasets["test"].repeat(1).batch(32).map(preprocess)
test_set = test_set.map(encode_words).prefetch(1)

In [43]:
model.evaluate(test_set)



[0.4141397178173065, 0.8172799944877625]

**<span style="color:red">Exercise 1:</span>** Replace the GRU cells with the LSTM cells and compare the results.

---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>