In this notebook we are going to train a neural network on the movie reviews data set. The prupose is for the model to predict whether a review is positive or negative.

First we download the data sets:

In [3]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import tensorflow_hub as hub

datasets, info = tfds.load("imdb_reviews", as_supervised=True, with_info=True)
train_size = info.splits['train'].num_examples



If we look at the datasets dictionary, we will see that it contains the train and test data sets:

In [4]:
datasets.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [6]:
train_set = datasets["train"]
test_set = datasets["test"]
type(train_set)

tensorflow.python.data.ops.dataset_ops._OptionsDataset

We can see that the data sets are stored as tensorflows. Let us take a look at a record:

In [7]:
for element in train_set.take(2):
    print(element)

(<tf.Tensor: shape=(), dtype=string, numpy=b'Well let me go say this because i love history and I know that movie is most important piece in our history and it was beautifully executed movie and Julia Stiles became my #1 favorite actress after seeing her in "The \'60s" and i own this movie in my video box with many movies and i suggest you to look for her new movies in the future and try to enjoy history!!!!'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Just because someone is under the age of 10 does not mean they are stupid. If your child likes this film you\'d better have him/her tested. I am continually amazed at how so many people can be involved in something that turns out so bad. This "film" is a showcase for digital wizardry AND NOTHING ELSE. The writing is horrid. I can\'t remember when I\'ve heard such bad dialogue. The songs are beyond wretched. The acting is sub-par but then the actors were not given much. Who decided to employ

The output shows that there are two tensors per record, one is the review and the other is a numbe (0 or 1) that indicates whether the review is positive or negative.

We will need to vectorize the text before we train the model. Here we have two options. We can either do the vectorization ourself, or we can use a pre-trained embedding. We will do both. For the first model, we will do the text vectorization and embedding ourself. To do that, we will need to use a vectorizer. We will use a vocabulary of size 10,000, and we will use a max length of 300 for the reviews. Reviews shroter than this will be padded. Longer reviews will be cut. 

In [8]:
vocabulary_size = 10000
max_length = 300

int_vectorize_layer = keras.layers.TextVectorization(max_tokens=vocabulary_size, output_mode='int', output_sequence_length=max_length)

We now need to use the vectorizer to convert the text to indices. However, before we do that, it would be nice to clean up the text. If you look at the text of the reviews you will notice that there are the characters some tags that sstart and end with <>, specifically with the word br in them. These tags do not help much in sentiment analysis. We can also remove digits if they exist, since we just need text. We will do this cleaning for both the training set and the validation set: 

In [9]:
train_set = train_set.map(lambda x_text, x_label: (tf.strings.regex_replace(x_text, "<br />", " "), x_label))
train_set = train_set.map(lambda x_text, x_label: (tf.strings.regex_replace(x_text, "[^a-zA-Z']", " "), x_label))

test_set = test_set.map(lambda x_text, x_label: (tf.strings.regex_replace(x_text, "<br />", " "), x_label))
test_set = test_set.map(lambda x_text, x_label: (tf.strings.regex_replace(x_text, "[^a-zA-Z']", " "), x_label))

We now use the modified text to prepare the vectorizer. To do that, we will need to call the adapt function. This function will adapt the vectorizer to the text. First, we need an object that contains only the text, since the train set contains tensors that contain the text and tensors that contain the outcome variable. So we create this object:

In [11]:
train_text = train_set.map(lambda text, labels: text)
for text in train_text.take(1):
    print(text)

tf.Tensor(b"This is the most depressing film I have ever seen  I first saw it as a child and even thinking about it now really upsets me  I know it was set in a time when life was hard and I know these people were poor and the crops were vital  Yes  I get all that  What I find hard to take is I can't remember one single light moment in the entire film  Maybe it was true to life  I don't know  I'm quite sure the acting was top notch and the direction and quality of filming etc etc was wonderful and I know that every film can't have a happy ending but as a family film it is dire in my opinion   I wouldn't recommend it to anyone who wants to be entertained by a film  I can't stress enough how this film affected me as a child  I was talking about it recently and all the sad memories came flooding back  I think it would have all but the heartless reaching for the Prozac ", shape=(), dtype=string)


We now have a data set that contains only the text. W ecan use this data set to setup the vectorizer:

In [12]:
int_vectorize_layer.adapt(train_text)

If you want to see the words that are indexed using the vectorizer you use the get_vocabulary function:

In [14]:
int_vectorize_layer.get_vocabulary()

['',
 '[UNK]',
 'the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'in',
 'it',
 'i',
 'this',
 'that',
 'was',
 'as',
 'for',
 'with',
 'movie',
 'but',
 'film',
 'on',
 'not',
 'you',
 'are',
 'his',
 'have',
 'he',
 'be',
 'one',
 'its',
 'all',
 'at',
 'by',
 'an',
 'they',
 'who',
 'so',
 'from',
 'like',
 'her',
 'or',
 'just',
 'about',
 'out',
 'if',
 'has',
 'there',
 'some',
 'what',
 'good',
 'more',
 'when',
 'very',
 'up',
 'no',
 'time',
 'she',
 'even',
 'my',
 'would',
 'which',
 'story',
 'only',
 'really',
 'see',
 'their',
 'were',
 'had',
 'can',
 'well',
 'me',
 'than',
 'we',
 'much',
 'bad',
 'been',
 'get',
 'will',
 'do',
 'also',
 'people',
 'into',
 'other',
 'first',
 'great',
 'because',
 'how',
 'him',
 'most',
 'dont',
 'made',
 'then',
 'movies',
 'way',
 'make',
 'them',
 'films',
 'too',
 'could',
 'any',
 'after',
 'characters',
 'think',
 'watch',
 'two',
 'character',
 'seen',
 'many',
 'being',
 'life',
 'plot',
 'acting',
 'never',
 'little',
 'love',
 'b

We will now convert the reviews in the train and test data sets to the word indices:

In [15]:
train_set = train_set.map(lambda text, labels: (int_vectorize_layer(text), labels))

test_set = test_set.map(lambda text, labels: (int_vectorize_layer(text), labels))

Let us take a look at the result:

In [16]:
for text, label in train_set.take(1):
    print(text)
    print(label)

tf.Tensor(
[  14    4    1  333    5 4302   10   25    1   75  676   32 5066    5
   24 2617  256   24  490 1343   33  569 1845  734  972    5  400  109
   31  171  642    8 6262 3441  463   95   30   13    4    1 5389    5
 1265   12   98   27  196 8200   40 6526   14    2 1313    1    8    4
   93   26   13    4 5251    1    3    1   26   98   27  615    3  631
    8    2  168 4086   26    1 3086    1 7556 6495 3911    3    1   16
 5250    3 3380    9  199   27    4  739   18   26   13    4 1929  557
    3    9    7    2  208   12    7   36  394 1001   37   24 7406   31
    2   55    5  480 2329  986    7  108    1    8 1415  803   20 2764
  687   30    5    2 4532    3 5526    7   65   18  689    5    2 1265
 3127    3 3513    1    2  951    7   41    4  453 6313  579    2   61
 7600   32    4 3913  244   71    4 7712   29   21   63 4302   31   30
 2329   20    2   82  499    7   73 2419    6    2  940    2 8861    5
    1    7  939 9604   37    2    1    1    6    2 3695    1    2 

We now see that the words have bee nreplaced by the index numbers while the outcomes (1s and 0s) are the same. We are now ready to train the model. We will use an embedding layer in order to map these 10000 words to a 50 dimensional output. We will then feed this as input to an LSTM layer. Finally, we will use a single node in the output layer with a sigmoid activation function in order to pedict a 1 or a 0. Note that we have set the mask_zero parameter to True. This parameter simply tells the embedding layer to ignore the index 0. Why did we do this? Remember that we specified a word length of 300 in the vectorizer. Any review that is less than this will be padded with zeros. So by setting the mask_zero parameter to true, we are simply telling the layer to ignore these zeros since they do not provide any meaningful information.

In [17]:
model = tf.keras.Sequential([
    keras.layers.Embedding(vocabulary_size, 50, input_shape=[None], mask_zero=True),
    tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss="binary_crossentropy", metrics=['accuracy'])
history = model.fit(train_set.batch(512), epochs=10, validation_data=test_set.batch(512))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The other option, as mentioned earlier, is to simply use a pre-trained embedding. This is actually easier than what we did because the prep-prepared embedding will do the work for us. Here is how we can load the pre-trained embedding model:

In [20]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=True)





The embedding is a 'Token based text embedding trained on English Google News 7B corpus'. There are other embeddings to choose from. This embedding maps the text to a 50-dimensional embedding vector. Let us try it out. First, we need to re-load the train adn test data sets because we had already vectorized the words for the previous model:

In [26]:
train_set = datasets["train"]
test_set = datasets["test"]

for text, label in train_set.take(1):
    print(text)
    print(hub_layer([text]))

tf.Tensor(b"Oh yeah! Jenna Jameson did it again! Yeah Baby! This movie rocks. It was one of the 1st movies i saw of her. And i have to say i feel in love with her, she was great in this move.<br /><br />Her performance was outstanding and what i liked the most was the scenery and the wardrobe it was amazing you can tell that they put a lot into the movie the girls cloth were amazing.<br /><br />I hope this comment helps and u can buy the movie, the storyline is awesome is very unique and i'm sure u are going to like it. Jenna amazed us once more and no wonder the movie won so many awards. Her make-up and wardrobe is very very sexy and the girls on girls scene is amazing. specially the one where she looks like an angel. It's a must see and i hope u share my interests", shape=(), dtype=string)
tf.Tensor(
[[ 0.42595312  0.41425827  0.05380295  0.65693736  0.02124309 -0.34377313
   0.27501962 -0.29560134 -0.8899341   0.4253666  -0.03075071  0.15959139
   0.05135492  0.41358554 -0.18948269 

All we had to do in the above code is to feed the text to the pre-trained embedding model, and it does everything for you. So while previously we had to vectorize the text and then specify an embedding layer, we can now just use the pre-trained embedding layer. We now build the NN model and simply use the data sets as they are:

In [32]:
model = tf.keras.Sequential([
    hub_layer,
    tf.keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

In [33]:
model.compile(optimizer='adam', loss="binary_crossentropy", metrics=['accuracy'])
history = model.fit(train_set.shuffle(10000).batch(512), epochs=10, validation_data=test_set.batch(512))

Epoch 1/10


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
