# Import dataset
We will be trying to classify IMDB reviews as either positive or negative. Sequence models are useful for this task since text is just a sequence of words and a sequence model can learn the pattern of these words. Due to the computing time required for this task, only 5000 samples will be used for the training set.

https://www.tensorflow.org/datasets/catalog/imdb_reviews

In [1]:
from tensorflow import keras
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np

train_set = tfds.load('imdb_reviews', split='train', as_supervised=True).take(5000)
test_set = tfds.load('imdb_reviews', split='test', as_supervised=True).take(1000)

# Prepare tokens

In [2]:
max_text_length = 0

# training
X_train = []
y_train = []
for i, j in train_set:
    i = str(i.numpy())
    max_text_length = max(max_text_length, len(i))
    X_train.append(i)
    y_train.append(int(j))

tokenizer = keras.preprocessing.text.Tokenizer(num_words=10000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_text_length, padding='post')

X_train = np.array(X_train)
y_train = np.array(y_train)

# testing
X_test = []
y_test = []
for i, j in test_set:
    i = str(i.numpy())
    X_test.append(i)
    y_test.append(int(j))

X_test = tokenizer.texts_to_sequences(X_test)
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_text_length, padding='post')

X_test = np.array(X_test)
y_test = np.array(y_test)

# Build the RNN
An embedding layer is needed to convert sequences of tokens into sequences of vectors that can be easily understood by the RNN layer, which uses tanh activation to take advantage of GPU optimizations. This is then fed into a layer for binary classification with sigmoid activation.

In [13]:
model = keras.Sequential()
model.add(keras.layers.Embedding(10001, 64, mask_zero=True))
model.add(keras.layers.SimpleRNN(128))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[keras.metrics.BinaryAccuracy()])

# Train the model

In [14]:
epochs = 5

model.fit(X_train, y_train, epochs=epochs, validation_split=0.1, batch_size=256)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb2de446d50>

# Evaluate on test set

In [15]:
print(model.metrics_names)
print(model.evaluate(X_test, y_test, verbose=0))

['loss', 'binary_accuracy']
[0.6651999950408936, 0.6190000176429749]


# Create new model with LSTM

In [3]:
model2 = keras.Sequential()
model2.add(keras.layers.Embedding(10001, 64, mask_zero=True))
model2.add(keras.layers.LSTM(128))
model2.add(keras.layers.Dense(1, activation='sigmoid'))

model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=[keras.metrics.BinaryAccuracy()])

# Train the new model

In [4]:
epochs = 5

model2.fit(X_train, y_train, epochs=epochs, validation_split=0.1, batch_size=64) # batch size reduced due to colab ram limits

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb30e14e0d0>

# Evaluate LSTM on test set
The LSTM is obviously much better than the SimpleRNN when predicting the sentiment of IMDB reviews. This is because the LSTM can remember longer patterns in the sentences. In comparison, the SimpleRNN can only remember one previous word for each cell.


In [6]:
print(model2.metrics_names)
print(model2.evaluate(X_test, y_test, verbose=0))

['loss', 'binary_accuracy']
[0.5765174627304077, 0.7770000100135803]
