In this notebook, we'll use an RNN to perform text classification. We'll work on the [Keras IMDB review data set](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification).

In [1]:
%tensorflow_version 2.x

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing import sequence
import numpy as np

We'll limit the number of words we will retain here. We will also define a max length our reviews can have:

In [2]:
max_features   = 20000
maxlen         = 80
embedding_size = 128

In [3]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test),  'test sequences')

# The reviews have already been tokenized for us:
print(X_train[0])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65,

We'll use `pad_sequences` to pad the reviews so that they all have the same length:

In [4]:
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test  = sequence.pad_sequences(X_test,  maxlen=maxlen)

print('X_train shape:', X_train.shape)
print('X_test shape:',  X_test.shape)

print(X_train[0])

X_train shape: (25000, 80)
X_test shape: (25000, 80)
[   15   256     4     2     7  3766     5   723    36    71    43   530
   476    26   400   317    46     7     4 12118  1029    13   104    88
     4   381    15   297    98    32  2071    56    26   141     6   194
  7486    18     4   226    22    21   134   476    26   480     5   144
    30  5535    18    51    36    28   224    92    25   104     4   226
    65    16    38  1334    88    12    16   283     5    16  4472   113
   103    32    15    16  5345    19   178    32]


Next, we define our architecture, we'll use a single LSTM here.

Note: this is a good point to mention that if you're using regularization, batch normalization or dropout, always check whether the layer itself exposes arguments to set this, rather than adding in these layers manually. The reason for this is due to the fact that techniques such as dropout need to be implemented carefully for e.g. RNNs especially:

In [5]:
model = Sequential()

model.add(Embedding(max_features, embedding_size))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


We can now train our model and see how well it does:

In [6]:
model.fit(X_train, y_train, batch_size=128, epochs=3,
          validation_data=(X_test, y_test))

score, acc = model.evaluate(X_test, y_test, batch_size=128)
print('Test score:',    score)
print('Test accuracy:', acc)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Test score: 0.4383922219276428
Test accuracy: 0.8284000158309937


Given how unstructured some of these reviews are, its pretty amazing that we get a relatively good accuracy value already:

In [7]:
test_instance_idx = 0

review     = ' '.join([[k for k,v in imdb.get_word_index().items() if v == w][0] for w in X_test[test_instance_idx] if w > 0])
prediction = model.predict(np.expand_dims(X_test[test_instance_idx], axis=0))[0][0]

print(review)
print('Predicted', prediction, '-- true label was:', y_test[test_instance_idx])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
the wonder own as by is sequence i i jars roses to of hollywood br of down shouting getting boring of ever it sadly sadly sadly i i was then does don't close faint after one carry as by are be favourites all family turn in does as three part in another some to be probably with world uncaring her an have faint beginning own as is sequence
Predicted 0.07241632 -- true label was: 0
