## LSTMs for classification

In this notebook, LSTMs are going to be used to predict the label (e.g. sentiment) of a sequence.

We are going to use `keras` to build LSTM network, using function `keras.layers.LSTM`. First, let's install the library `tensorflow` and `keras`. This may take a few seconds.

In [1]:
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

The imdb dataset: https://keras.io/api/datasets/imdb/#getwordindex-function

In [2]:
max_features = 2000 #We use top max_features most common words to build a vocabulary

Loading data (and reducing its size):

In [3]:
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = x_train[:1000]
x_test = x_test[:1000]
y_train = y_train[:1000]
y_test = y_test[:1000]
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
1000 train sequences
1000 test sequences


Just to give us an idea of what the sequences look like (each number represents a different word):

In [4]:
print("X-vector: "+str(x_train[0]))
print("Label: "+str(y_train[0]))

X-vector: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 1920, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]
Label: 1


For your curiosity, here we just show how to retrieve the dictionary mapping word indices back to words.
For more details, see https://stackoverflow.com/questions/42821330/restore-original-text-from-keras-s-imdb-dataset

In [5]:
INDEX_FROM=3   # word index offset, by default

word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2 #unknown words according to the vovabulary
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
<START> this film was just brilliant casting location scenery story direction <UNK> really <UNK> the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same <UNK> island as myself so i loved the fact there was a real <UNK> with this film the witty <UNK> throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the <UNK> <UNK> was amazing really <UNK> at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of <UNK> and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all <UNK> up are such a big <

Since sequences (in this case sentences) can have different lengths, we need to make sure that they are padded: we add zeros to the beginning of the sequences that are shorter than the longest sequence so we can still train them step-by-step:

In [6]:
# make sure sequences have same length
maxlen = 80  # in each sentence, cut texts  before this number of words

print('Transform sequences')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

Transform sequences
x_train shape: (1000, 80)
x_test shape: (1000, 80)


In [7]:
print("X-vector: "+str(x_train[0]))
print("Label: "+str(y_train[0]))

X-vector: [  15  256    4    2    7    2    5  723   36   71   43  530  476   26
  400  317   46    7    4    2 1029   13  104   88    4  381   15  297
   98   32    2   56   26  141    6  194    2   18    4  226   22   21
  134  476   26  480    5  144   30    2   18   51   36   28  224   92
   25  104    4  226   65   16   38 1334   88   12   16  283    5   16
    2  113  103   32   15   16    2   19  178   32]
Label: 1


Note:

When directly working with text, we need an embedding layer, where words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
Look at https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/ for more details

In [8]:
print('Build model...')
model = Sequential()
no_dim = 128

# First we create an embedding for each word of dimensionality 128
# no_dim - should match LSTM
model.add(Embedding(max_features, no_dim))

# dropout = percentage of units dropped by the input linear transformation
# rec_drop = percentage of units dropped by linear transformation of recurrent state
model.add(LSTM(no_dim, dropout=0.2, recurrent_dropout=0.2))

# dimensionality of the output space = 1: since we use classification of a label, e.g., [0,1,2,3]
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy','mae'])

model.fit(x_train, y_train,
          batch_size=32,
          epochs=15,
          validation_data=(x_test, y_test))

Build model...
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.src.callbacks.History at 0x268bed40d90>

Evaluation happens as follows:

In [9]:
evaluation = model.evaluate(x_test, y_test,return_dict = True)
print(evaluation)

{'loss': 1.2413833141326904, 'accuracy': 0.7080000042915344, 'mae': 0.2892940938472748}
