# Bidirectional LSTM on IMDB

**Description:** Train a 2-layer bidirectional LSTM on the IMDB movie review sentiment classification dataset.

Based on: https://keras.io/examples/nlp/bidirectional_lstm_imdb/

## Setup

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review


## Build the model

In [None]:
# Input for variable-length sequences of integers
inputs = keras.Input(shape=(None,), dtype="int32")
# Embed each integer in a 128-dimensional vector
x = layers.Embedding(max_features, 128)(inputs)
# Add 2 bidirectional LSTMs
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True))(x)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Add a classifier
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.summary()


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 128)         2560000   
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        98816     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 2,757,761
Trainable params: 2,757,761
Non-train

## Load the IMDB movie review sentiment data

In [None]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(
    num_words=max_features
)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 Training sequences
25000 Validation sequences


In [None]:
x_train

array([[   5,   25,  100, ...,   19,  178,   32],
       [   0,    0,    0, ...,   16,  145,   95],
       [   0,    0,    0, ...,    7,  129,  113],
       ...,
       [   0,    0,    0, ...,    4, 3586,    2],
       [   0,    0,    0, ...,   12,    9,   23],
       [   0,    0,    0, ...,  204,  131,    9]], dtype=int32)

## Train and evaluate the model

We will use some examples and a `decide` function to try this out.

In [None]:
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))


Epoch 1/2
 28/782 [>.............................] - ETA: 7:40 - loss: 0.6921 - accuracy: 0.5067

KeyboardInterrupt: ignored

In [None]:
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing import text
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets.imdb import get_word_index

imdb = tfds.load('imdb_reviews', as_supervised=True)
train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

for sentence, label in train_data:
    training_sentences.append(str(sentence.numpy()))
    training_labels.append(str(label.numpy()))

In [None]:
training_labels_final = np.array(training_labels).astype(float)
print(training_sentences[0])    # first samples
print(training_labels_final[0]) # first label


In [None]:
vocab_size = 2000 # The maximum number of words to keep, based on word frequency.
tokenizer = text.Tokenizer(
    num_words=vocab_size,
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    lower=True,
    split=" ",
    oov_token="<OOV>"
)
tokenizer.fit_on_texts(training_sentences)

def decide(text):
    tokenized_text = tokenizer.texts_to_sequences([text])
    training_padded = sequence.pad_sequences(tokenized_text, maxlen=maxlen, truncating='post')
    result = model.predict(training_padded)[0][0]
    if result >= 0.6 :
        return "Positive review"
    elif result <= 0.4:
        return "Negative review"
    else:
        return "Neutral review"

In [None]:
decide("I hate the movie, they made no effort in making the movie. Waste of time!")

In [None]:
decide("Awesome movie! Loved the way in which the hero acted.")

In [None]:
decide("This movie is very bad...")

In [None]:
decide("I absolutely hated this movie!")

In [None]:
decide("Please everybody should come and see this movie, I love it!")