## Bi-Directional LSTMs

Reference [Jon Krohn](https://github.com/the-deep-learners/TensorFlow-LiveLessons/blob/master/notebooks/bidirectional_lstm.ipynb)

In this model, we classify sentiment of movie review from IMDB using a di-directional LSTM

In [2]:

import keras
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, SpatialDropout1D, LSTM
from tensorflow.keras.layers import Bidirectional # note this dependency
from keras.callbacks import ModelCheckpoint
import os
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

In [3]:
# vector-space embedding
n_dim = 64
n_unique_words = 10000
max_review_length = 200
# this can be a bit longer, we are reading our reviews in both directions
# gradients disappear from both ends of the sequence
pad_type = trunc_type = 'pre'
drop_embed = 0.2

In [4]:
(x_train, y_train), (x_valid, y_valid) = imdb.load_data(num_words=n_unique_words)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Preprocess the data

In [5]:
x_train = pad_sequences(x_train, maxlen=max_review_length, padding=pad_type, truncating=trunc_type, value=0)
x_valid = pad_sequences(x_valid, maxlen=max_review_length, padding=pad_type, truncating=trunc_type, value=0)

Set Hyperparameters

In [6]:
# output directory name
output_dir = 'model_output/biLSTM'

# training details
epochs = 6
batch_size = 128

# LSTM layer architecture:
n_lstm = 256
drop_lstm = 0.2

Build the model

In [7]:
model = Sequential()
model.add(Embedding(n_unique_words, n_dim, input_length=max_review_length))
model.add(SpatialDropout1D(drop_embed))
model.add(Bidirectional(LSTM(n_lstm, dropout=drop_lstm)))
# add in the Bidirectional wrapper
model.add(Dense(1, activation='sigmoid'))

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 64)           640000    
                                                                 
 spatial_dropout1d (Spatial  (None, 200, 64)           0         
 Dropout1D)                                                      
                                                                 
 bidirectional (Bidirection  (None, 512)               657408    
 al)                                                             
                                                                 
 dense (Dense)               (None, 1)                 513       
                                                                 
Total params: 1297921 (4.95 MB)
Trainable params: 1297921 (4.95 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


we can see the number of parameters is larger - as we increase the number of elements in the sequence (200 words)


Compile the model

In [9]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [10]:
modelcheckpoint = ModelCheckpoint(filepath=output_dir+"/weights.{epoch:02d}.hdf5")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

Train the model

* use the GPU with RNNs espcially a bi-directional LSTM.  




In [11]:
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_valid, y_valid), callbacks=[modelcheckpoint]);

Epoch 1/6
Epoch 2/6


  saving_api.save_model(


Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


Evaluate the epoch with highest accuracy / lowest loss

In [12]:
# insert the relevant epoch

model.load_weights(output_dir+"/weights.03.hdf5") # zero-indexed

In [16]:
y_hat = model.predict(x_valid)



In [None]:
plt.hist(y_hat)
_ = plt.axvline(x=0.5, color='orange')


In [15]:
"{:0.2f}".format(roc_auc_score(y_valid, y_hat)*100.0)

NameError: name 'y_hat' is not defined

This is the best RNN, but the CNN still has best ROC performance.

Why not go for the CNN?  

The CNN only considers three word features - so in a larger data set, the LSTM would pick up far more nuance (using a 200 word sequence)