# Tutorial 2 - Text Classification - Deep Learning Sequential Models - LSTMs, Stacked LSTMs and Bidirectional LSTMs

Another new and interesting approach to supervised deep learning is the use of recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) which also considers the sequence of data (words, events and so on). These are more advanced models than your regular fully connected deep networks and usually take more time to train.

The focus of this tutorial will be to build different seuquential deep learning models on a classic sentiment analysis - text classification problem which includes the following models:

- Long Short Term Memory Networks (LSTMs)
- Stacked LSTMs
- Bi-directional LSTMs

In [1]:
!nvidia-smi

Thu Jul 22 00:32:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install contractions
!pip install textsearch
!pip install tqdm
import nltk
nltk.download('punkt')

Collecting contractions
  Downloading contractions-0.0.52-py2.py3-none-any.whl (7.2 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 21.1 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.2.0-py3-none-any.whl (283 kB)
[K     |████████████████████████████████| 283 kB 15.7 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85455 sha256=feb0ff708e40dec35c34a27095ca93d743222aae4b5ebd091afa464b8dc4f0a3
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully instal

True

In [3]:
import pandas as pd
import numpy as np

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)

## Load Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv("/content/drive/My Drive/NLP_DeepLearning_Course/Week1/movie_reviews.csv.bz2", compression='bz2')
dataset.info()
# dataset = pd.read_csv(r'movie_reviews.csv.bz2', compression='bz2')
# dataset.info()

Mounted at /content/drive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
# take a peek at the data
dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Prepare Train Test Splits

In [6]:
# build train and test datasets
reviews = dataset['review'].values
sentiments = dataset['sentiment'].values

train_reviews = reviews[:35000]
train_sentiments = sentiments[:35000]

test_reviews = reviews[35000:]
test_sentiments = sentiments[35000:]

## Text Wrangling and Normalization

In [7]:
import contractions
from bs4 import BeautifulSoup
import numpy as np
import re
import tqdm
import unicodedata


def strip_html_tags(text):
  soup = BeautifulSoup(text, "html.parser")
  [s.extract() for s in soup(['iframe', 'script'])]
  stripped_text = soup.get_text()
  stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
  return stripped_text

def remove_accented_chars(text):
  text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
  return text

def pre_process_corpus(docs):
  norm_docs = []
  for doc in tqdm.tqdm(docs):
    doc = strip_html_tags(doc)
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = remove_accented_chars(doc)
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, flags=re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  
    norm_docs.append(doc)
  
  return norm_docs

In [8]:
%%time

norm_train_reviews = pre_process_corpus(train_reviews)
norm_test_reviews = pre_process_corpus(test_reviews)

100%|██████████| 35000/35000 [00:16<00:00, 2125.34it/s]
100%|██████████| 15000/15000 [00:07<00:00, 2113.63it/s]

CPU times: user 23.6 s, sys: 180 ms, total: 23.8 s
Wall time: 23.6 s





## Preprocessing
To prepare text data for our deep learning model, we transform each review into a sequence. Every word in the review is mapped to an integer index and thus the sentence turns into a sequence of numbers.

To perform this transformation, ``tensorflow.keras`` provides the ``Tokenizer``

In [9]:
import tensorflow as tf

t = tf.keras.preprocessing.text.Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
t.fit_on_texts(norm_train_reviews)
t.word_index['<PAD>'] = 0

In [10]:
# word at max index, word at min index and index of <UNK>
max([(k, v) for k, v in t.word_index.items()], key = lambda x:x[1]), min([(k, v) for k, v in t.word_index.items()], key = lambda x:x[1]), t.word_index['<UNK>']

(('dawgis', 175845), ('<PAD>', 0), 1)

In [11]:
train_sequences = t.texts_to_sequences(norm_train_reviews)
test_sequences = t.texts_to_sequences(norm_test_reviews)

### Processed Dataset Summary

In [12]:
print("Vocabulary size={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

Vocabulary size=175846
Number of Documents=35000


## Sequence Normalization

Not all reviews are of same length. To handle this difference in length of reviews, we define a maximum length. For reviews which are smaller than this length, we pad them with zeros which longer ones are truncated

In [13]:
MAX_SEQUENCE_LENGTH = 1000

In [14]:
# pad dataset to a maximum review length in words
X_train = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_test = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
X_train.shape, X_test.shape

((35000, 1000), (15000, 1000))

## Encoding Labels
The dataset contains labels of the form positive/negative. The following step encodes the labels using ``sklearn``'s ``LabelEncoder``

In [15]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
# positive -> 1, negative -> 0
num_classes=2 

In [16]:
y_train = le.fit_transform(train_sentiments)
y_test = le.transform(test_sentiments)

In [17]:
VOCAB_SIZE = len(t.word_index)

## LSTM Model

## Embeddings
The Embedding layer helps us generate the word embeddings from scratch. This layer is also initialized with some weights and is updated based on our optimizer, similar to weights on the neuron units in other layers when the network tries to minimize the loss in each epoch. Thus, the embedding layer tries to optimize its weights such that we get the best word embeddings that will generate minimum error in the model and capture semantic similarity and relationships among words. How do we get the embeddings? Let’s say we have a review with three terms ['movie', 'was', 'good'] and a vocab_map consisting of word to index mappings for 175860 words.

<img src="https://i.imgur.com/WuV47DW.png">

## LSTM
LSTMs try to overcome the shortcomings of RNN models, especially with regard to handling long-term dependencies and problems that occur when the weight matrix associated with the units (neurons) become too small (leading to vanishing gradient) or too large (leading to exploding gradient). These architectures are more complex than regular deep networks and going into detailed internals and math concepts are out of the current scope, but we will try to cover the essentials here without making it math heavy

<img src="https://i.imgur.com/c8qGKX8.png">




---

__The sequence of operations in the LSTM cell is briefly shown as follows.__

<img src="https://i.imgur.com/uiIbDk1.png">


## Build the Model

In [18]:

EMBEDDING_DIM = 300 # dimension for dense embeddings for each token
LSTM_DIM = 128 # LSTM hidden state dimensionality 
# MAX_SEQUENCE_LENGTH = 1000 # ref, value set above

model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, 
                                    output_dim=EMBEDDING_DIM, 
                                    input_length=MAX_SEQUENCE_LENGTH))

model.add(tf.keras.layers.SpatialDropout1D(0.1))

model.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=False))

model.add(tf.keras.layers.Dense(256, activation='relu'))

model.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1000, 300)         52753800  
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 1000, 300)         0         
_________________________________________________________________
lstm (LSTM)                  (None, 128)               219648    
_________________________________________________________________
dense (Dense)                (None, 256)               33024     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 53,006,729
Trainable params: 53,006,729
Non-trainable params: 0
_________________________________________________________________


## Train the Model

In [19]:
batch_size = 128
EPOCHS = 10

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=2,
                                      restore_best_weights=True,
                                      verbose=1)

model.fit(X_train, y_train, epochs=EPOCHS, batch_size=batch_size, 
          callbacks=[es],
          shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Restoring model weights from the end of the best epoch.
Epoch 00003: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f78e60a64d0>

## Evaluate Model

In [20]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 85.71%


In [21]:
predictions = model.predict_classes(X_test).ravel()
predictions[:10]



array([0, 1, 0, 1, 1, 0, 1, 0, 1, 1], dtype=int32)

In [22]:
predictions = ['positive' if item == 1 else 'negative' for item in predictions]

In [23]:
from sklearn.metrics import confusion_matrix, classification_report

labels = ['negative', 'positive']
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.81      0.93      0.87      7490
    positive       0.91      0.79      0.85      7510

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



Unnamed: 0,negative,positive
negative,6935,555
positive,1588,5922




---



---



# Stacked LSTM

We are well aware of how the depth of a neural network helps it to learn complex and abstract concepts in general. Along the same lines, a stacked LSTM architecture, which has multiple layers of LSTMs stacked one after the other, has been shown to give considerable improvements. Stacked LSTMs were first presented by Graves et. al. in their work Speech Recognition with Deep Recurrent Neural Networks . They highlight the fact that depth (multiple layers of RNNs) has a greater impact on performance compared to the number of units per layer. 

Though there isn’t any theoretical proof to explain this performance gain, empirical results help us understand the impact. These enhancements can be attributed to the model’s capacity to learn complex features and even abstract representation of inputs. Since there is a time component associated with LSTMs and RNNs in general, deeper networks learn the ability to operate at different time scales as well . 

As we are making use of the high-level Keras API, we can easily extend the architecture we used in the previous section to add additional LSTM layers.

## Build Model

In [24]:
model2 = tf.keras.models.Sequential()

model2.add(tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, 
                                     output_dim=EMBEDDING_DIM, 
                                     input_length=MAX_SEQUENCE_LENGTH))
model2.add(tf.keras.layers.SpatialDropout1D(0.1))

model2.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=True)) # you can add more lstm layers, just set
# return_sequences=True for each additional lstm layer. 
model2.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=False)) # the last lstm layer must have
# return_sequences=False before passing on to the Dense layers below.

model2.add(tf.keras.layers.Dense(256, activation='relu'))
model2.add(tf.keras.layers.Dense(1, activation="sigmoid"))

model2.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1000, 300)         52753800  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 1000, 300)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 1000, 128)         219648    
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_2 (Dense)              (None, 256)               33024     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 53,138,313
Trainable params: 53,138,313
Non-trainable params: 0
__________________________________________

## Train the Model

In [25]:
batch_size = 128
EPOCHS = 10

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=2,
                                      restore_best_weights=True,
                                      verbose=1)

model2.fit(X_train, y_train, epochs=EPOCHS, batch_size=batch_size, 
           callbacks=[es],
           shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Restoring model weights from the end of the best epoch.
Epoch 00003: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f78d01b3350>

## Evaluate Model

In [26]:
# Final evaluation of the model
scores = model2.evaluate(X_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.73%


In [27]:
predictions = model2.predict_classes(X_test).ravel()
predictions[:10]



array([0, 1, 0, 1, 1, 0, 1, 0, 1, 1], dtype=int32)

In [None]:
predictions = ['positive' if item == 1 else 'negative' for item in predictions]

In [None]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.87      0.91      0.89      7490
    positive       0.90      0.86      0.88      7510

    accuracy                           0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



Unnamed: 0,negative,positive
negative,6785,705
positive,1055,6455




---



# Bidirectional LSTM

The second variant very widely used nowadays is the bidirectional LSTM. We have already discussed how LSTMs, and RNNs in general, condition their outputs by making use of previous timesteps. When it comes to text or any sequence data, this means that the LSTM is able to make use of past context to predict future timesteps. While this is a very useful property, this is not the best we can achieve.

A bidirectional LSTM (or biLSTM) is a combination of  two LSTM layers which work simultaneously. The first is the usual forward LSTM which takes the input sequence in its original order. The second one is called the backward LSTM which takes a reversed copy of the sequence as input. The forward and backward LSTMs work in tandem to process the original and reversed copy of the input sequences. Since we have two LSTM cells working on different contexts at any given time step, we need a way of defining the output that will be used by the downstream layers in the network. The outputs can be combined via summation, multiplication, concatenation or even averaging of hidden states. Different deep learning frameworks might set different defaults, but the most widely used method is concatenation of the biLSTM outputs


## Build Model

In [29]:
EMBEDDING_DIM = 300 # dimension for dense embeddings for each token
LSTM_DIM = 128 # total LSTM units

inp = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,))

x = tf.keras.layers.Embedding(VOCAB_SIZE, 
                              EMBEDDING_DIM, 
                              trainable=True)(inp)

x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM, 
                                                       return_sequences=True),
                                  merge_mode='concat')(x)

x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(LSTM_DIM, 
                                                       return_sequences=False),
                                  merge_mode='concat')(x)

x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(rate=0.2)(x)
x = tf.keras.layers.Dense(256, activation='relu')(x)
x = tf.keras.layers.Dropout(rate=0.2)(x)

outp = tf.keras.layers.Dense(1, activation='sigmoid')(x)
# initialize the model
model3 = tf.keras.models.Model(inputs=inp, outputs=outp)

    
model3.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(), 
               metrics=['accuracy'])
model3.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1000)]            0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 1000, 300)         52753800  
_________________________________________________________________
bidirectional_2 (Bidirection (None, 1000, 256)         439296    
_________________________________________________________________
bidirectional_3 (Bidirection (None, 256)               394240    
_________________________________________________________________
dense_7 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_8 (Dense)              (None, 256)               6579

## Train Model

In [30]:
batch_size = 100
model3.fit(X_train, y_train, epochs=2, batch_size=batch_size, 
           shuffle=True, validation_split=0.1, verbose=1)

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f78063d7f10>

## Evaluate Model

In [31]:
# Final evaluation of the model
scores = model3.evaluate(X_test, y_test, verbose=1)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 83.61%


In [32]:
prediction_probs = model3.predict(X_test, verbose=1).ravel()
predictions = [1 if prob > 0.5 else 0 for prob in prediction_probs]
predictions[:10]



[0, 1, 0, 1, 1, 0, 1, 0, 1, 1]

In [34]:
predictions = ['positive' if item == 1 else 'negative' for item in predictions]

In [35]:
labels = ['negative', 'positive']
print(classification_report(test_sentiments, predictions))
pd.DataFrame(confusion_matrix(test_sentiments, predictions), index=labels, columns=labels)

              precision    recall  f1-score   support

    negative       0.83      0.85      0.84      7490
    positive       0.85      0.82      0.83      7510

    accuracy                           0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000



Unnamed: 0,negative,positive
negative,6374,1116
positive,1342,6168
