# Recurrent Neural Networks : Long Short Term Memory networks (LSTM)

We have already covered feed-forward neural networks during the computer vision and the recommended system module. For natural language processing, one type of popular deep learning architecture is called Reccurent Neural Networks (RNNs). RNNs differ from feed-forward networks in the sense that some of their inner layers are recursively updated while iterating over the sequence of words given in input. We are going to use one specific RNN architecture called Long-Short Term Memory networks (LSTMs), which have been especially successful in various NLP tasks, including automatic translation, question answering, ... and text classification, our case study in this module.

Learn more about how RNNs and LSTMs encode texts as vectors first:

https://medium.com/towards-data-science/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

If you want to understand in depth how one LSTM cell is working, you can go through these two articles:

https://medium.com/towards-data-science/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


## Import Modules

In [115]:
import pandas as pd
import numpy as np
import tensorflow as tf

import keras
from keras.layers import TextVectorization, Dense, Embedding, LSTM, Dropout, Bidirectional, Conv1D, MaxPool1D, Flatten, GlobalMaxPool1D
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras import Input

## Functions

In [2]:
def print_results(dic):
    print("Accuracy on test set are: ")
    for key, value in dic.items():
        print(f"{value*100:.1f}% : {key}")

## Load Data

In [3]:
outdir = '../data/imdb_clean/'

In [4]:
train_deep_clean = pd.read_csv(outdir + 'train.csv')
valid_deep_clean = pd.read_csv(outdir + 'valid.csv')
test_deep_clean = pd.read_csv(outdir + 'test.csv')

In [5]:
dict_result={
    "Bag of Word + logistic regression": 0.818,
    "TF-IDF + logistic regression": 0.879
}

## Implementing LSTMs with Keras

In [6]:
#Some lines that allow for faster training with this version of tensorflow for these models
print("Tensorflow version: ", tf.__version__)

physical_devices = tf.config.list_physical_devices('GPU')
print(physical_devices)

Tensorflow version:  2.20.0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


We will use Keras to implement an LSTM network. First we need to encode all the reviews as a list of indexes, where each word is replaced by its embedding index using keras `TextVectorization` object. To make all training reviews the same size, the `TextVectorization` will add a special token < pad > (encoded as 0) as many time as necessary to each text to match the length of the longest text. This token will be ignored by the LSTM layer.

In [7]:
max_vocab_size = 10000 

vectorizer = TextVectorization(max_tokens=max_vocab_size, output_mode='int',split='whitespace')
vectorizer.adapt(pd.concat([train_deep_clean,valid_deep_clean]).review)

# This encodes our sentence as a sequence of integer
# each integer being the index of each word in the vocabulary
# Zeros are added at the end so that each text has the same length
train_seqs = vectorizer(train_deep_clean.review)
valid_seqs = vectorizer(valid_deep_clean.review)
test_seqs = vectorizer(test_deep_clean.review)

max_seq_length = train_seqs.shape[1]

#However due to a bug in Keras 3, we need to put the zeros at the beginning, not at the end:
def repad_left(x, maxlen):       
    rows = []
    x_arr = np.array(x)
    x_arr = np.concatenate((x_arr, np.zeros((x_arr.shape[0],1))), axis=1)
    for row in x_arr:
        rows.append(row[:row.argmin()])
    ledt_padded_x = keras.utils.pad_sequences(rows, maxlen=maxlen)
    return ledt_padded_x

X_train = repad_left(train_seqs, max_seq_length)
X_valid = repad_left(valid_seqs, max_seq_length)
X_test = repad_left(test_seqs, max_seq_length)

#Finally we encode the ys :
y_train = pd.get_dummies(train_deep_clean.sentiment).values[:,1]
y_valid = pd.get_dummies(valid_deep_clean.sentiment).values[:,1]
y_test = pd.get_dummies(test_deep_clean.sentiment).values[:,1]


I0000 00:00:1764674612.321167   16860 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3888 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1660, pci bus id: 0000:61:00.0, compute capability: 7.5


Now fill the following function to implement a simple LSTM model : one embedding layer, one LSTM layer, and a final dense layer that yields a single score with a sigmoid activation function. Use Keras' Sequential API.

In [8]:
def get_lstm_model(vocab_size, embedding_dim, seq_length, lstm_out_dim):
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding"),
        LSTM(lstm_out_dim, name="lstm"),
        Dense(1, activation='sigmoid', name="Dense"),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='SGD',metrics = ['accuracy'])
    return model

In [9]:
embedding_dim = 100
lstm_out_dim = 200  #Bigger than embedding dim, as it combines all the words of each review

model = get_lstm_model(max_vocab_size, embedding_dim, max_seq_length, lstm_out_dim)
print(model.summary())

None


In [10]:
batch_size = 64
max_epochs = 2
history = model.fit(
    X_train, y_train,
    epochs=max_epochs,
    batch_size=batch_size,
    verbose=1,
    validation_data = (X_valid, y_valid)
)

Epoch 1/2


2025-12-02 12:23:48.054868: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:473] Loaded cuDNN version 91600


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 225ms/step - accuracy: 0.5008 - loss: 0.6932 - val_accuracy: 0.4954 - val_loss: 0.6931
Epoch 2/2
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 222ms/step - accuracy: 0.5081 - loss: 0.6931 - val_accuracy: 0.4932 - val_loss: 0.6931


In [11]:
dict_result["LSTM with SGD"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [12]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD


Pretty low accuracy isn't it ? Actually it is very easy to incorrectly train a deep neural net. Change the optimizer with "adam" instead of "SGD", add a dropout layer after the LSTM layer for regularization, and use early stopping :

In [13]:
#dropout, early stopping, adam
def get_lstm_model_2(vocab_size, embedding_dim, seq_length, lstm_out_dim, dropout_rate):
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding"),
        LSTM(lstm_out_dim, name="lstm"),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid', name="Dense"),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [14]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_lstm_model_2(max_vocab_size, embedding_dim, max_seq_length, lstm_out_dim, dropout_rate)
print(model.summary())

None


In [15]:
early_stopping = EarlyStopping(monitor='accuracy', patience=2, verbose=1, restore_best_weights=True)

In [16]:
batch_size = 32
max_epochs = 5
history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size,
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m88s[0m 134ms/step - accuracy: 0.7628 - loss: 0.4916 - val_accuracy: 0.8420 - val_loss: 0.3758
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 133ms/step - accuracy: 0.8816 - loss: 0.3000 - val_accuracy: 0.8798 - val_loss: 0.3032
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 133ms/step - accuracy: 0.9089 - loss: 0.2371 - val_accuracy: 0.8662 - val_loss: 0.3455
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 134ms/step - accuracy: 0.9393 - loss: 0.1672 - val_accuracy: 0.8686 - val_loss: 0.3881
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 133ms/step - accuracy: 0.9367 - loss: 0.1637 - val_accuracy: 0.8490 - val_loss: 0.4260
Restoring model weights from the end of the best epoch: 4.


In [17]:
dict_result["LSTM + Dropout with Adam"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [18]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam


Much better. If we'd run for a longer time, we'd get a bit better results from our classic methods, but that's still quite slow for little improvement. We could also grid search for all hyper parameters (embedding and layer sizes, dropout rate, ...), but that's not the goal today, remember however that grid-search is standard when optimizing a model predictive performances.

## Predict sentiment for arbitrary sentences

Now you can try predict the sentiment of any kind of sentence in english, try your own. You first need to encode each review as a sequence of indexes (called tokens in keras), to pad these sequances, and finally predict the score with your trained model:

In [19]:
good = "i really liked the movie and had fun"
bad = "worst movie on the planet , so boring"
bad2 = "I really didn't like the movie, the movie was not good"

print("My model predict that:")
def repad_left2(x, maxlen):       
    rows = []
    x_arr = np.array(x, ndmin=2)
    x_arr = np.concatenate((x_arr, np.zeros((x_arr.shape[0],1))), axis=1)
    for row in x_arr:
        rows.append(row[:row.argmin()])
    ledt_padded_x = keras.utils.pad_sequences(rows, maxlen=maxlen)
    return ledt_padded_x

for review in [good, bad, bad2]:
    x_predict = vectorizer(review)
    x_predict = repad_left2(x_predict, max_seq_length)
    y_predict = model.predict(x_predict)
    print(review)
    if y_predict[0][0] >= 0.5:
        print("is a good review")
    else:
        print("is a bad review")
    print("-"*15)

My model predict that:
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 265ms/step
i really liked the movie and had fun
is a good review
---------------
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 98ms/step
worst movie on the planet , so boring
is a bad review
---------------
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 95ms/step
I really didn't like the movie, the movie was not good
is a bad review
---------------


## Initialize embeddings with pre-trained word embeddings

The training of LSTMs is a bit heavy, one way to speed this up is to re-use pre-trained word embeddings. Many such embeddings are available on the net. Read this to understand how are produced word embeddings and why they encode information that helps with all NLP tasks:

http://jalammar.github.io/illustrated-word2vec/

We are going to use GloVe embeddings, download and load the embeddings produced from 6 billions documents from : https://nlp.stanford.edu/projects/glove/

In [59]:
embeddings_index = {}
f = open('../data/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [60]:
word_index = {w : i for i,w in enumerate(vectorizer.get_vocabulary()) }

In [61]:
print('%s unique words in vocabulary' % len(word_index))

10000 unique words in vocabulary


Given our word index, search for each of our 10000 most frequents words if they exist in the pretrained GloVe embeddings and assign them to their corresponding row index in the embedding matrix. If they don't exist in the GloVe embeddings, assign a random vector :

In [62]:
embedding_dim = 100

# Allocate the embeddings matrix
embedding_matrix = np.zeros((max_vocab_size, embedding_dim))

for word, i in word_index.items():
    if word in embeddings_index:
        embedding_matrix[i,] = embeddings_index[word]
    else:
        embedding_matrix[i,] = np.random.rand(1,embedding_dim)

In [63]:
embedding_matrix

array([[ 0.62141373,  0.91280189,  0.59813193, ...,  0.43149346,
         0.85016167,  0.09558155],
       [ 0.50589382,  0.22122686,  0.92968344, ...,  0.94874934,
         0.9432365 ,  0.89240682],
       [-0.038194  , -0.24487001,  0.72812003, ..., -0.1459    ,
         0.82779998,  0.27061999],
       ...,
       [ 0.21675999,  0.23662999,  0.72715998, ..., -0.24111   ,
         0.51990998,  0.75832999],
       [-0.066952  ,  0.32960001,  0.43399   , ...,  0.85698998,
        -0.0097463 , -0.37911999],
       [-0.16408999, -0.28184   ,  0.063544  , ...,  0.54012001,
        -0.41968   , -0.01231   ]], shape=(10000, 100))

Now change your LSTM model so that the embedding layer is initialized with the pretrained embeddings :

In [70]:
model.layers[0].get_weights()

[array([[ 0.6214137 ,  0.9128019 ,  0.59813195, ...,  0.43149346,
          0.8501617 ,  0.09558155],
        [ 0.5058938 ,  0.22122686,  0.92968345, ...,  0.94874936,
          0.94323653,  0.8924068 ],
        [-0.038194  , -0.24487   ,  0.72812   , ..., -0.1459    ,
          0.8278    ,  0.27062   ],
        ...,
        [ 0.21676   ,  0.23663   ,  0.72716   , ..., -0.24111   ,
          0.51991   ,  0.75833   ],
        [-0.066952  ,  0.3296    ,  0.43399   , ...,  0.85699   ,
         -0.0097463 , -0.37912   ],
        [-0.16409   , -0.28184   ,  0.063544  , ...,  0.54012   ,
         -0.41968   , -0.01231   ]], shape=(10000, 100), dtype=float32)]

In [65]:
def get_lstm_model_pretrained_embs(
    vocab_size, embedding_dim, seq_length, 
    lstm_out_dim, dropout_rate, embedding_matrix):
    
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding", embeddings_initializer=embedding_matrix),
        LSTM(lstm_out_dim, name="lstm"),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid', name="Dense"),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [66]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_lstm_model_pretrained_embs(
    max_vocab_size, embedding_dim, max_seq_length, 
    lstm_out_dim, dropout_rate, embedding_matrix)
print(model.summary())

None


In [None]:
batch_size = 64
max_epochs = 5

early_stopping = EarlyStopping(monitor='accuracy', patience=2, verbose=1, restore_best_weights=True)

history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size, 
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 218ms/step - accuracy: 0.7114 - loss: 0.5529 - val_accuracy: 0.8356 - val_loss: 0.3950
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 218ms/step - accuracy: 0.8610 - loss: 0.3345 - val_accuracy: 0.8750 - val_loss: 0.3000
Epoch 2: early stopping
Restoring model weights from the end of the best epoch: 1.


In [27]:
dict_result["LSTM with embedding pretrained"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [29]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained


We can see that the validation accuracy indeed progressed faster during the first epochs. For this small dataset it is not an issue, but it can save hours of training on bigger ones. It also reached a higher accuracy, which is not always the case, especially on bigger datasets.

For more speed-up, at the expense of accuracy, let's fix the embeddings so that they are not trainable parameters of the model, meaning they won't be updated during training :

In [71]:
def get_lstm_model_pretrained_embs(
    vocab_size, embedding_dim, seq_length, 
    lstm_out_dim, dropout_rate, embedding_matrix,
    trainable_embeddings):

    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding", embeddings_initializer=embedding_matrix, trainable=trainable_embeddings),
        LSTM(lstm_out_dim, name="lstm"),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid', name="Dense"),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [72]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2
trainable_embeddings = False

model = get_lstm_model_pretrained_embs(max_vocab_size, embedding_dim, max_seq_length, 
                                       lstm_out_dim, dropout_rate, embedding_matrix, trainable_embeddings)
print(model.summary())

None


Notice the change in the number of trainable parameters in the summary.

In [73]:
batch_size = 64
max_epochs = 5

early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, verbose=1, restore_best_weights=True)

history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size, 
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m71s[0m 221ms/step - accuracy: 0.6641 - loss: 0.6073 - val_accuracy: 0.7286 - val_loss: 0.5331
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 222ms/step - accuracy: 0.7974 - loss: 0.4498 - val_accuracy: 0.8384 - val_loss: 0.3754
Epoch 3/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 224ms/step - accuracy: 0.8426 - loss: 0.3644 - val_accuracy: 0.8668 - val_loss: 0.3180
Epoch 4/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 214ms/step - accuracy: 0.8644 - loss: 0.3207 - val_accuracy: 0.8626 - val_loss: 0.3116
Epoch 5/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m67s[0m 214ms/step - accuracy: 0.8753 - loss: 0.2951 - val_accuracy: 0.8800 - val_loss: 0.2870
Restoring model weights from the end of the best epoch: 5.


In [74]:
dict_result["LSTM with non trainable embeddings"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [75]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained
87.7% : LSTM with non trainable embeddings
76.7% : LSTM bidirectional


By fixing the word embeddings, the training time shrunk a bit, but the validation accuracy is progressing more slowly and reaching a limit. Depending on the network architecture, the trade-off can be interesting, here not so much, just know this is a possibility.

# Going Further

## Bidirectional and stacked LSTMs

LSTMs parse the text from left to right, but doing it also from right to left and concatening the two output vectors improved the results. These are called bidirectional LSTMs. It is also possible to stack multiple LSTM layers.

This image is a good illustration of how these two variants work:

https://www.researchgate.net/figure/Illustrations-for-basic-LSTMs-and-the-three-layer-stacked-LSTM-model-for-the-sequential_fig3_313115860


First modify your network to make a bidirectional LSTM :

In [36]:
def get_bilstm_model_pretrained_embs(
        vocab_size, embedding_dim, seq_length, 
        lstm_out_dim, dropout_rate, embedding_matrix,
        trainable_embeddings):
    
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding", embeddings_initializer=embedding_matrix, trainable=trainable_embeddings),
        Bidirectional(LSTM(lstm_out_dim, name="lstm")),
        Dropout(dropout_rate),
        Dense(1, activation='sigmoid', name="Dense"),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [None]:
embedding_dim = 100
lstm_out_dim = 200
dropout_rate = 0.2

model = get_bilstm_model_pretrained_embs(
    max_vocab_size, embedding_dim, max_seq_length, 
    lstm_out_dim, dropout_rate, embedding_matrix, True)

print(model.summary())

None


In [None]:
batch_size = 32
max_epochs = 5

early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, verbose=1, restore_best_weights=True)

history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size, 
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m166s[0m 260ms/step - accuracy: 0.7373 - loss: 0.5205 - val_accuracy: 0.7700 - val_loss: 0.4693
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m163s[0m 261ms/step - accuracy: 0.8827 - loss: 0.2862 - val_accuracy: 0.8974 - val_loss: 0.2522
Epoch 2: early stopping
Restoring model weights from the end of the best epoch: 1.


In [40]:
dict_result["LSTM bidirectional"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [41]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained
75.3% : LSTM with non trainable embeddings
76.7% : LSTM bidirectional


Now try stacking multiple bidirectional LSTM layers, where the number of layers `n_layers` is a parameter of the function building the model :

In [91]:
def get_multilayer_bilstm_model_pretrained_embs(
        vocab_size, embedding_dim, seq_length, 
        lstm_out_dim, dropout_rate, embedding_matrix,
        trainable_embeddings, n_layers):
    
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding", embeddings_initializer=embedding_matrix, trainable=trainable_embeddings),
    ])

    for i in range(n_layers-1):
        model.add(Bidirectional(LSTM(lstm_out_dim, return_sequences=True)))

    model.add(Bidirectional(LSTM(lstm_out_dim)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid', name="Dense"))

    model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
    return model

In [92]:
embedding_dim = 100
lstm_out_dim = 100
dropout_rate = 0.2
n_layers = 2

model = get_multilayer_bilstm_model_pretrained_embs(
    max_vocab_size, embedding_dim, max_seq_length, 
    lstm_out_dim, dropout_rate, embedding_matrix, True, n_layers)

print(model.summary())

None


In [93]:
batch_size = 32
max_epochs = 5
history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size, 
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m256s[0m 403ms/step - accuracy: 0.7260 - loss: 0.5359 - val_accuracy: 0.8132 - val_loss: 0.4249
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m252s[0m 403ms/step - accuracy: 0.8666 - loss: 0.3266 - val_accuracy: 0.8858 - val_loss: 0.2885
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m252s[0m 403ms/step - accuracy: 0.9187 - loss: 0.2166 - val_accuracy: 0.9010 - val_loss: 0.2571
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m252s[0m 403ms/step - accuracy: 0.9469 - loss: 0.1539 - val_accuracy: 0.9020 - val_loss: 0.2708
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m252s[0m 403ms/step - accuracy: 0.9655 - loss: 0.1074 - val_accuracy: 0.8942 - val_loss: 0.3306
Restoring model weights from the end of the best epoch: 4.


In [94]:
dict_result["Stacked Bidirectional LSTM"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [95]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained
87.7% : LSTM with non trainable embeddings
76.7% : LSTM bidirectional
88.4% : Stacked Bidirectional LSTM


As you can see, the max accuracy reached is not much better than our TF-IDF model. This happens because the full word order is actually not so important for Sentiment Analysis. For this task, Convolutional Neural Networks can attain comparable performances faster, as they have simpler architectures. But that's not true for other task such as translation, question answering, ... (which are tasks that are a bit too long to train to be included in this course, hence the choice of sentiment an analysis to practice RNNs).

Let's do it with a convolutional model by using 1D convolution with a kernel size of 3 over the word embeddings (this means that it will convolve the embeddings of the consecutive words 3 by 3), followed by a 1D max pooling and a dense ReLU layer before the final sigmoid :

In [118]:
def get_conv_model_pretrained_embs(
    vocab_size, embedding_dim, seq_length, 
    filters, dropout_rate, embedding_matrix,
    trainable_embeddings):
    
    model = Sequential([
        Input(shape=(seq_length, )),
        Embedding(vocab_size, embedding_dim, name="embedding", embeddings_initializer=embedding_matrix, trainable=trainable_embeddings),
        Conv1D(filters, kernel_size=3, padding='same', activation='relu'),
        GlobalMaxPool1D(),
        #Flatten(),
        Dropout(dropout_rate),
        Dense(embedding_dim, activation="relu"),
        Dense(1, activation='sigmoid'),
    ])

    model.compile(loss = 'binary_crossentropy', optimizer='adam', metrics = ['accuracy'])
    return model

In [119]:
embedding_dim = 100
filters = 100
dropout_rate = 0.2

model = get_conv_model_pretrained_embs(
    max_vocab_size, embedding_dim, max_seq_length, 
    filters, dropout_rate, None, True)

print(model.summary())

None


In [120]:
batch_size = 64
max_epochs = 5

early_stopping = EarlyStopping(monitor='val_accuracy', patience=3, verbose=1, restore_best_weights=True)

history = model.fit(
    X_train,
    y_train,
    epochs=max_epochs,
    batch_size=batch_size, 
    verbose=1,
    validation_data = (X_valid, y_valid),
    callbacks=[early_stopping]
)

Epoch 1/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 60ms/step - accuracy: 0.7897 - loss: 0.4416 - val_accuracy: 0.8788 - val_loss: 0.2864
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 43ms/step - accuracy: 0.9060 - loss: 0.2360 - val_accuracy: 0.8974 - val_loss: 0.2504
Epoch 3/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 43ms/step - accuracy: 0.9552 - loss: 0.1265 - val_accuracy: 0.8908 - val_loss: 0.2889
Epoch 4/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 43ms/step - accuracy: 0.9787 - loss: 0.0618 - val_accuracy: 0.8894 - val_loss: 0.3278
Epoch 5/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 43ms/step - accuracy: 0.9908 - loss: 0.0315 - val_accuracy: 0.8794 - val_loss: 0.4280
Epoch 5: early stopping
Restoring model weights from the end of the best epoch: 2.


In [121]:
dict_result["CNN 1D"] = model.evaluate(X_test, y_test, verbose=0)[1]

In [122]:
print_results(dict_result)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained
87.7% : LSTM with non trainable embeddings
76.7% : LSTM bidirectional
88.4% : Stacked Bidirectional LSTM
88.8% : CNN 1D


# Save dictionnary of results

In [123]:
import json

# Save dict in .json
with open('../data/model_accuracy.json', 'w') as f:
    f.write(json.dumps(dict_result))

In [124]:
with open("../data/model_accuracy.json") as f:
    model_accuracy = json.load(f)

In [125]:
print_results(model_accuracy)

Accuracy on test set are: 
81.8% : Bag of Word + logistic regression
87.9% : TF-IDF + logistic regression
50.1% : LSTM with SGD
86.5% : LSTM + Dropout with Adam
82.5% : LSTM with embedding pretrained
87.7% : LSTM with non trainable embeddings
76.7% : LSTM bidirectional
88.4% : Stacked Bidirectional LSTM
88.8% : CNN 1D


# Going even further

The following parts are meant to be resources to explore if you are interested in the advanced concept of attention in deep nets. There are explanation links, as well as links with code for each of them, but don't feel obliged to implement all of them, these are meant to help understanding each of these concepts.

## Attention

Attention is a mechanism that changes the output of an LSTM : instead of outputting the final hidden state vector $h_n$ where $n$ is the length of the encoded text, attention plugs on top of a LSTM and returns a combination of all the hidden state vectors at each word position $\ \sum_{t=1}^n \alpha_t h_t$ (where $\alpha_t \in (0,1))$, and thus allows to pay a different attention to each part of the text, hence the name. 

It has been originally proposed for sequence to sequence models, like translation models, where there is a different attention combination computed for each translated output word. It is thus less useful for text classification, but it can be adapted, by computing a single output combination of all the hidden states, as explained in Section 3.3 of the following article :

https://www.aclweb.org/anthology/P16-2034.pdf

Here is a link about how to apply attention for text classification with Keras:

https://www.kaggle.com/yshubham/simple-lstm-for-text-classification-with-attention


You can also read the following link to understand how attention works in sequence to sequence models, which are nothing more than a reversed LSTM (the decoder) on top of a first LSTM (the encoder), in this case for translation where it helps aligning words in two different languages :

https://medium.com/towards-data-science/day-1-2-attention-seq2seq-models-65df3f49e263

## Transformer architecture for text classification

State of the art models in NLP are not RNNs anymore, but Transformers. Transformers do not read text sequentially like RNNs, the core concept of Transformers is self-attention, an attention mechanism that combine separately each word embedding with the other word embeddings of the text. There are multiple such attention mechanisms called "attention heads" in a layer, and multiple such layers are stacked.

Read this article to understand the self-attention layer:

https://medium.com/towards-data-science/illustrated-self-attention-2d627e33b20a

This article explains very well the Transformer for sequence to sequence models (again remember that a text classification model is just the encoder part of a sequence to sequence model) :

http://jalammar.github.io/illustrated-transformer/

Keras code to do text classification with a Transformer :

https://keras.io/examples/nlp/text_classification_with_transformer/



## BERT

Current state-of-the-art performance for text classification are achieved by doing transfer learning from the BERT model. The BERT model combines different techniques including the Transformer to pretrain in an unsupervised fashion on plain text. The last layers of BERT provide a high-level contextual representation of english sentences, and can then be reused in any NLP deep model. The website HuggingFace ( https://huggingface.co/ ) hosts many pretrained deep learning models that can be reused and fine-tuned for your application.

The BERT model : http://jalammar.github.io/illustrated-bert/

Reusing a pretrained BERT model from Keras-Hub for text classification : https://keras.io/keras_hub/api/models/bert/bert_text_classifier/#berttextclassifier-class

## BERT in pytorch with skorch

If you want to familizarize yourself with pytorch, one easy way when one is used to scikit-learn and keras is to use the `skorch` library. `skorch` implements a high level interface to pytorch with the usual `fit` and `predict` functions : https://skorch.readthedocs.io/en/stable/user/quickstart.html

You can first try to reimplement a simple LSTM with pytorch. Then you can try to redo the prediction with a BERT model from huggingface : https://nbviewer.org/github/skorch-dev/skorch/blob/master/notebooks/Hugging_Face_Finetuning.ipynb

## Querying OpenAI (ChatGPT, GPT, ...) models 

The largest language models available nowadays like the latest versions of GPT and ChatGPT do not fit on a typical computer, but we can query them through OpenAI API to use them. However this is not free :
https://openai.com/pricing

If you feel like you want to spend some money for this, we can try two different ways. In both cases we can use the `langchain` package to interact with the openAI API through python code. 

The first option is to get embeddings for each text from the Ada2 model, and then build a classifier from the embeddings as features :
https://shishirsingh66g.medium.com/langchain-applications-part-3-embedding-models-75e6a0d01545

Another one is simply to ask ChatGPT whether it thinks each review of the test set is positive or negative, and compute its accuracy. Look at the `langchain` library to do so : https://medium.com/@dmitri.mahayana/chatgpt-template-using-python-langchain-c201de474122