# RNNs - Sequence classification

In this notebook you will learn how to build Recurrent Neural Networks (RNNs) for sequence classification.

**Objectif : build RNN models for sequence classification.**
- 1 - Sequence classification: sentiment analysis => IMDB movie reviews, for binary sentiment analysis (positive review or negative review)
    - Train a baseline model using scikit learn pipelines
    - Create a sequence classifier using a LSTM model
- 2 - Bidirectional RNN

## Imports

In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import sklearn
import sys
import tensorflow as tf
from tensorflow import keras
import time

## ◢ 1 Sequence classification

Let's load the IMDB movie reviews, for binary sentiment analysis (positive review or negative review)

Use Keras Datasets API : https://keras.io/datasets/


We only want the 10,000 most common words:

In [2]:
num_words = 10000
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data(num_words=num_words)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [3]:
X_train[:1]

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32])],
      dtype=object)

Let's also get the word index (word to word id):

In [4]:
word_index = keras.datasets.imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [5]:
word_index["awful"]

370

And let's create a reverse index (word id to word). Three special id words  are added:

In [6]:
reverse_index = {word_id + 3: word for word, word_id in word_index.items()}
reverse_index[0] = "<pad>" # padding
reverse_index[1] = "<sos>" # start of sequence
reverse_index[2] = "<oov>" # out-of-vocabulary
reverse_index[3] = "<unk>" # unknown

Let's write a little function to decode reviews:

In [7]:
def decode_review(word_ids):
    return " ".join([reverse_index.get(word_id, "<err>") for word_id in word_ids])

Let's look at a review:

In [8]:
decode_review(X_train[0])

"<sos> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <oov> is an amazing actor and now the same being director <oov> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <oov> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <oov> to the two little boy's that played the <oov> of norman and paul they were just brilliant children are often left out of the <oov> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what

It seems very positive, let's look at the target (0=negative review, 1=positive review):

In [9]:
y_train[0]

1

And another review:

In [10]:
decode_review(X_train[1])

"<sos> big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just so damn terribly written the clothes are sickening and funny in equal <oov> the hair is big lots of boobs <oov> men wear those cut <oov> shirts that show off their <oov> sickening that men actually wore them and the music is just <oov> trash that plays over and over again in almost every scene there is trashy music boobs and <oov> taking away bodies and the gym still doesn't close for <oov> all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old laugh at how bad everything was back then"

Very negative! Let's check the target:

In [11]:
y_train[1]

0

### 1-1 Train a baseline model

Train and evaluate a baseline model using ScikitLearn. 

You will need to create a pipeline with :
- a `CountVectorizer` (The `CountVectorizer` transformer expects text as input)
- a `TfidfTransformer`
- and a `SGDClassifier`. 

So let's create a text version of the training set and test set:

In [12]:
X_train_text = [decode_review(words_ids) for words_ids in X_train]
X_test_text = [decode_review(words_ids) for words_ids in X_test]

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

In [14]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(max_iter=50)),
])

In [15]:
pipeline.fit(X_train_text, y_train)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf', SGDClassifier(max_iter=50))])

In [16]:
pipeline["vect"].get_feature_names_out()

array(['00', '000', '10', ..., 'zoom', 'zorro', 'zu'], dtype=object)

In [17]:
pipeline["tfidf"].n_features_in_

9773

In [18]:
pipeline["tfidf"].idf_

array([6.74464447, 5.73304356, 3.00311327, ..., 7.6926839 , 8.03562865,
       8.18223212])

In [19]:
pipeline.score(X_test_text, y_test)

0.88496

We get 88.5% accuracy, that's not too bad. But don't forget to check the ratio of positive reviews:

In [20]:
y_test.mean()

0.5

Let's try our model:

In [22]:
res = pipeline.predict(["this movie was really awesome"])
res[0]

0

### 1-2 Create a sequence classifier

Create a sequence classifier using Keras:
* Use `keras.preprocessing.sequence.pad_sequences()` to preprocess `X_train`: this will create a 2D array of 25,000 rows (one per review) and `maxlen=500` columns. Reviews longer than 500 words will be cropped, while reviews shorter 
than 500 words will be padded with zeros.


In [23]:
maxlen = 500
X_train_trim = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=maxlen)
X_test_trim = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=maxlen)

In [26]:
X_train_trim.shape

(25000, 500)

In [27]:
y_train.shape

(25000,)

In [28]:
num_words

10000

* The first layer in your model should be an `Embedding` layer, with `input_dim=num_words` and `output_dim=10`. The model will gradually learn to represent each of the 10,000 words as a 10-dimensional vector. So the next layer will receive 3D batchs of shape (batch size, 500, 10).
* Add one or more LSTM layers with 32 neurons each.
* The output layer should be a Dense layer with a sigmoid activation function, since this is a binary classification problem.

In [29]:
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=num_words, output_dim=10))
model.add(keras.layers.LSTM(32))
model.add(keras.layers.Dense(1, activation="sigmoid"))

* When compiling the model, you should use the `binary_crossentropy` loss.
* Use `rmsprop` as optimizer.
* Fit the model for 10 epochs, using a batch size of 128 and `validation_split=0.2`.

In [30]:
model.compile(loss="binary_crossentropy", 
              optimizer="rmsprop", 
              metrics=["accuracy"])

In [31]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 10)          100000    
                                                                 
 lstm (LSTM)                 (None, 32)                5504      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 105,537
Trainable params: 105,537
Non-trainable params: 0
_________________________________________________________________


In [32]:
history = model.fit(X_train_trim, 
                    y_train,
                    epochs=10, 
                    batch_size=128, 
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [33]:
model.evaluate(X_test_trim, y_test)



[0.35756465792655945, 0.8611999750137329]

## ◢ 2 Bidirectional RNN

Update the previous sequence classification model to use a bidirectional LSTM. For this, you just need to wrap the LSTM layer in a `Bidirectional` layer. If the model overfits, try adding a dropout layer.

Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems.

In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning on the problem.

In [34]:
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=num_words, output_dim=10))
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(32)))
model.add(keras.layers.Dense(1, activation="sigmoid"))

In [35]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 10)          100000    
                                                                 
 dropout (Dropout)           (None, None, 10)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 64)               11008     
 l)                                                              
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 111,073
Trainable params: 111,073
Non-trainable params: 0
_________________________________________________________________


In [36]:
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])

In [37]:
history = model.fit(X_train_trim, y_train,
                    epochs=30, batch_size=128, validation_split=0.2)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [38]:
model.evaluate(X_test_trim, y_test)



[0.4042607545852661, 0.8563200235366821]

## ◢ 3 Use Pretrained embeddings

In [None]:
#https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

import tensorflow_hub as hub

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)

X_train_text_trim = keras.preprocessing.sequence.pad_sequences(X_train_text, maxlen=maxlen)
X_test_text_trim = keras.preprocessing.sequence.pad_sequences(X_test_text, maxlen=maxlen)

X_train_text_trim.shape

model = keras.models.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dropout(0.5))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(32)))
model.add(keras.layers.Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])



history = model.fit(X_train_text_trim, y_train,
                    epochs=30, batch_size=128, validation_split=0.2)



## ◢ 3 Retrieve the trained word embeddings and save them to disk

Next, retrieve the word embeddings learned during training. The embeddings are weights of the Embedding layer in the model. The weights matrix is of shape (vocab_size, embedding_dimension).

Obtain the weights from the model using get_layer() and get_weights(). 

The get_vocabulary() function provides the vocabulary to build a metadata file with one token per line.

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = word_index.keys()

Write the weights to disk.

To use the [Embedding Projector](http://projector.tensorflow.org/?_gl=1*xa2gz2*_ga*MjE2NzI1MDI4LjE2NjQyODAwNDk.*_ga_W0YLR4190T*MTY3MDYwMTAwOC4yNi4xLjE2NzA2MDQxMDcuMC4wLjA.), you will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).


In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [None]:
word_index.keys

<function dict.items>