# Text classification with an RNN

This text classification tutorial trains several model architectures to see the difference between them and pick the best for this task, and we have:
1. Model with simple neural network (no RNN layers)
2. Model with the basic RNN layer offered by tensorflow
3. Model with GRUs
4. Model with LSTMs
5. Model with LSTMs + Bidirectional layers

on the [IMDB large movie review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) for sentiment analysis.

## Setup

In [None]:
import numpy as np

import tensorflow_datasets as tfds
import tensorflow as tf

tfds.disable_progress_bar()

## Setup input pipeline


The IMDB large movie review dataset is a *binary classification* dataset—all the reviews have either a *positive* or *negative* sentiment, represented by the labels 1 and 0.

Just for show, we used this dataset from the TFDS library preinstalled in this notebook, otherwise you can work on other labeled text data.


In [None]:
dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

Initially this returns a dataset of (text, label pairs):

In [None]:
for example, label in train_dataset.take(1):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0


Next shuffle the data for training and create batches of these `(text, label)` pairs:

In [None]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64
history_list = []

In [None]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
for example, label in train_dataset.take(2):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b'REALLY??? <br /><br />I am truly amazed to see the glowing reviews here! <br /><br />This is one of the worst movies I have ever seen. It is one big pathetic, grainy, clich\xc3\xa9. I would have laughed out loud, and a lot, but was on a date with an ex-military guy. I could not hide my other response, BOREDOM. Yes, I think my date, a flat-line "good old boy", liked it. That\'s not a compliment. I know an actor wants to work.... Fine for the others. But Ralph, come on.<br /><br />It was a painful tease from Ralph. I vote a 2 only because Ralph looked SO STUNNING. But I must plead, Ralph, how could you? And, why?? <br /><br />I\'m going to go watch The End of The Affair to heal and recover now.... C1'
 b'I guess this is meant to be a sort of reworking or updating of "Beauty and the Beast", but I can\'t say I\'ve ever watched a movie that began with several minutes of graphic horse sex. Wow. Anyway it seems that a young woman and her..aunt? Have traveled to this castle in Franc

## Create the text encoder
As you've seen in the previous workshop, the text encoding phase is one of the steps of text preprocessing before we put it in the input layer of our model (or the training data in ML models), here we'll use the text encoder layer of tensorflow which will replace words (or tokens) by numbers based on the preset VOCAB_SIZE.

In [None]:
VOCAB_SIZE = 1000
encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

The `.adapt` method sets the layer's vocabulary. Here are the first 20 tokens. After the padding and unknown tokens they're sorted by frequency:

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

Once the vocabulary is set, the layer can encode text into indices. The tensors of indices are 0-padded to the longest sequence in the batch (unless you set a fixed `output_sequence_length`):

In [None]:
encoded_example = encoder(example)[:3].numpy()
encoded_example

array([[ 63,  13,  13, ...,   0,   0,   0],
       [ 10, 479,  11, ...,   0,   0,   0],
       [ 10, 153, 118, ...,   0,   0,   0]])

With the default settings, the process is not completely reversible. There are three main reasons for that:

1. The default value for `preprocessing.TextVectorization`'s `standardize` argument is `"lower_and_strip_punctuation"`.
2. The limited vocabulary size and lack of character-based fallback results in some unknown tokens.

In [None]:
for n in range(3):
  print("Original: ", example[n].numpy())
  print("Round-trip: ", " ".join(vocab[encoded_example[n]]))
  print()

Original:  b'REALLY??? <br /><br />I am truly amazed to see the glowing reviews here! <br /><br />This is one of the worst movies I have ever seen. It is one big pathetic, grainy, clich\xc3\xa9. I would have laughed out loud, and a lot, but was on a date with an ex-military guy. I could not hide my other response, BOREDOM. Yes, I think my date, a flat-line "good old boy", liked it. That\'s not a compliment. I know an actor wants to work.... Fine for the others. But Ralph, come on.<br /><br />It was a painful tease from Ralph. I vote a 2 only because Ralph looked SO STUNNING. But I must plead, Ralph, how could you? And, why?? <br /><br />I\'m going to go watch The End of The Affair to heal and recover now.... C1'
Round-trip:  really br br i am truly [UNK] to see the [UNK] reviews here br br this is one of the worst movies i have ever seen it is one big [UNK] [UNK] [UNK] i would have [UNK] out [UNK] and a lot but was on a [UNK] with an [UNK] guy i could not [UNK] my other [UNK] [UNK] yes

## Create the models
Each model will have a defined architecture, will be compiled, trained and evaluated, and to conclude the notebook, a graphic plot will be displayed showing the difference of accuracy and evolution of each model
### Model 1 (Simple ANN)

### Architecture
`.Embedding` layer is the next step after word encoding, it'll take each number in the vocabulary and transforms it into a `n` dimension array (here we used 64 dimensions)
`.GlobalAveragePooling1D` is used to transform data into 1-dim array (used to make coherence between the layers)
`.Dense` + `activation='relu'` frequently used in neural networks for its high precision and calculations
`.Dropout` used to avoid overfitting
`.Dense(1)` is the output layer because we have two classes to predict them with sigmoid activation function  

In [None]:
Simple_Ann = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(len(encoder.get_vocabulary()), 64, mask_zero=True),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(20, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

### Compile the model
with BCE loss function (binary classification), `.Adam` optimizer for back-propagation with a learning rate of 0.001.
acurracy metric is used generally in classification tasks (mesure the number of true predicted classes)

In [None]:
Simple_Ann.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

### Train the model

In [None]:
trained_Simple_Ann = Simple_Ann.fit(train_dataset, epochs=5,
                    validation_data=test_dataset,
                    validation_steps=30)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Model evaluation

In [None]:
test_loss_Simple_Ann, test_acc_Simple_Ann = Simple_Ann.evaluate(test_dataset)

print('Test Loss:', test_loss_Simple_Ann)
print('Test Accuracy:', test_acc_Simple_Ann)

Test Loss: 0.5039566159248352
Test Accuracy: 0.7287200093269348


In [None]:
history_list.append(trained_Simple_Ann)

### Model 2: Simple Recurrent neural network

In [None]:
RNN_model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),output_dim=64,mask_zero=True),
    tf.keras.layers.SimpleRNN(32),
    tf.keras.layers.Dense(1)
])

In [None]:
RNN_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
trained_RNN_model = RNN_model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
test_loss_RNN_model, test_acc_RNN_model = RNN_model.evaluate(test_dataset)

print('Test Loss:', test_loss_RNN_model)
print('Test Accuracy:', test_acc_RNN_model)
history_list.append(trained_RNN_model)

Test Loss: 0.40619397163391113
Test Accuracy: 0.8324000239372253


### Model 3: GRU model

In [None]:
GRU_model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),output_dim=64,mask_zero=True),
    tf.keras.layers.GRU(32),
    tf.keras.layers.Dense(1)
])

In [None]:
GRU_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [None]:
trained_GRU_model = GRU_model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
test_loss_GRU_model, test_acc_GRU_model = GRU_model.evaluate(test_dataset)

print('Test Loss:', test_loss_GRU_model)
print('Test Accuracy:', test_acc_GRU_model)
history_list.append(trained_GRU_model)

Test Loss: 0.3451537787914276
Test Accuracy: 0.8444399833679199


### Model 4: LSTM Model

In [None]:
LSTM_model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),output_dim=64,mask_zero=True),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1)
])

In [None]:
LSTM_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-3),
              metrics=['accuracy'])

In [None]:
trained_LSTM_model = LSTM_model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
test_loss_LSTM_model, test_acc_LSTM_model = LSTM_model.evaluate(test_dataset)

print('Test Loss:', test_loss_LSTM_model)
print('Test Accuracy:', test_acc_LSTM_model)
history_list.append(trained_LSTM_model)

Test Loss: 0.44787338376045227
Test Accuracy: 0.817359983921051


### Model 5: LSTM + Bidirectional

In [None]:
BiLSTM_model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(input_dim=len(encoder.get_vocabulary()),output_dim=64,mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(1)
])

In [None]:
BiLSTM_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-3),
              metrics=['accuracy'])

In [None]:
trained_BiLSTM_model = BiLSTM_model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset,
                    validation_steps=30)

In [None]:
test_loss_BiLSTM_model, test_acc_BiLSTM_model = LSTM_model.evaluate(test_dataset)

print('Test Loss:', test_loss_BiLSTM_model)
print('Test Accuracy:', test_acc_BiLSTM_model)
history_list.append(trained_BiLSTM_model)