<a href="https://colab.research.google.com/github/Trantracy/Text-classification-RNN-and-LSTM/blob/master/Text_classification%20RNN%20and%20LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification with an RNN

When we first learn machine learning, we have learnt to solve this problem with sklearn and logistic regression with softmax (multi-classes) or sigmond (binary-classes) as activation functions.

We'll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow_datasets as tfds
import tensorflow as tf

In [0]:
import matplotlib.pyplot as plt

# we use want to plot all metrics for every epochs
def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

## Download our dataset


The IMDB dataset is a **binary** classification dataset so all the reviews have either a **positive** or **negative** review.


In [0]:
# Use the version pre-encoded with an ~8k vocabulary.
dataset, info = tfds.load('imdb_reviews/subwords8k', 
                          
                          # Also return the `info` structure. 
                          with_info=True,

                          # Return (example, label) pairs from the dataset (instead of a dictionary).
                          as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/subwords8k/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete7XMETI/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete7XMETI/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0.incomplete7XMETI/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/1.0.0. Subsequent calls will reuse this data.[0m


In [0]:
train_dataset, test_dataset = dataset['train'], dataset['test']

## Explore the data

Let's take a moment to understand the format of the data. The dataset comes preprocessed: each example is an array of integers representing the words of the movie review.

The text of reviews have been converted to integers, where each integer represents a specific word-piece in the dictionary.

Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [0]:
# print out first three sentences and its labels
for train_example, train_label in train_dataset.take(1):
  print('Encoded text:', train_example.numpy())
  print('Label:', train_label.numpy())

Encoded text: [  62   18   41  604  927   65    3  644 7968   21   35 5096   36   11
   43 2948 5240  102   50  681 7862 1244    3 3266   29  122  640    2
   26   14  279  438   35   79  349  384   11 1991    3  492   79  122
  188  117   33 4047 4531   14   65 7968    8 1819 3947    3   62   27
    9   41  577 5044 2629 2552 7193 7961 3642    3   19  107 3903  225
   85  198   72    1 1512  738 2347  102 6245    8   85  308   79 6936
 7961   23 4981 8044    3 6429 7961 1141 1335 1848 4848   55 3601 4217
 8050    2    5   59 3831 1484 8040 7974  174 5773   22 5240  102   18
  247   26    4 3903 1612 3902  291   11    4   27   13   18 4092 4008
 7961    6  119  213 2774    3   12  258 2306   13   91   29  171   52
  229    2 1245 5790  995 7968    8   52 2948 5240 8039 7968    8   74
 1249    3   12  117 2438 1369  192   39 7975]
Label: 0


 The dataset `info` includes the encoder (a `tfds.features.text.SubwordTextEncoder`). The encoder can be used to recover the original text.

In [0]:
encoder = info.features['text'].encoder

In [0]:
print('Unique vocabulary size: {}'.format(encoder.vocab_size))

Unique vocabulary size: 8185


This encoder can be trained on our current dataset.
This text encoder will reversibly encode any string.

Remember this, the encoder encodes the string by breaking it into subwords or characters if the word is not in its dictionary. So the more a string resembles the dataset, the shorter the encoded representation will be.


In [0]:
# Try "Hello Tonga." later and see the difference. Can you explain?
sample_string = 'I be want Hello Tonga.'

encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
print (type(encoded_string[0]))

original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))

Encoded string is [12, 35, 239, 4025, 222, 4310, 1023, 7975]
<class 'int'>
The original string: "I be want Hello Tonga."


In [0]:
# see how the encoder match one index to a string, it is totally up to the encoder and the whole corpus dataset, also the target vocab size
for index in encoded_string:
  print('{} ----> {}'.format(index, encoder.decode([index])))

12 ----> I 
35 ----> be 
239 ----> want 
4025 ----> Hell
222 ----> o 
4310 ----> Ton
1023 ----> ga
7975 ----> .


In [0]:
assert original_string == sample_string  #make sure our decode string and encoded string is the same

Ok, let's cool, but can we decode our first review in our trainset ?


In [0]:
#Therefore, let's decode our first sentence 
original_sentence = encoder.decode(train_example.numpy())
print('The original sentence: "{}"'.format(original_sentence))

The original sentence: "This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."


## Prepare the data for training

Next create batches of these encoded strings. Use the `padded_batch` method to zero-pad the sequences to the length of the longest string in the batch:

In [0]:
BUFFER_SIZE = 10000
BATCH_SIZE = 64

You will want to create batches of training data for your model. The reviews are all different lengths, so use padded_batch to zero pad the sequences while batching.

In [0]:
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)

test_dataset = test_dataset.padded_batch(BATCH_SIZE)

Each batch will have a shape of (batch_size, sequence_length) because the padding is dynamic each batch will have a different length:

In [0]:
for example_batch, label_batch in train_dataset.take(2):
  print("Batch shape:", example_batch.shape)
  print("label shape:", label_batch.shape)

Batch shape: (64, 1513)
label shape: (64,)
Batch shape: (64, 1472)
label shape: (64,)


## Create the model

Build a `tf.keras.Sequential` model and start with an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.

This index-lookup is much more efficient than the equivalent operation of passing a one-hot encoded vector through a `tf.keras.layers.Dense` layer.

A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input—and then to the next.

The `tf.keras.layers.Bidirectional` wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the output. This helps the RNN to learn long range dependencies.

In [0]:
# YOUR PART: TRY TO FINISH THIS MODEL (Simple one only using Embeddings)
model = tf.keras.Sequential([
    # layer Embeddings (vocab_size, 64)
    # Global Average Pooling 1D Layer
    # Dense 16 relu
    # Dense 1 sigmoid
])

In [0]:
model.summary()

In [0]:
# YOUR PART: TRY TO FINISH THIS MORE ADVANGE MODEL 
model = tf.keras.Sequential([
    # layer Embeddings (vocab_size, 64)
    # Bi LSTM 64
    # Dense 64 relu
    # Dense 1 sigmoid
])

In [0]:
model.summary()

A model needs a loss function and an optimizer for training. Since this is a binary classification problem and the model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the binary_crossentropy loss function.

Now, configure the model to use an optimizer and a loss function:

Compile the Keras model to configure the training process:

In [0]:
# create your own compile with loss (remember it is binary class), optimizers and metrics
model.compile(loss=_,
              optimizer=_,
              metrics=_)

## Train the model

In [0]:
# fill in the blank to get your model fit 
history = model.fit(_, epochs=_,
                    validation_data=_, 
                    validation_steps=_)

Epoch 1/2
Epoch 2/2


And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [0]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

This fairly naive approach achieves an accuracy of about 85%.

The above model does not mask the padding applied to the sequences. This can lead to skew if trained on padded sequences and test on un-padded sequences. Ideally you would [use masking](https://tensorflow.google.cn/guide/keras/masking_and_padding) to avoid this, but as you can see below it only have a small effect on the output.

If the prediction is >= 0.5, it is positive else it is negative.

In [0]:
def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec

In [0]:
def sample_predict(sample_pred_text):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  
  encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)

In [0]:
# predict on a positive sample text with padding

sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text)
print(predictions)

In [0]:
# predict on a negative sample text with padding

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text)
print (predictions)

In [0]:
plot_graphs(history, 'accuracy')

In [0]:
plot_graphs(history, 'loss')

## Stack two or more LSTM layers

Keras recurrent layers have two available modes that are controlled by the `return_sequences` constructor argument:

* Return either the full sequences of successive outputs for each timestep (a 3D tensor of shape `(batch_size, timesteps, output_features)`).
* Return only the last output for each input sequence (a 2D tensor of shape (batch_size, output_features)).

In [0]:
# YOUR PART: TRY TO FINISH THIS MODEL
model = tf.keras.Sequential([
    # layer Embeddings (vocab_size, 64)
    # LSTM 64, return sequence ?
    # LSTM 64
    # Dense 64 relu
    # Dense 1 sigmoid
])

In [0]:
model.compile(loss=_,
              optimizer=_,
              metrics=_)

In [0]:
history = model.fit(_, epochs=_,
                    validation_data=_,
                    validation_steps=_)

In [0]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

In [0]:
plot_graphs(history, 'accuracy')

In [0]:
plot_graphs(history, 'loss')