In this notebook, we attempt tp classify movie reviews as *positive* or *negative* based on the text description in the review. So we will work on a two-class classification problem.

This notebook presents a besic application of transfer learning using TensorFlow Hub and Keras.

The data used here is the IMDB dataset. It contains 50,000 movie reviews. They are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets contain an equivalent number of positive and negative reviews: They are balancd.

We will use:

* `tf.keras` to build and train models in TensorFlow
* `TensorFlow Hub` a library and platform for transfer learning

In [1]:
import numpy as np
import tensorflow as tf

!pip install -q tensorflow-hub
!pip install -q tfds-nightly
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.1.0
Eager mode:  True
Hub version:  0.8.0
GPU is NOT AVAILABLE


# Download the IMDB dataset
The IMDB dataset is available on TensorFlow datasets.

In [2]:
# Split the training set into 60% and 40%, so e'll end up with 15,000 examples 
# for training, 10,000 examples for validation and 25,000 examples for testing

train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to C:\Users\saif\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\saif\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incomplete2W875L\imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\saif\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incomplete2W875L\imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to C:\Users\saif\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incomplete2W875L\imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))

[1mDataset imdb_reviews downloaded and prepared to C:\Users\saif\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m


# Explore the data
Each example is a sentence representing the movie review and a corresponding label. 
The label is an an integer of either 0 or 1, where 0 is a negative review, and 1 is a positive review.
Le's print the first 5 examples:

In [4]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(5)))
train_examples_batch

<tf.Tensor: shape=(5,), dtype=string, numpy=
array([b'This is a big step down after the surprisingly enjoyable original. This sequel isn\'t nearly as fun as part one, and it instead spends too much time on plot development. Tim Thomerson is still the best thing about this series, but his wisecracking is toned down in this entry. The performances are all adequate, but this time the script lets us down. The action is merely routine and the plot is only mildly interesting, so I need lots of silly laughs in order to stay entertained during a "Trancers" movie. Unfortunately, the laughs are few and far between, and so, this film is watchable at best.',
       b"Perhaps because I was so young, innocent and BRAINWASHED when I saw it, this movie was the cause of many sleepless nights for me. I haven't seen it since I was in seventh grade at a Presbyterian school, so I am not sure what effect it would have on me now. However, I will say that it left an impression on me... and most of my friends.

Let's print the first 10 labels

In [5]:
train_labels_batch

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 1, 0, 1], dtype=int64)>

# Build the model
The neural network is creaed by stcking layers. This requires 3 main architectural decisions:
* How to represent the text?
* How many layers we use in the model?
* How many *hidden units* we use for each layer?

In our example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. We can use a pre-trained text embedding as the first layer, which will have 3 advantages:
* we don't have to worry about text preprocessing
* we can benefit from transfer learning
* the embedding has a fixed size, so it is simpler to process

For this example we will use a **pretrained text embedding model** from TensorFlow Hub called **google/tf2-preview/gnews-swivel-20dim/1**

There are 3 other pre-trained models to test:
* **google/tf2-preview/gnews-swivel-20dim-with-oov/1** - same as **google/tf2-preview/gnews-swivel-20dim/1**, but with 2.5% vocabulary converted to OOV buckets. This can help if vocabulary of the task and vocabulary of the model don't fully overlap.
* **google/tf2-preview/nnlm-en-dim50/1** - A much larger model with ~1M vocabulary size and 50 dimensions.
* **google/tf2-preview/nnlm-en-dim128/** - Even larger model with ~1M vocabulary size and 128 dimensions.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that no matter the length of the input text, the output shape of the embeddings is: `(num_examples, embedding_dimension)`

In [6]:
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], 
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 20), dtype=float32, numpy=
array([[ 2.209591  , -2.7093675 ,  3.6802928 , -1.0291991 , -4.1671185 ,
        -2.4566064 , -2.2519937 , -0.36589956,  1.9485804 , -3.1104462 ,
        -2.4610963 ,  1.3139242 , -0.9161584 , -0.16625322, -3.723651  ,
         1.8498232 ,  3.499562  , -1.2373022 , -2.8403084 , -1.213074  ],
       [ 1.9055302 , -4.11395   ,  3.6038654 ,  0.28555924, -4.658998  ,
        -5.5433393 , -3.2735848 ,  1.9235417 ,  3.8461034 ,  1.5882455 ,
        -2.64167   ,  0.76057523, -0.14820506,  0.9115291 , -6.45758   ,
         2.3990374 ,  5.0985413 , -3.2776263 , -3.2652326 , -1.2345369 ],
       [ 3.6510668 , -4.7066135 ,  4.71003   , -1.7002777 , -3.7708545 ,
        -3.709126  , -4.222776  ,  1.946586  ,  6.1182513 , -2.7392752 ,
        -5.4384456 ,  2.7078724 , -2.1263676 , -0.7084146 , -5.893995  ,
         3.1602864 ,  3.8389287 , -3.318196  , -5.1542974 , -2.4051712 ]],
      dtype=float32)>

let's now build the full mode

In [7]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 20)                400020    
_________________________________________________________________
dense (Dense)                (None, 16)                336       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 400,373
Trainable params: 400,373
Non-trainable params: 0
_________________________________________________________________


The layers are stacked sequenially to build the classifier:
1. The first layer is a Tensorflow Hub layer. This layer uses a pre-trained Saved Model to map^a sentence into its embedding vector. The pre-trained text embedding model that we are using (google/tf2-preview/gnews-swivel-20dim/1) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: (num_examples, embedding_dimension)
2. The fixed-length output vector is piped througha fully-connected (Dense) layer with16 hidden units
3. The last layer is densely connected with a single output node

Let's compile the model

## Loss function and optimizer
A model needs a loss function and an optimizer for training. Since we are performing a binary classification and the model ouputs are probabilities (a single-unit layer with a sigmoid activation), we will use the `binary_crossentropy` loss function.

We can use other loss function, such as `mean_squared_error`. But, `binary_crossentropy` performs better when dealing with probabilities (it measures the distance between probability distributions). 


In [8]:
model.compile(optimizer='adam',
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model
Train the model for 20 epochs in mini-batches of 512 samples. This is 20 iterations over all samples in the `x_train` and `y_train` tensors.

In [9]:
history = model.fit(train_data.shuffle(10000).batch(512),
                   epochs = 20,
                   validation_data=validation_data.batch(512),
                   verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Evaluate the model
Now we will assess the model performance based on two metrics: Loss (the error), and accuracy.

In [12]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

loss: 0.319
accuracy: 0.862
