This notebook classifies movie reviews as positive or negative using the review text. This is an example of binary - or two-class - classification, an important and widely applicable machine learning problem.

***The tutorial demonstrates a basic application of transfer learning using TensorFlow Hub and Keras.***

It uses the IMDB dataset , which contains the text of 50,000 movie reviews from the Internet Movie Database . These are divided into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced , that is, they contain an equal number of positive and negative reviews.

This notebook uses **tf.keras**, a high-level API for building and training models in **TensorFlow**, and **tensorflow_hub**, a library for loading trained models from ***TFHub*** in a single line of code. For a more detailed guide to text classification using ***tfkeras***., see the ***MLCC Text Classification Guide*** .


In [1]:
!pip install --upgrade tensorflow tensorflow-hub

Collecting tensorflow
  Downloading tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow)
  Downloading tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)


In [2]:
! pip install tensorflow-hub
! pip install tensorflow-datasets



# **Download the IMDB dataset**
The IMDB dataset is available in imdb reviews or in TensorFlow datasets . The following code downloads the IMDB dataset to your computer (or to the colab runtime):


In [3]:
import os
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.17.1
Eager mode:  True
Hub version:  0.16.1
GPU is available


In [4]:
!pip install --upgrade tensorflow tensorflow-hub


Collecting tensorflow
  Using cached tensorflow-2.18.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorboard<2.19,>=2.18 (from tensorflow)
  Using cached tensorboard-2.18.0-py3-none-any.whl.metadata (1.6 kB)


In [5]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews",
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.CXLMM7_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.CXLMM7_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.CXLMM7_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


# ***Explore the data***

Let's take a moment to understand the format of the data. Each example is a sentence representing a movie review and the corresponding label. The sentence is not processed in any way. The label is an integer from 0 to 1, where 0 is a negative review and 1 is a positive review.

Let's type the first 10 examples.

In [6]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

In [7]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])>

In [8]:
!pip install tf_keras



# **Build a model**

A neural network is created by overlaying layers - this requires three basic architectures:

How to represent the text?

How many layers to use in the model?

How many hidden units to use for each layer?

In this example, the input data consists of sentences. The labels for prediction are 0 or 1.

One way to represent text is to ***convert sentences into embedding vectors***.

***Use pre-trained text embedding as the first layer, which will have three advantages:***

You don't have to worry about pre-training the text,

Take advantage of transfer learning,the embedding has a fixed size, so it's easier to process.

In this example, you use a pre-trained text embedding model from TensorFlow Hub named google/nnlm-en-dim50/2 .

There are many other pre-trained text embeddings from TFHub that you can use in this tutorial:

google/nnlm-en-dim128/2 - trained with the same NNLM architecture on the same data as google/nnlm-en-dim50/2 , but with a larger embedding size. Larger embeddings may improve your task, but training your model may take longer.

google/nnlm-en-dim128-with-normalization/2 - same as google/nnlm-en-dim128/2 , but with additional text normalization, such as removing punctuation. This can help if the text in your task contains extra characters or punctuation.

google/universal-sentence-encoder/4 is a much larger model, giving 512-dimensional embeddings trained with a deep averaging network (DAN) encoder.
And much more! Find more text embedding models on TFHub.

Let's first create a Keras layer that uses the TensorFlow Hub model for sentence embedding and try it out on a few input examples. Note that regardless of the length of the input text, the output form of embeddings is: (num_examples, embedding_dimension) .

In [9]:
import tensorflow as tf
import tensorflow_hub as hub
import tf_keras as keras

embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423194 , -0.01190171,  0.06337537,  0.0686297 , -0.16776839,
        -0.10581177,  0.168653  , -0.04998823, -0.31148052,  0.07910344,
         0.15442258,  0.01488661,  0.03930155,  0.19772716, -0.12215477,
        -0.04120982, -0.27041087, -0.21922147,  0.26517656, -0.80739075,
         0.25833526, -0.31004202,  0.2868321 ,  0.19433866, -0.29036498,
         0.0386285 , -0.78444123, -0.04793238,  0.41102988, -0.36388886,
        -0.58034706,  0.30269453,  0.36308962, -0.15227163, -0.4439151 ,
         0.19462997,  0.19528405,  0.05666233,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201319 , -0.04418665, -0.08550781,
        -0.55847436, -0.2333639 , -0.20782956, -0.03543065, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862677,  0.7753425 , -0.07667087,
        -0.15752274,  0.01872334, -0.08169781, -0.3521876 ,  0.46373403,
        -0.08492758,  0.07166861, -0.00670818,  0.12686071, -0.19326551,
 

**Build a model**

In [10]:
model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48191433 (183.84 MB)
Trainable params: 48191433 (183.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


The layers are put in sequence to build the classifier:


1.   The first layer is the **TensorFlow Hub layer**. This layer uses a pre-trained
stored model to map a sentence to its embedding vector. The pre-trained text embedding model you use ( google/nnlm-en-dim50/2 ) ***breaks the sentence into tokens, embeds each token, and then merges the embedding.*** The resulting dimensions are: (num_examples, embedding_dimension) . For this model, the NNLM embedding_dimension is 50.

2. This fixed length output vector is passed through a fully connected ( Dense ) layer with 16 hidden units.

3. The latter layer is tightly coupled to a single output node.

Let us compile the model.

***Loss function and optimizer***

The model needs a ***loss function and an optimizer for training***. Since this is a binary classification problem and the model outputs logits (a single-element layer with linear activation), you will use the loss function binary_crossentropy .

This is not the only choice for the loss function, you can, for example, choose mean_squared_error . But in general, binary_crossentropy is better suited for working with probabilities - it measures the “distance” between probability distributions or, in our case, between the true distribution and the predictions.

Later, when you study regression problems (say, for predicting the price of a house), you'll see how to use another loss function called the mean square error.

Now set up your model to use the optimizer and the loss function:


In [11]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

# ***Train the model***
Train the model for 10 epochs in mini-packs of 512 samples. This is 10 iterations over all samples in ***x_train and y_train*** . During training, monitor the loss and accuracy of the model on 10,000 samples from the validation set:


In [12]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# ***Let's evaluate the model***


And let's see how the model behaves.Two values will be returned.Loss (a number that represents our error, the smaller the value the better) and accuracy.

In [13]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 2s - loss: 0.3368 - accuracy: 0.8542 - 2s/epoch - 36ms/step
loss: 0.337
accuracy: 0.854
