<br>

<div align=center><font color=maroon size=6><b>Text classification with TensorFlow Hub: Movie reviews</b></font></div>

<br>

<font size=4><b>References:</b></font>
1. TF2 official tutorials: <a href="https://www.tensorflow.org/tutorials" style="text-decoration:none;">TensorFlow Tutorials</a> 
    * `TensorFlow > Learn > TensorFlow Core > `Tutorials > <a href="https://www.tensorflow.org/tutorials/keras/text_classification_with_hub" style="text-decoration:none;">Text classification with TensorFlow Hub: Movie reviews</a>
        * Run in <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification_with_hub.ipynb" style="text-decoration:none;">Google Colab</a>

<br>
<br>
<br>

This notebook classifies movie reviews as *positive* or *negative* using the text of the review. This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.

The tutorial demonstrates the basic application of transfer learning with [TensorFlow Hub](https://tfhub.dev) and Keras.

It uses the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews. 

This notebook uses [`tf.keras`](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow, and [`tensorflow_hub`](https://www.tensorflow.org/hub), a library for loading trained models from [TFHub](https://tfhub.dev) in a single line of code. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

In [None]:
# !pip install tensorflow-hub
# !pip install tensorflow-datasets

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

import os
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.list_physical_devices("GPU") else "NOT AVAILABLE")

Version:  2.5.0
Eager mode:  True
Hub version:  0.12.0
GPU is available


<br>
<br>
<br>

## Download the IMDB dataset

The IMDB dataset is available on [imdb reviews](https://www.tensorflow.org/datasets/catalog/imdb_reviews) or on [TensorFlow datasets](https://www.tensorflow.org/datasets). The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [3]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(name="imdb_reviews",
                                                   split=('train[:60%]', 'train[60%:]', 'test'),
                                                   as_supervised=True)

# Dataset imdb_reviews downloaded and prepared to C:\Users\18617\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. 
# Subsequent calls will reuse this data.

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\18617\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]
Dl Completed...:   0%|                                                                         | 0/1 [00:00<?, ? url/s]
Dl Completed...:   0%|                                                                         | 0/1 [00:00<?, ? url/s]
Dl Size...:   0%|                                                                             | 0/80 [00:00<?, ? MiB/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:01<?, ? url/s][A
Dl Size...:   1%|▊                                                                    | 1/80 [00:01<01:49,  1.39s/ MiB][A
Dl Completed...:   0%|                                                                         | 0/1 [00:01<?, ? url/s][A
Dl Size...:   2%|█▋                                                                   | 2/80 [00:01<01:01,  1.28 MiB/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:

Dl Completed...:   0%|                                                                         | 0/1 [00:06<?, ? url/s][A
Dl Size...:  54%|████████████████████████████████████▌                               | 43/80 [00:06<00:03, 10.40 MiB/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:06<?, ? url/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:06<?, ? url/s][A
Dl Size...:  56%|██████████████████████████████████████▎                             | 45/80 [00:06<00:03, 11.46 MiB/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:06<?, ? url/s][A
Dl Completed...:   0%|                                                                         | 0/1 [00:06<?, ? url/s][A
Dl Size...:  59%|███████████████████████████████████████▉                            | 47/80 [00:06<00:03,  8.74 MiB/s][A
Dl Completed...:

Generating train examples...: 2942 examples [00:02, 2560.21 examples/s][A
Generating train examples...: 3237 examples [00:02, 2660.82 examples/s][A
Generating train examples...: 3533 examples [00:02, 2737.59 examples/s][A
Generating train examples...: 3851 examples [00:03, 2802.18 examples/s][A
Generating train examples...: 4145 examples [00:03, 2838.76 examples/s][A
Generating train examples...: 4460 examples [00:03, 2870.32 examples/s][A
Generating train examples...: 4755 examples [00:03, 2887.22 examples/s][A
Generating train examples...: 5050 examples [00:03, 2898.82 examples/s][A
Generating train examples...: 5344 examples [00:03, 2910.91 examples/s][A
Generating train examples...: 5640 examples [00:03, 2916.79 examples/s][A
Generating train examples...: 5943 examples [00:03, 2920.29 examples/s][A
Generating train examples...: 6255 examples [00:03, 2919.96 examples/s][A
Generating train examples...: 6555 examples [00:03, 2936.95 examples/s][A
Generating train examples

Generating test examples...: 7199 examples [00:02, 2939.26 examples/s][A
Generating test examples...: 7511 examples [00:02, 2930.68 examples/s][A
Generating test examples...: 7815 examples [00:03, 2933.81 examples/s][A
Generating test examples...: 8120 examples [00:03, 2932.36 examples/s][A
Generating test examples...: 8426 examples [00:03, 2931.41 examples/s][A
Generating test examples...: 8732 examples [00:03, 2931.90 examples/s][A
Generating test examples...: 9034 examples [00:03, 2928.77 examples/s][A
Generating test examples...: 9346 examples [00:03, 2935.27 examples/s][A
Generating test examples...: 9650 examples [00:03, 2941.48 examples/s][A
Generating test examples...: 9962 examples [00:03, 2948.48 examples/s][A
Generating test examples...: 10265 examples [00:03, 2945.70 examples/s][A
Generating test examples...: 10589 examples [00:03, 2952.91 examples/s][A
Generating test examples...: 10896 examples [00:04, 2947.88 examples/s][A
Generating test examples...: 11201 

Generating unsupervised examples...: 10538 examples [00:06, 2944.37 examples/s][A
Generating unsupervised examples...: 10833 examples [00:06, 2946.03 examples/s][A
Generating unsupervised examples...: 11132 examples [00:06, 2950.35 examples/s][A
Generating unsupervised examples...: 11444 examples [00:07, 2952.75 examples/s][A
Generating unsupervised examples...: 11740 examples [00:07, 2954.85 examples/s][A
Generating unsupervised examples...: 12037 examples [00:07, 2953.83 examples/s][A
Generating unsupervised examples...: 12335 examples [00:07, 2952.83 examples/s][A
Generating unsupervised examples...: 12645 examples [00:07, 2951.72 examples/s][A
Generating unsupervised examples...: 12941 examples [00:07, 2952.33 examples/s][A
Generating unsupervised examples...: 13238 examples [00:07, 2948.83 examples/s][A
Generating unsupervised examples...: 13535 examples [00:07, 2946.36 examples/s][A
Generating unsupervised examples...: 13831 examples [00:07, 2947.32 examples/s][A
Gene

Generating unsupervised examples...: 39810 examples [00:16, 2921.55 examples/s][A
Generating unsupervised examples...: 40103 examples [00:16, 2912.61 examples/s][A
Generating unsupervised examples...: 40395 examples [00:16, 2906.39 examples/s][A
Generating unsupervised examples...: 40708 examples [00:16, 2906.45 examples/s][A
Generating unsupervised examples...: 41002 examples [00:17, 2907.89 examples/s][A
Generating unsupervised examples...: 41296 examples [00:17, 2912.05 examples/s][A
Generating unsupervised examples...: 41606 examples [00:17, 2915.40 examples/s][A
Generating unsupervised examples...: 41898 examples [00:17, 2908.10 examples/s][A
Generating unsupervised examples...: 42192 examples [00:17, 2908.93 examples/s][A
Generating unsupervised examples...: 42487 examples [00:17, 2911.94 examples/s][A
Generating unsupervised examples...: 42779 examples [00:17, 2913.29 examples/s][A
Generating unsupervised examples...: 43072 examples [00:17, 2909.61 examples/s][A
Gene

[1mDataset imdb_reviews downloaded and prepared to C:\Users\18617\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m


<br>
<br>
<br>

## Explore the data 

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

Let's print first 10 examples.

In [4]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(10)))
train_examples_batch

<tf.Tensor: shape=(10,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell 

<br>

Let's also print the first 10 labels.

In [5]:
train_labels_batch

<tf.Tensor: shape=(10,), dtype=int64, numpy=array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0], dtype=int64)>

<br>
<br>
<br>

## Build the model

The neural network is created by stacking layers—this requires three main architectural decisions:

* How to represent the text?
* How many layers to use in the model?
* How many *hidden units* to use for each layer?

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors. Use a pre-trained text embedding as the first layer, which will have three advantages:

*   You don't have to worry about text preprocessing,
*   Benefit from transfer learning,
*   the embedding has a fixed size, so it's simpler to process.

For this example you use a **pre-trained text embedding model** from [TensorFlow Hub](https://tfhub.dev) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

There are many other pre-trained text embeddings from TFHub that can be used in this tutorial:

* [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2) - trained with the same NNLM architecture on the same data as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model.
* [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - the same as [google/nnlm-en-dim128/2](https://tfhub.dev/google/nnlm-en-dim128/2), but with additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation.
* [google/universal-sentence-encoder/4](https://tfhub.dev/google/universal-sentence-encoder/4) - a much larger model yielding 512 dimensional embeddings trained with a deep averaging network (DAN) encoder.

And many more! Find more [text embedding models](https://tfhub.dev/s?module-type=text-embedding) on TFHub.

<br>

In [6]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding,
                           input_shape=[],
                           dtype=tf.string,
                           trainable=True)


hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423195 , -0.0119017 ,  0.06337538,  0.06862972, -0.16776837,
        -0.10581174,  0.16865303, -0.04998824, -0.31148055,  0.07910346,
         0.15442263,  0.01488662,  0.03930153,  0.19772711, -0.12215476,
        -0.04120981, -0.2704109 , -0.21922152,  0.26517662, -0.80739075,
         0.25833532, -0.3100421 ,  0.28683215,  0.1943387 , -0.29036492,
         0.03862849, -0.7844411 , -0.0479324 ,  0.4110299 , -0.36388892,
        -0.58034706,  0.30269456,  0.3630897 , -0.15227164, -0.44391504,
         0.19462997,  0.19528408,  0.05666234,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201318 , -0.04418665, -0.08550783,
        -0.55847436, -0.23336391, -0.20782952, -0.03543064, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862679,  0.7753425 , -0.07667089,
        -0.15752277,  0.01872335, -0.08169781, -0.3521876 ,  0.4637341 ,
        -0.08492756,  0.07166859, -0.00670817,  0.12686075, -0.19326553,
 

<br>

Let's now build the full model:

In [7]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 50)                48190600  
_________________________________________________________________
dense (Dense)                (None, 16)                816       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


<br>

The layers are stacked sequentially to build the classifier:

1. The first layer is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The pre-trained text embedding model that you are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`. For this NNLM model, the `embedding_dimension` is 50.
2. This fixed-length output vector is piped through a fully-connected (`Dense`) layer with 16 hidden units.
3. The last layer is densely connected with a single output node.

Let's compile the model.

In [8]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy'])

<br>
<br>
<br>

## Train the model

Train the model for 10 epochs in mini-batches of 512 samples. This is 10 iterations over all samples in the `x_train` and `y_train` tensors. While training, monitor the model's loss and accuracy on the 10,000 samples from the validation set:

In [10]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<br>
<br>
<br>

## Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [11]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
    print("%s: %.3f" % (name, value))

49/49 - 1s - loss: 0.3545 - accuracy: 0.8560
loss: 0.355
accuracy: 0.856


<br>

This fairly naive approach achieves an accuracy of about 87%. With more advanced approaches, the model should get closer to 95%.

<br>
<br>
<br>

## Further reading

* For a more general way to work with string inputs and for a more detailed analysis of the progress of accuracy and loss during training, see the [Text classification with preprocessed text](./text_classification.ipynb) tutorial.
* Try out more [text-related tutorials](https://www.tensorflow.org/hub/tutorials#text-related-tutorials) using trained models from TFHub.

<br>
<br>
<br>

```python
# MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
```

<br>
<br>
<br>