# Text Classification with Movie Reviews

This notebook classifies movie reviews as *positive* or *negative* using the text of the review.
* This is an example of *binary*—or two-class—classification, an important and widely applicable kind of machine learning problem.

* We'll use the [IMDB dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the [Internet Movie Database](https://www.imdb.com/).
* These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are *balanced*, meaning they contain an equal number of positive and negative reviews.

This notebook uses [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras), a high-level API to build and train models in TensorFlow, and [TensorFlow Hub](https://www.tensorflow.org/hub), a library and platform for transfer learning. For a more advanced text classification tutorial using `tf.keras`, see the [MLCC Text Classification Guide](https://developers.google.com/machine-learning/guides/text-classification/).

## Setup

In [2]:
import numpy as np

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)

Version:  2.17.0
Eager mode:  True
Hub version:  0.16.1


## Download the IMDB dataset

The IMDB dataset is available on [TensorFlow datasets](https://github.com/tensorflow/datasets). The following code downloads the IMDB dataset to your machine (or the colab runtime):

In [3]:
train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"],
                                  batch_size=-1, as_supervised=True)

train_examples, train_labels = tfds.as_numpy(train_data)
test_examples, test_labels = tfds.as_numpy(test_data)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.WDDH7Y_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.WDDH7Y_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.WDDH7Y_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [7]:
print("train examples shape", train_examples.shape)
print("train labels shape", train_labels.shape)
print("test examples shape", test_examples.shape)
print("test labels shape", test_labels.shape)

train examples shape (25000,)
train labels shape (25000,)
test examples shape (25000,)
test labels shape (25000,)


In [8]:
unique, counts = np.unique(train_labels, return_counts=True)
result = np.column_stack((unique, counts))
print (result)


[[    0 12500]
 [    1 12500]]


In [10]:
counts

array([12500, 12500])

In [9]:
unique

array([0, 1])

## Explore the data

Let's take a moment to understand the format of the data. Each example is a sentence representing the movie review and a corresponding label. The sentence is not preprocessed in any way. The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

In [11]:
print("Training entries: {}, test entries: {}".format(len(train_examples), len(test_examples)))

Training entries: 25000, test entries: 25000


Let's print first 10 examples.

In [12]:
train_examples[:10]

array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot 

Let's also print the first 10 labels.

In [4]:
train_labels[:10]

array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0])

## Build the model

We will be building a transfer learning based model for this task.

For this, we will need a pre-trained model that can be used as a feature generator, followed by logistic regression model in sklearn.  You should explore other sklearn classification models for this tasks too.

In this example, the input data consists of sentences. The labels to predict are either 0 or 1.

One way to represent the text is to convert sentences into embeddings vectors.
* For this example we will use a model from [TensorFlow Hub](https://www.tensorflow.org/hub) called [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2).

There are two other models to test for the sake of this tutorial:
* [google/nnlm-en-dim50-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2) - same as [google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2), but with additional text normalization to remove punctuation. This can help to get better coverage of in-vocabulary embeddings for tokens on your input text.
* [google/nnlm-en-dim128-with-normalization/2](https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2) - A larger model with an embedding dimension of 128 instead of the smaller 50.

Let's first create a Keras layer that uses a TensorFlow Hub model to embed the sentences, and try it out on a couple of input examples. Note that the output shape of the produced embeddings is a expected: `(num_examples, embedding_dimension)`.

In [5]:
model = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"
hub_layer = hub.KerasLayer(model, input_shape=[], dtype=tf.string,
                           trainable=False)
hub_layer(train_examples[:3])

<tf.Tensor: shape=(3, 128), dtype=float32, numpy=
array([[ 1.15015078e+00,  7.80129954e-02,  9.26615447e-02,
         2.83361465e-01,  9.67164431e-03, -1.49186030e-01,
         3.35665703e-01, -3.50244790e-01, -8.28830525e-03,
        -1.87713988e-02, -3.33069712e-02, -6.33094192e-01,
        -3.75421166e-01, -2.77732819e-01, -9.66175571e-02,
         1.72553658e-01, -1.33676559e-01,  3.80765833e-02,
        -2.75138170e-01,  4.94762301e-01,  3.93051691e-02,
         1.34496242e-01, -2.70728201e-01,  1.78942848e-02,
        -2.41071597e-01,  2.71089897e-02,  1.02333426e-01,
        -1.06628530e-01,  5.24298586e-02,  1.19170524e-01,
        -6.67077769e-03,  3.39231491e-01,  1.13014966e-01,
         1.06842607e-01,  3.91571254e-01, -1.89536318e-01,
        -1.74000308e-01, -1.06444173e-01, -1.34200469e-01,
         1.73583925e-01, -2.77695030e-01, -4.33591381e-02,
        -3.91500629e-02, -1.98340908e-01,  2.74854768e-02,
         2.76703000e-01,  1.40702859e-01, -3.14256102e-01,
      

Let's now build the full model:

First we need to create a transformer class for usage in the pipeline.

In [21]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

class TextFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        # self.model_name = "https://tfhub.dev/google/nnlm-en-dim50/2"
        # self.model_name = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"
        self.model_name = "https://tfhub.dev/google/nnlm-en-dim50-with-normalization/2"
        self.hub_layer = hub.KerasLayer(self.model_name, input_shape=[], dtype=tf.string,
                                    trainable=False)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return self.hub_layer(X)

In [22]:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline

clf = make_pipeline(TextFeatureExtractor(),
                    SGDClassifier(max_iter=1000, tol=1e-3))


## Model training

Let's train the model:

In [23]:
clf.fit(train_examples, train_labels)

The layers are stacked sequentially to build the classifier:

1. The first step is a TensorFlow Hub layer. This layer uses a pre-trained Saved Model to map a sentence into its embedding vector. The model that we are using ([google/nnlm-en-dim50/2](https://tfhub.dev/google/nnlm-en-dim50/2)) splits the sentence into tokens, embeds each token and then combines the embedding. The resulting dimensions are: `(num_examples, embedding_dimension)`.
2. This fixed-length output vector is piped through a logistic regression model.

## Evaluate the model

And let's see how the model performs. Two values will be returned. Loss (a number which represents our error, lower values are better), and accuracy.

In [24]:
from sklearn.metrics import classification_report

print(classification_report(test_labels, clf.predict(test_examples)))

              precision    recall  f1-score   support

           0       0.82      0.65      0.72     12500
           1       0.71      0.85      0.77     12500

    accuracy                           0.75     25000
   macro avg       0.76      0.75      0.75     25000
weighted avg       0.76      0.75      0.75     25000

