Movie Review Classification: Binary Classification Example

This notebook is a code example from Section 3.4 of Chapter 3 of the book Deep Learning with Python by the creator of Keras. The book contains additional explanations and illustrations. This notebook includes only explanations related to the source code.

Binary classification, also known as two-class classification, is probably the most widely used machine learning problem. In this example, we will learn how to classify movie reviews as positive or negative based on their text.

The IMDB Dataset

We will use the IMDB dataset, which consists of 50,000 highly polarized reviews collected from the Internet Movie Database. The dataset is split into 25,000 training samples and 25,000 test samples, with 50% negative reviews and 50% positive reviews in each set.

Why do we split the data into training and test sets? Because you should never train and test a machine learning model on the same data! Just because a model performs well on the training data does not mean it will perform well on data it has never seen before. What really matters is the model’s performance on new data (in fact, since we already know the labels of the training data, there would be no point in building a model just to predict them).

For example, a model could simply memorize the mapping between the training samples and their targets. Such a model would be useless for predicting targets on new data. We will explore this issue in more detail in the next chapter

IMDB Dataset in Keras

Like the MNIST dataset, the IMDB dataset is included in Keras. The data has already been preprocessed: each review (a sequence of words) has been converted into a sequence of integers, where each integer represents a unique word in a dictionary.

The following code loads the dataset (the first time you run it, about 17 MB of data will be downloaded to your computer):

In [2]:
import keras

In [4]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, train_labels) = imdb.load_data(num_words=1000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 0us/step


The parameter num_words=10000 means that we will use only the 10,000 most frequently occurring words in the training data. Rarely appearing words will be ignored. This allows us to obtain vectorized data of a manageable size.

The variables train_data and test_data are lists of reviews. Each review is a list of word indices (that is, the word sequences have been encoded as integers).
The variables train_labels and test_labels are lists where 0 represents a negative review and 1 represents a positive review.

In [None]:

train_data[0