# 4.1 Classifying movie reviews: A binary classification example
1. [IMDB Data](#data)
1. [Preparing Data](#preparing)
1. [Building Model](#building)

<a name="data"></a>
# 4.1.1 The IMDB dataset

In [1]:
from tensorflow.keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
    num_words=10)

The argument `num_words=10000` means you’ll only __keep the top 10,000 most frequently occurring words in the training data__. 

Rare words will be discarded. This allows us to work with vector data of manageable size. If we didn’t set this limit, we’d be working with __88,585 unique words in the training data__, which is unnecessarily large. Many of these words only occur in a single sample, and thus can’t be meaningfully used for classification.

The variables `train_data` and `test_data` are lists of reviews; each review is a list of word indices (encoding a sequence of words). `train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for negative and 1 stands for positive.

In [2]:
print(train_data[0])
print(train_labels[0])

[1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 5, 2, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 6, 2, 2, 2, 2, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 8, 2, 8, 2, 5, 4, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 8, 4, 2, 2, 2, 2, 2, 4, 2, 7, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 6, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
1


Because we’re restricting ourselves to the top 10,000 most frequent words, no word index will exceed 10,000.

In [3]:
max([max(sequence) for sequence in train_data])

9

For kicks, here’s how you can quickly __decode one of these reviews back to English words__.

In [4]:
# dict mapping words to an int index
word_index = imdb.get_word_index()
reverse_word_index = dict(
    # reverses it, mapping int indices to words
    [(value, key) for (key, value) in word_index.items()])
decoded_review = " ".join(
    # decodes review
    [reverse_word_index.get(i - 3, "?") for i in train_data[0]])

<a name="preparing"></a>
# 4.1.2 Preparing the data

You can’t directly feed lists of integers into a NN. They all have different lengths, but __a NN expects to process contiguous batches of data__. You have to __turn your lists into tensors__. There are 2 ways to do that:

1. __Pad your lists__ so that they all have the same length, turn them into an integer tensor of shape `(samples, max_length)`, and start your model with a layer capable of handling such integer tensors (the `Embedding` layer, which we’ll cover in detail later in the book).

2. __Multi-hot encode your lists__ to turn them into vectors of 0s and 1s. Then you could use a `Dense` layer, capable of handling floating-point vector data, as the first layer in your model.

Let’s go with the latter solution to vectorize the data, which you’ll do manually for maximum clarity.

In [6]:
import numpy as np

def vectorize_sequences(sequences, dimension=10_000):
    # create all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        for j in sequences:
            # set specific indices of results[i] to 1s
            results[i, j] = 1.
    return results

x_train = vectorize_sequences(train_data[:100], 10)
x_test = vectorize_sequences(test_data[:100], 10)

Here’s what the samples look like now.

In [7]:
x_train[0]

array([0., 1., 1., 0., 1., 1., 1., 1., 1., 1.])

You should also __vectorize your labels__.

In [None]:
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")

Now the data is __ready to be fed into a NN__.

<a name="building"></a>
# 4.1.3 Building your model

The __input data is vectors__, and the __labels are scalars__ (1s and 0s): this is one of the simplest problem setups you’ll ever encounter. A type of model that performs well on such a problem is a __plain stack of `Dense` layers with `relu` activations__.

There are two __key architecture decisions__ to be made about such a stack of Dense layers:

1. How many layers to use
2. How many units to choose for each layer

For now: 
1. Two intermediate layers with 16 units each
2. A third layer that will output the scalar prediction regarding the sentiment of the current review

![](https://drek4537l1klr.cloudfront.net/chollet2/HighResolutionFigures/figure_4-1.png)