## Binary Classification with IMDB dataset

IMDB dataset: a set of 50,000 highly polarized reviews from the
Internet Movie Database. They’re split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting of 50% negative and 50% positive reviews.
It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.

In [1]:
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
from tensorflow.keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(
 num_words=10000) #only keep the top 10,000 most frequently occurring words in the training data. Rare words will be discarded.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [4]:
 train_data[0][:10] # lists of reviews; each review is a list of word indices (encoding a sequence of words).

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

In [5]:
train_labels[0]  #lists of 0s and 1s, where 0 stands for negative and 1 stands for positive

1

In [6]:
#Because we’re restricting ourselves to the top 10,000 most frequent words, no word index will exceed 10,000
max([max(sequence) for sequence in train_data])

9999

In [8]:
#decoding reviews back to text
word_index = imdb.get_word_index() #word_index is a dictionary mapping words to an integer index.
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])  #Reverses it, mapping integer indices to words
decoded_review = " ".join([reverse_word_index.get(i - 3, "?") for i in train_data[0]])  #Decodes the review. Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for “padding,” “start of sequence,” and “unknown.”