# Text classification with movie reviews

- This notebook classifies movie reviews as *positive* or *negative* using the text of the review
- An example of *binary* - or two-class - classification

- IMDB [dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb) that contains the text of 50,000 movie reviews from the Internet Movie Database
- Split into 25,000 reviews for training and 25,000 reviews for testing
- The training and testing set are *balanced*, meaning they contain an equal number of positive and negative reviews

- Uses `tf.keras`

In [1]:
import tensorflow as tf
from tensorflow import keras

import numpy as np

print(tf.__version__)

1.12.0


## Download the IMDB dataset

- The IMDB dataset comes packaged with TensorFlow
- Already been preprecessed such that the reviews (sequence of words) have been converted to sequence of integers, where each integer represent a specific word in a dictionary

In [2]:
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

- The argument `num_words=10000` keeps the top 10,000 most frequently occuring words in the training data
- The rare words are discarded to keep the size of the data manageable

## Explore the data

- Each example is an array of integers representing the words of the movie review
- Each label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review

In [3]:
print(f'Training entries: {len(train_data)}, labels: {len(train_labels)}')

Training entries: 25000, labels: 25000


- The text pf reviews have been converted into integers
- Each integer represents a specific word in a dictionary

In [4]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [5]:
print(f'Length of the first review: {len(train_data[0])}')
print(f'Length of the second review: {len(train_data[1])}')

Length of the first review: 218
Length of the second review: 189


### Convert the integers back to words

- Create a helper function to query a dictionary object that contains the integer to string mapping

In [6]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

In [7]:
decode_review(train_data[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for wh

## Prepare the data

- The reviews - the array of integers - must be converted to tensors before fed into the neural network
- Methods to do the conversion:
    - One-hot-encode 
        - convert them into vectors of 0s and 1s (i.e., the sequence [3, 5] would become a 10,000 dimensional vector that is all zeros except for indicies 3 and 5, which are ones)
        - First layer of the network -  a Dense layer that can handle floating point vector data
        - This approch is memeory intensive, requiring a `num_words * num_reviews` size matrix
    - Pad the arrays so they have the same length and create an integer tensor of shape `max_length * num_reviews`
    