---
# Classifying Movie Reviews With IMDB Dataset


For this project we'll be using the IMDB dataset for two-class/binary classification. Our goal being to classify movie reviews on being positive or negative based on text content.

The IMDB dataset contains 50,000 highly polarized reviews split evenly into 2 groups of 25,000 for training and testing, each group containing 50% positive and 50% negative reviews.

---

## Importing The Libraries

Importing the libraries that will be used for this notebook

In [None]:
import keras
import numpy as np
import tensorflow as tf
from keras.datasets import imdb

---
## Initial Overview of The Data

Train_data and test_data is a list of word indices (encoding a sequence of words). Train_label and test_label are binary lists that indicate whether the review is positive or negative. 0 standing for negative and 1 standing for positive.

As for words, we will be restricting ourselves to a max of 10,000 words, these will be the top 10,000 most frequently occuring words in the word indices (as noted when we review for the max sequence in the training dataset which is 9,999).


In [None]:
# num_words to only keep top 10,000 most frequently occuring words
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = 10000)

In [None]:
# Viewing first entry in the train data
train_data[0]

In [None]:
# View first entry in the training labels
train_labels[0]

In [None]:
max([max(sequence) for sequence in train_data])

In [None]:
# The following is just a way to decode the review back to English:

# word_index is a dictionary mapping words to an integer index
word_index = imdb.get_word_index()
reverse_word_index = dict(
    # Reverses, mapping integer indices to words
    [(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
    # Decodes the review (Offset by 3 due to reversed indices used for 'padding', 'start of sequence', and 'unknown')
    [reverse_word_index.get(indices-3, '?') for indices in train_data[0]])

---
## Preparing the Data


In [None]:
def vectorize_sequence(sequences, dimension=10000):
    # Create an all-zero matrix in the shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for indices sequence in enumerate(sequences):
        # Set specific indices of results[i] to 1s
        results[indices, sequence] = 1.
    return results

In [None]:
# Vectorize training data
X_train = vectorize_sequences(train_data)
# Vectorize test data
X_test = vectorize_sequences(test_data)