---
# Classifying Movie Reviews With IMDB Dataset


For this project we'll be using the IMDB dataset for two-class/binary classification. Our goal being to classify movie reviews on being positive or negative based on text content.

The IMDB dataset contains 50,000 highly polarized reviews split evenly into 2 groups of 25,000 for training and testing, each group containing 50% positive and 50% negative reviews.

---

## Importing The Libraries

Importing the libraries that will be used for this notebook

In [1]:
import keras
import numpy as np
import tensorflow as tf
from keras import models
from keras import layers
from keras import losses
from keras import metrics
from keras import optimizers
from keras.datasets import imdb
import matplotlib.pyplot as plt

%matplotlib inline

Using TensorFlow backend.


---
## Initial Overview of The Data

Train_data and test_data is a list of word indices (encoding a sequence of words). Train_label and test_label are binary lists that indicate whether the review is positive or negative. 0 standing for negative and 1 standing for positive.

As for words, we will be restricting ourselves to a max of 10,000 words, these will be the top 10,000 most frequently occuring words in the word indices (as noted when we review for the max sequence in the training dataset which is 9,999).


In [2]:
# num_words to only keep top 10,000 most frequently occuring words
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words = 10000)

In [3]:
# Viewing first entry in the train data
train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [4]:
# View first entry in the training labels
train_labels[0]

1

In [5]:
max([max(sequence) for sequence in train_data])

9999

In [6]:
# The following is just a way to decode the review back to English:

# word_index is a dictionary mapping words to an integer index
word_index = imdb.get_word_index()
reverse_word_index = dict(
    # Reverses, mapping integer indices to words
    [(value, key) for (key, value) in word_index.items()])
decoded_review = ' '.join(
    # Decodes the review (Offset by 3 due to reversed indices used for 'padding', 'start of sequence', and 'unknown')
    [reverse_word_index.get(indices-3, '?') for indices in train_data[0]])

---
## Preparing the Data


In [7]:
def vectorize_sequences(sequences, dimension=10000):
    # Create an all-zero matrix in the shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for indices, sequence in enumerate(sequences):
        # Set specific indices of results[i] to 1s
        results[indices, sequence] = 1.
    return results

In [8]:
# Vectorize training data
x_train = vectorize_sequences(train_data)
# Vectorize test data
x_test = vectorize_sequences(test_data)

In [9]:
# Vectorizing training and test labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

## Building The Network

In [10]:
# Activation function, first create variable to call sequential function in models
model = models.Sequential()
# Adding the activation function to the Dense layers
# 'relu' activation which is for non-linearity is for a deeper hypothesis
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
# Finally, add loss function and optimizer
# Outputting scalar prediction with sigmoid activation which outputs probability
model.add(layers.Dense(1, activation='sigmoid'))

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Compiling The Model

In [11]:
model.compile(optimizer='rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


## Configuring The Optimizers

In [12]:
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
             loss = 'binary_crossentropy',
             metrics = ['accuracy'])

## Using Custom Losses and Metrics

In [13]:
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
             loss = losses.binary_crossentropy,
             metrics = [metrics.binary_accuracy])

## Validating Approach

In [14]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

In [24]:
model.compile(optimizer = 'rmsprop',
             loss = 'binary_crossentropy',
             metrics = ['acc'])
history = model.fit(partial_x_train,
                   partial_y_train,
                   epochs = 20,
                   batch_size = 512,
                   validation_data = (x_val, y_val))

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [25]:
history_dict = history.history
history_dict.keys()

dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

## Plotting Training and Validation Loss

In [27]:
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

epochs = range(1, len(list(history_dict['acc']) +1)

plt.plot(epochs, loss_values, 'bo', label='Training Loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

SyntaxError: invalid syntax (<ipython-input-27-3b9b135af3ca>, line 7)

In [18]:
# Plotting the training and validation accuracy
plt.clf()
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training Acc')
plt.plot(epochs, val_acc, 'b', label='Valication Acc')
plt.title('Training and Valication Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

NameError: name 'epochs' is not defined

<Figure size 432x288 with 0 Axes>

---
## Retraining The Model From Scratch

In [21]:
model=models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
             loss='binary_crossentropy',
             metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [22]:
results

[0.28798335553646087, 0.8858399987220764]

In [23]:
model.predict(x_test)

array([[0.18419805],
       [0.99913496],
       [0.86805964],
       ...,
       [0.1409882 ],
       [0.0762586 ],
       [0.47776952]], dtype=float32)