# Analyzing IMDB Movie Data

**AIM:** To build a model which will predict the sentiment of the IMDB review. 

A label of 0 is given to a negative review, and a label of 1 is given to a positive review.
- The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

## Load and Prepare the Data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import keras
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer

%matplotlib inline

Using TensorFlow backend.


In [2]:
## Loading pre-loaded Keras dataset
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=2500)

In [3]:
print ('------ Data -----')
print ('x_train:', x_train.shape)
print ('x_test:', x_test.shape)
print ('y_train:', y_train.shape)
print ('y_test:', y_test.shape)

------ Data -----
x_train: (25000,)
x_test: (25000,)
y_train: (25000,)
y_test: (25000,)


In [4]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 2,
 66,
 2,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 2,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 2,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 2,
 16,
 480,
 66,
 2,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 2,
 15,
 256,
 4,
 2,
 7,
 2,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 2,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 2,
 18,
 51,
 36,
 28,
 224,
 92,
 25

In [5]:
y_train[0]

1

In [6]:
len(x_train)

25000

### Note on Provided Data:
- As the IMBD review data is taken from dataset provided by Keras, here we see that words are associated with numbers.
- Keras is providing us clean and pre-processed data. 
- Each and every word there is associated with some number (Bag of Words).
- The output comes as a vector of 1's and 0's, where 1 is a positive sentiment for the review, and 0 is negative.

### What comes Next?
As every word corresponds to some number, this opens up an opportunity for One-hot coding.

## Performing One-hot coding
- One-hot Coding will categorise each word. 
- If we have a vector of 10 words sentence as (3,6,1), it will correspond to vector [1,0,1,0,0,1,0,0,0,0]

In [7]:
# Using Tokenizer
# This class allows to vectorize a text corpus, 
# by turning each text into either a sequence of integers or into a vector

t = Tokenizer(num_words=2500)
x_train = t.sequences_to_matrix(x_train, mode='binary')
x_test = t.sequences_to_matrix(x_test, mode='binary')

In [8]:
x_train[0]

array([ 0.,  1.,  1., ...,  0.,  0.,  0.])

In [9]:
x_test[0]

array([ 0.,  1.,  1., ...,  0.,  0.,  0.])

In [10]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## Building the Model

In [11]:
model = Sequential()
model.add(Dense(512, activation = 'relu', input_dim=2500))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               1280512   
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 1026      
Total params: 1,281,538
Trainable params: 1,281,538
Non-trainable params: 0
_________________________________________________________________


In [12]:
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

## Training the Model

In [13]:
here = model.fit(x_train, y_train,
          batch_size=32,
          epochs=10,
          validation_data=(x_test, y_test), 
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
 - 11s - loss: 0.3641 - acc: 0.8502 - val_loss: 0.3071 - val_acc: 0.8738
Epoch 2/10
 - 11s - loss: 0.2889 - acc: 0.8909 - val_loss: 0.3356 - val_acc: 0.8696
Epoch 3/10
 - 11s - loss: 0.2727 - acc: 0.9018 - val_loss: 0.3229 - val_acc: 0.8780
Epoch 4/10
 - 11s - loss: 0.2611 - acc: 0.9069 - val_loss: 0.3742 - val_acc: 0.8696
Epoch 5/10
 - 11s - loss: 0.2516 - acc: 0.9126 - val_loss: 0.3506 - val_acc: 0.8762
Epoch 6/10
 - 11s - loss: 0.2410 - acc: 0.9170 - val_loss: 0.3817 - val_acc: 0.8762
Epoch 7/10
 - 10s - loss: 0.2365 - acc: 0.9216 - val_loss: 0.3868 - val_acc: 0.8768
Epoch 8/10
 - 11s - loss: 0.2241 - acc: 0.9258 - val_loss: 0.4029 - val_acc: 0.8769
Epoch 9/10
 - 11s - loss: 0.2209 - acc: 0.9299 - val_loss: 0.4381 - val_acc: 0.8722
Epoch 10/10
 - 11s - loss: 0.2127 - acc: 0.9326 - val_loss: 0.4329 - val_acc: 0.8777


## Evaluating the Model

In [14]:
score = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: ", score[1])

Accuracy:  0.87772
