<a href="https://colab.research.google.com/github/tiagoeude/Using_Keras_to_analyze_IMDB_Movie_Data/blob/master/Using_Keras_to_analyze_IMDB_Movie_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Loading the data

The data comes preloaded in Keras, which means we don't need to open or read any files manually. The command to load it is the following, which will actually split the words into training and testing sets and labels!:

In [0]:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.datasets import imdb
from keras.preprocessing.text import Tokenizer
import matplotlib.pyplot as plt
%matplotlib inline

(X_train, y_train), (X_test, y_test) = imdb.load_data(path="imdb.npz",
                                                     num_words=None,
                                                     skip_top=0,
                                                     maxlen=None,
                                                     seed=113,
                                                     start_char=1,
                                                     oov_char=2,
                                                     index_from=3)

## One-hot encoding the output

Here, we'll turn the input vectors into (0,1)-vectors. For example, if the pre-processed vector contains the number 14, then in the processed vector, the 14th entry will be 1.

In [61]:
# One-hot encoding the intput into vector mode, each of length 1000
tokenizer = Tokenizer(num_words=1000)
X_train = tokenizer.sequences_to_matrix(X_train, mode='binary')
X_test = tokenizer.sequences_to_matrix(X_test, mode='binary')
print(X_train_encoder[0])

[0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0.
 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0.
 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

In [62]:
# One-hot encoding the output
num_classes = 2
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_test.shape)

(25000, 2)
(25000, 2)


## Building the model

In [63]:
# Building the model architecture with one layer
model = Sequential()
model.add(Dense(500, activation='relu', input_dim=1000))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.summary()

# Compiling the model using categorical_crossentropy loss, and rmsprop optimizer.
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 500)               500500    
_________________________________________________________________
dropout_6 (Dropout)          (None, 500)               0         
_________________________________________________________________
dense_13 (Dense)             (None, 2)                 1002      
Total params: 501,502
Trainable params: 501,502
Non-trainable params: 0
_________________________________________________________________


## Trainning the model

In [64]:
# Running and evaluating the model
hist = model.fit(X_train, y_train,
                batch_size=32,
                epochs=100,
                validation_data=(X_test, y_test), 
                verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/100
 - 8s - loss: 0.3965 - acc: 0.8256 - val_loss: 0.3825 - val_acc: 0.8396
Epoch 2/100
 - 8s - loss: 0.3304 - acc: 0.8688 - val_loss: 0.3560 - val_acc: 0.8606
Epoch 3/100
 - 7s - loss: 0.3218 - acc: 0.8770 - val_loss: 0.3616 - val_acc: 0.8592
Epoch 4/100
 - 7s - loss: 0.3113 - acc: 0.8842 - val_loss: 0.3650 - val_acc: 0.8613
Epoch 5/100
 - 7s - loss: 0.3025 - acc: 0.8906 - val_loss: 0.3892 - val_acc: 0.8580
Epoch 6/100
 - 7s - loss: 0.2971 - acc: 0.8994 - val_loss: 0.3964 - val_acc: 0.8613
Epoch 7/100
 - 8s - loss: 0.2860 - acc: 0.9019 - val_loss: 0.4258 - val_acc: 0.8578
Epoch 8/100
 - 8s - loss: 0.2759 - acc: 0.9080 - val_loss: 0.4390 - val_acc: 0.8556
Epoch 9/100
 - 7s - loss: 0.2677 - acc: 0.9148 - val_loss: 0.4847 - val_acc: 0.8514
Epoch 10/100
 - 7s - loss: 0.2526 - acc: 0.9210 - val_loss: 0.4947 - val_acc: 0.8548
Epoch 11/100
 - 7s - loss: 0.2407 - acc: 0.9278 - val_loss: 0.4985 - val_acc: 0.8556
Epoch 12/100
 - 8s - los

## Evaluating the model

In [66]:
score = model.evaluate(X_test, y_test, verbose=2)

print("Accuracy: {}".format(score[1]))

Accuracy: 0.8428
