<a href="https://colab.research.google.com/github/Denjj/imdb-text-analysis/blob/master/imdb_text_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing libraries and dataset

In [0]:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
from keras.datasets import imdb
import numpy as num

# Import the imdb dataset into tensorflow
(x_train, y_train),(x_test, y_test) = imdb.load_data()

# Imdb dataset contains 25,000 movie reviews from IMDB
# Data is preprocessed. Each review is encoded as a sequence of word indexes (integers)
# Words are indexed by overall frequency in the dataset.
# X axis contains the sequence of integers
# Y axis contains the label of the review, Negative(0) or Positive(1)

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


# Observing the dataset

In [0]:
print("Number of reviews in train data set is {}".format(len(x_train)))
print("Number of labels in train data set is {}".format(len(y_train)))
print("Number of reviews in test set is {}".format(len(x_test)))
print("Number of labels in test set is {}".format(len(y_test)))

Number of reviews in train data set is 25000
Number of labels in train data set is 25000
Number of reviews in test set is 25000
Number of labels in test set is 25000


In [0]:
for i in range(15):
  print("Review #{} contains a sequence of length {}".format(i+1, len(x_train[i])))

Review #1 contains a sequence of length 218
Review #2 contains a sequence of length 189
Review #3 contains a sequence of length 141
Review #4 contains a sequence of length 550
Review #5 contains a sequence of length 147
Review #6 contains a sequence of length 43
Review #7 contains a sequence of length 123
Review #8 contains a sequence of length 562
Review #9 contains a sequence of length 233
Review #10 contains a sequence of length 130
Review #11 contains a sequence of length 450
Review #12 contains a sequence of length 99
Review #13 contains a sequence of length 117
Review #14 contains a sequence of length 238
Review #15 contains a sequence of length 109


Line 27 shows that each review contains a different amount of integers in the sequence.
This make sense because not every movie review contains the same number of words.
Because of this, the data needs to be standardized with the same length for each sequence

# Preparing the dataset

In [0]:
# Truncate the review sequence length into 256 integers, for both training and test data
# Sequences less than 256 integers in length will be padded with 0 values 
x_train = keras.preprocessing.sequence.pad_sequences(x_train, value=0, maxlen=256)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, value=0, maxlen=256)

In [0]:
for i in range(15):
  print("Review #{} contains a sequence of length {}".format(i+1, len(x_train[i])))

Review #1 contains a sequence of length 256
Review #2 contains a sequence of length 256
Review #3 contains a sequence of length 256
Review #4 contains a sequence of length 256
Review #5 contains a sequence of length 256
Review #6 contains a sequence of length 256
Review #7 contains a sequence of length 256
Review #8 contains a sequence of length 256
Review #9 contains a sequence of length 256
Review #10 contains a sequence of length 256
Review #11 contains a sequence of length 256
Review #12 contains a sequence of length 256
Review #13 contains a sequence of length 256
Review #14 contains a sequence of length 256
Review #15 contains a sequence of length 256


# Creating the neural network


In [0]:
model = keras.Sequential()
model.add(keras.layers.Embedding(100000, 16)) # Input shape is 100,000 because anything lower errors during model fitting
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid)) # Sigmoid activation function is ideal for binary targets

model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1600000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,600,289
Trainable params: 1,600,289
Non-trainable params: 0
_________________________________________________________________


# Training and deploying neural network

In [0]:
# Train and deploy neural network
# Model is compiled using binary crossentropy loss reduction since targets are in binary format
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, verbose=1)

Train on 25000 samples, validate on 25000 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fe83e373a58>

Training after 10 epochs resulted in an accuracy of 87.82%.

# Optional: Manipulating data for higher accuracy

**Test 1: Pad sequence to 128 integers**  


In [0]:
# This time, the review sequence length is truncated into 128 integers
x_train = keras.preprocessing.sequence.pad_sequences(x_train, value=0, maxlen=128)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, value=0, maxlen=128)

In [0]:
for i in range(15):
  print("Review #{} contains a sequence of length {}".format(i+1, len(x_train[i])))

Review #1 contains a sequence of length 128
Review #2 contains a sequence of length 128
Review #3 contains a sequence of length 128
Review #4 contains a sequence of length 128
Review #5 contains a sequence of length 128
Review #6 contains a sequence of length 128
Review #7 contains a sequence of length 128
Review #8 contains a sequence of length 128
Review #9 contains a sequence of length 128
Review #10 contains a sequence of length 128
Review #11 contains a sequence of length 128
Review #12 contains a sequence of length 128
Review #13 contains a sequence of length 128
Review #14 contains a sequence of length 128
Review #15 contains a sequence of length 128


In [0]:
model2 = keras.Sequential()
model2.add(keras.layers.Embedding(100000, 16))
model2.add(keras.layers.GlobalAveragePooling1D())
model2.add(keras.layers.Dense(16, activation=tf.nn.relu))
model2.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model2.summary()

model2.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model2.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, verbose=1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          1600000   
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 1,600,289
Trainable params: 1,600,289
Non-trainable params: 0
_________________________________________________________________
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fe83efc9518>

Padding the review sequence length into 128 integers, down from 256 integers, resulted in an accuracy loss of .0168 with the test data.

**Test 2: Pad sequence to 256 Integers**

In [0]:
# Now, the review sequence length is truncated to 512 integers
(x_train2, y_train2),(x_test2, y_test2) = imdb.load_data()
x_train2 = keras.preprocessing.sequence.pad_sequences(x_train, value=0, maxlen=512)
x_test2 = keras.preprocessing.sequence.pad_sequences(x_test, value=0, maxlen=512)

In [0]:
model3 = keras.Sequential()
model3.add(keras.layers.Embedding(100000, 16))
model3.add(keras.layers.GlobalAveragePooling1D())
model3.add(keras.layers.Dense(16, activation=tf.nn.relu))
model3.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model3.summary()

model3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model3.fit(x_train2, y_train2, validation_data=(x_test2, y_test2), epochs=10, verbose=1)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 16)          1600000   
_________________________________________________________________
global_average_pooling1d_2 ( (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 1,600,289
Trainable params: 1,600,289
Non-trainable params: 0
_________________________________________________________________
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fe83c30a278>

Truncating the review sequence length into 512 integers resulted in an accuracy loss of .0449 with the test data, as compared to the original 256 sequence length.