In [26]:
# neural network movie review classification

In [27]:
# Clssification of movie reviews whether good or bad

In [25]:
!pip install numpy==1.16.1



In [2]:
# Loading Data
# The dataset we will use for these next tutorials is the IMDB movie dataset from keras.

In [3]:
import tensorflow as tf
from tensorflow import keras
import numpy

imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [5]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [4]:
# Integer Encoded Data
# Having a look at our data we'll notice that our reviews are integer encoded. This means that each word in our reviews are represented as positive integers where each integer represents a specific word. This is necessary as we cannot pass strings to our neural network. However, if we (as humans) want to be able to read our reviews and see what they look like we'll have to find a way to turn those integer encoded reviews back into strings. The following code will do this for us:

In [6]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index() 

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


In [8]:
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

In [10]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

In [11]:
def decode_review(text):
	return " ".join([reverse_word_index.get(i, "?") for i in text])

# this function will return the decoded (human readable) reviews 

In [12]:
# We start by getting a dictionary that maps all of our words to an integer, add some more keys to it like , etc. and then reverse that dictionary so we can use integers as keys that map to each word. The function defied will take as a list the integer encoded reviews and return the human readable version.

In [16]:
# lets test it
print(decode_review(test_data[1]))

<START> this film requires a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances <UNK> the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere <UNK> with sexual tension and psychological <UNK> it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely one of her darker characters but she plays it so perfectly and convincingly that it's scary michael burns does a good job as the mute young man regular altman player michael murphy has a small part the <UNK> moody set fits the content of the story very well in short this movie is a powerful study of loneliness sexual <UNK> and desperation be patient <UNK> up t

In [18]:
# lets check the length of diffrent movie reviews 
print(len(test_data[0]), len(test_data[1]))

68 260


In [20]:
# Preprocessing Data
# If we take a look above at some of our loaded in reviews we'll notice that they are different lengths. This is an issue. We cannot pass different length data into out neural network. Therefore we must make each review the same length. To do this we will follow the procedure below:
# - if the review is greater than 250 words(250 we say randomly) then trim off the extra words
# - if the review is less than 250 words add the necessary amount of 's to make it equal to 250.

# Luckily for us keras has a function that can do this for us:

In [21]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post", maxlen=250)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post", maxlen=250)

In [23]:
# lets check the length of diffrent movie reviews now 
print(len(test_data[0]), len(test_data[1]))

250 250


In [22]:
# Defining the Model

In [24]:
model = keras.Sequential()
model.add(keras.layers.Embedding(88000, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation="relu"))
model.add(keras.layers.Dense(1, activation="sigmoid"))

model.summary()  # prints a summary of the model

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 16)          1408000   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,408,289
Trainable params: 1,408,289
Non-trainable params: 0
_________________________________________________________________


In [29]:
# word embedding layer
# We want a way to determine not only the contents of a sentence but the context of the sentence. A word embedding layer will attempt to determine the meaning of each word in the sentence by mapping each word to a position in vector space.

In [30]:
# GlobalPooling1D Layer
# This layer is nothing special and simply scales down our data's dimension to make it easier computationally for our model in the later layers. Because our word embedding layer maps thousands and thousands of words to a location in vector space they usually do this in a high dimensional vector space. This means when we get our word vectors from the embedding layer they have multiple dimensions and can be scaled down by this layer.

In [31]:
# Dense Layers
# The last two layers in our network are dense fully connected layers. The output layer is one neuron that uses the sigmoid function to get a value between a 0 and a 1 which will represent the likelihood of the review being positive or negative. The layer before that contains 16 neurons with a relu activation function designed to find patterns between different words present in the review.

In [33]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [32]:
# Validation Data
# For this specific model we will introduce a new idea of validation data. In the last tutorial when we trained the models accuracy after each epoch on the current training data, data the model had seen before. This can be problematic as it is highly possible the a model can simply memorize input data and its related output and the accuracy will affect how the model is modified as it trains. So to avoid this issue we will sperate our training data into two sections, training and validation. The model will use the validation data to check accuracy after learning from the training data. This will hopefully result in us avoiding a false confidence for our model.

# We can split our training data into validation data like so:

In [34]:
x_val = train_data[:10000]
x_train = train_data[10000:]

y_val = train_labels[:10000]
y_train = train_labels[10000:]

In [37]:
# training model

fitModel = model.fit(x_train, y_train, epochs=40, batch_size=512, validation_data=(x_val, y_val), verbose=1)

# i have experimented by tweaking value of epochs but max accuracy came with epoch=40

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


In [38]:
# testing the model

results = model.evaluate(test_data, test_labels)
print(results)

[0.8018378615379333, 0.8477200269699097]


In [41]:
# make prediction and compare with actual

test_review=test_data[0]
predict=model.predict([test_review])

print("review: ")

print(decode_review(test_review))

print("Prediction " + str(predict[0]))

print("Actual " + str(test_labels[0]))

print(results)

review: 
<START> please give this one a miss br br <UNK> <UNK> and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite <UNK> so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 

In [47]:
# Now we will save our model and then reuse this for other movie review classification

In [43]:
# Saving the Model
model.save("model.h5")  # name it whatever you want but end with .h5

In [44]:
#loading the model

model = keras.models.load_model("model.h5")

In [45]:
# Making Predictions
# Now it is time to used our saved model to make predictions. Now this is a little bit harder than it looks because we need to consider the following:
# – our model accepts integer encoded data
# – our model needs reviews that are of length 250 words

# This means we can’t just pass any string of text into our model. It will need to be reshaped and reformed to meet the criteria above.

In [46]:
# Transforming our Data

# The data I’ll use for this tutorial will be simple raw text data of a movie review of the lion king

In [48]:
# To start we will need to integer encode the data. We will do this using the following function:

In [49]:
def review_encode(s):
	encoded = [1]  #starting with 1

	for word in s:
		if word.lower() in word_index:
			encoded.append(word_index[word.lower()])
		else:
			encoded.append(2)  # if word is not present in word_index it will added as 2 means unknown word

	return encoded

In [50]:
with open("test.txt", encoding="utf-8") as f:
	for line in f.readlines():
		nline = line.replace(",", "").replace(".", "").replace("(", "").replace(")", "").replace(":", "").replace("\"","").strip().split(" ")
# We have replaced all commas or brackets with nothing so that when we split the data based on white spaces, we dont want these unwanted characters included in our data

In [51]:
encode = review_encode(nline)  # encode all integer characters and mapping words done

In [52]:
encode = keras.preprocessing.sequence.pad_sequences([encode], value=word_index["<PAD>"], padding="post", maxlen=250) # make the data 250 words long

In [54]:
predict = model.predict(encode)
print(line)
print(encode)
print(predict[0])

Of all the animation classics from the Walt Disney Company, there is perhaps none that is more celebrated than "The Lion King." Its acclaim is understandable: this is quite simply a glorious work of art."The Lion King" gets off to a fantastic start. The film's opening number, "The Circle of Life," is outstanding. The song lasts for about four minutes, but from the first sound, the audience is floored. Not even National Geographic can capture something this beautiful and dramatic. Not only is this easily the greatest moment in film animation, this is one of the greatest sequences in film history. The story that follows is not as majestic, but the film has to tell a story. Actually, the rest of the film holds up quite well. The story takes place in Africa, where the lions rule. Their king, Mufasa (James Earl Jones) has just been blessed with a son, Simba (Jonathan Taylor Thomas), who goes in front of his uncle Scar (Jeremy Irons) as next in line for the throne. Scar is furious, and sets 

In [55]:
# This result [0.91459614] is showing that reviw is positive