### Movie Sentiment Analysis using Simple RNN

This is a dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a list of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode the pad token.

In [1]:
import numpy as np
import pandas as pd

import tensorflow as tf 
from tensorflow.keras.datasets import imdb 
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence 
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, LeakyReLU





In [2]:
word_index = imdb.get_word_index()
print("Number of unique words:", len(word_index))
print("Example word-to-index mapping:", list(word_index.items())[:10])

Number of unique words: 88584
Example word-to-index mapping: [('fawn', 34701), ('tsukino', 52006), ('nunnery', 52007), ('sonja', 16816), ('vani', 63951), ('woods', 1408), ('spiders', 16115), ('hanging', 2345), ('woody', 2289), ('trawling', 52008)]


In [3]:
# features to represent the words 
max_features = 10000 

(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=max_features)

print(f"Training data shape: {X_train.shape}, Target shape: {y_train.shape}")
print(f"Testing data shape: {X_test.shape}, Target shape: {y_test.shape}")

Training data shape: (25000,), Target shape: (25000,)
Testing data shape: (25000,), Target shape: (25000,)


In [4]:
X_train[0] # this represents the word data the indexing is based on most frequent words

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [5]:
# Understanding the sentence 
reverse_word_index = {value:key for key,value in word_index.items()}

review_0 = ' '.join([reverse_word_index.get(i-3,'***') for i in X_train[0]])
print(review_0)

*** this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert *** is an amazing actor and now the same being director *** father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for *** and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also *** to the two little boy's that played the *** of norman and paul they were just brilliant children are often left out of the *** list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done

In [6]:
max_len = 1000 

# if the review sentence is greater than 500 then,
#  last 500 words will be taken into account

print(len(X_train[0]))
X_train = sequence.pad_sequences(X_train,maxlen=max_len)
X_test = sequence.pad_sequences(X_test,maxlen=max_len)
print(len(X_train[0]))

218
1000


### Train Simple RNN

In [7]:
# added the embedded layer to convert words into input vector

model = Sequential()
model.add(Embedding(max_features,300,input_length=max_len))
model.add(SimpleRNN(128,activation=LeakyReLU(alpha=0.1)))
model.add(Dense(1,activation="sigmoid"))




In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1000, 300)         3000000   
                                                                 
 simple_rnn (SimpleRNN)      (None, 128)               54912     
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 3055041 (11.65 MB)
Trainable params: 3055041 (11.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [9]:
# we define the way to update weights
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])




In [10]:
# Create an EarlyStoppingCallBack 
from tensorflow.keras.callbacks import EarlyStopping 
earlyStoppingCallback = EarlyStopping(monitor='val_loss',patience=5,restore_best_weights=True)

In [11]:
model.fit(
    X_train,y_train,
    epochs=40,
    batch_size=32,
    validation_split=0.2,
    callbacks=[earlyStoppingCallback]
)

Epoch 1/40


Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40


<keras.src.callbacks.History at 0x1a46e614b10>

In [12]:
# save the model 
model.save('./pickle_files/simple_rnn_imdb.h5')

  saving_api.save_model(
