# Sentiment Analysis on IMDB Data

Using Vanilla RNN to predict the sentiment of IMDB reviews. Data is taken from Keras datasets, which consists of 25000 training sequence and 25000 testing sequence. The target column has two(Binary) values, positive and negative.
RNN (Recurrent Neural Network) is used because it is efficient for sequential data. To solve the problems faced by RNN like Vanishing/Exploding gradient, LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit) can be used.

In [1]:
#Importing the necessary Libraries
import pandas as pd
from tensorflow import keras
from keras.datasets import imdb
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN
from keras import initializers

In [2]:
data=imdb.load_data(max_features=25000)

In [3]:
(xTrain,yTrain),(xTest,yTest) = data
print(len(xTrain),'Train Sequence')
print(len(yTrain),'Test Sequence\n')

#Padding the Maximum length of the Sequences
xTrain=pad_sequences(xTrain,maxlen=25)
xTest=pad_sequences(xTest,maxlen=25)

print("Shape of XTrain : ",xTrain.shape)
print("Shape of XTest : ",xTest.shape)

25000 Train Sequence
25000 Test Sequence

Shape of XTrain :  (25000, 25)
Shape of XTest :  (25000, 25)


In [9]:
## Building an simple RNN

vocab_size=25000
embedding_dim=50
hidden_state=5

rnn_model=Sequential()
rnn_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=xTrain.shape[1:]))
rnn_model.add(SimpleRNN(hidden_state,
                    kernel_initializer=initializers.RandomNormal(stddev=0.001),
                    recurrent_initializer=initializers.Identity(gain=1.0),
                    activation='relu'))
rnn_model.add(Dense(1, activation='sigmoid'))

rmsprop = keras.optimizers.RMSprop(learning_rate = .0001)
rnn_model.compile(optimizer=rmsprop, loss='binary_crossentropy',metrics=['accuracy'])

In [5]:
rnn_model.summary()

In [6]:
rnn_model.fit(xTrain,yTrain,batch_size=32,epochs=10,validation_data=(xTest,yTest))

Epoch 1/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 13ms/step - accuracy: 0.5482 - loss: 0.6877 - val_accuracy: 0.6638 - val_loss: 0.6334
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.6815 - loss: 0.6076 - val_accuracy: 0.7073 - val_loss: 0.5656
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.7375 - loss: 0.5343 - val_accuracy: 0.7337 - val_loss: 0.5268
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.7702 - loss: 0.4921 - val_accuracy: 0.7476 - val_loss: 0.5079
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 13ms/step - accuracy: 0.7908 - loss: 0.4556 - val_accuracy: 0.7567 - val_loss: 0.4966
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 12ms/step - accuracy: 0.8052 - loss: 0.4317 - val_accuracy: 0.7665 - val_loss: 0.4786
Epoch 7/10
[1m7

<keras.src.callbacks.history.History at 0x289900fafb0>

In [7]:
score,acc= rnn_model.evaluate(xTest,yTest,batch_size=32)
print("Test Score : ", score,"\nTest Accuracy : ", acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.7785 - loss: 0.4679
Test Score :  0.465295672416687 
Test Accuracy :  0.7785199880599976


Making changes in the Vocab size and pad_sequences to improve the test accuracy and score.

In [11]:
vocab_size=20000
embedding_dim=50
hidden_state=5

data=imdb.load_data(max_features=vocab_size)
(xTrain,yTrain),(xTest,yTest) = data

xTrain=pad_sequences(xTrain,maxlen=80)
xTest=pad_sequences(xTest,maxlen=80)

In [12]:
embedding_dim=50

rnn_model=Sequential()
rnn_model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_shape=xTrain.shape[1:]))
rnn_model.add(SimpleRNN(hidden_state,
                    kernel_initializer=initializers.RandomNormal(stddev=0.001),
                    recurrent_initializer=initializers.Identity(gain=1.0),
                    activation='relu'))
rnn_model.add(Dense(1, activation='sigmoid'))

rmsprop = keras.optimizers.RMSprop(learning_rate = .0001)
rnn_model.compile(optimizer=rmsprop, loss='binary_crossentropy',metrics=['accuracy'])

In [13]:
rnn_model.fit(xTrain,yTrain,batch_size=32,epochs=10,validation_data=(xTest,yTest))

Epoch 1/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 19ms/step - accuracy: 0.5214 - loss: 0.6853 - val_accuracy: 0.6944 - val_loss: 0.6327
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 24ms/step - accuracy: 0.7061 - loss: 0.6129 - val_accuracy: 0.7471 - val_loss: 0.5921
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 23ms/step - accuracy: 0.7699 - loss: 0.5710 - val_accuracy: 0.7890 - val_loss: 0.5669
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 22ms/step - accuracy: 0.8037 - loss: 0.5100 - val_accuracy: 0.7857 - val_loss: 0.4729
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 21ms/step - accuracy: 0.8188 - loss: 0.4111 - val_accuracy: 0.8014 - val_loss: 0.4344
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 21ms/step - accuracy: 0.8438 - loss: 0.3701 - val_accuracy: 0.8136 - val_loss: 0.4158
Epoch 7/10
[1m7

<keras.src.callbacks.history.History at 0x28981550a30>

In [14]:
score,acc= rnn_model.evaluate(xTest,yTest,batch_size=32)
print("Test Score : ", score,"\nTest Accuracy : ", acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8293 - loss: 0.3867
Test Score :  0.38895317912101746 
Test Accuracy :  0.8267599940299988


We can see there has been an improvement in the model accuracy from 77% to 82%. Also, the loss of the model has been reduced from .46 to .38. This signifies the new hyper parameters works better. Now running the same model for more epochs to test whether there is any improvement in the model performance.

In [15]:
rnn_model.fit(xTrain,yTrain,batch_size=32,epochs=10,validation_data=(xTest,yTest))

Epoch 1/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 16ms/step - accuracy: 0.8819 - loss: 0.2914 - val_accuracy: 0.8318 - val_loss: 0.3844
Epoch 2/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 21ms/step - accuracy: 0.8943 - loss: 0.2687 - val_accuracy: 0.8292 - val_loss: 0.3943
Epoch 3/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 23ms/step - accuracy: 0.8956 - loss: 0.2639 - val_accuracy: 0.8347 - val_loss: 0.3748
Epoch 4/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 22ms/step - accuracy: 0.8988 - loss: 0.2539 - val_accuracy: 0.8301 - val_loss: 0.3921
Epoch 5/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 24ms/step - accuracy: 0.9032 - loss: 0.2467 - val_accuracy: 0.8333 - val_loss: 0.3838
Epoch 6/10
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 23ms/step - accuracy: 0.9051 - loss: 0.2414 - val_accuracy: 0.8351 - val_loss: 0.3833
Epoch 7/10
[1m7

<keras.src.callbacks.history.History at 0x289a0695a20>

In [16]:
score,acc= rnn_model.evaluate(xTest,yTest,batch_size=32)
print("Test Score : ", score,"\nTest Accuracy : ", acc)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.8362 - loss: 0.3944
Test Score :  0.3947882354259491 
Test Accuracy :  0.8334800004959106


There is a very small improvement in accuracy, which is negligible and the loss has slightly increased. Running for more epochs might casue the model to overfit to the the data. so, we could conclude that the simple RNN model can be run within 10-20 epochs to produce efficient output. To further improve the predictions, we could use other networks like LSTM or GRU's. They work well with avoiding vanishing gradient problem 