# Sentiment Analysis on IMDB Reviews using Keras and LSTM
------------------------------------------------------------------

Problem Statement
--------------------
We have to predict whether a review is positive or negative based on sentiments by using LSTM model on the large movie review dataset.

- Here each input is a sequnece of words.
- The large dataset is divided into 25k reviews for train and the same for test.
- LSTM's are sometimes very tricky to get them work, but we will use LSTM to work on this problem for solving the issue of long term dependencies.
- Sequences can vary in length. So out of the vocabulary, a table is created consisting of frequencies of the words sorted by frequencies later. Now, each input is changed into list of indices of those words in  word-frequency sorted table.
- We can quickly develop LSTM for the IMDB sentiment analysis problem and achieve a good accuracy.
- Now, we will go through the code to know :
    - how to develop a LSTM model for a sequence classififcation problem.
    - how to reduce overfitting problem using dropouts.
    - how to combine LSTM and CNN( convolutional neural networks ) to get better performance.
- Luckily keras provides access to the IMDB dataset builtin.

Import Libs
----------

In [98]:
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
import warnings
from keras.models import load_model
from keras.layers import Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
warnings.filterwarnings("ignore")
# fix random seed for reproducibility
numpy.random.seed(7)

Load the Dataset
------------

In [92]:
# Keep only the top n words
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [13]:
X_train.shape

(25000,)

In [14]:
print(len(X_train[1]))
X_train[1]

189


[1,
 194,
 1153,
 194,
 2,
 78,
 228,
 5,
 6,
 1463,
 4369,
 2,
 134,
 26,
 4,
 715,
 8,
 118,
 1634,
 14,
 394,
 20,
 13,
 119,
 954,
 189,
 102,
 5,
 207,
 110,
 3103,
 21,
 14,
 69,
 188,
 8,
 30,
 23,
 7,
 4,
 249,
 126,
 93,
 4,
 114,
 9,
 2300,
 1523,
 5,
 647,
 4,
 116,
 9,
 35,
 2,
 4,
 229,
 9,
 340,
 1322,
 4,
 118,
 9,
 4,
 130,
 4901,
 19,
 4,
 1002,
 5,
 89,
 29,
 952,
 46,
 37,
 4,
 455,
 9,
 45,
 43,
 38,
 1543,
 1905,
 398,
 4,
 1649,
 26,
 2,
 5,
 163,
 11,
 3215,
 2,
 4,
 1153,
 9,
 194,
 775,
 7,
 2,
 2,
 349,
 2637,
 148,
 605,
 2,
 2,
 15,
 123,
 125,
 68,
 2,
 2,
 15,
 349,
 165,
 4362,
 98,
 5,
 4,
 228,
 9,
 43,
 2,
 1157,
 15,
 299,
 120,
 5,
 120,
 174,
 11,
 220,
 175,
 136,
 50,
 9,
 4373,
 228,
 2,
 5,
 2,
 656,
 245,
 2350,
 5,
 4,
 2,
 131,
 152,
 491,
 18,
 2,
 32,
 2,
 1212,
 14,
 9,
 6,
 371,
 78,
 22,
 625,
 64,
 1382,
 9,
 8,
 168,
 145,
 23,
 4,
 1690,
 15,
 16,
 4,
 1355,
 5,
 28,
 6,
 52,
 154,
 462,
 33,
 89,
 78,
 285,
 16,
 145,
 95]

Pad and Truncate Inputs
---------------------------

In [93]:
max_review_length = 600
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

print(X_train.shape)
print(X_train[1])

(25000, 600)
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    

Build the model
----------------------------

# LSTM model

In [24]:
embedding_vec_length = 32
model = Sequential()
model.add(Embedding(top_words+1, embedding_vec_length, input_length = max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 600, 32)           160032    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,333
Trainable params: 213,333
Non-trainable params: 0
_________________________________________________________________
None


In [25]:
model.fit(X_train, y_train, epochs = 10, batch_size = 64)
scores = model.evaluate(X_test, y_test, verbose = 0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 85.86%


Testing
---------

In [43]:
y_pred = model.predict_classes(X_test, batch_size = 64)
true = 0
for i, y in enumerate(y_test):
    if y ==  y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))


Correct Prediction: 21466
Wrong Prediction: 3534
Accuracy: 85.86399999999999


Save Model
----------

In [44]:
model.save("imdblstm.h5")
print("Saved model to disk")

Saved model to disk


Your saved model can be loaded by calling load_model() function and passing the filename you saved.

Load Saved Model
-----------------

In [46]:
model1 = load_model('imdblstm.h5')
model1.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 600, 32)           160032    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,333
Trainable params: 213,333
Non-trainable params: 0
_________________________________________________________________


In [76]:
result = model1.predict(X_test)[90]
print(result)

[0.99361765]


In [77]:
if result >= 0.75:
    print('Postive')
else:
    print('Negative')

Postive


# LSTM model with Dropout


RNN's like LSTM are more likely to suffer overfitting problems. So dropouts can be applied between the embedding and LSTM layers and the LSTM and Dense output layers. Let's try them out.

In [94]:
embedding_vec_length = 32
model2 = Sequential()
model2.add(Embedding(top_words, embedding_vec_length, input_length=max_review_length))
model2.add(Dropout(0.2))
model2.add(LSTM(100))
model2.add(Dropout(0.2))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 600, 32)           160000    
_________________________________________________________________
dropout_4 (Dropout)          (None, 600, 32)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dropout_5 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [95]:
model2.fit(X_train, y_train, epochs = 10, batch_size = 64)
scores = model.evaluate(X_test, y_test, verbose = 0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 85.86%


Here, we got same accuracy as LSTM model without dropout. Let's try LSTM with CNN. 

# LSTM and CNN 

In [100]:
embedding_vecor_length = 32
model4 = Sequential()
model4.add(Embedding(top_words, embedding_vec_length, input_length=max_review_length))
model4.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model4.add(MaxPooling1D(pool_size=2))
model4.add(LSTM(100))
model4.add(Dense(1, activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model4.summary())
model4.fit(X_train, y_train, epochs=10, batch_size=64)
scores = model4.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 600, 32)           160000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 600, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 300, 32)           0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 101       
Total params: 216,405
Trainable params: 216,405
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/1

I expect that, better results can be achieved over more epochs. We somehow got more accuracy using LSTM with CNN.