## Sentiment analysis from movie reviews

The IMDb data set used consists of user-generated movie reviews and classification of whether the user liked the movie or not based on its associated rating.More info on the dataset is here:

https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification

I have used LSTM (Long Short-Term Memory) cells because we don't really want to "forget" words .

Let's start by importing the stuff we need:

In [2]:
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb

Now import our dataset. num_words=20000 specify that we only care about the 20,000 most popular words in the dataset in order to keep things somewhat managable. The dataset includes 5,000 training reviews and 25,000 testing reviews.

In [3]:
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=20000)
x_train[0]

Loading data...


[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,

the data set has already converted words to integer-based indices.Each number in the training features represent some unique word. It's a bummer that we can't just read the reviews in English as a gut check to see if sentiment analysis is really working, though. 
now Let's check what the review actually is

In [4]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])
decode_review(x_train[0])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be 

let's check whether the user liked the movie or not

In [5]:
y_train[0]

1

1 indicates tat the user liked the movie.

RNN blow up quickly, so again to keep things managable let's limit the reviews to their first 80 words:

In [6]:
x_train = sequence.pad_sequences(x_train, maxlen=80)
x_test = sequence.pad_sequences(x_test, maxlen=80)

Now let's set up our neural network model!
We will start with an Embedding layer - this is just a step that converts the input data into dense vectors of fixed size that's better suited for a neural network.The 20,000 indicates the vocabulary size and 128 is the output dimension of 128 units.

Next we set up a LSTM layer for the RNN. We specify 128 to match the output size of the Embedding layer, and dropout terms to avoid overfitting, which RNN's are particularly prone to.

 The last layer consists of a single neuron with a sigmoid activation function to choose our binay sentiment classification of 0 or 1.

In [7]:
model = Sequential()
model.add(Embedding(20000, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
lstm (LSTM)                  (None, 128)               131584    
_________________________________________________________________
dense (Dense)                (None, 1)                 129       
Total params: 2,691,713
Trainable params: 2,691,713
Non-trainable params: 0
_________________________________________________________________


As this is a binary classification problem, we'll use the binary_crossentropy loss function and the Adam optimizer .

In [8]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

now we are training our model and seeing the progess at each epoch.

In [9]:
model.fit(x_train, y_train,
          batch_size=32,
          epochs=10,
          verbose=2,
          validation_data=(x_test, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/10
 - 207s - loss: 0.4596 - acc: 0.7848 - val_loss: 0.4154 - val_acc: 0.8179
Epoch 2/10
 - 205s - loss: 0.3045 - acc: 0.8762 - val_loss: 0.3825 - val_acc: 0.8367
Epoch 3/10
 - 210s - loss: 0.2156 - acc: 0.9172 - val_loss: 0.4217 - val_acc: 0.8334
Epoch 4/10
 - 213s - loss: 0.1555 - acc: 0.9432 - val_loss: 0.5235 - val_acc: 0.8242
Epoch 5/10
 - 214s - loss: 0.1078 - acc: 0.9616 - val_loss: 0.5364 - val_acc: 0.8266
Epoch 6/10
 - 213s - loss: 0.0775 - acc: 0.9729 - val_loss: 0.6085 - val_acc: 0.8207
Epoch 7/10
 - 214s - loss: 0.0580 - acc: 0.9811 - val_loss: 0.7092 - val_acc: 0.8174
Epoch 8/10
 - 213s - loss: 0.0459 - acc: 0.9854 - val_loss: 0.7332 - val_acc: 0.8180
Epoch 9/10
 - 212s - loss: 0.0350 - acc: 0.9885 - val_loss: 0.8757 - val_acc: 0.8158
Epoch 10/10
 - 213s - loss: 0.0278 - acc: 0.9912 - val_loss: 0.9042 - val_acc: 0.8174


<tensorflow.python.keras.callbacks.History at 0x1a197258550>

OK, let's evaluate our model's accuracy+:

In [10]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=32,
                            verbose=2)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.9041770041930676
Test accuracy: 0.8174


81% considering we limited ourselves to just the first 80 words of each review.
