# Sentiment classification with the IMDB movie reviews dataset

Goal is to test the effects of pretrained word embeddings, namely [word2vec](https://en.wikipedia.org/wiki/Word2vec), which is available in pre-trained form in [Gensim](https://radimrehurek.com/gensim/) from [here](https://github.com/RaRe-Technologies/gensim-data).

We will train an lstm based sentiment classifier for the IMDB dataset first without and then with pre-trained embeddings to test the effect of transfer learning.

## The IMDB dataset

"Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word."

Source [here](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification).


##  The techniques

For this we will utilize [Keras Sequential API](https://keras.io/getting-started/sequential-model-guide/).

This task's original is: [Sentiment detection with Keras, word embeddings and LSTM deep learning networks](https://www.liip.ch/en/blog/sentiment-detection-with-keras-word-embeddings-and-lstm-deep-learning-networks) and [here](https://github.com/plotti/keras_sentiment/blob/master/Imdb%20Sentiment.ipynb)



## Import libraries

In [None]:
import pandas as pd
import numpy as np
from keras.datasets import imdb 
from keras.models import Sequential 
from keras.layers import Dense 
from keras.layers import LSTM 
from keras.layers.embeddings import Embedding 
from keras.preprocessing import sequence 
import keras

Using TensorFlow backend.


In [None]:
EPOCHS = 5
BATCH_SIZE = 500#64
max_review_length = 500 #For truncting the maximum length of reviews in tokens
optimizer = "adam"

## Import data

In [None]:
# fix random seed for reproducibility 
np.random.seed(7) 

# load the dataset but only keep the top n words, zero the rest 
top_words = 5000  #TODO!!!!
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [None]:
INDEX_FROM=3   # word index offset
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in X_train[0] ))

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> for the whole film but these children are amazing and should be <UNK> for what they

## Preprocess data

In [None]:
# truncate and pad the review sequences 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length) 

In [None]:
pd.DataFrame(X_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,0,0,0,0,0,0,0,0,0,0,...,4472,113,103,32,15,16,2,19,178,32
1,0,0,0,0,0,0,0,0,0,0,...,52,154,462,33,89,78,285,16,145,95
2,0,0,0,0,0,0,0,0,0,0,...,106,607,624,35,534,6,227,7,129,113
3,687,23,4,2,2,6,3693,42,38,39,...,26,49,2,15,566,30,579,21,64,2574
4,0,0,0,0,0,0,0,0,0,0,...,19,14,5,2,6,226,251,7,61,113


## Create model

In [None]:
# create the baseline model 
embedding_vector_length = 300 #fixed for being fair, word2vec has 300 later on
model = Sequential() 
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
model.add(LSTM(100)) 

model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer=optimizer, metrics=['accuracy']) 
print(model.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 300)          1500000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 1,660,501
Trainable params: 1,660,501
Non-trainable params: 0
_________________________________________________________________
None


## Train Model

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE) 

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f0bf3e80a20>

## Evaluate model

In [None]:
# Final evaluation of the model 
scores = model.evaluate(X_test, y_test, verbose=0) 

print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.06%


## Predict something

In [None]:
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"
for review in [good,bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length) 
    print("%s . Sentiment: %s" % (review,model.predict(np.array([tmp_padded][0]))[0][0]))

i really liked the movie and had fun . Sentiment: 0.8632016
this movie was terrible and bad . Sentiment: 0.05640324


## Load word2vec embeddings

In [None]:
!pip install gensim
import gensim.downloader as api

w2v_model = api.load("word2vec-google-news-300")

w2v_matrix = np.zeros((len(id_to_word), embedding_vector_length))

for wid, word in id_to_word.items():
    try:
      w2v_matrix[wid]=w2v_model[word]
    except:
      pass



## Prepare pretrained embedding layer

In [None]:
from keras.initializers import Constant

word2vec_embeddings_layer = Embedding(len(id_to_word),
                            embedding_vector_length,
                            embeddings_initializer=Constant(w2v_matrix),
                            input_length=max_review_length,
                            trainable=True)

## Build model with pretrained embedding

In [None]:
# create the model using the pretrained embeddings
model_pretrained = Sequential() 

model_pretrained.add(word2vec_embeddings_layer) 

model_pretrained.add(LSTM(100)) 
 
model_pretrained.add(Dense(1, activation='sigmoid')) 
model_pretrained.compile(loss='binary_crossentropy',optimizer=optimizer, metrics=['accuracy']) 
print(model_pretrained.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 300)          26576100  
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 26,736,601
Trainable params: 26,736,601
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
model_pretrained.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=EPOCHS, batch_size=BATCH_SIZE) 

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f0b8eb42e10>

## Test final score

In [None]:
# Final evaluation of the model 
scores = model_pretrained.evaluate(X_test, y_test, verbose=0) 

print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.88%
