# GloVe Embeddings and IMDb Sentiment Analysis Example

This notebook will have 2 parts: Part I will introduce the pretrained [GloVe](https://nlp.stanford.edu/projects/glove/) word embeddings by performing some fun word-arithmetic tasks. Part II will use the pretrained embeddings as an input and train a LSTM sentiment Analysis model on Keras's IMDb movie review dataset.

The GloVe embeddings matrix is made open source by the Stanford NLP team. It was trained on the 2014 English language Wikipedia dump, the final model consists of 400,000 tokens (words), and is available in sizes of 50, 100, 200, and 300 dimensional vectors. This notebook gives us a brief overview of some of its capabilities.

Credits to the Stanford NLP team: Jeffrey Pennington, Richard Socher, Christopher D. Manning.

In [1]:
import os, sys, operator
import numpy as np
from matplotlib import pyplot as plt

## Part I: Introduction to GloVe embeddings matrix

In [2]:
from scipy.spatial.distance import cosine

In [3]:
# Load GloVe embeddings:
glove = {}
with open('../GloVe Embeddings/glove.6B.100d.txt', encoding = 'utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        glove[word] = coefs

print('Found {} word vectors.'.format(len(glove)))

In [4]:
# A quick look at what word vectors look like:
glove['nyc']

array([ 0.85189  ,  0.35649  ,  0.14484  , -0.61532  ,  0.61493  ,
        0.2261   , -0.55096  ,  0.46157  , -0.019063 ,  0.56516  ,
        0.17011  , -0.49439  , -0.18368  ,  0.08651  , -0.54403  ,
        0.40244  ,  0.35977  ,  0.012714 , -0.23156  ,  0.081932 ,
        0.031566 , -0.66883  , -0.18811  , -0.098277 , -0.2276   ,
       -0.0044313,  0.14616  ,  0.069204 , -0.13451  ,  0.35255  ,
       -0.24226  ,  0.21137  , -0.14358  ,  0.86754  , -0.83692  ,
        0.045826 , -0.45233  , -0.32635  ,  0.57908  , -0.10124  ,
        0.59631  ,  0.0056739, -0.57863  , -0.18945  ,  0.3612   ,
        0.35982  , -0.58917  ,  0.028608 ,  0.46961  ,  0.32781  ,
       -0.34656  , -0.33941  ,  0.10335  ,  0.31001  , -0.85238  ,
       -0.77135  ,  0.38455  ,  0.56638  ,  0.3545   , -0.39816  ,
       -0.91958  ,  0.17678  , -0.012436 , -0.28267  ,  0.52689  ,
        0.42276  , -0.18496  , -0.28477  ,  0.20716  ,  0.44375  ,
       -0.24254  , -0.03832  , -0.66316  ,  0.19424  ,  0.0014

In [5]:
# A quick test for cosine similarity:
m = glove['mother']
f = glove['father']
print(1 - cosine(m, f))

0.8656660914421082


#### **Let's have some fun with word arithmetics**

In [6]:
def word_analogy(word_a, word_b, word_c, embedding_matrix):
    # Get the embedding vector for each input word:
    e_a, e_b, e_c = embedding_matrix[word_a], embedding_matrix[word_b], embedding_matrix[word_c]

    # Initialize max_cosine_sim as a large negative number and best_word as None:
    words = embedding_matrix.keys()
    max_cosine_sim = -100
    best_word = None

    # Loop over the whole word vector set:
    for w in words:
        # To avoid best_word being one of the input words, pass on them:
        if w in (word_a, word_b, word_c):
            continue

        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c):
        cosine_sim = 1 - cosine(e_b - e_a, embedding_matrix[w] - e_c)

        # Keep track of best word:
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w

    return best_word

In [7]:
# Test 1:
rslt = word_analogy('tokyo', 'japan', 'rome', glove)
print('Tokyo - Japan | Rome -', '\nanswer:', rslt)

Tokyo - Japan | Rome - 
answer: italy


In [8]:
# Test 2:
rslt = word_analogy('man', 'woman', 'boy', glove)
print('man - woman | boy -', '\nanswer:', rslt)

man - woman | boy - 
answer: girl


In [9]:
# Test 3 - Impressive:
rslt = word_analogy('sky', 'blue', 'zebra', glove)
print('sky - blue | zebra -', '\nanswer:', rslt)

sky - blue | zebra - 
answer: striped


In [10]:
# Test 4 - Doesn't always work:
rslt = word_analogy('spain', 'madrid', 'canada', glove)
print('Spain - Madrid | Canada -', '\nanswer:', rslt)

Spain - Madrid | Canada - 
answer: toronto


#### **Some more fun with closest neighbors**

In [11]:
def closest_neighbors(word, n, embedding_matrix):
    # First, obtain the embedding vector for the input word:
    word_vec = embedding_matrix[word]

    # Compute cosine similarities across entire vocabulary set:
    words = embedding_matrix.keys()
    cosine_sims = {w:1 - cosine(word_vec, embedding_matrix[w]) for w in words}

    # Obtain the n best matches (excluding the input word itself):
    cosine_sims_sorted = sorted(cosine_sims.items(), key=operator.itemgetter(1))
    synonyms_list = [i[0] for i in cosine_sims_sorted[-(n+1):-1]]
    return synonyms_list[::-1]

In [12]:
# Test 1:
closest_neighbors('red', 5, glove)

['yellow', 'blue', 'green', 'black', 'white']

In [13]:
# Test 2:
closest_neighbors('brave', 5, glove)

['courageous', 'fearless', 'proud', 'heroic', 'valiant']

In [14]:
# Test 3:
closest_neighbors('cia', 5, glove)

['fbi', 'intelligence', 'secret', 'covert', 'pentagon']

In [15]:
# Test 4 (what happened here...?):
closest_neighbors('carnivore', 5, glove)

['herbivore', 'carnivorous', 'spamming', 'marsupial', 'tyrannosaurus']

## Part II: Sentiment analysis model using the IMDb movie review dataset

In [16]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, LSTM, CuDNNLSTM, Bidirectional, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


#### **Load IMDb dataset**

In [17]:
# Load the IMDb dataset and limit to the top 5000 most frequent words:
top_words = 5000
word_idx_start = 3
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words, index_from=word_idx_start)

In [18]:
# Load word-index dictionary so that we can view the dataset. Notice that for this dataset, we need to manually start the starting
# indices to <PDA>, <START>, and <UNK>. These are the special start, unknown word, and zero-padding tokens:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v + word_idx_start) for k, v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key, value in word_to_id.items()}

In [19]:
# Have a look at some reviews:
print('5-star review:')
print(' '.join(id_to_word[id] for id in X_train[0]))

print('\n1-star review:')
print(' '.join(id_to_word[id] for id in X_train[15]))

5-star review:
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly <UNK> was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little <UNK> that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big <UNK> for the whole film but these children are amazing and should be <UNK

#### **Preprocessing and create embedding matrix**

In [20]:
# Truncate and pad the input sequences, using pad_sequences(), so that they are all the same length for modeling. The model will learn that the
# zero values carry no information. The sequences are not the same length in terms of content, but same length vectors are required in Keras:
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

In [21]:
# Create our pre-trained embedding matrix using the GloVe vectors we loaded earlier:
embedding_dim = 100
vocab_size = len(word_to_id)

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_to_id.items():
    embedding_vector = glove.get(word)
    # Words not found in pretrained embedding will be all-zeros.
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

glove_layer = Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=max_review_length, trainable=False)

#### **Build LSTM model**

In [22]:
# Start with Sequential() as before. The first layer will be the GloVe word embeddings we've just processed:
model = Sequential()
model.add(glove_layer)

# First recurrent layer. LSTM(n) will add a single LSTM layer with n recurrent units. Since we'll be feeding the output of this layer into
# a second LSTM layer, we must specify return_sequences = True:
model.add(CuDNNLSTM(embedding_dim, return_sequences=True))
model.add(Dropout(0.5))

# Second recurrent layer. Notice that we're no longer specifying return_sequences = True, since this is the final recurrent layer of the network:
model.add(CuDNNLSTM(embedding_dim))
model.add(Dropout(0.5))

# Output layer, sigmoid activation since this is a binary-classification problem:
model.add(Dense(1, activation='sigmoid'))

In [23]:
# View model architecture:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 100)          8858700   
_________________________________________________________________
cu_dnnlstm_1 (CuDNNLSTM)     (None, 500, 100)          80800     
_________________________________________________________________
dropout_1 (Dropout)          (None, 500, 100)          0         
_________________________________________________________________
cu_dnnlstm_2 (CuDNNLSTM)     (None, 100)               80800     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 9,020,401
Trainable params: 161,701
Non-trainable params: 8,858,700
____________________________________________________________

In [24]:
# Define the loss function, optimizer, batch size, and number of epochs:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=4, batch_size=32, validation_split=0.2, verbose=1)

Train on 20000 samples, validate on 5000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x21b9fb13b00>

#### **Save/Load model (so that we don't have to retrain)**

In [None]:
from keras.models import load_model

In [None]:
# Save model:
model.save('IMDb_LSTM.h5')

In [None]:
# Load saved model:
model = load_model('IMDb_LSTM.h5')

#### **Evaluate performance on test data**

In [25]:
from sklearn.metrics import classification_report

In [26]:
# Test data predictions:
test_probabilities = model.predict(X_test)
test_probabilities = test_probabilities.reshape(len(test_probabilities),)
test_predictions = np.array([int(round(i)) for i in test_probabilities])

In [27]:
# Accuracy evaluation:
test_evaluation = model.evaluate(X_test, y_test, verbose=1)
print('Test accuracy:', test_evaluation[1])

Test accuracy: 0.87656


In [28]:
# Precision and recall stats:
print(classification_report(y_test, test_predictions))

             precision    recall  f1-score   support

          0       0.88      0.87      0.88     12500
          1       0.87      0.88      0.88     12500

avg / total       0.88      0.88      0.88     25000



#### **View some predictions**

In [29]:
import random

In [30]:
# Sentiment result function:
def sentiment(n):
    if n <= 0.5:
        return 'Neg'
    else:
        return 'Pos'

In [31]:
# View a random selection:
rn = random.randint(0, len(test_predictions))
print('Actual: {}; Predicted: {}; Pred. Probability: {}'.format(sentiment(y_test[rn]), sentiment(test_probabilities[rn]), test_probabilities[rn]))
print('--------------------------------------------------------------------')
print(' '.join(id_to_word[id] for id in X_test[rn] if id > 0))

Actual: Neg; Predicted: Neg; Pred. Probability: 0.026494808495044708
--------------------------------------------------------------------
<START> ok i admit that i still <UNK> <UNK> with la that was the main reason i went to see this film but it was so boring that i nearly felt asleep sorry but her talents as actress are not very convincing furthermore this film was presented as having outstanding special effects and cgi yeah for a b movie it is not that bad after having seen her in <UNK> some years ago also a very crappy film i thought that she would play more convincingly but la and may be the james bond the world is not enough seem to be the only good films with her is it her talent does she have a bad taste when <UNK> her films or simply bad luck


In [32]:
# Isolate to only the incorrect predictions:
incorrects = []
ind = 0
for i in range(len(test_predictions)):
    if y_test[i] != test_predictions[i]:
        incorrects.append(ind)
    ind += 1

In [33]:
rn = random.choice(incorrects)
print('Actual: {}; Predicted: {}; Pred. Probability: {}'.format(sentiment(y_test[rn]), sentiment(test_probabilities[rn]), test_probabilities[rn]))
print('--------------------------------------------------------------------')
print(' '.join(id_to_word[id] for id in X_test[rn] if id > 0))

Actual: Neg; Predicted: Pos; Pred. Probability: 0.9176121950149536
--------------------------------------------------------------------
<START> <UNK> a previous summary says if you like aliens and <UNK> you will enjoy this film i could not disagree more this film pays no respect to its <UNK> <UNK> and has reduced two of the best loved sci fi <UNK> to little more than a teen horror slasher movie it has none of the tension or <UNK> present in previous alien or <UNK> movies and there is no <UNK> lead character i really did not care about any of the characters and i <UNK> <UNK> to see the stereotypical cast die as soon as possible in the <UNK> hope something better would <UNK> them it really takes super human <UNK> to have two of the most <UNK> creatures ever <UNK> <UNK> fail to make a gripping thrilling movie only watch this if you want to see how not to do it
