# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
##Import libraries
import tensorflow as tf

import numpy as np

from keras import models, regularizers, layers, optimizers, losses, metrics

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, LSTM, GRU
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.utils import np_utils, to_categorical
from keras.callbacks import TensorBoard

Using TensorFlow backend.


In [0]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# LOAD IMDB DATA
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

# restore np.load for future normal usage
np.load = np_load_old

In [0]:
# reverse lookup
word_to_id = imdb.get_word_index()
'''word_to_id = {k: (v + INDEX_FROM) for k, v in word_to_id.items()}'''
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [0]:
# Truncate and pad the review sequences, to take only the first 500 words
max_review_length = 500
x_train = sequence.pad_sequences(X_train, maxlen=max_review_length, padding='post')
x_test = sequence.pad_sequences(X_test, maxlen=max_review_length, padding='post')

In [0]:
data = np.concatenate((X_train, X_test), axis=0)
targets = np.concatenate((y_train, y_test), axis=0)

In [19]:
print(data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]


In [20]:
reverse_word_index = dict([(value, key) for (key, value) in word_to_id.items()])
decoded_review = ' '.join([reverse_word_index.get(i - 3, '#') for i in data[0]])

print(decoded_review)

# this film was just brilliant casting location scenery story direction everyone's really suited the part they played <UNK> you could just imagine being there robert # is an amazing actor <UNK> now the same being director # father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for # <UNK> would recommend it to everyone to watch <UNK> the fly # was amazing really cried at the end it was so sad <UNK> you know what they say if you cry at a film it must have been good <UNK> this definitely was also # to the two little # that played the # of norman <UNK> paul they were just brilliant children are often left out of the # list i think because the stars that play them all grown up are such a big # for the whole film but these children are amazing <UNK> should be # for what they have done don't you think the

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [8]:
embedding_vector_length = 32  # dimensions

model = Sequential()
# layer of 500 x 32
model.add(
    Embedding(
        top_words,  # how many words to consider based on count
        embedding_vector_length,  # dimensions
        input_length=max_review_length))  # input vector
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

W0105 16:23:21.182579 140600634251136 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0105 16:23:21.190463 140600634251136 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0105 16:23:21.196619 140600634251136 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0105 16:23:21.231417 140600634251136 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0105 16:23:21.246052 140600634251136 module_wrapper.py:139] From /usr/local

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               4000250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 4,160,501
Trainable params: 4,160,501
Non-trainable params: 0
_________________________________________________________________


In [21]:
# Train the model
results = model.fit(
    x_train,
    y_train,
    validation_data=(x_test, y_test),
    epochs=10, batch_size=128)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
# Evaluate the model
#scores = model.evaluate(x_test, y_test, verbose=0)
#print("Accuracy: %.2f%%" % (scores[1] * 100))

print(np.mean(results.history["val_acc"]))

0.8609040000286102


In [23]:
# predict sentiment from reviews
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"
for review in [good, bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length)
    print("%s. Sentiment: %s" % (
        review, model.predict(np.array([tmp_padded[0]]))))

i really liked the movie and had fun. Sentiment: [[0.3439622]]
this movie was terrible and bad. Sentiment: [[0.28569573]]


In [24]:
#print(decoded_review[1:len(decoded_review)-12])
#print(decoded_review.replace("#", " ")) 
pred_review = decoded_review.replace("#", "")
pred_review = pred_review.replace("  ", " ")
pred_review = pred_review[1:len(pred_review)]

print(len(pred_review.split()))

206


In [0]:
def checkSentiment(rev):
  for review in [rev]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length)
    print("%s. Sentiment: %s" % (review, model.predict(np.array([tmp_padded[0]]))))

In [26]:
checkSentiment(pred_review)

this film was just brilliant casting location scenery story direction everyone's really suited the part they played <UNK> you could just imagine being there robert is an amazing actor <UNK> now the same being director father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> would recommend it to everyone to watch <UNK> the fly was amazing really cried at the end it was so sad <UNK> you know what they say if you cry at a film it must have been good <UNK> this definitely was also to the two little that played the of norman <UNK> paul they were just brilliant children are often left out of the list i think because the stars that play them all grown up are such a big for the whole film but these children are amazing <UNK> should be for what they have done don't you think the whole story was so lo

In [0]:
model.save("trained_imdb_model.h5")

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
##The output gives the prediction of the word either to be 1 (positive sentiment) or 0 (negative sentiment).
#Value closer to 1 is strong positive sentiment and a value close to 0 is a strong negative sentiment. 

In [39]:
print(model.layers[0].output)
print(model.layers[1].output)
print(model.layers[2].output)
print(model.layers[3].output)

Tensor("embedding_1/embedding_lookup/Identity:0", shape=(?, 500, 32), dtype=float32)
Tensor("flatten_1/Reshape:0", shape=(?, ?), dtype=float32)
Tensor("dense_1/Relu:0", shape=(?, 250), dtype=float32)
Tensor("dense_2/Sigmoid:0", shape=(?, 1), dtype=float32)
