# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [3]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.



Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [4]:
print("Shape of X_train : {}".format(x_train.shape))

Shape of X_train : (25000,)


In [5]:
print("Shape of X_test : {}".format(x_test.shape))

Shape of X_test : (25000,)


In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

# restore np.load for future normal usage
np.load = np_load_old

In [7]:
print("Shape of X_train : {}".format(x_train.shape))

Shape of X_train : (25000, 300)


In [8]:
print("Shape of X_test : {}".format(x_test.shape))

Shape of X_test : (25000, 300)


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
#3. Get the word index and then Create key value pair for word and word_id.

In [10]:
#IMDB dataset is preloaded with word index
w_index = imdb.get_word_index()                                    
#reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])            
#decoded_review = ' '.join([reverse_word_index.get(i - 3, "") for i in x_train[0]])

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [11]:
w_index

{u'fawn': 34701,
 u'tsukino': 52006,
 u'nunnery': 52007,
 u'sonja': 16816,
 u'vani': 63951,
 u'woods': 1408,
 u'spiders': 16115,
 u'hanging': 2345,
 u'woody': 2289,
 u'trawling': 52008,
 u"hold's": 52009,
 u'comically': 11307,
 u'localized': 40830,
 u'disobeying': 30568,
 u"'royale": 52010,
 u"harpo's": 40831,
 u'canet': 52011,
 u'aileen': 19313,
 u'acurately': 52012,
 u"diplomat's": 52013,
 u'rickman': 25242,
 u'rumbustious': 52014,
 u'familiarness': 52015,
 u"spider'": 52016,
 u'hahahah': 68804,
 u"wood'": 52017,
 u'transvestism': 40833,
 u"hangin'": 34702,
 u'screaming': 1927,
 u'seamier': 40834,
 u'wooded': 34703,
 u'bravora': 52018,
 u'grueling': 16817,
 u'wooden': 1636,
 u'wednesday': 16818,
 u"'prix": 52019,
 u'altagracia': 34704,
 u'circuitry': 52020,
 u'crotch': 11585,
 u'busybody': 57766,
 u"tart'n'tangy": 52021,
 u'pantheistic': 52022,
 u'thrace': 52023,
 u"tom's": 11038,
 u'snuggles': 52025,
 u"frasier's": 52026,
 u'complainers': 52027,
 u'templarios': 52125,
 u'272': 40835

In [12]:
x_test[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [13]:
#4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
import keras
keras.backend.clear_session()
model = keras.Sequential()

W0112 11:07:07.031630 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

W0112 11:07:07.033277 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:98: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0112 11:07:07.050512 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:102: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.



In [14]:
from keras.layers import Embedding,LSTM,Dense

model.add(Embedding(vocab_size + 1, #Vocablury size
                    50, #Embedding size
                    input_length=maxlen) #Number of words in each review
          )

W0112 11:07:07.069997 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0112 11:07:07.074024 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



In [15]:
model.add(LSTM(256, #RNN State - size of cell state and hidden state
                               dropout=0.25, #Dropout before feeding the data to LSTM layer
                               recurrent_dropout=0.25)) #Dropout applied to the output of LSTM layer))

W0112 11:07:07.345886 140155926091648 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
model.add(Dense(1,activation='sigmoid'))

In [17]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

W0112 11:07:07.623876 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0112 11:07:07.647109 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0112 11:07:07.653536 140155926091648 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [18]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 50)           500050    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 814,675
Trainable params: 814,675
Non-trainable params: 0
_________________________________________________________________


In [19]:
model.fit(x_train,y_train,
          epochs=10,
          batch_size=250,          
          validation_data=(x_test, y_test))

W0112 11:07:08.699136 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

W0112 11:07:08.820499 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

W0112 11:07:08.891124 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0112 11:07:08.897171 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0112 11:07:08.898128 140155926091648 module_wrapper.py:139] From /usr/local/li

Train on 25000 samples, validate on 25000 samples
Epoch 1/10


W0112 11:07:10.340188 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

W0112 11:07:10.342027 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

W0112 11:07:10.521941 140155926091648 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.



Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f783b753710>

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
from keras import backend as K # input placeholder
outputs = [layer.output for layer in model.layers]     # all layer outputs

In [21]:
outputs

[<tf.Tensor 'embedding_1/embedding_lookup/Identity:0' shape=(?, 300, 50) dtype=float32>,
 <tf.Tensor 'lstm_1/TensorArrayReadV3:0' shape=(?, 256) dtype=float32>,
 <tf.Tensor 'dense_1/Sigmoid:0' shape=(?, 1) dtype=float32>]

In [0]:
test =x_test[0].reshape(1,300)

In [36]:
model.predict(test)

array([[0.4388075]], dtype=float32)

In [37]:
model.layers

[<keras.layers.embeddings.Embedding at 0x7f783b182710>,
 <keras.layers.recurrent.LSTM at 0x7f783b182c10>,
 <keras.layers.core.Dense at 0x7f783b182f90>]

In [0]:
from keras import backend as K

# with a Sequential model
get_1st_layer_output = K.function([model.layers[0].input],
                                  [model.layers[0].output])
get_2nd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[1].output])
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])


In [44]:
get_1st_layer_output([test])[0]

array([[[ 0.04436672,  0.07383348,  0.02983799, ...,  0.01955908,
          0.01320746, -0.03671551],
        [ 0.04436672,  0.07383348,  0.02983799, ...,  0.01955908,
          0.01320746, -0.03671551],
        [ 0.04436672,  0.07383348,  0.02983799, ...,  0.01955908,
          0.01320746, -0.03671551],
        ...,
        [ 0.03630169, -0.03406252,  0.0154582 , ...,  0.03614634,
          0.05001722,  0.03982173],
        [ 0.02122898, -0.01472956, -0.00943918, ...,  0.02011439,
         -0.0373191 ,  0.00290281],
        [-0.06827493, -0.09430009, -0.07474107, ..., -0.03405517,
         -0.07757273,  0.10213285]]], dtype=float32)

In [45]:
get_2nd_layer_output([test])[0]

array([[ 5.57582863e-02,  3.51180658e-02,  2.37971041e-02,
        -6.24547899e-03, -2.67486423e-02, -2.95896959e-02,
        -1.49366381e-02, -5.48655577e-02, -1.24778785e-02,
         2.83080731e-02, -5.81053495e-02,  4.95156413e-03,
         7.48798102e-02, -5.64270606e-03, -1.43446941e-02,
        -8.36328231e-03, -4.02758457e-02,  2.94165802e-03,
         5.76708391e-02,  2.33119610e-03,  4.01535779e-02,
        -1.47934258e-02, -1.19520398e-02, -3.36043723e-02,
         4.55890745e-02,  2.92405672e-02, -2.47052237e-02,
         1.09410065e-03,  4.06023078e-02,  1.67764090e-02,
        -8.64074193e-03, -3.81891765e-02, -3.10100783e-02,
         2.65658423e-02, -5.06972708e-02, -5.88221028e-02,
        -5.80646135e-02, -7.61188045e-02, -5.44605870e-03,
        -9.16073378e-03, -4.69069742e-02,  4.18929607e-02,
         1.46431420e-02,  2.63024028e-02, -3.38013992e-02,
         2.17273254e-02, -7.20813349e-02,  1.99342379e-03,
        -4.17730073e-03, -2.99922507e-02,  2.39361566e-0

In [46]:
get_3rd_layer_output([test])[0]

array([[0.4388075]], dtype=float32)

0