# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import LSTM, SimpleRNN,GRU, Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
import numpy as np
#import keras

Using TensorFlow backend.


In [0]:
from keras.datasets import imdb

vocab_size = 5000 #vocab size

#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

# call load_data with allow_pickle implicitly set to true
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

# restore np.load for future normal usage
np.load = np_load_old

In [3]:
X_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]),
       list([1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369,

In [0]:
# Truncate and pad the review sequences 
from keras.preprocessing import sequence 
max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length) 

In [5]:
X_train[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [6]:
X_train.shape

(25000, 500)

In [7]:
# create the model
model = Sequential()
model.add(Dense(32,input_shape=(500,)))
#model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

W0628 17:42:58.576520 139665857730432 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0628 17:42:58.593480 139665857730432 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0628 17:42:58.598895 139665857730432 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0628 17:42:58.645564 139665857730432 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0628 17:42:58.663388 139665857730432 deprecation_wrappe

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                16032     
_________________________________________________________________
dense_2 (Dense)              (None, 64)                2112      
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 129       
Total params: 26,593
Trainable params: 26,593
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
#Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=5, batch_size=32) 

  """Entry point for launching an IPython kernel.
W0628 17:43:01.269480 139665857730432 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f0636a70050>

In [9]:
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 50.00%


In [0]:
from keras import backend as K

inputs = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
functors = [K.function([inputs], [out]) for out in outputs]    # evaluation functions

# Testing
layer_outs = [func([X_test]) for func in functors]

In [12]:
print layer_outs[1]

[array([[  90.12236 ,    0.      ,  163.14014 , ...,    0.      ,
         205.69214 ,  246.79068 ],
       [   8.743372,    0.      ,    0.      , ...,    0.      ,
         153.4457  , 1700.6721  ],
       [   0.      ,    0.      ,    0.      , ...,    0.      ,
         537.7311  , 1241.1593  ],
       ...,
       [ 159.01767 ,    0.      ,  187.79956 , ...,  276.77155 ,
           0.      ,  481.19452 ],
       [ 638.7765  ,  199.18195 ,    0.      , ...,    0.      ,
         390.04446 ,  286.6161  ],
       [   0.      ,    0.      ,  959.2946  , ...,  352.15262 ,
         507.51913 , 1863.5068  ]], dtype=float32)]


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [13]:
# Build the model 
embedding_vector_length = 32 
model_e = Sequential() 
model_e.add(Embedding(vocab_size, embedding_vector_length, input_length=max_review_length)) 
model_e.add(LSTM(100)) 
model_e.add(Dense(1, activation='sigmoid')) 
model_e.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model_e.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [15]:
#Train the model
model_e.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=3, batch_size=100) 

  """Entry point for launching an IPython kernel.


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f0636a0e890>

In [16]:
scores = model_e.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.21%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
inputs_e = model_e.input                                           # input placeholder
outputs_e = [layer.output for layer in model_e.layers]          # all layer outputs
functors = [K.function([inputs], [out]) for out in outputs]    # evaluation functions

# Testing
layer_outs_e = [func([X_test]) for func in functors]

In [25]:
layer_outs_e[0]

[array([[   67.28716 ,    65.28795 ,   -36.878967, ...,    54.085407,
           -84.5538  ,   241.03487 ],
        [  675.71985 ,  1058.3662  ,  1155.8304  , ...,  -499.59027 ,
         -2191.4246  ,  2229.5242  ],
        [ 5634.6455  ,  1282.8372  , -1817.8756  , ...,    48.314594,
          -490.11166 ,  4111.062   ],
        ...,
        [  154.55225 ,   983.98486 ,  -255.97148 , ...,   920.25604 ,
          -176.85677 ,  1241.9014  ],
        [ -529.1401  ,  -479.43063 ,   459.84042 , ...,  -654.9687  ,
          -403.28806 ,  1272.7351  ],
        [ -351.0364  ,   532.338   ,   713.6292  , ...,   263.71234 ,
           552.81256 ,  1460.2668  ]], dtype=float32)]

In [20]:
layer_outs_e[1]

[array([[  90.12236 ,    0.      ,  163.14014 , ...,    0.      ,
          205.69214 ,  246.79068 ],
        [   8.743372,    0.      ,    0.      , ...,    0.      ,
          153.4457  , 1700.6721  ],
        [   0.      ,    0.      ,    0.      , ...,    0.      ,
          537.7311  , 1241.1593  ],
        ...,
        [ 159.01767 ,    0.      ,  187.79956 , ...,  276.77155 ,
            0.      ,  481.19452 ],
        [ 638.7765  ,  199.18195 ,    0.      , ...,    0.      ,
          390.04446 ,  286.6161  ],
        [   0.      ,    0.      ,  959.2946  , ...,  352.15262 ,
          507.51913 , 1863.5068  ]], dtype=float32)]

In [21]:
layer_outs_e[2]

[array([[0.0000000e+00, 0.0000000e+00, 2.2119250e+02, ..., 0.0000000e+00,
         2.8106481e-01, 0.0000000e+00],
        [3.7860809e+02, 0.0000000e+00, 1.3802365e+03, ..., 0.0000000e+00,
         0.0000000e+00, 2.4817639e+02],
        [0.0000000e+00, 0.0000000e+00, 4.2350654e+03, ..., 0.0000000e+00,
         2.2031748e+01, 0.0000000e+00],
        ...,
        [1.3339992e+02, 0.0000000e+00, 3.2539136e+02, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [0.0000000e+00, 0.0000000e+00, 2.8319962e+02, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [0.0000000e+00, 0.0000000e+00, 1.4945128e+03, ..., 0.0000000e+00,
         9.7951721e+01, 0.0000000e+00]], dtype=float32)]