# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

### Handle the below error using below snippet
"Object arrays cannot be loaded when allow_pickle=False"

In [0]:
import numpy as np
np_load_old = np.load
np.load = lambda *a, **k: np_load_old(*a, allow_pickle=True, **k)

###1. Import test and train data

In [0]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [0]:

#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

##Description of the data
(From https://keras.io/datasets/ --> "IMDB Movie reviews sentiment classification")

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

In [5]:
#visualizing the data
print ('review: ', x_train[0], 'label: ', y_train[0])

review:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    1   14
   22   16   43  530  973 1622 1385   65  458 4468   66 3941    4  173
   36  256    5   25  100   43  838  112   50  670    2    9   35  480
  284    5  150    4  172  112  167    2  336  385   39    4  172 4536
 1111   17  546   38   13  447    4  192   50   16    6  147 2025   19
   14   22    4 1920 4613  469    4   22   71   87   12   16   43  530
   38   76   15   13 1247    4   22   17  515   17   12   16  626   18
    2    5   62  386   12    8  316    8  106    5    4 2223 5244   16
  480   66 3785   33    4  130   12   16   38  619    5   25  124   

In [6]:
print(x_train.shape, x_test.shape)

(25000, 300) (25000, 300)


###3. Get the word index and then Create key value pair for word and word_id

In [13]:
word_id = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [14]:
id_to_word = {i: word for word, i in word_id.items()}
print('review text')
print([id_to_word.get(i, ' ') for i in x_train[4]])
print('label')
print(y_train[4])

review text
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'the', 'sure', 'themes', 'br', 'only', 'acting', 'i', 'i', 'was', 'favourite', 'as', 'on', 'she', 'they', 'hat', 'but', 'already', 'most', 'was', 'scares', 'minor', 'if', 'flash', 'was', 'well', 'also', 'good', '8', 'older

In [15]:
#Test data
print('review text')
print([id_to_word.get(i, ' ') for i in x_test[4]])
print('label')
print(y_test[4])

review text
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'the', 'just', 'good', 'bec

### 4. Build a Sequential Model using Keras for Sentiment Classification task.

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_vector_length = 32



### Build an RNN using LSTM

We want the model to learn the association of the sentiment with the sequence of words. And hence RNN. We will specifically use LSTM for it's ability to 'remember' long term dependencies.

In [18]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_vector_length, input_length=maxlen))
model.add(Dropout(0.2))
model.add(LSTM(32))
model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

W0823 21:56:17.210970 140503148017536 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0823 21:56:17.226783 140503148017536 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0823 21:56:17.240900 140503148017536 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0823 21:56:17.248678 140503148017536 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Ins

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           320000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 32)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               8448      
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 337,025
Trainable params: 337,025
Non-trainable params: 0
_________________________________________________________________


In [19]:
batch_size = 64
num_epochs = 5

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=num_epochs,
          validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc92530fba8>

In [20]:
log_loss, acc  = model.evaluate(x_test, y_test, verbose=0)
print('Log loss:', log_loss)
print('Test set accuracy:', acc)

Log loss: 0.46912706779241564
Test set accuracy: 0.85868


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
from keras import backend as K

TEST_INPUT_INDEX = 3

def print_layer_outputs(model, test_input_index=TEST_INPUT_INDEX):
    input_ = model.input                                        # input placeholder
    outputs = [layer.output for layer in model.layers]     # all layer outputs
    func = K.function([input_, K.learning_phase()], outputs )   # evaluation function

    # Testing
    test = [x_test[test_input_index]]
    layer_outs = func([test, 1.])

    for i, layer_out in enumerate(layer_outs):
        print("OUTPUT SHAPE for Layer {} ({}) : {}".format(i+1, outputs[i].name, layer_out.shape))
        print(layer_out)
        print()

    print("EXPECTED OUTPUT LABEL : {}".format(y_test[test_input_index]))

In [26]:
print_layer_outputs(model)

OUTPUT SHAPE for Layer 1 (embedding_1/embedding_lookup/Identity:0) : (1, 300, 32)
[[[ 0.04913902  0.03158562  0.01610495 ...  0.03528945  0.00819031
    0.0286692 ]
  [ 0.04913902  0.03158562  0.01610495 ...  0.03528945  0.00819031
    0.0286692 ]
  [ 0.04913902  0.03158562  0.01610495 ...  0.03528945  0.00819031
    0.0286692 ]
  ...
  [-0.00644389 -0.03355484  0.04389771 ...  0.05145062  0.0228942
    0.01330343]
  [-0.04806447 -0.01907362 -0.00738712 ... -0.02627657 -0.00144542
    0.02984534]
  [-0.00417691  0.01354503 -0.01274874 ...  0.01298238  0.02059083
    0.05622735]]]

OUTPUT SHAPE for Layer 2 (dropout_1/cond/Merge:0) : (1, 300, 32)
[[[ 0.          0.03948202  0.02013119 ...  0.          0.01023789
    0.0358365 ]
  [ 0.06142377  0.03948202  0.02013119 ...  0.04411181  0.01023789
    0.0358365 ]
  [ 0.06142377  0.03948202  0.02013119 ...  0.04411181  0.01023789
    0.0358365 ]
  ...
  [-0.00805487 -0.          0.05487214 ...  0.06431327  0.02861775
    0.        ]
  [-0.   