# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

#  Importing the Dataset and the labels

In [49]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) 
# vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [50]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 20  #number of word used from each review

# Padding the words to ensure all word sequences are of the same length

In [51]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [52]:
x_test.shape

(25000, 20)

In [53]:
x_train.shape

(25000, 20)

In [65]:
print('Loaded dataset has {} in Training data and {} in Test data'.format(len(x_train), len(x_test)))



Loaded dataset has 25000 in Training data and 25000 in Test data


In [68]:
print('X_train 6 Samples')
print(x_train[:6])


X_train 6 Samples
[[  65   16   38 1334   88   12   16  283    5   16 4472  113  103   32
    15   16 5345   19  178   32]
 [  23    4 1690   15   16    4 1355    5   28    6   52  154  462   33
    89   78  285   16  145   95]
 [1352   13  191   79  638   89    2   14    9    8  106  607  624   35
   534    6  227    7  129  113]
 [   7 2804    5    4  559  154  888    7  726   50   26   49 7008   15
   566   30  579   21   64 2574]
 [  15  595   13  784   25 3171   18  165  170  143   19   14    5 7224
     6  226  251    7   61  113]
 [  10   10 1361  173    4  749    2   16 3804    8    4  226   65   12
    43  127   24    2   10   10]]


In [69]:
print('y_train 6 Samples')
print(y_train[:6])

y_train 6 Samples
[1 0 0 1 0 0]


In [55]:
word_id=imdb.get_word_index()

In [56]:
word_w = {i: word for word, i in word_id.items()}
print('---review with words---')
print([word_w.get(i, ' ') for i in x_train[5]])
print('---label---')
print(y_train[5])

---review with words---
['i', 'i', 'slowly', 'lot', 'of', 'above', 'and', 'with', 'connect', 'in', 'of', 'script', 'their', 'that', 'out', 'end', 'his', 'and', 'i', 'i']
---label---
0


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

# Building a Sequential Model with Embedding layer

In [57]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_size=32

model=Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=maxlen))
model.add(LSTM(100))
# Fully connected layer
model.add(Dense(64, activation='relu'))
# Fully connected layer
model.add(Dense(64, activation='relu'))

model.add(Dense(1, activation='sigmoid'))

print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 32)            320000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_7 (Dense)              (None, 64)                6464      
_________________________________________________________________
dense_8 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 65        
Total params: 383,889
Trainable params: 383,889
Non-trainable params: 0
_________________________________________________________________
None


In [58]:
model.compile(loss='binary_crossentropy', optimizer='adam',  metrics=['accuracy'])

In [59]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)


In [60]:
batch_size = 500
num_epochs = 20

Valididation_Features, Validation_Target = x_train[:batch_size], y_train[:batch_size]
X_train_final, y_train_final = x_train[batch_size:], y_train[batch_size:]

model.fit(X_train_final, y_train_final, validation_data=(Valididation_Features, Validation_Target), batch_size=batch_size, epochs=num_epochs)

Train on 24500 samples, validate on 500 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1d5f41b2be0>

# Testing the Accuracy of the Test and the Validation Data

In [63]:
scores = model.evaluate(Valididation_Features, Validation_Target, verbose=0)
print('Validation accuracy:', scores[1])

Validation accuracy: 0.6800000009536743


# We are able to achieve an accuracy of 68% on the Validation Data

In [62]:
scores = model.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.70208


# We are able to achieve an accuracy of 70% on the Test Data

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [46]:
from keras import backend as Ks
import numpy as np
inp = model.input                                           # input placeholder
outputs = [layer.output for layer in model.layers]          # all layer outputs
evalution_func = Ks.function([inp, Ks.learning_phase()], outputs )   # evaluation function

# Printing the output of every layer

In [47]:
# Testing
test = np.random.random(maxlen)[np.newaxis,...]
layer_outs = evalution_func([test, 1.])
print(layer_outs)

[array([[[ 0.08339524,  0.06326357,  0.03041614,  0.0147679 ,
          0.00340238,  0.07200875, -0.0015922 , -0.06254016,
          0.00143166,  0.10034624,  0.06708504, -0.05592422,
          0.03770265, -0.02154157,  0.03784594,  0.06294331,
         -0.01458008, -0.01182852, -0.04454797,  0.01405974,
          0.0192377 , -0.01103234,  0.05640777,  0.0394287 ,
         -0.01846388,  0.05262274,  0.02952977, -0.00300695,
         -0.01784075,  0.08953592,  0.0512347 , -0.05203152],
        [ 0.08339524,  0.06326357,  0.03041614,  0.0147679 ,
          0.00340238,  0.07200875, -0.0015922 , -0.06254016,
          0.00143166,  0.10034624,  0.06708504, -0.05592422,
          0.03770265, -0.02154157,  0.03784594,  0.06294331,
         -0.01458008, -0.01182852, -0.04454797,  0.01405974,
          0.0192377 , -0.01103234,  0.05640777,  0.0394287 ,
         -0.01846388,  0.05262274,  0.02952977, -0.00300695,
         -0.01784075,  0.08953592,  0.0512347 , -0.05203152],
        [ 0.08339524,

# Source for reference

https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e 