<a href="https://colab.research.google.com/github/GreatLearningAIML1/delhi-may19-batch-Trapti04/blob/master/SeqNLP_Project1_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [2]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb

Using TensorFlow backend.


In [3]:
vocab_size = 10000 #vocab size
print('Loading data as a list of ints...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

Loading data as a list of ints...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [4]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

25000 train sequences
25000 test sequences


In [0]:
# cut texts after this number of words (among top max_features most common words)
maxlen = 300  #number of word used from each review

In [6]:
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

x_train shape: (25000, 300)
x_test shape: (25000, 300)


In [12]:
x_test[0,:] # sample view of what a training sample looks like.

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [7]:
print(len(y_train), 'train labels')
print(len(y_test), 'test labels')

25000 train labels
25000 test labels


In [8]:
print('Build model...')
model = Sequential()
model.add(Embedding(vocab_size, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# will try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


Build model...


In [0]:
batch_size = 32

In [41]:
print('Train...')
model.fit(x_train, y_train,batch_size=batch_size, epochs=10,validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test score: 0.6207315696120262
Test accuracy: 0.8467599749565125


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [42]:
model.summary() # summary of the model built above

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 128)         1280000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 1,411,713
Trainable params: 1,411,713
Non-trainable params: 0
_________________________________________________________________


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [43]:
input_shape = x_test[0,:]
print("Test input", input_shape)

Test input [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0 

In [44]:
from keras import backend as back

inpt = model.input
output = [layer.output for layer in model.layers]
evalFunction = [back.function([inpt, back.learning_phase()], [out]) for out in output]

layerOpt = [func([x_test[[0]], 1]) for func in evalFunction]
print(layerOpt)

[[array([[[-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        [-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        [-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        ...,
        [ 0.0993589 , -0.00456585, -0.09932908, ...,  0.03250732,
          0.03377443,  0.02973673],
        [ 0.03030975, -0.00253651,  0.04216104, ...,  0.07718196,
         -0.00934459, -0.05002823],
        [-0.04258195, -0.06253711, -0.17224348, ...,  0.05432844,
         -0.07417119,  0.02632374]]], dtype=float32)], [array([[ 0.1725192 ,  0.02031459, -0.09959959, -0.15237378,  0.10832769,
        -0.01714554, -0.09406634, -0.06904251, -0.01397615, -0.00389728,
         0.14921723, -0.50293636,  0.05859414,  0.07923333, -0.13156933,
        -0.48329574, -0.05083937, -0.2708171 , -0.11430614, -0.1472006 ,
        -0.00121022,  0.07283178,  0.05321192, -0.02605742, -

In [45]:
print('Embedding layer Output')
layerOpt[0][0]

Embedding layer Output


array([[[-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        [-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        [-0.01958594, -0.07941326, -0.0056704 , ..., -0.0429342 ,
          0.03579984,  0.04388482],
        ...,
        [ 0.0993589 , -0.00456585, -0.09932908, ...,  0.03250732,
          0.03377443,  0.02973673],
        [ 0.03030975, -0.00253651,  0.04216104, ...,  0.07718196,
         -0.00934459, -0.05002823],
        [-0.04258195, -0.06253711, -0.17224348, ...,  0.05432844,
         -0.07417119,  0.02632374]]], dtype=float32)

In [46]:
print('LSTM Output')
layerOpt[1]

LSTM Output


[array([[ 0.1725192 ,  0.02031459, -0.09959959, -0.15237378,  0.10832769,
         -0.01714554, -0.09406634, -0.06904251, -0.01397615, -0.00389728,
          0.14921723, -0.50293636,  0.05859414,  0.07923333, -0.13156933,
         -0.48329574, -0.05083937, -0.2708171 , -0.11430614, -0.1472006 ,
         -0.00121022,  0.07283178,  0.05321192, -0.02605742, -0.06026261,
         -0.04836049,  0.04862405,  0.02803506, -0.16122991, -0.1116797 ,
         -0.16515829, -0.00615289, -0.5110276 ,  0.21411467, -0.25602928,
         -0.02232128, -0.00964604, -0.10008066,  0.08173978,  0.11683379,
         -0.31812513, -0.04522363,  0.07530516, -0.07888705,  0.07669975,
         -0.06780262,  0.00093909,  0.05254717,  0.12138733,  0.08401588,
         -0.08436623, -0.06150853,  0.9258759 ,  0.10367718, -0.03830636,
          0.11892256,  0.0737358 , -0.0304853 , -0.12195323, -0.01665873,
          0.16241542,  0.10808758,  0.03023693,  0.05671649,  0.02240321,
          0.11937688,  0.02460613, -0.

In [47]:
print('Dense Layer output')
layerOpt[2]

Dense Layer output


[array([[0.03054881]], dtype=float32)]