# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

Import test and train data and the labels

In [3]:
%tensorflow_version 1.x
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


Get the word index and then Create key value pair for word and word_id. (12.5 points)

In [4]:
wrd_indx = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [0]:
word_dict = dict(wrd_indx)

Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

In [0]:
import numpy as np

In [0]:
#X_train_new = []
#for n in range(0,x_train.shape[0]):
#  X_train_new.append(np.array(x_train[n][:20]))

#Using all words to improve on the accuracy of the model
X_train_new = x_train

In [0]:
#X_test_new = []
#for n in range(0,x_test.shape[0]):
#  X_test_new.append(np.array(x_test[n][:20]))

#Using all words to improve on the accuracy of the model
X_test_new = x_test

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [0]:
#load dataset as a list of ints
#(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
#import pandas as pd
#X_train = pd.DataFrame(x_train)
#X_train = X_train.iloc[:,0:21]
x_train_pad = pad_sequences(X_train_new, maxlen=maxlen)
x_test_pad =  pad_sequences(X_test_new, maxlen=maxlen)

In [19]:
print('x_train_pad :',x_train_pad[1])
print('x_test_pad :',x_test_pad[1])

x_train_pad : [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    1
  194 1153  194 8255   78  228    5    6 1463 4369 5012  134   26    4
  715    8  118 1634   14  394   20   13  119  954  189  102    5  207
  110 3103   21   14   69  188    8   30   23    7    4  249  126   93
    4  114    9 2300 1523    5  647    4  116    9   35 8163    4  229
    9  340 1322    4  118    9    4  130 4901   19    4 1002    5   89
   29  952   46   37    4  455    9   45   43   38 1543 1905  3

In [20]:
print('y_train: ',y_train)
print('y_test: ',y_test)

y_train:  [1 0 0 ... 0 1 0]
y_test:  [0 1 1 ... 0 0 0]


## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
import tensorflow as tf
from keras.callbacks import ModelCheckpoint, EarlyStopping

In [0]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [23]:
model.add(tf.keras.layers.Embedding(vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    input_length=maxlen) #Number of words in each review
          )

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [24]:
model.output

<tf.Tensor 'embedding/embedding_lookup/Identity_1:0' shape=(?, 300, 50) dtype=float32>

In [0]:
model.add(tf.keras.layers.LSTM(256, #RNN State - size of cell state and hidden state
                               dropout=0.2, #Dropout before feeding the data to LSTM layer
                               recurrent_dropout=0.4)) #Dropout applied to the output of LSTM layer

In [0]:
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [27]:
#Compile the model
adam = tf.keras.optimizers.Adadelta(learning_rate=1.0, rho=0.95)
model.compile(optimizer=adam,loss='binary_crossentropy',metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [28]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 300, 50)           500050    
_________________________________________________________________
lstm (LSTM)                  (None, 256)               314368    
_________________________________________________________________
dense (Dense)                (None, 1)                 257       
Total params: 814,675
Trainable params: 814,675
Non-trainable params: 0
_________________________________________________________________


In [0]:
checkpoint = ModelCheckpoint("seqNLP_best.h5", monitor='val_acc', verbose=1, save_best_only=True, mode='auto')
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=5, verbose=1, mode='auto')

In [30]:
model.fit(x_train_pad,y_train,
          epochs=10,
          batch_size=100,          
          validation_data=(x_test_pad, y_test),callbacks = [checkpoint, early])

Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 00001: val_acc improved from -inf to 0.60800, saving model to seqNLP_best.h5
Epoch 2/10
Epoch 00002: val_acc improved from 0.60800 to 0.79228, saving model to seqNLP_best.h5
Epoch 3/10
Epoch 00003: val_acc improved from 0.79228 to 0.83128, saving model to seqNLP_best.h5
Epoch 4/10
Epoch 00004: val_acc improved from 0.83128 to 0.83192, saving model to seqNLP_best.h5
Epoch 5/10
Epoch 00005: val_acc did not improve from 0.83192
Epoch 6/10
Epoch 00006: val_acc improved from 0.83192 to 0.83980, saving model to seqNLP_best.h5
Epoch 7/10
Epoch 00007: val_acc improved from 0.83980 to 0.84080, saving model to seqNLP_best.h5
Epoch 8/10
Epoch 00008: val_acc did not improve from 0.84080
Epoch 9/10
Epoch 00009: val_acc improved from 0.84080 to 0.84752, saving model to seqNLP_best.h5
Epoch 10/10
Epoch 00010: val_acc did not improve from 0.84752


<tensorflow.python.keras.callbacks.History at 0x7ff0fa012da0>

## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [144]:
for i in range(0,np.size(model.layers)):
  print(model.layers[i])

<tensorflow.python.keras.layers.embeddings.Embedding object at 0x7f6951bb78d0>
<tensorflow.python.keras.layers.recurrent.LSTM object at 0x7f6951bb7cc0>
<tensorflow.python.keras.layers.core.Dense object at 0x7f6951bb7c88>


<tf.Tensor 'dense/Sigmoid:0' shape=(?, 1) dtype=float32>