# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


###Import Libraries and Packages

In [0]:

import tensorflow as tf
import csv
import pandas as pd
from PIL import Image

from keras import Model
from keras.applications.mobilenet import MobileNet, preprocess_input
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, Callback
from keras.layers import Conv2D, Reshape
from keras.utils import Sequence
from keras.backend import epsilon

In [17]:
!pip install numpy==1.16.1



In [0]:
import numpy as np

###Load train and test data 

In [0]:
from keras.datasets import imdb
vocab_size = 10000 #vocab size
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size) #vocab_size is no.of words to consider from the dataset, ordering based on frequency.

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

###Import the labels ( train and test)

In [0]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [22]:
x_train.shape

(25000, 300)

In [23]:
x_test.shape

(25000, 300)

###Get the word index and then Create key value pair for word and word_id. (12.5 points)

In [0]:
import keras

In [0]:
(training_data, training_labels), (test_data, test_labels)= imdb.load_data(num_words=vocab_size, index_from=3)

In [0]:
word_to_id = keras.datasets.imdb.get_word_index()

In [0]:
import itertools

In [29]:
dict(itertools.islice(word_to_id.items(), 4))

{u'fawn': 34701, u'nunnery': 52007, u'sonja': 16816, u'tsukino': 52006}

In [0]:
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [0]:
id_to_word = {id:word for word,id in word_to_id.items()}

In [32]:
print(' '.join(id_to_word[id] for id in training_data[2]))

<START> this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had <UNK> working to watch this feeble excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how <UNK> this is to watch save yourself an hour a bit of your life


###Build a Sequential Model using Keras for Sentiment Classification task. (10 points)

In [0]:
from __future__ import print_function

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
import itertools
from keras.layers import Flatten

In [34]:
model = Sequential()

model.add(Embedding(vocab_size, 8, input_length=maxlen))

model.add(Conv1D(64, 3, padding='same'))

model.add(Conv1D(32, 3, padding='same'))

model.add(Conv1D(16, 3, padding='same'))

model.add(Flatten())

model.add(Dropout(0.2))

model.add(Dense(180,activation='sigmoid'))

model.add(Dropout(0.2))

model.add(Dense(1,activation='sigmoid'))

W1003 11:31:32.471015 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1003 11:31:32.520788 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1003 11:31:32.530266 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W1003 11:31:32.628218 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W1003 11:31:32.637995 

In [35]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 8)            80000     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 300, 64)           1600      
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 300, 32)           6176      
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 300, 16)           1552      
_________________________________________________________________
flatten_1 (Flatten)          (None, 4800)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 4800)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 180)               864180    
__________

In [36]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

W1003 11:31:35.642786 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1003 11:31:35.669902 140044002645888 deprecation_wrapper.py:119] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W1003 11:31:35.678539 140044002645888 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_impl.py:180: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [37]:
history= model.fit(x_train, y_train, epochs=20, batch_size=76)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


###Report the Accuracy of the model. (5 points)

In [0]:
Accuracy_scores = model.evaluate(x_test, y_test, verbose=0)

In [39]:
print("Accuracy_scores: %.2f%%" % (Accuracy_scores[1]*100))

Accuracy_scores: 84.58%


###Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)

In [0]:
from keras import backend as K

In [41]:
x_train.shape

(25000, 300)

In [0]:
get_1st_layer_output = K.function([model.layers[0].input], 
                                  [model.layers[0].output])
get_2nd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[1].output])
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])
get_4th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[3].output])
get_5th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[4].output])
get_6th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[5].output])
get_7th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[6].output])
get_8th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[7].output])
get_9th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[8].output])

In [44]:
lyr_op1 = get_1st_layer_output([x_test[0:1,]])[0]
lyr_op1

array([[[-0.0014094 ,  0.02141405,  0.00337149, ...,  0.01615587,
          0.00933726, -0.01901296],
        [-0.0014094 ,  0.02141405,  0.00337149, ...,  0.01615587,
          0.00933726, -0.01901296],
        [-0.0014094 ,  0.02141405,  0.00337149, ...,  0.01615587,
          0.00933726, -0.01901296],
        ...,
        [ 0.01653636,  0.04973924,  0.00579983, ...,  0.03293035,
         -0.01044111, -0.01169479],
        [-0.01693693, -0.0206557 ,  0.00418183, ..., -0.01506506,
         -0.00536103,  0.01319498],
        [ 0.07213545, -0.10712177,  0.06814995, ...,  0.15623501,
          0.05137044,  0.12305746]]], dtype=float32)

In [45]:
lyr_op9 = get_9th_layer_output([x_test[0:1,]])[0]
lyr_op9

array([[0.00034643]], dtype=float32)

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).