# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
import keras

Using TensorFlow backend.


In [0]:
import numpy as np

In [0]:
import numpy as np
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

In [3]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
np.load = np_load_old

In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

In [0]:

#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [10]:
x_train.shape

(25000, 300)

In [11]:
x_test.shape

(25000, 300)

In [12]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [13]:
x_train

array([[   0,    0,    0, ...,   19,  178,   32],
       [   0,    0,    0, ...,   16,  145,   95],
       [   0,    0,    0, ...,    7,  129,  113],
       ...,
       [   0,    0,    0, ...,    4, 3586,    2],
       [   0,    0,    0, ...,   12,    9,   23],
       [   0,    0,    0, ...,  204,  131,    9]], dtype=int32)

In [0]:
#Get the word index and then Create key value pair for word and word_id. (12.5 points)

In [14]:
word_index=imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [15]:
word_index

{u'fawn': 34701,
 u'tsukino': 52006,
 u'nunnery': 52007,
 u'sonja': 16816,
 u'vani': 63951,
 u'woods': 1408,
 u'spiders': 16115,
 u'hanging': 2345,
 u'woody': 2289,
 u'trawling': 52008,
 u"hold's": 52009,
 u'comically': 11307,
 u'localized': 40830,
 u'disobeying': 30568,
 u"'royale": 52010,
 u"harpo's": 40831,
 u'canet': 52011,
 u'aileen': 19313,
 u'acurately': 52012,
 u"diplomat's": 52013,
 u'rickman': 25242,
 u'rumbustious': 52014,
 u'familiarness': 52015,
 u"spider'": 52016,
 u'hahahah': 68804,
 u"wood'": 52017,
 u'transvestism': 40833,
 u"hangin'": 34702,
 u'screaming': 1927,
 u'seamier': 40834,
 u'wooded': 34703,
 u'bravora': 52018,
 u'grueling': 16817,
 u'wooden': 1636,
 u'wednesday': 16818,
 u"'prix": 52019,
 u'altagracia': 34704,
 u'circuitry': 52020,
 u'crotch': 11585,
 u'busybody': 57766,
 u"tart'n'tangy": 52021,
 u'pantheistic': 52022,
 u'thrace': 52023,
 u"tom's": 11038,
 u'snuggles': 52025,
 u"frasier's": 52026,
 u'complainers': 52027,
 u'templarios': 52125,
 u'272': 40835

In [0]:
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])

In [17]:
reverse_word_index

{1: u'the',
 2: u'and',
 3: u'a',
 4: u'of',
 5: u'to',
 6: u'is',
 7: u'br',
 8: u'in',
 9: u'it',
 10: u'i',
 11: u'this',
 12: u'that',
 13: u'was',
 14: u'as',
 15: u'for',
 16: u'with',
 17: u'movie',
 18: u'but',
 19: u'film',
 20: u'on',
 21: u'not',
 22: u'you',
 23: u'are',
 24: u'his',
 25: u'have',
 26: u'he',
 27: u'be',
 28: u'one',
 29: u'all',
 30: u'at',
 31: u'by',
 32: u'an',
 33: u'they',
 34: u'who',
 35: u'so',
 36: u'from',
 37: u'like',
 38: u'her',
 39: u'or',
 40: u'just',
 41: u'about',
 42: u"it's",
 43: u'out',
 44: u'has',
 45: u'if',
 46: u'some',
 47: u'there',
 48: u'what',
 49: u'good',
 50: u'more',
 51: u'when',
 52: u'very',
 53: u'up',
 54: u'no',
 55: u'time',
 56: u'she',
 57: u'even',
 58: u'my',
 59: u'would',
 60: u'which',
 61: u'only',
 62: u'story',
 63: u'really',
 64: u'see',
 65: u'their',
 66: u'had',
 67: u'can',
 68: u'were',
 69: u'me',
 70: u'well',
 71: u'than',
 72: u'we',
 73: u'much',
 74: u'been',
 75: u'bad',
 76: u'get',
 77: 

In [0]:
decoded_review = ' '.join([reverse_word_index.get(i,'?') for i in x_train[0]])

In [19]:
decoded_review

u"? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd 

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
from keras.models import Sequential
from keras.layers import Flatten,Dense,Embedding

In [0]:
#define a Embedding Layer with 64 dimensions and input_length = 300 words

In [21]:
model = Sequential()
model.add(Embedding(10000,64,input_length=
                    maxlen))
model.add(Flatten())
model.add(Dense(32,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])

W0104 17:57:51.345451 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0104 17:57:51.355986 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0104 17:57:51.360774 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0104 17:57:51.402538 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0104 17:57:51.418385 139876623099776 module_wrapper.py:139] From /usr/local

In [22]:
history = model.fit(x_train,y_train,
          epochs=20,
          batch_size=32,          
          validation_data=(x_test, y_test))

W0104 17:57:53.757633 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

W0104 17:57:53.785254 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

W0104 17:57:53.832523 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0104 17:57:53.838604 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

W0104 17:57:53.840030 139876623099776 module_wrapper.py:139] From /usr/local/li

Train on 25000 samples, validate on 25000 samples
Epoch 1/20


W0104 17:57:55.258111 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

W0104 17:57:55.259315 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

W0104 17:57:55.346569 139876623099776 module_wrapper.py:139] From /usr/local/lib/python2.7/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.



Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [0]:
#Report the Accuracy of the model. (5 points)

In [24]:
loss,acc=model.evaluate(x_test,y_test)
print(acc)

0.83392


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
from keras import backend as K


In [31]:

inp = model.input                                           
outputs = [layer.output for layer in model.layers]          
functors = [K.function([inp], [out]) for out in outputs]   

layer_outs = [func([x_test[0].reshape(1,-1)]) for func in functors]
print layer_outs

[[array([[[-0.00194268, -0.00765159, -0.03504343, ..., -0.01325203,
         -0.00606479, -0.01482375],
        [-0.00194268, -0.00765159, -0.03504343, ..., -0.01325203,
         -0.00606479, -0.01482375],
        [-0.00194268, -0.00765159, -0.03504343, ..., -0.01325203,
         -0.00606479, -0.01482375],
        ...,
        [ 0.02786462, -0.05064026, -0.01322706, ...,  0.02633499,
          0.06181337,  0.02612353],
        [ 0.00236842, -0.01174042,  0.0017697 , ...,  0.01354748,
         -0.01293511, -0.00688689],
        [-0.01816829,  0.07322223, -0.12703255, ..., -0.03705573,
          0.04132452, -0.14361395]]], dtype=float32)], [array([[-0.00194268, -0.00765159, -0.03504343, ..., -0.03705573,
         0.04132452, -0.14361395]], dtype=float32)], [array([[1.5859357 , 1.725853  , 2.160128  , 1.7440513 , 2.5747254 ,
        1.5254537 , 1.5868578 , 1.3690289 , 1.9937618 , 1.508213  ,
        0.52547157, 1.7538209 , 0.35309103, 1.3072786 , 1.6660818 ,
        1.6598849 , 1.7008424 