# Sentiment Classification


### Generate Word Embeddings and retrieve outputs of each layer with Keras based on Classification task

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation.

It is a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems.

We willl use the imdb dataset to learn word embeddings as we train our dataset. This dataset contains 25,000 movie reviews from IMDB, labeled with sentiment (positive or negative). 



### Dataset

`from keras.datasets import imdb`

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocab size of 10,000.

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.


### Aim

1. Import test and train data  
2. Import the labels ( train and test) 
3. Get the word index and then Create key value pair for word and word_id. (12.5 points)
4. Build a Sequential Model using Keras for Sentiment Classification task. (10 points)
5. Report the Accuracy of the model. (5 points)  
6. Retrive the output of each layer in keras for a given single test sample from the trained model you built. (2.5 points)


#### Usage:

In [1]:
# Import the IMDB Dataset

from keras.datasets import imdb

vocab_size = 10000 #vocab size,it is the most frequent words.

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) 

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
# View sample Train and Test data

x_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228,

In [0]:
x_test

array([list([1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4, 177, 5760, 394, 354, 4, 123, 9, 1035, 1035, 1035, 10, 10, 13, 92, 124, 89, 488, 7944, 100, 28, 1668, 14, 31, 23, 27, 7479, 29, 220, 468, 8, 124, 14, 286, 170, 8, 157, 46, 5, 27, 239, 16, 179, 2, 38, 32, 25, 7944, 451, 202, 14, 6, 717]),
       list([1, 14, 22, 3443, 6, 176, 7, 5063, 88, 12, 2679, 23, 1310, 5, 109, 943, 4, 114, 9, 55, 606, 5, 111, 7, 4, 139, 193, 273, 23, 4, 172, 270, 11, 7216, 2, 4, 8463, 2801, 109, 1603, 21, 4, 22, 3861, 8, 6, 1193, 1330, 10, 10, 4, 105, 987, 35, 841, 2, 19, 861, 1074, 5, 1987, 2, 45, 55, 221, 15, 670, 5304, 526, 14, 1069, 4, 405, 5, 2438, 7, 27, 85, 108, 131, 4, 5045, 5304, 3884, 405, 9, 3523, 133, 5, 50, 13, 104, 51, 66, 166, 14, 22, 157, 9, 4, 530, 239, 34, 8463, 2801, 45, 407, 31, 7, 41, 3778, 105, 21, 59, 299, 12, 38, 950, 5, 4521, 15, 45, 629, 488, 2733, 127, 6, 52, 292, 17, 4, 6936, 185, 132, 1988, 5304, 1799, 488, 2693, 47, 6, 392, 173, 4, 2, 4378, 270, 2352, 4, 1500, 7, 

In [0]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [2]:
# Dictiionary for Word ID vs Word

word_to_id = imdb.get_word_index()
word_to_id = {k:v for k,v in word_to_id.items()}
id_to_word = {value:key for key,value in word_to_id.items()}
id_to_word

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


{34701: 'fawn',
 52006: 'tsukino',
 52007: 'nunnery',
 16816: 'sonja',
 63951: 'vani',
 1408: 'woods',
 16115: 'spiders',
 2345: 'hanging',
 2289: 'woody',
 52008: 'trawling',
 52009: "hold's",
 11307: 'comically',
 40830: 'localized',
 30568: 'disobeying',
 52010: "'royale",
 40831: "harpo's",
 52011: 'canet',
 19313: 'aileen',
 52012: 'acurately',
 52013: "diplomat's",
 25242: 'rickman',
 6746: 'arranged',
 52014: 'rumbustious',
 52015: 'familiarness',
 52016: "spider'",
 68804: 'hahahah',
 52017: "wood'",
 40833: 'transvestism',
 34702: "hangin'",
 2338: 'bringing',
 40834: 'seamier',
 34703: 'wooded',
 52018: 'bravora',
 16817: 'grueling',
 1636: 'wooden',
 16818: 'wednesday',
 52019: "'prix",
 34704: 'altagracia',
 52020: 'circuitry',
 11585: 'crotch',
 57766: 'busybody',
 52021: "tart'n'tangy",
 14129: 'burgade',
 52023: 'thrace',
 11038: "tom's",
 52025: 'snuggles',
 29114: 'francesco',
 52027: 'complainers',
 52125: 'templarios',
 40835: '272',
 52028: '273',
 52130: 'zaniacs',

In [0]:
#Define maximum number of words to consider in each review
max_review_length = 300

#Pad training and test reviews
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
                                                        maxlen=max_review_length
                                                        )
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, 
                                                       maxlen=max_review_length 
                                                       )

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [3]:
# Model

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.layers import LSTM

# create the model
model = Sequential()

#Embdedding layer

model.add(Embedding(vocab_size, 50, input_length=max_review_length))

#LSTM Layer
model.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2)) 

#Dense layer
model.add(Dense(1, activation='sigmoid'))

# Compile layer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model Summary
model.summary()

# Fit the model
model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=2, batch_size=128, verbose=2)





Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 50)           500000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 814,625
Trainable params: 814,625
Non-trainable params: 0
_________________________________________________________________



Train on 25000 samples, validate on 25000 samples
Epoch 1/2





 - 158s - loss: 0.5392 - acc: 0.7267 - val_loss: 0.5240 - val_acc: 0.7354
Epoch 2/2
 - 156s - l

<keras.callbacks.History at 0x7fa0794cbe80>

In [4]:
# Accuracy of the model built

accuracy= model.evaluate(x_test, y_test)
print("Accuracy: %.2f%%" % (accuracy[1]*100))

Accuracy: 84.70%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [5]:
# Output of Embedded layer

# create the model
model = Sequential()

#Embdedding layer

model.add(Embedding(vocab_size, 50, input_length=max_review_length))
model.output

<tf.Tensor 'embedding_2/embedding_lookup/Identity:0' shape=(?, 300, 50) dtype=float32>

In [6]:
# Output of LSTM Layer

#LSTM Layer
model.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2)) 
model.output


<tf.Tensor 'lstm_2/TensorArrayReadV3:0' shape=(?, 256) dtype=float32>

In [7]:
# Dense layer

#Dense layer
model.add(Dense(1, activation='sigmoid'))
model.output

<tf.Tensor 'dense_2/Sigmoid:0' shape=(?, 1) dtype=float32>

In [9]:
# Compile layer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Model Summary
model.summary()



Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 300, 50)           500000    
_________________________________________________________________
lstm_2 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 257       
Total params: 814,625
Trainable params: 814,625
Non-trainable params: 0
_________________________________________________________________


**Predicting the output for a few samples**

In [0]:
# Predicting the Test set

y_pred=model.predict(x_test)

In [23]:
# Print some reviews with actual and predicted output for test set

import pandas as pd
text=[]
actual=[]
predicted=[]
for i in range(0,5):
  text.append(' '.join(id_to_word[id] for id in x_test[i] ))
  actual.append(y_test[i])
  predicted.append(y_pred[i])
 
sample_test = pd.DataFrame( {'Review': text, 'Actual Sentiment': actual,  'Predicted Sentiment': predicted  })
sample_test

Unnamed: 0,Review,Actual Sentiment,Predicted Sentiment
0,the please give this one a miss br br and and ...,0,[0.26843128]
1,the this film and a lot of and because it and ...,1,[0.95369613]
2,the many animation and and and and the great a...,1,[0.6197295]
3,the i and love this type of movie however this...,0,[0.2754928]
4,the like some other people and i'm a die hard ...,1,[0.9803919]
