# Sentiment Classification


The Dataset of 25,000 movie reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been
preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words
are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the
first 20 words from each review to speed up training, using a max vocab size of 10,000.
As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

1. Import test and train data.
2. Import the labels ( train and test).
3. Get the word index and then Create a key-value pair for word and word_id.
4. Build a Sequential Model using Keras for the Sentiment Classification task.
5. Report the Accuracy of the model.
6. Retrieve the output of each layer in Keras for a given single test sample from the trained model you built.

Generate Word Embeddings and retrieve outputs of each layer with Keras based on the Classification task.

## Loading the dataset

In [None]:
# Import the library and packages
import tensorflow as tf
import csv
import pandas as pd
from PIL import Image
from keras import Model
from keras.applications.mobilenet import MobileNet, preprocess_input
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, Callback
from keras.layers import Conv2D, Reshape
from keras.utils import Sequence
from keras.backend import epsilon
import numpy as np

Using TensorFlow backend.


##Import test and train data

In [None]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [None]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

##Import the labels ( train and test)

In [None]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [None]:
print(x_train.shape)

(25000, 300)


In [None]:
print(y_train.shape)

(25000,)


##Get the word index and then Create key value pair for word and word_id.

In [None]:
import keras

In [None]:
(training_data, training_labels), (test_data, test_labels)= imdb.load_data(num_words=vocab_size, index_from=3)

In [None]:
word_to_id = keras.datasets.imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [None]:
import itertools
dict(itertools.islice(word_to_id.items(), 4))

{'fawn': 34701, 'nunnery': 52007, 'sonja': 16816, 'tsukino': 52006}

In [None]:
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

In [None]:
id_to_word = {id:word for word,id in word_to_id.items()}

In [None]:
print(' '.join(id_to_word[id] for id in training_data[2]))

<START> this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had <UNK> working to watch this feeble excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how <UNK> this is to watch save yourself an hour a bit of your life


##Build a Sequential Model using Keras for Sentiment Classification task

In [None]:
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
from keras.layers import Flatten

In [None]:
# create the model
model = Sequential()
model.add(Embedding(vocab_size, 32, input_length=maxlen))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           320000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 9600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               2400250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 2,720,501
Trainable params: 2,720,501
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           320000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 9600)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               2400250   
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 251       
Total params: 2,720,501
Trainable params: 2,720,501
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history= model.fit(x_train, y_train, epochs=10, batch_size=64)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


##Report the Accuracy of the model

In [None]:
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 86.48%


##Trying with top Words (Vocab) = 5000 and Max Len = 500

In [None]:
# CNN for the IMDB problem
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

In [None]:
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# pad dataset to a maximum review length in words
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

In [None]:
# create the model
model = Sequential()
model.add(Embedding(top_words, 32, input_length=max_words))
model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(250, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 250)               2000250   
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 251       
Total params: 2,163,605
Trainable params: 2,163,605
Non-trainable params: 0
____________________________________________

In [None]:
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
 - 6s - loss: 0.4353 - accuracy: 0.7743 - val_loss: 0.2747 - val_accuracy: 0.8853
Epoch 2/5
 - 2s - loss: 0.2074 - accuracy: 0.9189 - val_loss: 0.2962 - val_accuracy: 0.8749
Epoch 3/5
 - 2s - loss: 0.1413 - accuracy: 0.9490 - val_loss: 0.3017 - val_accuracy: 0.8812
Epoch 4/5
 - 2s - loss: 0.0718 - accuracy: 0.9788 - val_loss: 0.3762 - val_accuracy: 0.8745
Epoch 5/5
 - 2s - loss: 0.0233 - accuracy: 0.9961 - val_loss: 0.4695 - val_accuracy: 0.8730
Accuracy: 87.30%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [None]:
from keras import backend as K

In [None]:
x_train.shape


(25000, 300)

In [None]:
get_1st_layer_output = K.function([model.layers[0].input], 
                                  [model.layers[0].output])
get_2nd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[1].output])
get_3rd_layer_output = K.function([model.layers[0].input],
                                  [model.layers[2].output])
get_4th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[3].output])
get_5th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[4].output])
get_6th_layer_output = K.function([model.layers[0].input],
                                  [model.layers[5].output])


In [None]:
layer_output1 = get_1st_layer_output([x_test[0:1,]])[0]
layer_output1

array([[[ 0.00275071,  0.02683793, -0.02008983, ...,  0.01048011,
          0.02925541,  0.01813565],
        [ 0.00275071,  0.02683793, -0.02008983, ...,  0.01048011,
          0.02925541,  0.01813565],
        [ 0.00275071,  0.02683793, -0.02008983, ...,  0.01048011,
          0.02925541,  0.01813565],
        ...,
        [ 0.06443904,  0.00960596, -0.01097517, ..., -0.03039504,
         -0.02569287, -0.06392781],
        [-0.03337857,  0.02366957,  0.00863247, ..., -0.02003787,
         -0.00442734,  0.00600675],
        [-0.11843377, -0.01231405,  0.06487207, ..., -0.03019277,
         -0.03232521, -0.03293739]]], dtype=float32)

In [None]:
layer_output2 = get_2nd_layer_output([x_test[0:1,]])[0]
layer_output2

array([[[0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.00201868, 0.        ],
        [0.        , 0.0979745 , 0.        , ..., 0.02456164,
         0.04902659, 0.03882478],
        [0.07246035, 0.        , 0.        , ..., 0.        ,
         0.028593  , 0.08754403]]], dtype=float32)

In [None]:
layer_output3 = get_3rd_layer_output([x_test[0:1,]])[0]
layer_output3

array([[[0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        [0.        , 0.        , 0.        , ..., 0.        ,
         0.        , 0.        ],
        ...,
        [0.        , 0.06881598, 0.        , ..., 0.01497562,
         0.03681341, 0.09962009],
        [0.05837861, 0.        , 0.        , ..., 0.        ,
         0.00201868, 0.02185615],
        [0.07246035, 0.0979745 , 0.        , ..., 0.02456164,
         0.04902659, 0.08754403]]], dtype=float32)

In [None]:
layer_output4 = get_4th_layer_output([x_test[0:1,]])[0]
layer_output4

array([[0.        , 0.        , 0.        , ..., 0.02456164, 0.04902659,
        0.08754403]], dtype=float32)