# Sentiment Classification


## Loading the dataset

In [1]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

## Train test split

In [0]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

## Creating  word index and  key-value pair for word and word_id

In [4]:
word_id = imdb.get_word_index()

word_id = {k:(v+3) for k,v in word_id.items()}
word_id["<PAD>"] = 0
word_id["<START>"] = 1
word_id["<UNK>"] = 2

id_word = {value:key for key,value in word_id.items()}
print(' '.join(id_word[id] for id in x_train[0] ))

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so muc

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

In [6]:
embedding_length = 32 
MODEL = Sequential() 
MODEL.add(Embedding(vocab_size, embedding_length, input_length=maxlen)) 
MODEL.add(LSTM(100)) 
MODEL.add(Dense(1, activation='sigmoid')) 
print(MODEL.summary()) 

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
MODEL.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 

In [8]:
MODEL.fit(x_train, y_train, validation_data=(x_test, y_test), epochs= 5, batch_size= 32) 

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.callbacks.History at 0x7fdd908cbf50>

## Printing the accuracy

In [11]:
acc = MODEL.evaluate(x_test, y_test) 
print("Accuracy = %.2f%%" % (acc[1]*100))

Accuracy = 86.47%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [12]:
 from keras import backend as K
 
 outputs = []
for layer in MODEL.layers:
    keras_function = K.function([MODEL.input], [layer.output])
    outputs.append(keras_function([x_test, 1]))
print(outputs)

[[array([[[ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        [ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        [ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        ...,
        [-0.01951145,  0.00069703, -0.03557115, ..., -0.03832467,
         -0.04537717,  0.03077606],
        [-0.01228778,  0.00046838, -0.01154031, ..., -0.00282057,
          0.02917405,  0.03288529],
        [ 0.14674443, -0.03639597,  0.04794239, ...,  0.01799275,
          0.11272208,  0.12595624]],

       [[ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        [ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        [ 0.03975408, -0.04225037,  0.04976373, ...,  0.03796664,
         -0.00161235, -0.00494425],
        ...,
        [-0.0408451 , -0.02523319, -0.06308564, ...,

## Prediction of a test sample

In [15]:
for review in [x_test[1000]]:
    temp = []
    temp_padded = pad_sequences([temp], maxlen=maxlen) 
    print(" Sentiment: %s" % (MODEL.predict(([temp_padded][0]))[0][0]))

 Sentiment: 0.46831185


### The model predicts a sesntiment values which lies between 0.0 and 1.0.
### Sentiment value = Close to 0.0 : Review = Negative
### Sentiment value = Close to 1.0 : Review = Positive

## Prediction on user input.

In [26]:
negative = "the movie is very boring"
positive = "a good movie very entertaining"
for review in [negative,positive]:
    temp = []
    for word in review.split(" "):
        temp.append(word_id[word])
    temp_padded = pad_sequences([temp], maxlen=maxlen) 
    print("%s. Sentiment: %s" % (review,MODEL.predict(([temp_padded][0]))[0][0]))

the movie is very boring. Sentiment: 0.19278501
a good movie very entertaining. Sentiment: 0.8923264


### Sentiment value : 0.19 . Hence, Review : Negative
### Sentiment value : 0.89 . Hence, Review : Positive