# Sentiment Classification


## Loading the dataset

In [0]:
from keras.datasets import imdb

vocab_size = 10000 #vocab size

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # vocab_size is no.of words to consider from the dataset, ordering based on frequency.

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
from keras.preprocessing.sequence import pad_sequences
vocab_size = 10000 #vocab size
maxlen = 300  #number of word used from each review

## Train test split

In [0]:
#load dataset as a list of ints
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
#make all sequences of the same length
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test =  pad_sequences(x_test, maxlen=maxlen)

In [0]:
x_train[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    1,   14,   22,   16,   43,  530,
        973, 1622, 1385,   65,  458, 4468,   66, 3941,    4,  173,   36,
        256,    5,   25,  100,   43,  838,  112,   50,  670,    2,    9,
         35,  480,  284,    5,  150,    4,  172,  112,  167,    2,  336,
        385,   39,    4,  172, 4536, 1111,   17,  546,   38,   13,  447,
          4,  192,   50,   16,    6,  147, 2025,   19,   14,   22,    4,
       1920, 4613,  469,    4,   22,   71,   87,   

## Create the word index and a key-value pair for word and word_id.

In [0]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so muc

## Build Keras Embedding Layer Model
We can think of the Embedding layer as a dicionary that maps a index assigned to a word to a word vector. This layer is very flexible and can be used in a few ways:

* The embedding layer can be used at the start of a larger deep learning model. 
* Also we could load pre-train word embeddings into the embedding layer when we create our model.
* Use the embedding layer to train our own word2vec models.

The keras embedding layer doesn't require us to onehot encode our words, instead we have to give each word a unqiue intger number as an id. For the imdb dataset we've loaded this has already been done, but if this wasn't the case we could use sklearn [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [0]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, LSTM

In [0]:
embedding_vector_length = 32 
model = Sequential() 
model.add(Embedding(vocab_size, embedding_vector_length, input_length=maxlen)) 
model.add(LSTM(100)) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model.summary()) 

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 32)           320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 373,301
Trainable params: 373,301
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), nb_epoch=3, batch_size=64) 

  """Entry point for launching an IPython kernel.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x7efbfa2bbd50>

## Accuracy of the model.

In [0]:
scores = model.evaluate(x_test, y_test, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 87.02%


## Retrive the output of each layer in keras for a given single test sample from the trained model you built

In [0]:
 from keras import backend as K

In [0]:
outputs = []
for layer in model.layers:
    keras_function = K.function([model.input], [layer.output])
    outputs.append(keras_function([x_test, 1]))
print(outputs)

[[array([[[-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        [-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        [-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        ...,
        [-0.03498746, -0.04443849, -0.03118311, ..., -0.03166886,
          0.00268514,  0.02311075],
        [-0.00784201,  0.02104096, -0.00880034, ...,  0.04326359,
         -0.04325129, -0.03879954],
        [-0.05686688, -0.03734658,  0.00725429, ..., -0.04969543,
         -0.01082605, -0.02246371]],

       [[-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        [-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        [-0.00777039,  0.02623438,  0.01142781, ...,  0.0289028 ,
         -0.02578218, -0.00926467],
        ...,
        [ 0.01601357, -0.04824288, -0.01251326, ...,

## Prediction from the model.

### Prediction from the test set.

In [0]:
for review in [x_test[10]]:
    tmp = []
    tmp_padded = pad_sequences([tmp], maxlen=maxlen) 
    print(" Sentiment: %s" % (model.predict(([tmp_padded][0]))[0][0]))

 Sentiment: 0.2730021


#### Here we can see that our model is generating a value for the sentiment.
#### If the sentiment value is closer to 0 : The review is classified as a negative review. 
#### If the sentiment value is closer to 1 : The review is classified as a positive review. 

### Prediction for custom examples.

In [0]:
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"
for review in [good,bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = pad_sequences([tmp], maxlen=maxlen) 
    print("%s. Sentiment: %s" % (review,model.predict(([tmp_padded][0]))[0][0]))

i really liked the movie and had fun. Sentiment: 0.7557901
this movie was terrible and bad. Sentiment: 0.03472362


### Here we can see that the positive reviews sentiment value is 0.755 , which is close to 1.
### Here we can see that the negative reviews sentiment value is 0.034 , which is close to 0.

## THANK YOU.