# This demonstrates the following:
How to load imdb data
How load converts words into integer indexes
How words are stored in highest occuring freq to lowest occurring
How to take top 5000 or 10000 vocabulary size
idiosyncracy of imdb.get_word_index() 
Re-convert the index to take into account padding, start, out-of-vocabulary
What is an embedding layer
Create a RNN model
DO the sentiment analyzer
Predict sample sentences
This also takes into advanced features of Keras such as callbacks and loads the best saved weights also

In [1]:
import keras
from keras.datasets import imdb
from keras.layers import Dense,  Flatten

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
print (keras.__version__)

2.2.4


In [2]:
vocabulary_size = 10000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('IMDB dataset  {} training samples, {} test samples'.format(len(X_train), len(X_test)))

IMDB dataset  25000 training samples, 25000 test samples


In [3]:
w2id = imdb.get_word_index() 
#GET_WORD_INDEX gets the dictionary where the occurring words are stored. It does not store tokens like "Start", "unknown" etc.
#We need to make that correction.
id2word = {i: word  for word, i in w2id.items()}
print (len(w2id), "id2Word[1] =", id2word[1], "w2id[the]" ,w2id["the"])
print ( "id2Word[2] =", id2word[2], "w2id[and]" ,w2id["and"])
print ("Least frequently ocurring word is at index 88583 and it is = ", id2word[88583])

88584 id2Word[1] = the w2id[the] 1
id2Word[2] = and w2id[and] 2
Least frequently ocurring word is at index 88583 and it is =  voorhees'


In [4]:
#gibberish data?
actual_data = []
for i in range (len(X_train[8])):
    ind = X_train[8][i]
    if(ind != 0):
        actual_data.append(id2word[ind])
actual_data = " ".join (actual_data )
print (actual_data, "len=", len(actual_data),"\nAnd sentiment is ", y_train[8])
print (X_train[8], "len=", len(X_train[8]),"\nAnd sentiment is ", y_train[8])

the out can't some to hit looking when is themselves novel as it theatre people have former some br is on dull this should rick to and thru characters one lost her show jews for of brosnan br realised back video excellent to plenty theatre involved and and off as in we no that how moments some to as on it early in we years was nothing does is 10 has as new part first that to when is possibly what have whatever of positive have 2 also of slow for as it by br though and characters has so heart broad europe film show of try making to looks in one want key bodies br when is him their that it christmas not them his dancer want this and as it his for on of great stupid of their as end should is very together br twilight us writer br were machines go bad aka were young to might us father film any more it her get predictable of alleged for find dvd from through i'm her get up happily date of lionel with other is him pushed has columbo criticized to what's of thief this so superior too not as y

# IN THIS EXAMPLE WE GOT THE IMDB TO LOAD THE DATASET IN A PECULIAR WAY. ON LOADING IT ASSUMES THAT 1ST WORD IS "START" TOKEN, 2ND WORD IS "UNKNOWN" TOKEN. BUT IT  STORES THEM IN ORDER OF THE FREQUENCY OF OCCURING. SO MOST FREQUENT WORD WILL BE STORED IN INDEX 1 (THAT CORRESPONDS TO "THE"), LESS FREQUENT IN INDEX 2, AND SO ON. IT RESERVES 0 FOR PADDING. So THE WORDS "UNKONWN", "START"ARE NOT STORED YET. SO A PARTICULAR TRAINING DATA  ROW INDEX REFERS TO ACTUAL INDEX + 2. HENCE WE MAKE THAT CORRECTION.

In [5]:
#So we make a new dictionary offsetting by 3, keeping one for out of vocabulary in future
word2id ={w: i+3 for w, i in w2id.items()}
word2id["__PADDING__"] = 0
word2id["__START__"] = 1
word2id["__UNK__"] = 2

#This returns the index of the words from 1 to n with 1 being the most frequently occuring word, 

# and n the least frequently occuring word

print (type(word2id), len (word2id))

id2word = {i: word  for word, i in word2id.items()}
print(id2word[1], id2word[2], #id2word[3],#
      id2word[4], id2word[5], id2word[6], id2word[88586])
actual_data = []
for i in range (len(X_train[8])):
    ind = X_train[8][i]
    if(ind != 0):
        actual_data.append(id2word[ind])
actual_data = " ".join (actual_data )
print (actual_data, "\nAnd sentiment is ", y_train[8])
print (word2id["the"])

<class 'dict'> 88587
__START__ __UNK__ the and a voorhees'
__START__ just got out and cannot believe what a brilliant documentary this is rarely do you walk out of a movie theater in such awe and __UNK__ lately movies have become so over hyped that the thrill of discovering something truly special and unique rarely happens __UNK__ __UNK__ did this to me when it first came out and this movie is doing to me now i didn't know a thing about this before going into it and what a surprise if you hear the concept you might get the feeling that this is one of those __UNK__ movies about an amazing triumph covered with over the top music and trying to have us fully convinced of what a great story it is telling but then not letting us in __UNK__ this is not that movie the people tell the story this does such a good job of capturing every moment of their involvement while we enter their world and feel every second with them there is so much beyond the climb that makes everything they go through so 

In [6]:
print('Maximum review length: {}'.format(
len((max((X_train + X_test), key=len)))))

Maximum review length: 2697


In [7]:
from keras.preprocessing import sequence
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)
modelCNN = None
modelRNN=None
modelLSTM=None
print ("After Padding X[train[8]=\n", X_train[8])

After Padding X[train[8]=
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0

In [8]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

In [9]:
#Now create a simple RNN model and lets see the accuracy
from keras.layers import SimpleRNN

In [10]:
embedding_size=50
modelRNN=Sequential()
modelRNN.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
modelRNN.add(SimpleRNN(150,activation = "tanh")) 
modelRNN.add(Dense(1, activation='sigmoid'))
print(modelRNN.summary())




_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 50)           500000    
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 150)               30150     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 151       
Total params: 530,301
Trainable params: 530,301
Non-trainable params: 0
_________________________________________________________________
None


In [11]:
modelRNN.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [12]:
batch_size = 256
num_epochs = 20
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
callback_listRNN = [#early1
                keras.callbacks.ModelCheckpoint(filepath="my_modRNN_BestValAcc.h5", monitor="val_acc",
                                               save_best_only=True),
                #keras.callbacks.TerminateOnNaN()
                ]

In [13]:
modelRNN.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs, 
             callbacks=callback_listRNN)
modelRNN.save_weights("my_modRNN_Latest")




Train on 24744 samples, validate on 256 samples
Epoch 1/20





Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [14]:
scoresRNN = modelRNN.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', scoresRNN[1])

Test accuracy: 0.74372


In [15]:
#Loading the last epoch model
modelRNN=Sequential()
modelRNN.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
modelRNN.add(SimpleRNN(150,activation = "tanh")) 
modelRNN.add(Dense(1, activation='sigmoid'))
print(modelRNN.summary())
modelRNN.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])
modelRNN.load_weights("my_modRNN_Latest")

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 50)           500000    
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 150)               30150     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 151       
Total params: 530,301
Trainable params: 530,301
Non-trainable params: 0
_________________________________________________________________
None


In [16]:
bestModel=Sequential()
bestModel.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
bestModel.add(SimpleRNN(150,activation = "tanh")) 
bestModel.add(Dense(1, activation='sigmoid'))
print(bestModel.summary())
bestModel.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])
bestModel.load_weights("my_modRNN_BestValAcc.h5")
print (bestModel.evaluate(X_test, y_test, verbose=0))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 500, 50)           500000    
_________________________________________________________________
simple_rnn_3 (SimpleRNN)     (None, 150)               30150     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 151       
Total params: 530,301
Trainable params: 530,301
Non-trainable params: 0
_________________________________________________________________
None
[0.902381838722229, 0.74372]


In [18]:
reviewList = ["the movie was boring",
              "the movie was not too long",
              "the movie was a total waste of my time",
              "it was a lovely evening",
              "the cat was adorable",
              "such pusilamity was not encountered before"
             ]
print ("Now going to predict these reviews")

Now going to predict these reviews


In [19]:
print (word2id["the"])

4


In [21]:
def PredictSentiment(reviewList, ModelListTuple):
    sentiment= {True: "Positive",
               False: "Negative"}
    Threshold = 0.5
    for r in reviewList:
        words = r.split()
        review = []
        print ("review=", r)
        for word in words:
          if word not in word2id: 
            review.append(2)
            print (word, "Appended 2")
          else:
            if (word2id[word]) >= vocabulary_size:
                print("got a word outside the vocab_index", word, word2id[word], "replacing with unk")               
                review.append(2) 
            else:
                print ("word=", word, "id=", word2id[word])
                review.append(word2id[word]) 
        review = keras.preprocessing.sequence.pad_sequences([review],
          truncating='pre', padding='pre', maxlen=max_words)
        for i,m in enumerate(ModelListTuple):
            if m[0] is not None:
                prediction = m[0].predict(review)
                print("Prediction Probability for ", "\"",r, "\"",ModelListTuple[i][1]," = ", prediction[0][0], "Sentiment=", 
                      sentiment[prediction[0][0]>Threshold], "\n")
                
PredictSentiment(reviewList,[(modelRNN, "RNN"), (bestModel, "Best Val Checkpointed Model")])

review= the movie was boring
word= the id= 4
word= movie id= 20
word= was id= 16
word= boring id= 357
Prediction Probability for  " the movie was boring " RNN  =  0.04350068 Sentiment= Negative 

Prediction Probability for  " the movie was boring " Best Val Checkpointed Model  =  0.04350068 Sentiment= Negative 

review= the movie was not too long
word= the id= 4
word= movie id= 20
word= was id= 16
word= not id= 24
word= too id= 99
word= long id= 196
Prediction Probability for  " the movie was not too long " RNN  =  0.0046611633 Sentiment= Negative 

Prediction Probability for  " the movie was not too long " Best Val Checkpointed Model  =  0.0046611633 Sentiment= Negative 

review= the movie was a total waste of my time
word= the id= 4
word= movie id= 20
word= was id= 16
word= a id= 6
word= total id= 964
word= waste id= 437
word= of id= 7
word= my id= 61
word= time id= 58
Prediction Probability for  " the movie was a total waste of my time " RNN  =  0.03787117 Sentiment= Negative 

Pred

In [22]:
reviewList = [             "I am unhappy at receiving this request. You are so late in application, \
              that I would have retired by the time I got this mail.  I am not going to allow the amount  \
              (which by the way is a lot at this late date.\
             But I am frustrated at your courtesy",
            "You may have reasons to feel pleased about yourself. You would think that I am happy at \
            receiving this request. Unfortunately you are wrong. Very wrong. As Wrong as sky is black! \
            You are so late in application, \
              that I would have retired by the time I got this mail.  I am not going to allow the amount  \
              (which by the way is a lot at this late date.\
             Don't think that I do not know you are happy. But I am not and thats what matters! "              

             ]
PredictSentiment(reviewList,[(modelRNN, "RNN"), (bestModel, "Best Val Checkpointed Model")])
#We will try this in LSTM and BI Directional

review= I am unhappy at receiving this request. You are so late in application,               that I would have retired by the time I got this mail.  I am not going to allow the amount                (which by the way is a lot at this late date.             But I am frustrated at your courtesy
I Appended 2
word= am id= 244
word= unhappy id= 4434
word= at id= 33
word= receiving id= 5611
word= this id= 14
request. Appended 2
You Appended 2
word= are id= 26
word= so id= 38
word= late id= 522
word= in id= 11
application, Appended 2
word= that id= 15
I Appended 2
word= would id= 62
word= have id= 28
word= retired id= 5046
word= by id= 34
word= the id= 4
word= time id= 58
I Appended 2
word= got id= 188
word= this id= 14
mail. Appended 2
I Appended 2
word= am id= 244
word= not id= 24
word= going id= 170
word= to id= 8
word= allow id= 1741
word= the id= 4
word= amount id= 1166
(which Appended 2
word= by id= 34
word= the id= 4
word= way id= 96
word= is id= 9
word= a id= 6
word= lot id= 176
word