In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import pandas as pd
from scipy.spatial.distance import cdist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, GRU, Embedding
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [2]:
df=pd.read_csv('IMDB Dataset.csv')

In [3]:
print(df.shape)
df.head()

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
df.sentiment = df.sentiment.apply(lambda x: 1 if x=='positive' else 0)

In [5]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [6]:
X=df.review
y=df.sentiment

In [7]:
X=np.array(X)
y=np.array(y)

In [8]:
from sklearn.model_selection import train_test_split
x_train_text, x_test_text, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [9]:
print(x_train_text[0],y_train[0])
print(x_test_text[0],y_test[0])

"Congo" is based on the best-selling novel by Michael Crichton, which I thought lacked Crichton's usual charm, smart characters and punch. Well, sorry to say, but the same goes for the film.<br /><br />Here's the plot:<br /><br />Greed is bad, this simple morality tale cautions. A megalomaniacal C.E.O. (Joe Don Baker) sends his son into the dangerous African Congo on a quest for a source of diamonds large enough and pure enough to function as powerful laser communications transmitter (or is it laser weapons?). When contact is lost with his son and the team, his daughter-in-law (Laura Linney), a former CIA operative and computer-freak, is sent after them. On her quest, she is accompanied by gee-whiz gadgetry and a few eccentric characters (including a mercenary (Ernie Hudson), a researcher with a talking gorilla (Dylan Walsh), and a a nutty Indiana-Jones-type looking for King Solomon's Mines (Tim Curry). After some narrow escapes from surface-to-air missiles and some African wildlife, t

In [10]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


In [11]:
data_text = x_train_text + x_test_text

In [12]:
x_train_text[1]

'Wow, here it finally is; the action "movie" without action. In a real low-budget setting (don\'t miss the hilarious flying saucers flying by a few times) of a future Seattle we find a no-brain hardbody seeking to avenge her childhood.<br /><br />There is nothing even remotely original or interesting about the plot and the actors\' performance is only rivalled in stupidity by the attempts to steal from other movies, mainly "Matrix" without having the money to do it right. Yes, we do get to see some running on walls and slow motion shoot-outs (45 secs approx.) but these scenes are about as cool as the stupid hardbody\'s attempts at making jokes about male incompetence now and then.<br /><br />And, yes, we are also served a number of leads that lead absolutely nowhere, as if the script was thought-out by the previously unseen cast while shooting the scenes.<br /><br />Believe me, it is as bad as it possibly can get. In fact, it doesn\'t deserve to be taken seriously, but perhaps I can ma

In [13]:
y_train[1]

0

In [14]:
num_words = 10000

In [15]:
tokenizer = Tokenizer(num_words=num_words)

In [16]:
%%time
tokenizer.fit_on_texts(data_text)

Wall time: 14.8 s


In [17]:
if num_words is None:
    num_words = len(tokenizer.word_index)

In [18]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

In [19]:
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)

In [20]:
x_train_text[1]

'Wow, here it finally is; the action "movie" without action. In a real low-budget setting (don\'t miss the hilarious flying saucers flying by a few times) of a future Seattle we find a no-brain hardbody seeking to avenge her childhood.<br /><br />There is nothing even remotely original or interesting about the plot and the actors\' performance is only rivalled in stupidity by the attempts to steal from other movies, mainly "Matrix" without having the money to do it right. Yes, we do get to see some running on walls and slow motion shoot-outs (45 secs approx.) but these scenes are about as cool as the stupid hardbody\'s attempts at making jokes about male incompetence now and then.<br /><br />And, yes, we are also served a number of leads that lead absolutely nowhere, as if the script was thought-out by the previously unseen cast while shooting the scenes.<br /><br />Believe me, it is as bad as it possibly can get. In fact, it doesn\'t deserve to be taken seriously, but perhaps I can ma

In [21]:
np.array(x_train_tokens[1])

array([1360,  133,    9,  415,    6,    1,  205,   17,  208,  205,    8,
          3,  144,  359,  332,  950,   89,  698,    1,  579, 1647, 1647,
         31,    3,  170,  209,    4,    3,  731, 8601,   73,  164,    3,
         54, 1125, 2929,    5, 8697,   40, 1655,    7,    7,   46,    6,
        160,   57, 2627,  212,   38,  218,   42,    1,  111,    2,    1,
       6186,  241,    6,   61,    8, 2860,   31,    1, 1013,    5, 2088,
         36,   79,   97, 1420, 3059,  208,  263,    1,  291,    5,   77,
          9,  203,  422,   73,   77,   76,    5,   63,   47,  638,   20,
       3667,    2,  561, 1293, 1191, 6088, 3265,   18,  132,  134,   23,
         42,   14,  593,   14,    1,  364, 1013,   30,  231,  616,   42,
        910, 8771,  146,    2,   91,    7,    7,    2,  422,   73,   23,
         81, 2833,    3,  627,    4,  842,   12,  468,  419, 1272,   14,
         43,    1,  227,   13,  190,   41,   31,    1, 2434, 5101,  174,
        136, 1209,    1,  134,    7,    7,  262,   

In [22]:
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

In [23]:
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

In [24]:
np.mean(num_tokens)

221.27474

In [25]:
np.max(num_tokens)

2209

In [26]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [27]:
np.sum(num_tokens < max_tokens) / len(num_tokens)

0.94532

In [28]:
pad = 'pre'
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

In [29]:
print(x_train_pad.shape)
print(x_test_pad.shape)

(25000, 544)
(25000, 544)


In [30]:
np.array(x_train_tokens[1])

array([1360,  133,    9,  415,    6,    1,  205,   17,  208,  205,    8,
          3,  144,  359,  332,  950,   89,  698,    1,  579, 1647, 1647,
         31,    3,  170,  209,    4,    3,  731, 8601,   73,  164,    3,
         54, 1125, 2929,    5, 8697,   40, 1655,    7,    7,   46,    6,
        160,   57, 2627,  212,   38,  218,   42,    1,  111,    2,    1,
       6186,  241,    6,   61,    8, 2860,   31,    1, 1013,    5, 2088,
         36,   79,   97, 1420, 3059,  208,  263,    1,  291,    5,   77,
          9,  203,  422,   73,   77,   76,    5,   63,   47,  638,   20,
       3667,    2,  561, 1293, 1191, 6088, 3265,   18,  132,  134,   23,
         42,   14,  593,   14,    1,  364, 1013,   30,  231,  616,   42,
        910, 8771,  146,    2,   91,    7,    7,    2,  422,   73,   23,
         81, 2833,    3,  627,    4,  842,   12,  468,  419, 1272,   14,
         43,    1,  227,   13,  190,   41,   31,    1, 2434, 5101,  174,
        136, 1209,    1,  134,    7,    7,  262,   

In [31]:
x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [32]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

In [33]:
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

In [34]:
x_train_text[1]

'Wow, here it finally is; the action "movie" without action. In a real low-budget setting (don\'t miss the hilarious flying saucers flying by a few times) of a future Seattle we find a no-brain hardbody seeking to avenge her childhood.<br /><br />There is nothing even remotely original or interesting about the plot and the actors\' performance is only rivalled in stupidity by the attempts to steal from other movies, mainly "Matrix" without having the money to do it right. Yes, we do get to see some running on walls and slow motion shoot-outs (45 secs approx.) but these scenes are about as cool as the stupid hardbody\'s attempts at making jokes about male incompetence now and then.<br /><br />And, yes, we are also served a number of leads that lead absolutely nowhere, as if the script was thought-out by the previously unseen cast while shooting the scenes.<br /><br />Believe me, it is as bad as it possibly can get. In fact, it doesn\'t deserve to be taken seriously, but perhaps I can ma

In [35]:
tokens_to_string(x_train_tokens[1])


"wow here it finally is the action movie without action in a real low budget setting don't miss the hilarious flying flying by a few times of a future seattle we find a no brain seeking to avenge her childhood br br there is nothing even remotely original or interesting about the plot and the actors' performance is only in stupidity by the attempts to steal from other movies mainly matrix without having the money to do it right yes we do get to see some running on walls and slow motion shoot outs 45 but these scenes are about as cool as the stupid attempts at making jokes about male incompetence now and then br br and yes we are also served a number of leads that lead absolutely nowhere as if the script was thought out by the previously unseen cast while shooting the scenes br br believe me it is as bad as it possibly can get in fact it doesn't deserve to be taken seriously but perhaps i can make some of you not rent it and save your money"

In [36]:
model = Sequential()

In [37]:
embedding_size = 8

In [38]:
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))
model.add(GRU(units=16, return_sequences=True))
model.add(GRU(units=8, return_sequences=True))
model.add(GRU(units=4))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1248      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            624       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 168       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 82,045
Trainable params: 82,045
Non-trainable params: 0
_________________________________________________________________


In [39]:
model.compile(loss='binary_crossentropy',
              optimizer= Adam(lr=1e-3),
              metrics=['accuracy'])

In [40]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Wall time: 14min 12s


<tensorflow.python.keras.callbacks.History at 0x1f9c7460af0>

In [44]:
model.save('saved_model/model1') 



INFO:tensorflow:Assets written to: saved_model/model1\assets


INFO:tensorflow:Assets written to: saved_model/model1\assets


In [45]:
%%time
result = model.evaluate(x_test_pad, y_test)

Wall time: 1min 26s


In [46]:
print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 88.50%


In [47]:
%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

Wall time: 5.18 s


In [48]:
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])

In [49]:
cls_true = np.array(y_test[0:1000])

In [50]:
incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

In [51]:
len(incorrect)

111

In [52]:
idx = incorrect[0]
idx

0

In [53]:
text = x_test_text[idx]
text

"I really liked this Summerslam due to the look of the arena, the curtains and just the look overall was interesting to me for some reason. Anyways, this could have been one of the best Summerslam's ever if the WWF didn't have Lex Luger in the main event against Yokozuna, now for it's time it was ok to have a huge fat man vs a strong man but I'm glad times have changed. It was a terrible main event just like every match Luger is in is terrible. Other matches on the card were Razor Ramon vs Ted Dibiase, Steiner Brothers vs Heavenly Bodies, Shawn Michaels vs Curt Hening, this was the event where Shawn named his big monster of a body guard Diesel, IRS vs 1-2-3 Kid, Bret Hart first takes on Doink then takes on Jerry Lawler and stuff with the Harts and Lawler was always very interesting, then Ludvig Borga destroyed Marty Jannetty, Undertaker took on Giant Gonzalez in another terrible match, The Smoking Gunns and Tatanka took on Bam Bam Bigelow and the Headshrinkers, and Yokozuna defended th

In [54]:
y_pred[idx]

0.11640993

In [55]:
cls_true[idx]

1

In [56]:
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

In [57]:
tokens = tokenizer.texts_to_sequences(texts)

In [58]:
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 544)

In [59]:
model.predict(tokens_pad)

array([[0.90655303],
       [0.75120926],
       [0.35504228],
       [0.65970504],
       [0.27195597],
       [0.17173913],
       [0.63376534],
       [0.10649526]], dtype=float32)