In [18]:
import tensorflow as tf
import tensorflow.keras as keras
import pandas as pd   
import numpy as np

In [19]:
imdb_reviews = pd.read_csv("imdb_reviews.csv")
test_reviews = pd.read_csv("test_reviews.csv")

In [20]:
imdb_reviews.head()

Unnamed: 0,Reviews,Sentiment
0,<START this film was just brilliant casting lo...,positive
1,<START big hair big boobs bad music and a gian...,negative
2,<START this has to be one of the worst films o...,negative
3,<START the <UNK> <UNK> at storytelling the tra...,positive
4,<START worst mistake of my life br br i picked...,negative


In [21]:
test_reviews.head()

Unnamed: 0,Reviews,Sentiment
0,<START please give this one a miss br br <UNK>...,negative
1,<START this film requires a lot of patience be...,positive
2,<START many animation buffs consider <UNK> <UN...,positive
3,<START i generally love this type of movie how...,negative
4,<START like some other people wrote i'm a die ...,positive


We can not pass the string data to our model directly, so we need to transform the string data into integer format.For this we can map each distinct word as a distinct integer for eg.{'this':14 , 'the':1}.We already have a file that contains the mapping from words to integers so we are going to load that file.

In [22]:
word_index = pd.read_csv("word_indexes.csv")

In [23]:
word_index.head()

Unnamed: 0,Words,Indexes
0,tsukino,52009
1,nunnery,52010
2,sonja,16819
3,vani,63954
4,woods,1411


 convert the word_index dataframe into a python dictionary so that we can use it for converting our reviews from string to integer format.

In [24]:
#Converting to dictionary
word_index = dict(zip(word_index.Words, word_index.Indexes))

In [25]:
word_index["<PAD>"]=0
word_index["<START"]=1
word_index["<UNK>"]=2
word_index["<UNUSED>"]=3

In [26]:
def review_encoder(text):
  arr=[word_index[word] for word in text]
  return arr

In [27]:
#split the reviews from their corresponding sentiments
train_data,train_labels=imdb_reviews['Reviews'],imdb_reviews['Sentiment']
test_data, test_labels=test_reviews['Reviews'],test_reviews['Sentiment']

In [28]:
train_data=train_data.apply(lambda review:review.split())
test_data=test_data.apply(lambda review:review.split())

In [29]:
train_data[0]

['<START',
 'this',
 'film',
 'was',
 'just',
 'brilliant',
 'casting',
 'location',
 'scenery',
 'story',
 'direction',
 "everyone's",
 'really',
 'suited',
 'the',
 'part',
 'they',
 'played',
 'and',
 'you',
 'could',
 'just',
 'imagine',
 'being',
 'there',
 'robert',
 '<UNK>',
 'is',
 'an',
 'amazing',
 'actor',
 'and',
 'now',
 'the',
 'same',
 'being',
 'director',
 '<UNK>',
 'father',
 'came',
 'from',
 'the',
 'same',
 'scottish',
 'island',
 'as',
 'myself',
 'so',
 'i',
 'loved',
 'the',
 'fact',
 'there',
 'was',
 'a',
 'real',
 'connection',
 'with',
 'this',
 'film',
 'the',
 'witty',
 'remarks',
 'throughout',
 'the',
 'film',
 'were',
 'great',
 'it',
 'was',
 'just',
 'brilliant',
 'so',
 'much',
 'that',
 'i',
 'bought',
 'the',
 'film',
 'as',
 'soon',
 'as',
 'it',
 'was',
 'released',
 'for',
 '<UNK>',
 'and',
 'would',
 'recommend',
 'it',
 'to',
 'everyone',
 'to',
 'watch',
 'and',
 'the',
 'fly',
 'fishing',
 'was',
 'amazing',
 'really',
 'cried',
 'at',
 'the

In [30]:
#we have tokenized the reviews now we can apply the review_encoder function to each review and transform the reviews into integer format.
train_data=train_data.apply(review_encoder)
test_data=test_data.apply(review_encoder)

In [32]:
train_data.head()

0    [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, ...
1    [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463,...
2    [1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5...
3    [1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111, 153,...
4    [1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637, 1...
Name: Reviews, dtype: object

In [37]:
#We also need to encode the sentiments and we are labeling the positive sentiment as 1 and negative sentiment as 0
def encode_sentiments(x):
  if x=='positive':
    return 1
  else:
    return 0

train_labels=train_labels.apply(encode_sentiments)
test_labels=test_labels.apply(encode_sentiments)

Before giving the review as an input to the model we need to perform following preprocessing steps:

The length of each review should be made equal for the model to be working correctly.

We have chosen the length of each review to be 500.

If the review is longer than 500 words we are going to cut the extra part of the review.

If the review is contains less than 500 words we are going to pad the review with zeros to increase its length to 500.

In [38]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index ["<PAD>"], padding='post', maxlen=500)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value = word_index ["<PAD>"], padding='post', maxlen=500)

Our model is a neural network and it consits of the following layers :

one word embedding layer which creates word embeddings of length 16 from integer encoded review.

second layer is global average pooling layer which is used to prevent overfitting by reducing the number of parameters.

then a dense layer which has 16 hidden units and uses relu as activation function

the final layer is the output layer which uses sigmoid as activation function

In [39]:
model=keras.Sequential([keras.layers.Embedding(10000,16,input_length=500),
                        keras.layers.GlobalAveragePooling1D(),
                        keras.layers.Dense(16,activation='relu'),
                        keras.layers.Dense(1,activation='sigmoid')])

In [40]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
"""
Adam is used as optimization function for our model.

Binary cross entropy loss function is used as loss function for the model.

Accuracy is used as the metric for evaluating the model.
"""

'\nAdam is used as optimization function for our model.\n\nBinary cross entropy loss function is used as loss function for the model.\n\nAccuracy is used as the metric for evaluating the model.\n'

In [41]:
#training the model
history=model.fit(train_data,train_labels,epochs=30,batch_size=512,validation_data=(test_data,test_labels))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [42]:
loss,accuracy=model.evaluate(test_data,test_labels)



In [43]:
index = np.random.randint(1, 1000)

In [44]:
user_review = test_reviews.loc[index]
print(user_review)

Reviews      <START the plot is simple an american couple i...
Sentiment                                             negative
Name: 695, dtype: object


As we can see the sentiment for the above review is positive, now we are going to take the integer format of this particular review which we already have in our preprocessed test data and then give it as an input to our model to check the prediction of our model.

In [45]:
user_review=test_data[index]
user_review= np.array([user_review])
if (model.predict(user_review) > 0.5).astype('int32'):
    print("Positive sentiment")
else:
    print("Negative sentiment")

Negative sentiment
