In [None]:
'''Word Embedding is a representation of text where words that have the same meaning have a similar representation. 
In other words it represents words in a coordinate system where related words, based on a corpus of relationships, 
are placed closer together. In the deep learning frameworks such as TensorFlow, Keras, this part is usually handled 
by an embedding layer which stores a lookup table to map the words represented by numeric indexes to their dense 
vectorrepresentations.'''

In [None]:
'''Deep network takes the sequence of embedding vectors as input and converts them to a compressed representation. 
The compressed representation effectively captures all the information in the sequence of words in the text. 
The deep neywrok part is usually an RNN or some forms of it like LSTM/GRU. The dropout is added to overcome the 
tendency to overfit, a very common problem with RNN based networks. Please refer here for detailed discussion on LSTM,GRU.'''

In [None]:
'''The fully connected layer takes the deep representation from the RNN/LSTM/GRU and transforms 
it into the final output classes or class scores. This component is comprised of fully connected layers 
along with batch normalization and optionally dropout layers for regularization.'''

In [None]:
'''Based on the problem at hand, this layer can have either Sigmoid for binary classification or Softmax for both
binary and multi classification output.'''

In [104]:
import pandas as pd
import numpy as np

df = pd.read_csv("C:\\MyWork\\MyLearning\\ML\\Files\\DataSet\\movie_data.csv",encoding='utf-8').sample(n=1000, random_state=1)

In [105]:
df.head()

Unnamed: 0,review,sentiment
26779,I was impressed with this film because of the ...,1
47092,The best bond game made of all systems. It was...,1
17779,Beautifully made with a wonderful performance ...,1
39526,What a stupid idea. Ewoks should be enslaved a...,0
34147,Please don't waste your money on this sorry ex...,0


In [106]:
X_train  = df.review.iloc[:499]
y_train = df.sentiment.iloc[:499]

X_test = df.review.iloc[500:]
y_test = df.sentiment.iloc[500:]

In [107]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer_obj = Tokenizer()
total_reviews  = df.review

tokenizer_obj.fit_on_texts(total_reviews)

In [108]:
# Define Vocabulary Size
vocab_size = len(tokenizer_obj.word_index) +1

print(vocab_size)

19170


In [109]:
# pad sequence
max_length = max([len(s.split()) for s in total_reviews])

print(max_length)

1723


In [110]:
X_train_tokens = tokenizer_obj.texts_to_sequences(X_train)
X_test_tokens = tokenizer_obj.texts_to_sequences(X_test)

In [111]:
X_train_pad = pad_sequences(X_train_tokens,maxlen=max_length,padding='post')
X_test_pad = pad_sequences(X_test_tokens,maxlen=max_length,padding='post')

In [112]:
# Build Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,LSTM,Embedding,GRU

# Initialize Model
model = Sequential()

EMBEDDING_DIM = 100

# Add Model Layer
model.add(Embedding(vocab_size,EMBEDDING_DIM,input_length=max_length))
model.add(GRU(units=32,dropout = 0.2,recurrent_dropout=0.2))
model.add(Dense(1,activation = 'sigmoid'))

# Compile Model
model.compile(optimizer = 'adam',loss='binary_crossentropy',metrics = ['accuracy'])

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 1723, 100)         1917000   
_________________________________________________________________
gru_5 (GRU)                  (None, 32)                12768     
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 33        
Total params: 1,929,801
Trainable params: 1,929,801
Non-trainable params: 0
_________________________________________________________________


In [113]:
# Train Model
model.fit(X_train_pad,y_train,batch_size=128,epochs=25,validation_data=(X_test_pad,y_test),verbose=2)

Train on 499 samples, validate on 500 samples
Epoch 1/25
499/499 - 8s - loss: 0.6943 - acc: 0.4810 - val_loss: 0.6932 - val_acc: 0.4940
Epoch 2/25
499/499 - 7s - loss: 0.6935 - acc: 0.5251 - val_loss: 0.6947 - val_acc: 0.4940
Epoch 3/25
499/499 - 7s - loss: 0.6929 - acc: 0.5210 - val_loss: 0.6956 - val_acc: 0.4940
Epoch 4/25
499/499 - 7s - loss: 0.6927 - acc: 0.5210 - val_loss: 0.6958 - val_acc: 0.4940
Epoch 5/25
499/499 - 7s - loss: 0.6922 - acc: 0.5210 - val_loss: 0.6951 - val_acc: 0.4940
Epoch 6/25
499/499 - 7s - loss: 0.6915 - acc: 0.5210 - val_loss: 0.6946 - val_acc: 0.4940
Epoch 7/25
499/499 - 7s - loss: 0.6917 - acc: 0.5210 - val_loss: 0.6944 - val_acc: 0.4940
Epoch 8/25
499/499 - 7s - loss: 0.6924 - acc: 0.5210 - val_loss: 0.6942 - val_acc: 0.4940
Epoch 9/25
499/499 - 7s - loss: 0.6917 - acc: 0.5210 - val_loss: 0.6942 - val_acc: 0.4940
Epoch 10/25
499/499 - 7s - loss: 0.6928 - acc: 0.5210 - val_loss: 0.6941 - val_acc: 0.4940
Epoch 11/25
499/499 - 7s - loss: 0.6927 - acc: 0.5190

<tensorflow.python.keras.callbacks.History at 0x29c45932b88>

In [118]:
# Test Model

test_sample_1 = "This movie is fantastic! I really like it because it is so  good!"
test_sample_2 = "Good Movie!"
test_sample_3 = "Maybe I Like this movie"
test_sample_4 = "Not to my taste, will skip and watch another movie"
test_sample_5 = "if you like action, then this movie might be good for you"
test_sample_6 = "Bad movie!"
test_sample_7 = "Not a good movie!"
test_sample_8 = "This movie really sucks! Can I get my money back please?"

test_samples = [test_sample_1,test_sample_2,test_sample_3,test_sample_4,test_sample_5,test_sample_6,test_sample_7,test_sample_8]

test_sample_tokens = tokenizer_obj.texts_to_sequences(test_samples)

test_sample_token_pad = pad_sequences(test_sample_tokens,maxlen=max_length)

In [120]:
# The output gives the prediction of the word either to be 1 (positive sentiment) or 0 (negative sentiment).

# Predict
model.predict(test_sample_token_pad)

array([[0.49367246],
       [0.4885422 ],
       [0.49148762],
       [0.494942  ],
       [0.48225394],
       [0.4866268 ],
       [0.5055415 ],
       [0.4927562 ]], dtype=float32)