The project is about **Sentiment Analysis** that shows output on **whether the provided text is positive or negative**. 

Here I have implemented RNN, LSTM, GRU, Bidirectional LSTM and Bidirectional GRU for analyzing the sequential textual data for **Sentiment Classification**.

It's an Introductory & comparative analysis on accuracy results among different models, that shows performances of different models.

In [1]:
!wget https://www.dropbox.com/s/pdhwlpi2yeie0ol/movie-reviews-dataset.zip

--2021-07-03 07:22:52--  https://www.dropbox.com/s/pdhwlpi2yeie0ol/movie-reviews-dataset.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6016:18::a27d:112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/pdhwlpi2yeie0ol/movie-reviews-dataset.zip [following]
--2021-07-03 07:22:52--  https://www.dropbox.com/s/raw/pdhwlpi2yeie0ol/movie-reviews-dataset.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc98cce4da5d6aaa724816d6e42b.dl.dropboxusercontent.com/cd/0/inline/BRkl_bKeWYCGwJ1ERGwS21BypPzypzdgLU_uU-HB96Dd0Y16_xE4DUQLIXjHIfHx10YoVL15Jb5OQictbLjTo_v_Ur8K0m8yjOkxF9lBdrdRGyq0_fhguEAnMHdsxJY1bXNsm2APibbOcQO3j0jIP2qq/file# [following]
--2021-07-03 07:22:53--  https://uc98cce4da5d6aaa724816d6e42b.dl.dropboxusercontent.com/cd/0/inline/BRkl_bKeWYCGwJ1ERGwS21BypPzypzdgLU_uU-HB96Dd0Y16_xE

In [2]:
!unzip -q "/content/movie-reviews-dataset.zip"       #unzip dataset #loading data

replace movie-reviews-dataset/.DS_Store? [y]es, [n]o, [A]ll, [N]one, [r]ename: A


#**RNN**

In [3]:
import tensorflow as tf
from tensorflow.keras.preprocessing import text_dataset_from_directory  
#loads text data present in a particular directory format as tensorflow data object 
from tensorflow.strings import regex_replace   #tensorflow based fast text replacement operation
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
#Text vectorization module is used to convert string into a vector of integers that will be passed into the network.
from tensorflow.keras.models import Sequential 
#sequential object is used to keep track of tensorflow objects in a graphic manner.
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, RNN, SimpleRNNCell, Embedding, Dropout  #all the different layers in neural network

In [4]:
def prepareData(dir):
  df = text_dataset_from_directory(dir)
  return df.map(lambda text, label: (regex_replace(text, '<br />', ' '), label),)

# as the data has been created by scrapping website, there are chances of presence of html notations such as break statement. 
#to remove the notations, regex_replace() function is used here to replace with an empty string

A function has been created that takes in a directory as input and returns tensorflow data object as the output inside the function. We use text dataset from directory module provided by tensorflow.

This module only notes data when it is organized in class directories, which contains text files in them.

In [5]:
train_df = prepareData('movie-reviews-dataset/train') #creating dataset object for both train and test data
test_df = prepareData('movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [6]:
for text_batch, label_batch in train_df.take(1):  #checking if the data is loaded properly from training dataset
  print(text_batch.numpy()[0])
  print(label_batch.numpy()[0])

b'Checking the spoiler alert just in case.  Perhaps one of the most horrendous movies I have ever seen, Mazes and Monsters felt like I wasted 101 minutes of my life. The only redeeming quality of the movie were scenes that tried to be serious, but just ended up being funny since they were so bad. Evil Dead anyone? Unfortunately for M&M (fortunately for us) it did not develop a cult following and result in a trilogy. This movie tried to address a series of problems that the main character, Robbie (played by Hanks) encountered throughout the film. It ended up being a fear mongering video about stereotypes that helped fuel the D&D is the Devil movement in the 80s.  If you want to avoid wasting your time and money, steer clear of this junk.  P.S. - Even though the cover looks kinda interesting, which is why I guess my brother bought it, it in no way takes place in a fantasy realm, unless you consider New England or New York City to be such a place.'
0


In [7]:
model = tf.keras.models.Sequential()  #initialized the sequential object, as the model will take a string as input.
model.add(tf.keras.Input(shape=(1,), dtype='string'))  #adding a input layer that takes string data 




In [8]:
#converting this string information into vector representation.
#For this we are going to use the text vectorization module.


max_tokens = 1000   #only tokenizing 1000 words
max_len = 100
vectorize_layer = TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_len)

In [9]:
train_texts = train_df.map(lambda text, label: text)  
vectorize_layer.adapt(train_texts)

model.add(vectorize_layer)

In [10]:
model.add(Embedding(max_tokens + 1, 128))

rnn = RNN(SimpleRNNCell(64) , return_sequences=False,return_state=False)
model.add(rnn)
model.add(Dense(128, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [11]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [12]:
model.fit(train_df, epochs=10, batch_size=64, verbose=2)

Epoch 1/10
782/782 - 27s - loss: 0.6992 - accuracy: 0.5105
Epoch 2/10
782/782 - 26s - loss: 0.6897 - accuracy: 0.5364
Epoch 3/10
782/782 - 26s - loss: 0.6664 - accuracy: 0.5932
Epoch 4/10
782/782 - 26s - loss: 0.6446 - accuracy: 0.6368
Epoch 5/10
782/782 - 26s - loss: 0.6631 - accuracy: 0.5991
Epoch 6/10
782/782 - 26s - loss: 0.6685 - accuracy: 0.5874
Epoch 7/10
782/782 - 26s - loss: 0.6605 - accuracy: 0.6008
Epoch 8/10
782/782 - 26s - loss: 0.6369 - accuracy: 0.6400
Epoch 9/10
782/782 - 26s - loss: 0.6487 - accuracy: 0.6098
Epoch 10/10
782/782 - 26s - loss: 0.6583 - accuracy: 0.5983


<tensorflow.python.keras.callbacks.History at 0x7f25781b8c10>

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 128)          128128    
_________________________________________________________________
rnn (RNN)                    (None, 64)                12352     
_________________________________________________________________
dense (Dense)                (None, 128)               8320      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 148,929
Trainable params: 148,929
Non-trainable params: 0
_________________________________________________________________


In [14]:
model.evaluate(test_df, verbose=0)  #evaluating model on test dataset

[0.6762213110923767, 0.5723199844360352]

In [15]:
from numpy import sqrt          #model evaluation by calculating RMSE, MSE values
loss, error = model.evaluate(test_df, verbose=0)
print('MSE: %.3f, RMSE: %.3f' % (error, sqrt(error)))

MSE: 0.572, RMSE: 0.757


In [16]:
loss, error = model.evaluate(train_df, verbose=0)
print('MSE: %.3f, RMSE: %.3f' % (error, sqrt(error)))

MSE: 0.620, RMSE: 0.788


MSE is computed by the sum of square of prediction error which is real output minus predicted output, then divided by the number of data points. It shows you an absolute number as to how far your forecast results differ from the actual number. You cannot understand many ideas from one outcome, but this allows you to compare a real number with other results of a model and to help you select the best model for regression.

MSE square root is the Root Mean Square Error(RMSE). It's more frequently used than MSE, as MSE can first be too large to be easily compared. Second, MSE is calculated by the error square and therefore the square root returns to the same predictive level.

This signifies, the lower the error, the better and if 0, the better the model.
####There are 3 main metrics for model evaluation in regression:
1. R Square/Adjusted R Square

2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)

3. Mean Absolute Error(MAE)




In [17]:
loss, acc = model.evaluate(test_df, verbose=2)  #Model evaluation with checking the accuracy rate
print('Accuracy: %.3f' % acc)

782/782 - 10s - loss: 0.6762 - accuracy: 0.5723
Accuracy: 0.572


In [18]:
loss, acc = model.evaluate(train_df, verbose=2)  #Model evaluation with checking the accuracy rate
print('Accuracy: %.3f' % acc)

782/782 - 11s - loss: 0.6377 - accuracy: 0.6205
Accuracy: 0.620


In [19]:
text = "I do not like the movie !" 
model.predict([text])  #result is less than 0.5 ,that means the result says the text is negative sentiments

array([[0.4639558]], dtype=float32)

In [20]:
text = "I loved the movie !"
model.predict([text])  #if the result is greater than 0.5 ,that means the result says the text is positive sentiments.
# but Through RNN Implementation it could not predict popsitive emotions.

array([[0.46395582]], dtype=float32)

#**LSTM**

In [1]:
import tensorflow as tf
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout

In [2]:
def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label),
  )

In [3]:
train_data = prepareData('movie-reviews-dataset/train') #creating dataset object for both train and test data
test_data = prepareData('movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [4]:
model = tf.keras.models.Sequential()  #initialized the sequential object, as the model will take a string as input.
model.add(tf.keras.Input(shape=(1,), dtype='string'))  #adding a input layer that takes string data 




In [5]:
max_tokens = 1000   #only tokenizing 1000 words
max_len = 100
vectorize_layer = TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_len)

In [6]:
train_texts = train_data.map(lambda text, label: text)  
vectorize_layer.adapt(train_texts)

model.add(vectorize_layer)

In [7]:
model.add(Embedding(max_tokens + 1, 128))

model.add(LSTM(64))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [8]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 128)          128128    
_________________________________________________________________
lstm (LSTM)                  (None, 64)                49408     
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 181,761
Trainable params: 181,761
Non-trainable params: 0
_________________________________________________________________


In [10]:
model.compile(loss="binary_crossentropy", optimizer="Adam", metrics=["accuracy"])
model.fit(train_data, epochs=10, batch_size=32, verbose=2) 

Epoch 1/10
782/782 - 56s - loss: 0.3934 - accuracy: 0.8255
Epoch 2/10
782/782 - 54s - loss: 0.3676 - accuracy: 0.8369
Epoch 3/10
782/782 - 54s - loss: 0.3516 - accuracy: 0.8466
Epoch 4/10
782/782 - 54s - loss: 0.3431 - accuracy: 0.8540
Epoch 5/10
782/782 - 54s - loss: 0.3286 - accuracy: 0.8596
Epoch 6/10
782/782 - 55s - loss: 0.3221 - accuracy: 0.8653
Epoch 7/10
782/782 - 55s - loss: 0.3060 - accuracy: 0.8716
Epoch 8/10
782/782 - 55s - loss: 0.2854 - accuracy: 0.8806
Epoch 9/10
782/782 - 55s - loss: 0.2977 - accuracy: 0.8734
Epoch 10/10
782/782 - 54s - loss: 0.2625 - accuracy: 0.8904


<tensorflow.python.keras.callbacks.History at 0x7fb75c997c90>

In [12]:
loss, acc = model.evaluate(test_data, verbose=2)  #Model evaluation with checking the accuracy rate
print('Accuracy: %.3f' % acc)

782/782 - 17s - loss: 0.6003 - accuracy: 0.7700
Accuracy: 0.770


In [13]:
text = "I don't like the movie !"
model.predict([text])  #result is less than 0.5 ,that means the result says the text is about negative review/ negative sentiment

array([[0.32062522]], dtype=float32)

In [14]:

text = "I loved the movie !"
model.predict([text])  #result is greater than 0.5 ,that means the result says the text is positive review/ positive sentiment

array([[0.8115357]], dtype=float32)

#GRU

In [1]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, GRU, Embedding, Dropout

In [2]:
def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map(
    lambda text, label: (regex_replace(text, '<br />', ' '), label),
  )

In [3]:
train_data = prepareData('movie-reviews-dataset/train')
test_data = prepareData('movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.
b"Nice movie with a great soundtrack which spans through the rock landscape of the 70's and 80's. Radiofreccia describes a generation, it describes life in a small village near Correggio (hometown of Ligabue, the singer who wrote the book that inspired the movie), it describes life of young people and their problems relating to the world. It reminds of Trainspotting, with a bit of Italian touch."
1


In [4]:
model = Sequential()
model.add(Input(shape=(1,), dtype="string"))



In [5]:
max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_len,)

In [6]:
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)

model.add(vectorize_layer)

In [7]:
model.add(Embedding(max_tokens + 1, 128))

model.add(GRU(128))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [10]:
model.compile(loss="mse", optimizer="Adam", metrics=["accuracy"])

In [11]:
model.fit(train_data, epochs=10, batch_size=32, verbose=2)

Epoch 1/10
782/782 - 78s - loss: 0.1448 - accuracy: 0.7946
Epoch 2/10
782/782 - 76s - loss: 0.1274 - accuracy: 0.8215
Epoch 3/10
782/782 - 76s - loss: 0.1192 - accuracy: 0.8370
Epoch 4/10
782/782 - 76s - loss: 0.1111 - accuracy: 0.8500
Epoch 5/10
782/782 - 76s - loss: 0.1037 - accuracy: 0.8637
Epoch 6/10
782/782 - 76s - loss: 0.0984 - accuracy: 0.8710
Epoch 7/10
782/782 - 76s - loss: 0.0919 - accuracy: 0.8822
Epoch 8/10
782/782 - 76s - loss: 0.0865 - accuracy: 0.8900
Epoch 9/10
782/782 - 76s - loss: 0.0810 - accuracy: 0.8983
Epoch 10/10
782/782 - 76s - loss: 0.0751 - accuracy: 0.9074


<tensorflow.python.keras.callbacks.History at 0x7f4f2940bd90>

In [46]:
loss, acc = model.evaluate(test_data, verbose=0)
print('Accuracy on test dataset: %.3f' % acc)

Accuracy on trained dataset: 0.776


In [47]:
loss, acc = model.evaluate(train_data, verbose=0)
print('Accuracy on trained dataset: %.3f' % acc)

Accuracy on test dataset: 0.910


In [16]:
text = "I loved the movie !"   # Prediction result > 0.5, this predicted the text as positive sentiment.
model.predict([text])

array([[0.9731984]], dtype=float32)

In [17]:
text = "I do not like the movie !"   # Prediction result < 0.5, this predicted the text as positive sentiment.
model.predict([text])

array([[0.3991946]], dtype=float32)

#**BiDirectional_LSTM**

In [3]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout , Bidirectional

In [4]:
def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map( lambda text, label: (regex_replace(text, '<br />', ' '), label),)

In [5]:
train_data = prepareData('movie-reviews-dataset/train')
test_data = prepareData('movie-reviews-dataset/test')

Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [6]:
model = Sequential()
model.add(Input(shape=(1,), dtype="string"))



In [7]:
max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
  max_tokens=max_tokens,
  output_mode="int",
  output_sequence_length=max_len,
)

In [8]:
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)

model.add(vectorize_layer)

In [9]:
model.add(Embedding(max_tokens + 1, 128))

model.add(Bidirectional(LSTM(64)))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [10]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [11]:
model.fit(train_data, epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f9430130f10>

In [13]:
loss, acc = model.evaluate(test_data, verbose=0)
print('Accuracy on test dataset: %.3f' % acc)

Accuracy on test dataset: 0.774


In [14]:
loss, acc = model.evaluate(train_data, verbose=0)
print('Accuracy on trained dataset: %.3f' % acc)

Accuracy on trained dataset: 0.908


In [18]:
text = "I loved the movie !"  # Prediction result > 0.5, this predicted the text as positive sentiment.
model.predict([text])

array([[0.99359655]], dtype=float32)

#**Bidirectional GRU**

In [1]:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, GRU, Embedding, Dropout , Bidirectional

In [2]:
def prepareData(dir):
  data = text_dataset_from_directory(dir)
  return data.map( lambda text, label: (regex_replace(text, '<br />', ' '), label),)

In [3]:
train_data = prepareData('movie-reviews-dataset/train')
test_data = prepareData('movie-reviews-dataset/test')


Found 25000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [4]:
model = Sequential()
model.add(Input(shape=(1,), dtype="string"))



In [5]:
max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
  max_tokens=max_tokens,
  output_mode="int",
  output_sequence_length=max_len,
)

In [6]:
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)

model.add(vectorize_layer)

In [7]:
model.add(Embedding(max_tokens + 1, 128))

model.add(Bidirectional(GRU(64)))
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))

In [8]:
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(train_data, epochs=10, batch_size = 32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f642e68ca90>

In [10]:
model.evaluate(test_data, verbose=0)

[0.7846997380256653, 0.7623199820518494]

In [11]:
model.evaluate(train_data, verbose=0)

[0.1765107661485672, 0.9256799817085266]

In [12]:
text = "I love the movie !"
model.predict([text])


array([[0.8951876]], dtype=float32)