# Sentiment Analysis using Recurrent Models

This notebook compares the performance of RNN, LSTM, GRU, and BiLSTM for sentiment analysis using the IMDB dataset. Each model is trained for 10 epochs with fixed units, using the Adam optimizer.

## Overview
The key steps involve importing the dataset, defining and training various recurrent models, and comparing their performance.

## Procedure
- **Dataset Preparation**: Import the IMDB dataset and convert it to vector form using the Bag of Words technique.
- **RNN Model**: Define and train an RNN model on the dataset.
- **LSTM Model**: Define and train an LSTM model on the dataset.
- **GRU Model**: Define and train a GRU model on the dataset.
- **BiLSTM Model**: Define and train a BiLSTM model on the dataset.
- **Performance Comparison**: Compare the accuracy of all models to determine the best performer.

References:
- [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

# 1.1

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("imdb.csv", usecols=["review", "sentiment"], encoding='latin-1')
## 1 - positive, 0 - negative
df.sentiment = (df.sentiment == "positive").astype("int")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [3]:
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)


def train_val_test_split(df=None, train_percent=0.7, test_percent=0.15, val_percent=0.15):
  df = df.sample(frac=1)
  train_df = df[: int(len(df)*train_percent)]
  test_df = df[int(len(df)*train_percent)+1 : int(len(df)*(train_percent+test_percent))]
  val_df = df[int(len(df)*(train_percent + test_percent))+1 : ]
  return train_df, test_df, val_df

train_df, test_df, val_df = train_val_test_split(df, 0.7, 0.15, 0.15)
train_labels, train_texts = train_df.values[:,1], train_df.values[:,0]
val_labels, val_texts = val_df.values[:,1], val_df.values[:,0]
test_labels, test_texts = test_df.values[:,1], test_df.values[:,0]
print(len(train_df), len(test_df), len(val_df))
print(len(train_texts), len(train_labels), len(val_df))

35000 7499 7499
35000 35000 7499


In [4]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

def process_tokens(text):
    """
    function to process tokens, replace any unwanted chars
    """
    preprocessed_text = text.lower().replace(",", "").replace(".", "").replace(":", "").replace(")", "").replace("-", "").replace("(", "")
    preprocessed_text = ''.join([i for i in preprocessed_text if not preprocessed_text.isdigit()])
    return preprocessed_text

def preprocessing(data):
    """
    preprocessing data to list of tokens
    """
    nlp = English()
    tokenizer = Tokenizer(nlp.vocab)
    preprocessed_data = []
    for sentence in data:
        sentence = process_tokens(sentence)
        tokens = tokenizer(sentence)
        tlist = []
        for token in tokens:
            tlist.append(str(token))
        preprocessed_data.append(tlist)
    return preprocessed_data

train_data = preprocessing(train_texts)
val_data = preprocessing(val_texts)
test_data = preprocessing(test_texts)

In [5]:
import numpy as np
import itertools

## Creating a vectorizer to vectorize text and create matrix of features
## Bag of words technique
class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        word_dict = {}
        for sentence in dataset:
            for token in sentence:
                if token not in word_dict:
                    word_dict[token] = 1
                else:
                    word_dict[token] += 1
        word_dict = dict(sorted(word_dict.items(), key=lambda item: item[1], reverse=True))
        end_to_slice = min(len(word_dict), self.max_features)
        word_dict = dict(itertools.islice(word_dict.items(), end_to_slice))
        self.vocab_list = list(word_dict.keys())
        self.token_to_index = {}
        counter = 0
        for token in self.vocab_list:
            self.token_to_index[token] = counter
            counter += 1


    def transform(self, dataset):
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i, sentence in enumerate(dataset):
            for token in sentence:
                if token in self.token_to_index:
                    data_matrix[i, self.token_to_index[token]] += 1
        return data_matrix

## max features - top k words to consider only
max_features = 2000

vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)

## Checking if the len of vocab = k
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

In [6]:
import tensorflow as tf
keras = tf.keras
from keras.utils import to_categorical
y_train = y_train.astype('int')
y_val = y_val.astype('int')
y_test = y_test.astype('int')

y_train = to_categorical(y_train, 2)
y_test = to_categorical(y_test, 2)
y_val = to_categorical(y_val, 2)

X_train = X_train.reshape(-1, 1, X_train.shape[1])
X_val = X_val.reshape(-1, 1, X_val.shape[1])
X_test = X_test.reshape(-1, 1, X_test.shape[1])

y_train = y_train.reshape(-1, 2)
y_val = y_val.reshape(-1, 2)
y_test = y_test.reshape(-1, 2)

print(f'X_train.shape: {X_train.shape}, y_train.shape: {y_train.shape}')

X_train.shape: (35000, 1, 2000), y_train.shape: (35000, 2)


In [7]:
model_results = {}

## Parameters for all models

In [8]:
## Parameters for all models
BATCH_SIZE = 256
LR = 0.01
EPOCHS = 10

# 1.2

In [9]:
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import SimpleRNN, Dropout
from keras.optimizers.legacy import Adam


model = None
model = Sequential()
model.add(SimpleRNN(256, input_shape=(1, max_features)))
model.add(Dense(2, activation='softmax'))

optimizer = Adam(learning_rate=LR)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, 
              metrics=['accuracy'], )
print(model.summary())

with tf.device('/device:GPU:0'):
    history = model.fit(X_train, y_train,
            batch_size=BATCH_SIZE,
            validation_data=(X_val, y_val),
            epochs=EPOCHS)
print(history.history.keys())

score, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score)
print('Test accuracy:', acc)

model_results['RNN'] = {
    'loss': score,
    'accuracy': acc,
    'history': history
}

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn (SimpleRNN)      (None, 256)               577792    
                                                                 
 dense (Dense)               (None, 2)                 514       
                                                                 
Total params: 578306 (2.21 MB)
Trainable params: 578306 (2.21 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Test loss: 0.5322189331054688
Test accuracy: 0.8718495965003967


# 1.3

In [None]:
from keras.layers import LSTM

model = None
model = Sequential()
model.add(LSTM(256, input_shape=(1, max_features)))
model.add(Dense(2, activation='softmax'))

optimizer = Adam(learning_rate=LR)
model.compile(
    loss='categorical_crossentropy', 
    optimizer=optimizer, 
     metrics=['accuracy'])

print(model.summary())

with tf.device('/device:GPU:0'):
    history = model.fit(X_train, y_train,
            batch_size=BATCH_SIZE,
            validation_data=(X_val, y_val),
            epochs=EPOCHS)
print(history.history.keys())

score, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test loss: {:.5f}'.format(score))
print('Test accuracy: {:.5f}'.format(acc))

model_results['LSTM'] = {
    'loss': score,
    'accuracy': acc,
    'history': history,
    'model': model
}



Model: "sequential_21"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_10 (LSTM)              (None, 256)               2311168   
                                                                 
 dense_21 (Dense)            (None, 2)                 514       
                                                                 
Total params: 2311682 (8.82 MB)
Trainable params: 2311682 (8.82 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Test loss: 0.55006
Test accuracy: 0.86732


# 1.4

In [52]:
from keras.layers import GRU

model = None
model = Sequential()
model.add(GRU(256, input_shape=(1, max_features)))
model.add(Dense(2, activation='softmax'))

optimizer = Adam(learning_rate=LR)
model.compile(
    loss='categorical_crossentropy', 
    optimizer=optimizer, 
     metrics=['accuracy'])

print(model.summary())

with tf.device('/device:GPU:0'):
    history = model.fit(X_train, y_train,
            batch_size=BATCH_SIZE,
            validation_data=(X_val, y_val),
            epochs=EPOCHS)
    
print(history.history.keys())

score, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test loss: {:.4f}'.format(score))
print('Test accuracy: {:.4f}'.format(acc))

model_results['GRU'] = {
    'loss': score,
    'accuracy': acc,
    'history': history,
    'model': model
}

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 gru_1 (GRU)                 (None, 256)               1734144   
                                                                 
 dense_22 (Dense)            (None, 2)                 514       
                                                                 
Total params: 1734658 (6.62 MB)
Trainable params: 1734658 (6.62 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Test loss: 0.5596
Test accuracy: 0.8685


# 1.5

In [53]:
from keras.layers import Bidirectional

model = None
model = Sequential()
model.add(Bidirectional(LSTM(256), input_shape=(1, max_features)))
model.add(Dense(2, activation='softmax'))

optimizer = Adam(learning_rate=LR)
model.compile(
    loss='categorical_crossentropy', 
    optimizer=optimizer, 
     metrics=['accuracy'])

print(model.summary())

with tf.device('/device:GPU:0'):
    history = model.fit(X_train, y_train,
            batch_size=BATCH_SIZE,
            validation_data=(X_val, y_val),
            epochs=EPOCHS)
    
print(history.history.keys())

score, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test loss: {:.4f}'.format(score))
print('Test accuracy: {:.4f}'.format(acc))

model_results['BiLSTM'] = {
    'loss': score,
    'accuracy': acc,
    'history': history,
    'model': model
}

Model: "sequential_23"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bidirectional_8 (Bidirecti  (None, 512)               4622336   
 onal)                                                           
                                                                 
 dense_23 (Dense)            (None, 2)                 1026      
                                                                 
Total params: 4623362 (17.64 MB)
Trainable params: 4623362 (17.64 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
Test loss: 0.5759
Test accuracy: 0.8702


# 1.6

In [72]:
%precision %.4f
for model_type, results in model_results.items():
    print(f'{model_type:6s} - Test loss: {results["loss"]:.4f}, Test accuracy: {results["accuracy"]:.4f}')

RNN    - Test loss: 0.5377, Test accuracy: 0.8690
LSTM   - Test loss: 0.5501, Test accuracy: 0.8673
GRU    - Test loss: 0.5596, Test accuracy: 0.8685
BiLSTM - Test loss: 0.5759, Test accuracy: 0.8702


Compare the performance of all the models. In which case do you get the best accuracy?

``` text
RNN    - Test loss: 0.5377, Test accuracy: 0.8690
LSTM   - Test loss: 0.5501, Test accuracy: 0.8673
GRU    - Test loss: 0.5596, Test accuracy: 0.8685
BiLSTM - Test loss: 0.5759, Test accuracy: 0.8702
```

The BiLSTM model outperformed the other models with a test accuracy of 0.8702. This is likely due to its ability to capture contextual information from both forward and backward directions of the input sequence. The RNN performed the worst with a test accuracy of 0.8690, likely due to its inability to capture long-term dependencies in the input sequence compared to LSTM, GRU, and BiLSTM. The LSTM and GRU models performed similarly, with test accuracies of 0.8673 and 0.8685.