_Sentiment Classification using RNN, LSTM_

1. Train a RNN based sentiment analysis model for classification of movie reviews.
    - Explore and learn about the different preprocessing steps in the Natural Language Processing(NLP) domain.
    - Apply suitable preprocessing steps for this sentiment analysis assignment.
    - Build and train a RNN model using basic layers from the framework.
    - Test model on the test set using suitable evaluation metrics.

2.  Train a LSTM based model for the same sentiment analysis problem.
    - Build and train a LSTM model using basic layers from the framework.
    - Test model on the test set using suitable evaluation metrics.

Compare between the two approaches and highlight the improvements.


- Dataset:  Stanford Sentiment Treebank 2 
- Original dataset link: https://huggingface.co/datasets/stanfordnlp/sst2 
- Dataset Zip Link: https://drive.google.com/file/d/1TytoIgt7KI9Ep9bo8bs_X0HSSnBJX0oi/


Data fields in dataset:
- idx: Monotonically increasing index ID.
- sentence: Complete sentence expressing an opinion about a film.
- label: Sentiment of the opinion, either "negative" (0) or positive (1).

Split the provided train dataset of 67349 rows into 5000 for testing and rest for training. Use the separately provided validation dataset (872 rows) file for validation.


In [14]:
import tensorflow as tf
import pandas as pd
import numpy as np

In [15]:
def prepare_dataset(filename):
    df=pd.read_parquet(filename)
    df['sentence']=df['sentence'].str.lower()
    df['sentence']=df['sentence'].replace(r'[^a-z0-9\s]','',regex=True)
    return df
df=prepare_dataset('./sst2_sentiment_dataset/sst2_train.parquet')
display(df)
validation_df=prepare_dataset('./sst2_sentiment_dataset/sst2_valid.parquet')
display(validation_df)

Unnamed: 0,idx,sentence,label
0,0,hide new secretions from the parental units,0
1,1,contains no wit only labored gags,0
2,2,that loves its characters and communicates som...,1
3,3,remains utterly satisfied to remain the same t...,0
4,4,on the worst revengeofthenerds clichs the film...,0
...,...,...,...
67344,67344,a delightful comedy,1
67345,67345,anguish anger and frustration,0
67346,67346,at achieving the modest crowdpleasing goals i...,1
67347,67347,a patient viewer,1


Unnamed: 0,idx,sentence,label
0,0,it s a charming and often affecting journey,1
1,1,unflinchingly bleak and desperate,0
2,2,allows us to hope that nolan is poised to emba...,1
3,3,the acting costumes music cinematography an...,1
4,4,it s slow very very slow,0
...,...,...,...
867,867,has all the depth of a wading pool,0
868,868,a movie with a real anarchic flair,1
869,869,a subject like this should inspire reaction in...,0
870,870,is an arthritic attempt at directing by calli...,0


In [16]:
max_features = 5000  
max_length = 200

tokenizer=tf.keras.preprocessing.text.Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(pd.merge(df['sentence'],validation_df['sentence']))
X=tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences(df['sentence']),maxlen=max_length)
y=df['label'].values
X_val=tf.keras.preprocessing.sequence.pad_sequences(tokenizer.texts_to_sequences(validation_df['sentence']),maxlen=max_length)
y_val=validation_df['label'].values
print("Main Dataframe:")
display(X.shape,y.shape)
print("Validation Dataframe:")
display(X_val.shape,y_val.shape)

Main Dataframe:


(67349, 200)

(67349,)

Validation Dataframe:


(872, 200)

(872,)

In [17]:
dataset=tf.data.Dataset.from_tensor_slices((X,y))
dataset.shuffle(1000)
test_dataset=dataset.take(5000)
train_dataset=dataset.skip(5000)
validation_dataset=tf.data.Dataset.from_tensor_slices((X_val,y_val))

for i in train_dataset.take(1):
    print("Train Dataset:")
    print(i)
for i in test_dataset.take(1):
    print("\nTest Dataset:")
    print(i)
for i in validation_dataset.take(1):
    print("\nValidation Dataset:")
    print(i)

Train Dataset:
(<tf.Tensor: shape=(200,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0], dtype=int32)>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

Test Dataset:
(<tf.Tensor: shape=(200,), dtype=int32, numpy=
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

2025-03-17 10:26:02.756312: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [18]:
RNN_model=tf.keras.Sequential([
    # tf.keras.layers.Embedding(input_dim=max_features,output_dim=16,input_length=max_length),
    # tf.keras.layers.SimpleRNN(64,activation='tanh',return_sequences=False),
    tf.keras.layers.Dense(1,activation='sigmoid')
])

RNN_model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

RNN_model.summary()