<a href="https://colab.research.google.com/github/Mohammmedsaqib/Neural-Networks-And-Fuzzy-Logic/blob/main/Assignment_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement and compare a Recurrent Neural Network (RNN) and a feedforward model in PyTorch for sentiment classification using the dataset at "/content/IMDB Dataset.csv". Analyze their performance on sequential data.

## Load the data

### Subtask:
Load the IMDB dataset into a pandas DataFrame.


**Reasoning**:
Import the pandas library and load the dataset into a DataFrame, then display the head and info of the DataFrame.



In [1]:
import pandas as pd

df = pd.read_csv('/content/IMDB Dataset.csv')

display(df.head())
display(df.info())

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


None

## Preprocess the data

### Subtask:
Clean and preprocess the text data, including tokenization and padding.


**Reasoning**:
Convert sentiment to numerical, clean text data, tokenize, and pad sequences.



In [2]:
import re
import nltk
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

nltk.download('stopwords')

# 1. Convert sentiment to numerical
df['sentiment'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# 2. Clean text data
def clean_text(text):
    text = re.sub(r'<br />', ' ', text) # Remove HTML br tags
    text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
    text = text.lower() # Convert to lowercase
    return text

df['cleaned_review'] = df['review'].apply(clean_text)

# 3. Tokenize the cleaned text data
tokenizer = Tokenizer(num_words=5000) # Limit vocabulary size
tokenizer.fit_on_texts(df['cleaned_review'])
sequences = tokenizer.texts_to_sequences(df['cleaned_review'])

# 4. Pad the tokenized sequences
max_len = 200 # Define a fixed sequence length
padded_sequences = pad_sequences(sequences, maxlen=max_len, padding='post', truncating='post')

display(df.head())
display(padded_sequences[:5])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,review,sentiment,cleaned_review
0,One of the other reviewers has mentioned that ...,1,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,1,a wonderful little production the filming te...
2,I thought this was a wonderful way to spend ti...,1,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,0,basically theres a family where a little boy j...
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,petter matteis love in the time of money is a ...


array([[  27,    4,    1,   75, 1944,   44, 1062,   11,  100,  143,   39,
         477, 3300,  388,  469,   25, 3141,   33,   22,  200,   13,   10,
           6,  601,   47,  578,   14,   68,    1,   87,  145,   11, 3259,
          68,   41, 3300,   12,   28,    2,  132,    4,  576,   60,  283,
           7,  200,   34,    1,  669,  137, 1703,   68,   10,    6,   20,
           3,  117,   15,    1,   38,   10,  117, 2516,   55,   14,    5,
        1448,  382,   38,  576,   28,    6, 3382,    7,    1,  349,  345,
           4,    1,  669,    8,    6,  472, 3300,   13,   11,    6,    1,
         344,    5,    1, 2525, 1058,    8, 2648, 1381,   19,  530,   32,
        4725, 2507,    4,    1, 1190,  113,   30,    1,   24, 2945,    2,
         398,   36,    6,   20,  318,   19,    1, 4906, 3569,  530,    6,
         337,    5, 2476,    2,  323,    2,   22,  108,  225,  237,    9,
          56,  129,    1,  272, 1290,    4,    1,  117,    6,  670,    5,
           1,  187,   11,    8,  263, 

## Split the data

### Subtask:
Split the dataset into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using the padded sequences as features and the sentiment column as the target variable.



In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(padded_sequences, df['sentiment'], test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (40000, 200)
Shape of X_test: (10000, 200)
Shape of y_train: (40000,)
Shape of y_test: (10000,)


## Create vocabulary and embeddings

### Subtask:
Build a vocabulary from the training data and create word embeddings.


**Reasoning**:
Determine the vocabulary size and define the embedding dimension.



In [4]:
vocab_size = len(tokenizer.word_index) + 1 # Add 1 for padding
embedding_dim = 100 # Define the embedding dimension

print("Vocabulary Size:", vocab_size)
print("Embedding Dimension:", embedding_dim)

Vocabulary Size: 168031
Embedding Dimension: 100


## Implement the feedforward model

### Subtask:
Define and implement a simple feedforward neural network for sentiment classification.


**Reasoning**:
Implement the feedforward neural network model as requested in the subtask.



In [5]:
import torch
import torch.nn as nn
import torch.optim as optim

class FeedforwardNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim * max_len, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        flattened = embedded.view(embedded.size(0), -1) # Flatten the embedded sequences
        hidden = self.fc1(flattened)
        activated = self.relu(hidden)
        output = self.fc2(activated)
        return output

# Define model parameters
hidden_dim = 64
output_dim = 1 # For binary classification

# Instantiate the model
ff_model = FeedforwardNN(vocab_size, embedding_dim, hidden_dim, output_dim)

print(ff_model)

FeedforwardNN(
  (embedding): Embedding(168031, 100)
  (fc1): Linear(in_features=20000, out_features=64, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=64, out_features=1, bias=True)
)


## Implement the rnn model

### Subtask:
Define and implement the Recurrent Neural Network (RNN) model for sentiment classification.


**Reasoning**:
Define the RNN model class according to the instructions, including the embedding layer, RNN layer, and linear layer, and then instantiate and print the model.



In [6]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)
        # Use the hidden state from the last time step
        hidden = hidden.squeeze(0)
        output = self.fc(hidden)
        return output

# Define model parameters
hidden_dim = 64
output_dim = 1 # For binary classification

# Instantiate the model
rnn_model = RNN(vocab_size, embedding_dim, hidden_dim, output_dim)

print(rnn_model)

RNN(
  (embedding): Embedding(168031, 100)
  (rnn): RNN(100, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=1, bias=True)
)


## Train both models

### Subtask:
Train both the feedforward and RNN models on the training data.


**Reasoning**:
Convert the training and testing data to PyTorch tensors and create TensorDatasets and DataLoaders.



In [7]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float).unsqueeze(1) # Add a dimension for BCEWithLogitsLoss

X_test_tensor = torch.tensor(X_test, dtype=torch.long)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float).unsqueeze(1) # Add a dimension for BCEWithLogitsLoss


# Create TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create DataLoaders
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print("Train DataLoader size:", len(train_loader))
print("Test DataLoader size:", len(test_loader))

Train DataLoader size: 313
Test DataLoader size: 79


**Reasoning**:
Define the loss function, optimizers, and set the number of epochs for training.



In [8]:
import torch.nn as nn
import torch.optim as optim

# Define loss function
criterion = nn.BCEWithLogitsLoss()

# Define optimizers
ff_optimizer = optim.Adam(ff_model.parameters(), lr=0.001)
rnn_optimizer = optim.Adam(rnn_model.parameters(), lr=0.001)

# Set number of epochs
num_epochs = 5

print("Loss function defined.")
print("Optimizers defined.")
print(f"Number of epochs set to {num_epochs}.")

Loss function defined.
Optimizers defined.
Number of epochs set to 5.


**Reasoning**:
Train both the feedforward and RNN models using the defined loss function, optimizers, and number of epochs.



In [9]:
# Training loop for the Feedforward Model
print("Starting training for Feedforward Model...")
ff_model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for inputs, labels in train_loader:
        ff_optimizer.zero_grad()
        outputs = ff_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        ff_optimizer.step()
        total_loss += loss.item()
    print(f'Feedforward Model - Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}')

# Training loop for the RNN Model
print("\nStarting training for RNN Model...")
rnn_model.train()
for epoch in range(num_epochs):
    total_loss = 0
    for inputs, labels in train_loader:
        rnn_optimizer.zero_grad()
        outputs = rnn_model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        rnn_optimizer.step()
        total_loss += loss.item()
    print(f'RNN Model - Epoch {epoch+1}/{num_epochs}, Loss: {total_loss/len(train_loader):.4f}')

print("\nTraining complete for both models.")

Starting training for Feedforward Model...
Feedforward Model - Epoch 1/5, Loss: 0.6571
Feedforward Model - Epoch 2/5, Loss: 0.3677
Feedforward Model - Epoch 3/5, Loss: 0.1820
Feedforward Model - Epoch 4/5, Loss: 0.1008
Feedforward Model - Epoch 5/5, Loss: 0.0602

Starting training for RNN Model...
RNN Model - Epoch 1/5, Loss: 0.6956
RNN Model - Epoch 2/5, Loss: 0.6891
RNN Model - Epoch 3/5, Loss: 0.6847
RNN Model - Epoch 4/5, Loss: 0.6900
RNN Model - Epoch 5/5, Loss: 0.6827

Training complete for both models.


## Evaluate both models

### Subtask:
Evaluate the performance of both models on the testing data.


**Reasoning**:
Evaluate the performance of both models on the testing data by iterating through the test loader, getting predictions, and calculating accuracy.



In [10]:
from sklearn.metrics import accuracy_score

# Set models to evaluation mode
ff_model.eval()
rnn_model.eval()

# Lists to store predictions and true labels
ff_predictions = []
rnn_predictions = []
true_labels = []

# Disable gradient calculations
with torch.no_grad():
    for inputs, labels in test_loader:
        # Feedforward model evaluation
        ff_outputs = ff_model(inputs)
        ff_predicted = torch.sigmoid(ff_outputs).round() # Apply sigmoid and round to get binary predictions
        ff_predictions.extend(ff_predicted.tolist())
        true_labels.extend(labels.tolist())

        # RNN model evaluation
        rnn_outputs = rnn_model(inputs)
        rnn_predicted = torch.sigmoid(rnn_outputs).round() # Apply sigmoid and round to get binary predictions
        rnn_predictions.extend(rnn_predicted.tolist())

# Convert lists to tensors or numpy arrays
ff_predictions = torch.tensor(ff_predictions).squeeze(1)
rnn_predictions = torch.tensor(rnn_predictions).squeeze(1)
true_labels = torch.tensor(true_labels).squeeze(1)

# Calculate accuracy
ff_accuracy = accuracy_score(true_labels, ff_predictions)
rnn_accuracy = accuracy_score(true_labels, rnn_predictions)

print(f"Feedforward Model Accuracy: {ff_accuracy:.4f}")
print(f"RNN Model Accuracy: {rnn_accuracy:.4f}")

Feedforward Model Accuracy: 0.7905
RNN Model Accuracy: 0.5104


## Analyze and compare results

### Subtask:
Compare the performance of the two models and analyze why the RNN might perform better or worse on sequential data.


**Reasoning**:
Compare the accuracy scores and discuss the performance of the two models, analyzing potential reasons for the RNN's lower performance and the feedforward model's reasonable performance on sequential data.



In [11]:
print(f"Feedforward Model Accuracy: {ff_accuracy:.4f}")
print(f"RNN Model Accuracy: {rnn_accuracy:.4f}")

print("\n--- Model Performance Analysis ---")

if ff_accuracy > rnn_accuracy:
    print("The Feedforward Model performed better than the RNN Model on this sentiment classification task.")
elif rnn_accuracy > ff_accuracy:
    print("The RNN Model performed better than the Feedforward Model on this sentiment classification task.")
else:
    print("Both models performed similarly on this sentiment classification task.")

print("\n--- Potential Reasons for RNN's Performance ---")
print("The RNN model achieved a significantly lower accuracy compared to the feedforward model.")
print("Several factors could contribute to this:")
print("1. Simple RNN Architecture: A simple RNN might struggle to capture long-range dependencies in text compared to more advanced architectures like LSTMs or GRUs.")
print("2. Lack of Hyperparameter Tuning: The current implementation used default or basic hyperparameters. Optimal learning rate, hidden dimension size, and other parameters could significantly impact performance.")
print("3. Limited Training Epochs: Training for only 5 epochs might not be sufficient for the RNN to converge and learn complex patterns in the data, especially with a simple architecture.")
print("4. Vanishing/Exploding Gradients: Simple RNNs are prone to vanishing or exploding gradients, which can hinder effective learning over longer sequences.")

print("\n--- Why Feedforward Model Performed Reasonably Well ---")
print("Despite not explicitly handling the sequential nature of text, the feedforward model achieved a reasonable accuracy.")
print("This could be due to:")
print("1. Bag-of-Words like Representation: By flattening the embedded sequences, the feedforward model effectively treats the input as a bag of embedded words.")
print("2. Learning Word Associations: The model can learn to associate the presence of certain words or combinations of words (regardless of their order within the fixed sequence length) with positive or negative sentiment.")
print("3. Sufficient Data: With a large dataset (50,000 reviews), the model has enough data to learn these word-sentiment associations effectively.")

print("\n--- Summary ---")
print("In this specific implementation and with the given training constraints, the feedforward model outperformed the simple RNN for sentiment classification.")
print("While RNNs are theoretically better suited for sequential data, their performance is highly dependent on the architecture, hyperparameter tuning, and sufficient training.")
print("The feedforward model's ability to learn word associations from the flattened embedded sequences, combined with a large dataset, allowed it to achieve a respectable accuracy.")

Feedforward Model Accuracy: 0.7905
RNN Model Accuracy: 0.5104

--- Model Performance Analysis ---
The Feedforward Model performed better than the RNN Model on this sentiment classification task.

--- Potential Reasons for RNN's Performance ---
The RNN model achieved a significantly lower accuracy compared to the feedforward model.
Several factors could contribute to this:
1. Simple RNN Architecture: A simple RNN might struggle to capture long-range dependencies in text compared to more advanced architectures like LSTMs or GRUs.
2. Lack of Hyperparameter Tuning: The current implementation used default or basic hyperparameters. Optimal learning rate, hidden dimension size, and other parameters could significantly impact performance.
3. Limited Training Epochs: Training for only 5 epochs might not be sufficient for the RNN to converge and learn complex patterns in the data, especially with a simple architecture.
4. Vanishing/Exploding Gradients: Simple RNNs are prone to vanishing or explo

## Summary:

### Data Analysis Key Findings

*   The dataset was successfully loaded, containing 50,000 movie reviews and their corresponding sentiments ('positive' or 'negative').
*   Data preprocessing involved converting sentiment labels to numerical (1 for positive, 0 for negative), cleaning the text by removing HTML tags and punctuation, converting to lowercase, tokenizing with a vocabulary size limit of 5000, and padding sequences to a fixed length of 200.
*   The dataset was split into training (40,000 samples) and testing (10,000 samples) sets.
*   A vocabulary size of 168,031 was determined, and an embedding dimension of 100 was set.
*   Both a simple Feedforward Neural Network and a simple Recurrent Neural Network (RNN) were implemented in PyTorch for binary sentiment classification.
*   Both models were trained for 5 epochs using the BCEWithLogitsLoss criterion and the Adam optimizer.
*   During training, the Feedforward Model's loss decreased significantly from approximately 0.657 to 0.060, indicating learning. The RNN Model's loss remained high, around 0.68 - 0.69, suggesting ineffective learning in this configuration.
*   On the test set, the Feedforward Model achieved an accuracy of 0.7905, while the RNN Model achieved a significantly lower accuracy of 0.5104.

### Insights or Next Steps

*   The simple RNN architecture, lack of hyperparameter tuning, and limited training epochs likely contributed to its poor performance compared to the feedforward model. Using more advanced RNN architectures like LSTMs or GRUs, tuning hyperparameters, and training for more epochs could significantly improve the RNN's performance on this sequential data.
*   The Feedforward Model's reasonable performance suggests that for this dataset and task, leveraging word associations through embeddings and linear layers was more effective than the simple sequential processing attempted by the basic RNN. Further analysis of the Feedforward Model's learned weights could provide insights into which word patterns are most indicative of sentiment.
