# Project 3: Implementing a Simple Recurrent Neural Network (RNN)

## Introduction

In this project, you will design, implement, and evaluate a simple Recurrent Neural Network (RNN) from scratch. This will involve building the entire pipeline, from data preprocessing to model training and evaluation.

## Objectives

1. Set up TensorFlow or PyTorch environments. You are free to choose your preferred DL platform.
2. Use GPU for training.
3. Create a data loader and implement data preprocessing where needed.
4. Design a Convolutional Neural Network.
5. Train and evaluate your model. Make sure to clearly show loss and accuracy values. Include visualizations too.
6. Answer assessment questions.

## Dataset

You are free to choose any dataset for this project! Kaggle would be a good source to look for datasets. Below are some examples:
- Daily Minimum Temperatures in Melbourne: This dataset contains the daily minimum temperatures in Melbourne, Australia, from 1981 to 1990.
- Daily Bitcoin Prices: This dataset contains historical daily prices of Bitcoin, which can be used for time series forecasting projects.
- Text8 Dataset: This dataset consists of the first 100 million characters from Wikipedia. It's great for text generation or language modeling tasks.
- IMDB Movie Reviews: This dataset contains 50,000 movie reviews for sentiment analysis, split evenly into 25,000 training and 25,000 test sets.
- Jena Climate Dataset: This dataset records various weather attributes (temperature, pressure, humidity, etc.) every 10 minutes, making it ideal for time series analysis.
- Earthquake Aftershocks: This dataset contains seismic data, suitable for predicting aftershocks following major earthquakes.


In [9]:
import pandas as pd
import numpy as np
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load your dataset
data = pd.read_csv('IMDB_Dataset.csv')

# Remove HTML tags and URLs from the reviews
data['review'] = data['review'].apply(lambda x: re.sub(r'<.*?>', '', x))
data['review'] = data['review'].apply(lambda x: re.sub(r'http\S+', '', x))

# Convert text to lowercase
data['review'] = data['review'].str.lower()

# Remove special characters
data['review'] = data['review'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

# Remove duplicates
data.drop_duplicates(inplace=True)

# Tokenization and padding
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(data['review'])
sequences = tokenizer.texts_to_sequences(data['review'])

# Define a more practical maximum sequence length if needed
max_len = 280  # You can use this instead of np.max(data['review_length'])

# Pad sequences
padded_sequences = pad_sequences(sequences, maxlen=max_len)

# Prepare labels
labels = data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0).values

print("Data shape:", padded_sequences.shape)
print("Labels shape:", labels.shape)


Data shape: (49580, 280)
Labels shape: (49580,)


In [12]:
import torch
from torch import nn

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, drop_prob=0.5):
        super(SentimentRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        lstm_out = lstm_out[:, -1]  # get the last time step
        out = self.dropout(lstm_out)
        out = self.fc(out)
        return self.sigmoid(out)

# Parameters
vocab_size = 10000  # As used in the tokenizer
embedding_dim = 64
hidden_dim = 256
output_dim = 1
n_layers = 2

# Instantiate the model
model = SentimentRNN(vocab_size, embedding_dim, hidden_dim, output_dim, n_layers)
print(model)


SentimentRNN(
  (embedding): Embedding(10000, 64)
  (lstm): LSTM(64, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


In [13]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

# Loss and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Data
from torch.utils.data import DataLoader, TensorDataset

# Convert arrays to tensors
train_data = torch.from_numpy(padded_sequences).type(torch.LongTensor)
train_labels = torch.from_numpy(labels).type(torch.FloatTensor)

# Create Tensor datasets
train_dataset = TensorDataset(train_data, train_labels)

# dataloaders
batch_size = 64
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)




In [14]:
# Training loop
epochs = 10
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        output = model(inputs)
        loss = criterion(output.squeeze(), labels)
        
        # Backward and optimize
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader)}')


Epoch 1/10, Loss: 0.6428901310889952
Epoch 2/10, Loss: 0.47934946642768
Epoch 3/10, Loss: 0.3117935614912741
Epoch 4/10, Loss: 0.2769839208452932
Epoch 5/10, Loss: 0.2643570976295779
Epoch 6/10, Loss: 0.21631919082614684
Epoch 7/10, Loss: 0.19366211922418686
Epoch 8/10, Loss: 0.17124731135945168
Epoch 9/10, Loss: 0.1536765522269472
Epoch 10/10, Loss: 0.13502622269574674


In [22]:
from sklearn.metrics import accuracy_score

# Switch to evaluation mode
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs).squeeze()
        predicted = (outputs > 0.5).float()
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# Calculate accuracy
accuracy = accuracy_score(all_labels, all_preds)
print(f'Accuracy: {accuracy:.4f}')


Accuracy: 0.9622


NameError: name 'epoch_losses' is not defined

---
### Questions
Answer the following questions in detail.

1. What is a Recurrent Neural Network (RNN)? Describe its key components and how they differ from those in a feedforward neural network.
2. Explain the purpose of the recurrent connection in an RNN. How does it enable the network to handle sequential data?
3. What are vanishing and exploding gradients, and how do they affect the training of RNNs?
4. Describe the Long Short-Term Memory (LSTM) network and its key components. How does it address the issues of vanishing and exploding gradients?
5. What is the purpose of the GRU (Gated Recurrent Unit) in RNNs? Compare it with LSTM.
6. Explain the role of the hidden state in an RNN. How is it updated during the training process?
7. What are some common evaluation metrics used to assess the performance of an RNN on a sequential task, such as language modeling or time series forecasting?
8. How does data preprocessing impact the performance of RNNs? Provide examples of preprocessing steps for text and time series data.
9. What is sequence-to-sequence learning in the context of RNNs, and what are its applications?
10. How can RNNs be used for anomaly detection in time series data? Describe the general approach.


---
### Submission
Submit a link to your completed Jupyter Notebook (e.g., on GitHub (private) or Google Colab) with all the cells executed, and answers to the assessment questions included at the end of the notebook.