### Understanding RNN architecture and Backpropagation Through Time (BPTT)


#### Detailed Architecture of RNN

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data by maintaining a form of memory of previous inputs. They are particularly well-suited for tasks where context or order matters, such as language modeling, time series prediction, and speech recognition.

---

### **Components of an RNN**

- **Input Layer:**  
    Receives sequential data, where each element of the sequence is fed into the network at each time step \( t \).  
    *Example:* For a sentence, each word (or character) is an input at a different time step.

- **Hidden Layer:**  
    Maintains a "memory" of past inputs through recurrent connections. The hidden state at time \( t \) (\( h_t \)) is updated based on the current input and the previous hidden state.  
    The update rule is:  
    \[
    h_t = f(W_h h_{t-1} + W_x x_t + b_h)
    \]
    where:
    - \( h_{t-1} \): Hidden state from the previous time step
    - \( x_t \): Input at the current time step
    - \( W_h \): Weight matrix for recurrent (hidden-to-hidden) connections
    - \( W_x \): Weight matrix for input-to-hidden connections
    - \( b_h \): Bias term
    - \( f \): Non-linear activation function (commonly tanh or ReLU)

    The hidden state acts as a summary of all previous inputs, allowing the network to retain information over time.

- **Output Layer:**  
    Produces the output at each time step, which can be used for tasks like sequence prediction, classification, or generation.  
    The output at time \( t \) is typically computed as:  
    \[
    y_t = g(W_y h_t + b_y)
    \]
    where:
    - \( W_y \): Weight matrix for hidden-to-output connections
    - \( b_y \): Output bias
    - \( g \): Activation function (e.g., softmax for classification, linear for regression)

---

### **Key Points**

- **Parameter Sharing:**  
    RNNs share parameters (weights and biases) across all time steps, making them efficient for sequential data and reducing the number of parameters to learn.
- **Temporal Dependencies:**  
    The recurrent connection allows information to persist, enabling the network to learn temporal dependencies and context from previous inputs.
- **Flexible Output:**  
    RNNs can be configured for different sequence tasks:  
    - One-to-one (e.g., image classification)  
    - One-to-many (e.g., image captioning)  
    - Many-to-one (e.g., sentiment analysis)  
    - Many-to-many (e.g., machine translation)

---

### **Backpropagation Through Time (BPTT)**

**What is BPTT?**  
BPTT is an extension of standard backpropagation to handle sequential data in RNNs. It calculates gradients for each time step and propagates them backward through the sequence, updating the shared weights.

**Steps of BPTT:**
1. **Unroll the RNN:**  
     The RNN is "unrolled" across the sequence for a fixed number of time steps, creating a computational graph where each time step is a layer.
2. **Compute the Loss:**  
     The loss is computed at each time step (or only at the final step, depending on the task).
3. **Backpropagate Errors:**  
     Errors are backpropagated through all time steps to update the shared weights, taking into account dependencies across time.

---

### **Challenges in BPTT**

- **Vanishing Gradient Problem:**  
    As gradients are propagated backward through many time steps, they can shrink exponentially, making them extremely small. This leads to very slow or stalled learning for long-term dependencies, as the network fails to update weights effectively for earlier time steps.
    - This problem is especially severe with long sequences and deep unrollings.
    - It limits the ability of standard RNNs to capture relationships between distant elements in a sequence.
    - **Mitigation:**  
        - Use specialized architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which introduce gating mechanisms to help preserve gradients and maintain long-term dependencies.
        - Proper weight initialization and using activation functions less prone to saturation (like ReLU) can also help.

- **Exploding Gradient Problem:**  
    Gradients can grow exponentially, causing numerical instability during training.
    - **Mitigation:**  
        - Use gradient clipping to cap the gradients during backpropagation.
        - Careful tuning of learning rates.

---

### **Limitations of Vanilla RNNs**

- **Short-Term Memory:**  
    Struggle to learn dependencies in long sequences due to vanishing gradients, making it hard to capture long-term context.
- **Sequential Computation:**  
    Cannot parallelize training across time steps, making them computationally expensive and slow for long sequences.
- **Sensitive Initialization:**  
    Performance depends heavily on proper weight initialization and learning rates. Poor choices can exacerbate vanishing/exploding gradients.
- **Difficulty with Long Sequences:**  
    Standard RNNs are not well-suited for tasks requiring the retention of information over many time steps.

---

### **Applications of RNNs**

- Natural Language Processing (NLP): Language modeling, machine translation, text generation
- Time Series Prediction: Stock prices, weather forecasting
- Speech Recognition: Transcribing audio to text
- Sequence Generation: Music, handwriting

---

**Summary:**  
RNNs are powerful for modeling sequential data but face significant challenges with long-term dependencies due to vanishing and exploding gradients. Advanced architectures like LSTM and GRU are commonly used to address these issues and enable learning from longer sequences.

In [2]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense 

Load dataset

In [3]:
vocab_size = 10000
max_len = 200

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)

X_train = pad_sequences(X_train, maxlen=max_len, padding="post")
X_test = pad_sequences(X_test, maxlen=max_len, padding="post")

print(f"Training Data Shape {X_train.shape}")
print(f"Testing Data Shape {X_test.shape}")

model = Sequential([
    Embedding(input_dim = vocab_size, output_dim = 128),
    SimpleRNN(128, activation="tanh", return_sequences=False),
    Dense(1, activation="sigmoid")
])

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.summary()

# train model
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

# evaluate model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Loss: {loss}")
print(f"Accuracy: {accuracy}")

Training Data Shape (25000, 200)
Testing Data Shape (25000, 200)


Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 20ms/step - accuracy: 0.5062 - loss: 0.6950 - val_accuracy: 0.5380 - val_loss: 0.6855
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.5895 - loss: 0.6453 - val_accuracy: 0.5434 - val_loss: 0.6891
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.6326 - loss: 0.5996 - val_accuracy: 0.5434 - val_loss: 0.6772
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.6105 - loss: 0.6093 - val_accuracy: 0.5428 - val_loss: 0.6820
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 20ms/step - accuracy: 0.6157 - loss: 0.6015 - val_accuracy: 0.5420 - val_loss: 0.6977
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 7ms/step - accuracy: 0.5386 - loss: 0.7019
Loss: 0.7031590342521667
Accuracy: 0.5343999862670898


Pytorch

In [8]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset


train_dataset = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

class RNNModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(RNNModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size,embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        embedded = self.embedding(x)
        output, hidden = self.rnn(embedded)
        return torch.sigmoid(self.fc(hidden.squeeze(0)))
    
model = RNNModel(vocab_size=10000, embedding_dim=128, hidden_dim=128, output_dim=1)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

def train_rnn(model, train_loader, criterion, optimizer, epochs=5):
    model.train()
    for epoch in range(epochs):
        epoch_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            predictions = model(X_batch).squeeze(1)
            loss = criterion(predictions, y_batch.float())
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        print(f"Epoch {epoch+1}, Loss:{epoch_loss/len(train_loader)}")

train_rnn(model,train_loader,criterion,optimizer)

def evalutate_rnn(model,X_test,y_test):
    model.eval()
    with torch.no_grad():
        predictions = model(torch.tensor(X_test)).squeeze(1)
        loss = criterion(predictions, torch.tensor(y_test).float())
        accuracy = ((predictions>0) == torch.tensor(y_test).float()).float().mean().item()
    print(F"Test Loss: {loss.item()}, Test accuracy: {accuracy}")

evalutate_rnn(model,X_test,y_test)

Epoch 1, Loss:0.6838183702562776
Epoch 2, Loss:0.6490452869240281
Epoch 3, Loss:0.6204159514754629
Epoch 4, Loss:0.5934461690199649
Epoch 5, Loss:0.5570540175870862
Test Loss: 0.6885334253311157, Test accuracy: 0.5
