<a href="https://colab.research.google.com/github/DonaSul/teaching/blob/main/intro_nlp_day_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


<br>
====================================================<br>
RNN Improvements Practical<br>
Vanishing/Exploding Gradients, GRUs and LSTMs<br>
====================================================<br>
Learning Goals:<br>
- Understand vanishing and exploding gradients in RNNs<br>
- Apply gradient clipping as a solution<br>
- Implement GRU and LSTM for text classification<br>
- Compare GRU vs LSTM on IMDB Sentiment Dataset<br>
====================================================<br>


In [1]:
!pip install torch==2.3.1 torchtext==0.18.0 -f https://download.pytorch.org/whl/torch_stable.html


Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==2.3.1
  Downloading https://download.pytorch.org/whl/rocm6.0/torch-2.3.1%2Brocm6.0-cp312-cp312-linux_x86_64.whl (2193.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 GB[0m [31m342.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchtext==0.18.0
  Downloading https://download.pytorch.org/whl/cpu/torchtext-0.18.0%2Bcpu-cp312-cp312-linux_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m61.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.8.0+cu126
    Uninstalling torch-2.8.0+cu126:
      Successfully uninstalled torch-2.8.0+cu126
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudi

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence



====================================================<br>
TASK 1 — Conceptual<br>
====================================================


<br>
Q1: Why do RNNs suffer from vanishing or exploding gradients?<br>
Write your answer here:<br>


Because during backpropagation through time, gradients are repeatedly multiplied across many timesteps. If values are < 1, they shrink exponentially (vanishing), and if > 1, they grow uncontrollably (exploding), making it hard to learn long-term dependencies or causing unstable training.

====================================================<br>
TASK 2 — Demonstrate Exploding Gradients <br>
====================================================

In [10]:
rnn = nn.RNN(input_size=1, hidden_size=1, batch_first=True) # Simple RNN with 1 input feature and 1 hidden unit

'''
x = torch.ones((1, 50, 1))  # long sequence of ones, (batch=1, seq_len=50, input_size=1)
target = torch.tensor([1])  # fake label
criterion = nn.MSELoss()
optimizer = optim.SGD(rnn.parameters(), lr=1.0)
'''

In [12]:
print("\n--- Task 2: Exploding Gradients Demonstration ---")
#your code here

with torch.no_grad():
    rnn.weight_hh_l0.fill_(3.0)    # forcingg the recurrent weight
    rnn.weight_ih_l0.fill_(1.0)

x = torch.ones((1, 50, 1))
target = torch.tensor([[0.0]])

criterion = nn.MSELoss()
optimizer = optim.SGD(rnn.parameters(), lr=0.1)

for epoch in range(5):
    optimizer.zero_grad()
    out, _ = rnn(x)
    y_pred = out[:, -1, :]

    loss = criterion(y_pred, target)
    loss.backward()

    # Gradient norm
    total_norm = 0
    for p in rnn.parameters():
        if p.grad is not None:
            param_norm = p.grad.data.norm(2)
            total_norm += param_norm.item() ** 2
    total_norm = total_norm ** 0.5

    print(f"Epoch {epoch+1} | Loss: {loss.item():.6e} | Gradient norm: {total_norm:.6e}")

    optimizer.step()

    '''
    Gnorm, how big the gradients are across all the parameters of your model at a training step.
    '''



--- Task 2: Exploding Gradients Demonstration ---
Epoch 1 | Loss: 9.974962e-01 | Gradient norm: 1.007493e-02
Epoch 2 | Loss: 9.974861e-01 | Gradient norm: 1.011637e-02
Epoch 3 | Loss: 9.974758e-01 | Gradient norm: 1.015781e-02
Epoch 4 | Loss: 9.974654e-01 | Gradient norm: 1.019973e-02
Epoch 5 | Loss: 9.974551e-01 | Gradient norm: 1.024165e-02


In [13]:
x = torch.ones((1, 50, 1))
print(x.shape)


torch.Size([1, 50, 1])



<br>
# --- TIP ---<br>
You should observe gradient norms growing very large, signifying an exploding gradient problem. Why is that the case? How can you remedy that?<br>


====================================================<br>
TASK 3 — Apply Gradient Clipping (15 mins)<br>
====================================================

Gradient clipping is a technique to prevent exploding gradients by forcing the gradient norm to stay below a threshold before doing the parameter update.

In [14]:
print("\n--- Task 3: Gradient Clipping ---")
rnn = nn.RNN(input_size=1, hidden_size=1, batch_first=True)
optimizer = optim.SGD(rnn.parameters(), lr=1.0)


--- Task 3: Gradient Clipping ---


In [15]:
# Your code here
for epoch in range(5):
    optimizer.zero_grad()

    output, _ = rnn(x)                   # forward pass
    y_pred = output[:, -1, :]            # take last output
    loss = criterion(y_pred, target.float().unsqueeze(1))  # match shapes

    loss.backward()

    torch.nn.utils.clip_grad_norm_(rnn.parameters(), max_norm=1.0) # If that norm is greater than max_norm (1.0 here), it rescales all gradients proportionally so the total norm becomes 1.0.

    optimizer.step()

    print(f"Epoch {epoch+1}, Loss: {loss.item():.6f}")


Epoch 1, Loss: 0.780032
Epoch 2, Loss: 0.772807
Epoch 3, Loss: 0.206455
Epoch 4, Loss: 0.656150
Epoch 5, Loss: 0.044063


  return F.mse_loss(input, target, reduction=self.reduction)


------------------------------<br>
Task 4: Manual Forward Pass <br>
------------------------------

In [None]:
def task4_manual_forward_pass():
    """
    Compute a forward pass manually (hidden and output state) for a small LSTM using the activation functions in the formula.
    Input sequence length T=3, input size=2, hidden size=2
    """
    x_seq = [np.array([0.5, -1.0]),
             np.array([1.0, 0.0]),
             np.array([-0.5, 0.5])]
    h_prev = np.zeros(2)
    # Your code here

====================================================<br>
Preprocess IMDB dataset<br>
====================================================

In [None]:
print("\n--- Loading IMDB dataset ---")

====================================================<br>
TASK 5 — Implement LSTM Sentiment Classifier on the IMDB dataset<br>
====================================================

In [None]:
class GRUClassifier(nn.Module):

====================================================<br>
TASK 6 — Swap LSTM with GRU and repeat Task 4<br>
====================================================

In [None]:
class GRUClassifier(nn.Module):


====================================================<br>
TASK 7 <br>
====================================================

Compare loss curves for the LSTM and GRU classifiers. Which performs better and why?<br>