<a href="https://colab.research.google.com/github/Kavya-Jain5/Project1/blob/main/milestone_4_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone 4 ‚Äî Sequence Modeling with LSTM and GRU

This milestone introduces **deep learning models (LSTM / GRU)** that are specifically designed to capture the **order and contextual relationships** between words in a sequence.

---

##  Suggested Readings
- [LSTM](https://docs.pytorch.org/docs/stable/generated/torch.nn.GRU.html)
- [GRU](https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html)

---

## ‚öôÔ∏è Instructions

Use the **constants and helper functions** provided in the next cell to answer all **Milestone-4 questions**.

Perform the following tasks on the **training dataset** provided as part of the Kaggle competition:

üîó **Competition Link:**  
[2025-Sep-DL-Gen-AI-Project](https://www.kaggle.com/competitions/2025-sep-dl-gen-ai-project)


# Imports

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
import random
from collections import Counter
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings("ignore")

### Set seeds and Constants

In [2]:
#----------------------------- DON'T CHANGE THIS --------------------------
DATA_SEED = 67
TRAINING_SEED = 1234
MAX_LEN = 50
BATCH_SIZE = 64
EMB_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 5

random.seed(DATA_SEED)
np.random.seed(DATA_SEED)
torch.manual_seed(DATA_SEED)
torch.cuda.manual_seed(DATA_SEED)

# Create Vocab

In [3]:
data_path= '/content/train.csv'     # enter your data path here
df = pd.read_csv(data_path)            # read it and store it in df

In [4]:
# Split train df into train_df(80%) and test_df (20%) use seed
# ------------------- write your code here -------------------------------
train_df, val_df = train_test_split(df, test_size=0.2, random_state=DATA_SEED)
#-------------------------------------------------------------------------

In [5]:
# create a simple space-based tokenizer.
# ------------------- write your code here -------------------------------
def tokenize(text):
    return text.split()
#-------------------------------------------------------------------------

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

simple_lstm_model = SimpleLSTM(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)
bilstm_model = BiLSTM(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)
stacked_gru_model = StackedGRU(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)

print(f"Number of training parameters for Simple LSTM: {count_parameters(simple_lstm_model)}")
print(f"Number of training parameters for Bidirectional LSTM: {count_parameters(bilstm_model)}")
print(f"Number of training parameters for Stacked GRU: {count_parameters(stacked_gru_model)}")

Number of training parameters for Simple LSTM: 940877
Number of training parameters for Bidirectional LSTM: 1308749
Number of training parameters for Stacked GRU: 1243981


In [6]:
# Use counter to count all tokens in train_df
token_counter = Counter()
# ------------------- write your code here -------------------------------
for text in train_df['text']:
    token_counter.update(tokenize(text))
#------------------------------------------------------------------------

## Create train and val dataloaders

In [7]:
#----------------------------- DON'T CHANGE THIS --------------------------
specials = ['<unk>', '<pad>']
min_freq = 2
vocab_list = specials + [token for token, freq in token_counter.items() if freq >= min_freq]
word2idx = {token: i for i, token in enumerate(vocab_list)}
UNK_IDX = word2idx['<unk>']
PAD_IDX = word2idx['<pad>']
def text_pipeline(text):
    """Converts text to a list of indices using the word2idx dict."""
    tokens = tokenize(text)
    return [word2idx.get(token, UNK_IDX) for token in tokens]
class EmotionDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].values
        self.labels = dataframe[['anger', 'fear', 'joy', 'sadness', 'surprise']].values.astype(np.float32)
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]
def collate_batch(batch):
    label_list, text_list = [], []
    for (_text, _labels) in batch:
        label_list.append(_labels)
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)[:MAX_LEN]
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.float32)
    text_list = pad_sequence(text_list, batch_first=True, padding_value=PAD_IDX)
    if text_list.shape[1] < MAX_LEN:
        pad_tensor = torch.full(
            (text_list.shape[0], MAX_LEN - text_list.shape[1]),
            PAD_IDX,
            dtype=torch.int64
        )
        text_list = torch.cat((text_list, pad_tensor), dim=1)

    return text_list, label_list

# Create train and val dataloaders
# ------------------- write your code here -------------------------------
train_dataset = EmotionDataset(train_df)
val_dataset = EmotionDataset(val_df)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
#------------------------------------------------------------------------

### Q1. What are the vocabulary size, padding token index, and unknown token index for the above dataset?

In [8]:
# ------------------- write your code here -------------------------------
VOCAB_SIZE = len(vocab_list)
print(f"Vocabulary Size: {VOCAB_SIZE}")
print(f"Padding Token Index: {PAD_IDX}")
print(f"Unknown Token Index: {UNK_IDX}")
#-------------------------------------------------------------------------

Vocabulary Size: 5730
Padding Token Index: 1
Unknown Token Index: 0


### Q2.What are the indices for the words "happy", "alone", and "sad" in the vocabulary?

In [9]:
# ------------------- write your code here -------------------------------
print(f"Index for 'happy': {word2idx.get('happy', UNK_IDX)}")
print(f"Index for 'alone': {word2idx.get('alone', UNK_IDX)}")
print(f"Index for 'sad': {word2idx.get('sad', UNK_IDX)}")
#-------------------------------------------------------------------------

Index for 'happy': 1578
Index for 'alone': 2525
Index for 'sad': 885


In [10]:
# Get one batch to test shapes
#take one batch as input here and store it in text_batch
text_batch, _ = next(iter(train_dataloader))
emb_layer = nn.Embedding(VOCAB_SIZE, EMB_DIM)
embedded_batch = emb_layer(text_batch)

# Simple LSTM layer Output Shape (Use constants defined in 2nd cell)
lstm = nn.LSTM(EMB_DIM, HIDDEN_DIM, batch_first=True)
#read_output = lstm(embedded_batch)

### Q3. What is the output shape of the Embedding layer?


In [11]:
# ------------------- write your code here -------------------------------
print(embedded_batch.shape)
#-------------------------------------------------------------------------

torch.Size([64, 50, 100])


### Q4. What will be output shape of simple LSTM layer

In [12]:
# ------------------- write your code here -------------------------------
lstm_output, _ = lstm(embedded_batch)
print(lstm_output.shape)
#-------------------------------------------------------------------------

torch.Size([64, 50, 256])


### Q5. What is the 'hidden' state shape from a simple LSTM?

In [13]:
# ------------------- write your code here -------------------------------
_, (hidden, cell) = lstm(embedded_batch)
print(hidden.shape)
#-------------------------------------------------------------------------

torch.Size([1, 64, 256])


### Q6. What is the 'hidden' state shape from a simple GRU?

In [14]:
# similarly do it for gru and find hidden state shape
# ------------------- write your code here -------------------------------
gru = nn.GRU(EMB_DIM, HIDDEN_DIM, batch_first=True)
_, hidden_gru = gru(embedded_batch)
print(hidden_gru.shape)
#-------------------------------------------------------------------------

torch.Size([1, 64, 256])


### Q7. What is the 'output' tensor shape from a bidirectional LSTM?

In [15]:
# Bidirectional LSTM Output Shape
# ------------------- write your code here -------------------------------
bidirectional_lstm = nn.LSTM(EMB_DIM, HIDDEN_DIM, batch_first=True, bidirectional=True)
output_bi_lstm, (hidden_bi_lstm, cell_bi_lstm) = bidirectional_lstm(embedded_batch)
print(output_bi_lstm.shape)
#-------------------------------------------------------------------------

torch.Size([64, 50, 512])


### Q8. What is the 'hidden' state shape from a bidirectional LSTM?

In [16]:
# Bidirectional LSTM Hidden Shape
# ------------------- write your code here -------------------------------
print(hidden_bi_lstm.shape)
#-------------------------------------------------------------------------

torch.Size([2, 64, 256])


### Q9. Create 3 sequential models using the (Simple & Bidirectional)LSTM and Stacked GRU (2 layers)For all models, follow this(Embedding layer ‚Üí [LSTM / BiLSTM / Stacked GRU] ‚Üí Linear layer) architecture. What will be the training parameters in all 3 cases?(LSTM, BiLSTM, Stacked GRU)

In [17]:
# ------------------- write your code here -------------------------------
# Simple LSTM
class SimpleLSTM(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, (hidden, cell) = self.lstm(embedded)
        hidden = hidden.squeeze(0)
        return self.linear(hidden)

# Bidirectional LSTM
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, bidirectional=True, batch_first=True)
        self.linear = nn.Linear(hidden_dim * 2, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, (hidden, cell) = self.lstm(embedded)
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1)
        return self.linear(hidden)

# Stacked GRU
class StackedGRU(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx, n_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.gru = nn.GRU(emb_dim, hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, hidden = self.gru(embedded)
        hidden = hidden[-1,:,:].squeeze(0)
        return self.linear(hidden)
#-------------------------------------------------------------------------

### Q10. If you experimented with both LSTM and GRU models using the same hyperparameters, which one achieved a better peak Macro F1-score in your W&B logs?

### Q11. Compare the total training time for your best sequential model against the simple averaging model from Milestone 3. How much longer (in minutes or percentage) did the more complex model (LSTM and GRU) take to train for the same number of epochs?

### Q12. If you experimented with both LSTM and GRU models using the same hyperparameters, which one achieved a better peak Macro F1-score in your W&B logs?

### Q13 Based on your experiments, what was the most impactful hyperparameter you tuned for your sequential model (e.g., learning rate, hidden size, number of layers, dropout rate)?