# Sequence Modeling with RNNs, LSTMs, and GRUs

Sequences—text, audio, time series—require models that respect order and temporal dependencies. This notebook connects embeddings and recurrent networks to the attention mechanisms you will implement next.

## Learning Objectives

- Construct embeddings and recurrent blocks (RNN, LSTM, GRU) in PyTorch.
- Handle variable-length sequences with packing utilities.
- Implement teacher forcing in a simple sequence-to-sequence model.
- Build a bi-LSTM tagger with masking and accuracy metrics.

## Representing Tokens with Embeddings

Embeddings turn discrete tokens into continuous vectors that the network can differentiate. Initialize them randomly; gradients will shape them during training.

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(2)

vocab_size = 20
embedding_dim = 16
embedding = nn.Embedding(vocab_size, embedding_dim)
tokens = torch.randint(0, vocab_size, (4, 6))
embedded = embedding(tokens)
print(embedded.shape)


## Vanilla RNN vs. GRU vs. LSTM

Recurrent networks process one timestep at a time, maintaining hidden state. GRUs/LSTMs mitigate the vanishing gradient problem with gates.

Monitor hidden-state norms to diagnose stability.

In [None]:
sequence_length = 6
hidden_size = 32

rnn = nn.RNN(embedding_dim, hidden_size, batch_first=True)
gru = nn.GRU(embedding_dim, hidden_size, batch_first=True)
lstm = nn.LSTM(embedding_dim, hidden_size, batch_first=True)

rnn_out, _ = rnn(embedded)
gru_out, _ = gru(embedded)
lstm_out, _ = lstm(embedded)

for name, out in {"RNN": rnn_out, "GRU": gru_out, "LSTM": lstm_out}.items():
    norms = out.norm(dim=-1).mean().item()
    print(f"{name} hidden-state mean norm: {norms:.4f}")


## Mini Task – Packed Sequences

Variable-length batches waste compute if you process padding. Use `pack_padded_sequence`/`pad_packed_sequence` to skip padded timesteps.

Implement the packing/unpacking workflow.

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

lengths = torch.tensor([6, 4, 2, 1])
sorted_idx = torch.argsort(lengths, descending=True)
sorted_embedded = embedded[sorted_idx]

# TODO: pack, run through GRU, then unpack back to padded form


In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

lengths = torch.tensor([6, 4, 2, 1])
sorted_idx = torch.argsort(lengths, descending=True)
sorted_embedded = embedded[sorted_idx]
packed = pack_padded_sequence(sorted_embedded, lengths[sorted_idx], batch_first=True)
packed_out, hidden = gru(packed)
unpadded, _ = pad_packed_sequence(packed_out, batch_first=True)
print(unpadded.shape)


## Sequence-to-Sequence Skeleton

Before attention, encoder-decoder RNNs were the backbone of translation. The example below uses teacher forcing (feeding ground-truth tokens during training).

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.encoder = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.decoder = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.classifier = nn.Linear(hidden_dim, vocab_size)

    def forward(self, src, tgt):
        src_emb = self.embedding(src)
        tgt_emb = self.embedding(tgt)
        _, hidden = self.encoder(src_emb)
        outputs, _ = self.decoder(tgt_emb, hidden)
        return self.classifier(outputs)

model = Seq2Seq(vocab_size=30, embed_dim=18, hidden_dim=32)
src = torch.randint(0, 30, (2, 5))
tgt = torch.randint(0, 30, (2, 6))
print(model(src, tgt).shape)


## Comprehensive Exercise – Bi-LSTM Tagger

Implement a bi-directional LSTM tagger with masking for padding positions and an accuracy metric that ignores padded tokens.

In [None]:
class BiLSTMTagger(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_tags):
        super().__init__()
        # TODO: initialize embedding, bi-LSTM, dropout, classifier

    def forward(self, tokens, lengths):
        # TODO: run packed sequence through bi-LSTM, return logits
        raise NotImplementedError

def token_accuracy(logits, targets, mask):
    # TODO: compute accuracy ignoring masked positions
    raise NotImplementedError


In [None]:
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class BiLSTMTagger(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_tags, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim * 2, num_tags)

    def forward(self, tokens, lengths):
        embedded = self.embedding(tokens)
        packed = pack_padded_sequence(embedded, lengths.cpu(), batch_first=True, enforce_sorted=False)
        packed_out, _ = self.lstm(packed)
        outputs, _ = pad_packed_sequence(packed_out, batch_first=True)
        outputs = self.dropout(outputs)
        return self.classifier(outputs)

def token_accuracy(logits, targets, mask):
    preds = logits.argmax(dim=-1)
    correct = (preds == targets) & mask
    total = mask.sum()
    return correct.sum().float() / total.float()

torch.manual_seed(4)
tokens = torch.randint(1, 50, (3, 7))
lengths = torch.tensor([7, 5, 6])
tags = torch.randint(0, 8, (3, 7))
mask = torch.arange(tokens.size(1)).expand_as(tokens) < lengths.unsqueeze(1)

tagger = BiLSTMTagger(50, 32, 64, 8)
logits = tagger(tokens, lengths)
print("Accuracy:", token_accuracy(logits, tags, mask).item())


## Further Reading

- PyTorch NLP tutorials: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html
- Bahdanau et al. (2015) – Neural Machine Translation by Jointly Learning to Align and Translate
- Consider torchtext or Hugging Face Datasets for large-scale sequence inputs