# Mini Transformer with Pretrained GloVe Embeddings

This notebook is a **reference** for building a simple text classifier using:

- Pretrained **GloVe** word embeddings (used as a lookup table)
- A **TransformerEncoder** from PyTorch
- A final **Linear** layer as classifier

## 1. Imports

We import PyTorch, a few utilities, and `torchtext` for loading GloVe.

- `nn.Module`: base class for all neural network modules
- `nn.Embedding`: lookup-table layer for vector embeddings
- `nn.TransformerEncoder` and `nn.TransformerEncoderLayer`: the core Transformer blocks


In [9]:
import math
import torch
import torch.nn as nn
from torchtext.vocab import GloVe, Vocab
from collections import Counter


## 2. Load GloVe and Build a Vocab

Here we:

1. Load pretrained GloVe vectors using `torchtext.vocab.GloVe`.
2. Build a `Vocab` object from `glove.stoi` (string-to-index mapping).
3. Add special tokens:
   - `<unk>` for unknown words
   - `<pad>` for padding

The important idea:

> **GloVe is just a big lookup table.**

Each word has a fixed vector, and we wrap that into a PyTorch `nn.Embedding` layer later.

In [11]:
# Load GloVe
glove = GloVe(name="6B", dim=100)

# Define special tokens
specials = ["<unk>", "<pad>"]

# Build a Counter from GloVe vocab (all frequency=1)
counter = Counter(glove.stoi.keys())

# Create Vocab properly
my_vocab = Vocab(counter, specials=specials)

vocab_size = len(my_vocab)
embedding_dim = glove.dim  # same as glove.vectors.size(1)

print("Vocab size:", vocab_size)
print("Embedding dim:", embedding_dim)

Vocab size: 400002
Embedding dim: 100


## 3. Create an Embedding Layer from Pretrained Vectors

We now wrap the GloVe tensor into an `nn.Embedding` using
`nn.Embedding.from_pretrained`.

- `glove.vectors` is a tensor of shape `[vocab_size_without_specials, embedding_dim]`.
- We need to **extend** it to include our `<unk>` and `<pad>` rows.
- `freeze=True` means we do **not** train the embeddings; they stay as GloVe.

This layer is still just a **lookup table**: it maps token IDs â†’ word vectors.

In [12]:
# Build a weight matrix that matches our vocab (including specials)
num_specials = len(specials)
pad_vectors = torch.zeros(num_specials, embedding_dim)

# Order: specials first, then GloVe vectors
embedding_weights = torch.cat([pad_vectors, glove.vectors], dim=0)
assert embedding_weights.size(0) == vocab_size

embedding_layer = nn.Embedding.from_pretrained(
    embedding_weights,
    freeze=True  # set to False if you want to fineâ€‘tune the embeddings
)


In [13]:
embedding_weights

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.3609, -0.1692, -0.3270,  ...,  0.2714, -0.2919,  0.1611],
        [-0.1046, -0.5047, -0.4933,  ...,  0.4253, -0.5125, -0.1705],
        [ 0.2837, -0.6263, -0.4435,  ...,  0.4368, -0.8261, -0.1570]])

## 4. Positional Encoding

Selfâ€‘attention by itself is **positionâ€‘agnostic**. It doesn't know which token
came first.

We add a standard sinusoidal positional encoding (as in the original Transformer paper):

- Same `d_model` as the embeddings
- Precomputed for a maximum sequence length
- Added to the embeddings before passing them to the Transformer


In [14]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.cos(position * div_term)
        pe[:, 1::2] = torch.sin(position * div_term)
        pe = pe.unsqueeze(0)  # shape: [1, max_len, d_model]
        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Add positional encodings to input.

        x shape: [batch_size, seq_len, d_model]
        """
        seq_len = x.size(1)
        x = x + self.pe[:, :seq_len]
        return self.dropout(x)


## 5. The Transformer Model (`Net`)

We now define a simple classifier model that subclasses `nn.Module`:

1. **`__init__`**: define layers
   - `self.emb`: our GloVeâ€‘based embedding layer
   - `self.pos_encoder`: positional encoding module
   - `self.transformer_encoder`: a stack of TransformerEncoderLayers
   - `self.classifier`: final linear layer mapping to `num_classes`

2. **`forward`**: define the forward pass
   - Look up embeddings for token IDs
   - Scale them by `sqrt(d_model)`
   - Add positional encodings
   - Pass through the transformer encoder
   - Meanâ€‘pool over sequence length to get a sentence representation
   - Pass through classifier to get logits


In [None]:
class Net(nn.Module):
    def __init__(self,
                 num_classes: int,
                 embedding_layer: nn.Embedding,
                 nhead: int = 2,
                 dim_feedforward: int = 128,
                 num_layers: int = 2,
                 dropout: float = 0.1):
        super().__init__()

        self.emb = embedding_layer
        d_model = embedding_layer.embedding_dim

        self.pos_encoder = PositionalEncoding(d_model=d_model,
                                              dropout=dropout)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
            batch_first=True,  # so input is [batch, seq, d_model]
        )
        self.transformer_encoder = nn.TransformerEncoder(
            encoder_layer,
            num_layers=num_layers,
        )

        self.classifier = nn.Linear(d_model, num_classes)
        self.d_model = d_model

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        x: tensor of token IDs with shape [batch_size, seq_len]
        returns: logits with shape [batch_size, num_classes]
        """
        # 1. Embedding lookup (GloVe as lookup table)
        x = self.emb(x) * math.sqrt(self.d_model)

        # 2. Add positional encodings
        x = self.pos_encoder(x)

        # 3. Transformer encoder (selfâ€‘attention + FFN layers)
        x = self.transformer_encoder(x)

        # 4. Mean pooling over the sequence dimension -> meaning we turn the set of vectors in the sentence into a single average vector
        x = x.mean(dim=1)

        # 5. Final classifier
        x = self.classifier(x)
        return x


## 6. Simple Tokenization Helper

For this reference notebook, we'll use a **very naive tokenizer**:

- Lowercase the sentence
- Split on spaces
- Look up each token in the vocab (unknown words â†’ `<unk>`)

In [19]:
def encode_sentence(sentence: str, vocab_obj, max_len: int = 16) -> torch.Tensor:
    tokens = sentence.lower().split()
    ids = [vocab_obj[t] for t in tokens[:max_len]]

    # Pad if needed
    pad_id = vocab_obj["<pad>"]
    if len(ids) < max_len:
        ids += [pad_id] * (max_len - len(ids))

    return torch.tensor(ids, dtype=torch.long)


# Quick test
example = "This is a tiny test sentence"
print(encode_sentence(example, my_vocab))

tensor([358161, 192974,  43011, 360284, 356532, 325081,      1,      1,      1,
             1,      1,      1,      1,      1,      1,      1])


## 7. Instantiate the Model and Run a Forward Pass

Here we create the `Net` model and run a **single forward pass** to see
the shapes and verify that everything is wired correctly.


In [20]:
num_classes = 3  # e.g. positive / neutral / negative
model = Net(num_classes=num_classes, embedding_layer=embedding_layer)

sentence_batch = [
    "This movie was surprisingly good",
    "I really did not like this",
]

batch_ids = torch.stack([
    encode_sentence(s, my_vocab, max_len=16) for s in sentence_batch
])  # shape: [batch_size, seq_len]

logits = model(batch_ids)
print("Input IDs shape:", batch_ids.shape)
print("Logits shape:", logits.shape)
print("Logits:")
print(logits)

torch.Size([2, 16, 100])
Input IDs shape: torch.Size([2, 16])
Logits shape: torch.Size([2, 3])
Logits:
tensor([[ 0.7014,  0.2494, -0.8791],
        [ 0.5865,  0.1323, -0.2063]], grad_fn=<AddmmBackward0>)


## 9. Recap

- **GloVe** provides static word embeddings via a lookup table.
- `nn.Embedding.from_pretrained` wraps the lookup table into a PyTorch layer.
- `nn.Module` is the base class; `forward()` defines how inputs flow through the model.
- `nn.TransformerEncoderLayer` contains selfâ€‘attention + feedâ€‘forward sublayers.
- We:
  1. Embed tokens
  2. Add positional encodings
  3. Pass through the Transformer encoder
  4. Meanâ€‘pool over time
  5. Classify with a linear layer

You can now adapt this notebook for your own experiments, add notes, and
extend the model as needed. ðŸ’¡


## Feed-Forward Network (FFN) in the Transformer

In the original *Attention Is All You Need* paper, the Feed-Forward Network (FFN)
inside each Transformer encoder layer is defined as:

$$
\mathrm{FFN}(x) = \max(0,\; xW_1 + b_1)\, W_2 + b_2
$$

Where:

- $x$ is a **single token embedding** of dimension $d_{\text{model}}$
- $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ and $b_1$ expand the representation
- $W_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ and $b_2$ compress it back
- $\max(0, \cdot)$ is the **ReLU activation function**

This FFN is applied **independently to each token** (no token-to-token interaction).
Token interaction happens in the self-attention block; the FFN is responsible for
**non-linear feature transformation** at the token level.



## Why the Transformer Uses a Feed-Forward Network

The Transformer FFN serves a different role than self-attention:

- **Self-attention** mixes information *across tokens*
- **Feed-Forward Network (FFN)** transforms information *within each token*

The FFN works by:
1. Expanding the token embedding into a higher-dimensional space
2. Applying a non-linear transformation
3. Compressing it back to the original model dimension

This pattern gives the model more expressive power while keeping the input/output
dimensions compatible with residual connections.

### ReLU activation

The term

$$
\max(0, x)
$$

is the **ReLU (Rectified Linear Unit)** activation function.

ReLU is defined element-wise as:
- Output $x$ if $x > 0$
- Output $0$ if $x \le 0$

ReLU introduces **non-linearity**, which is essential â€” without it, the FFN would
collapse into a single linear transformation and lose expressive power.

In PyTorch, the FFN is implemented as:

```python
x = linear1(x)   # d_model â†’ d_ff
x = relu(x)      # non-linearity (max(0, x))
x = dropout(x)
x = linear2(x)   # d_ff â†’ d_model
