# Lab 01c: Transformers

Transformers probably don't need an introduction, but in case you need a refresher: The transformer neural network architecture is what powers the likes of GPT, Llama, and co. In this notebook, you will implement GPT from scratch. Yes, you read that right.

Now that I have your _attention_, let's look at what we have to understand and implement to achieve this feat.

First and foremost, transformers are language models. They are trained on raw text (or any other data modality) in a self-supervised fashion. The objective is computed directly from the data, no human labeling needed.

The transformer architecture comprises two building blocks:
- An encoder that receives the input and builds features from it.
- A decoder that uses these features (along with other inputs) to generate an output sequence (usually probabilities).

Depending on the task, a model might have both or only one of the two.

Encoder-Decoder architectures are not new. Autoencoders can do that too. What makes transformers special is _attention_.

In [1]:
import torch
torch.manual_seed(42)

<torch._C.Generator at 0x116ef28b0>

## Embeddings

Before we worry about attention, we first have to transform words (or sentences) into a form that a neural network can understand. This is achieved by embedding the words. Let's look at the sentence: "Life is too short for bad coffee."
We restrict our vocabulary to the words that occur in this sentence, in reality the vocabulary is of course **much** bigger.

In [2]:
sentence = "Life is too short for bad coffee"

vocab = {s:i for i, s in enumerate(sorted(sentence.split()))}
n = len(vocab)
vocab

{'Life': 0, 'bad': 1, 'coffee': 2, 'for': 3, 'is': 4, 'short': 5, 'too': 6}

Using our vocabulary, we can assign an integer index to each word.

In [3]:
import torch
from torch import nn
import torch.nn.functional as F

sentence_int = torch.tensor([vocab[s] for s in sentence.split()])
sentence_int

tensor([0, 4, 6, 5, 3, 1, 2])

Using an [embedding layer](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html), we can transform the integer representation `sentence_int` into a real-valued embedding. The embedding layer is simply a look-up table for embeddings of a fixed dictionary (the vocab) and size. We'll use 8-dimensional embeddings (for future reference, $d = 8$). Together with the $n = 7$ words in the vocab, we arrive at a 7x8-dimensional embedding.

In [4]:
embedding = torch.nn.Embedding(len(vocab), 8)
embedded = embedding(sentence_int).detach()
embedded

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-1.3847, -0.8712, -0.2234,  1.7174,  0.3189, -0.4245,  0.3057, -0.7746],
        [-1.2024,  0.7078, -1.0759,  0.5357,  1.1754,  0.5612, -0.4527, -0.7718],
        [-0.8371, -0.9224,  1.8113,  0.1606,  0.3672,  0.1754,  1.3852, -0.4459],
        [ 1.2791,  1.2964,  0.6105,  1.3347, -0.2316,  0.0418, -0.2516,  0.8599],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806]])

If you have ever used the [🤗 Transformers library](https://huggingface.co/docs/transformers/index) you have very likely used some type of `Tokenizer`. Tokenizers implement the process described above - although on a much larger scale. When you use a tokenizer to process a piece of text, it breaks it down into tokens and assigns a unique numerical identifier (token ID) to each token. These token IDs are what the model uses as input during training or inference. The model's embedding layer then looks up the corresponding embeddings for these token IDs from its embedding matrix.

## Attention

The key feature of Transformer models is their use of attention, more precisely _self-attention_. Self-Attention was introduced in the publication [Attention is all you need](https://arxiv.org/abs/1706.03762).

Attention mimics human cognitive attention by calculating "soft" weights for each word in the current context window. Soft weights can change at runtime, as opposed to "hard" weights which are computed through, and constant after, training.

The self-attention mechanism is also known as _scaled dot-product attention_ which will make sense, once you see the mathematical formula describing self-attention. 

The trainable components of self-attention are the three weight matrices, $\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v$. They project the input sequence $\mathbf{x}$ to form _query_ ($q$), _key_ ($k$), and  _value_ ($v$) sequences:

- query sequence $\mathbf{q}^{(i)}=\mathbf{W}_q \mathbf{x}^{(i)}$
- key sequence $\mathbf{k}^{(i)}=\mathbf{W}_k \mathbf{x}^{(i)}$ 
- value sequence $\mathbf{v}^{(i)}=\mathbf{W}_v \mathbf{x}^{(i)}$

where $i \in [1, T]$ is the token index and $T$ the length of the input sequence. Both $q^{(i)}$ and $k^{(i)}$ are vectors of dimension $d_k$. This is important, because $q$ and $k$ will later be "dot-producted" together. The projection matrices $\mathbf{W}_q$ and $\mathbf{W}_k$ are $d_k \times d$, while $W_v$ is $d_v\times d$. $d_v$ is not constrained by the dimensions of the other vectors as it is the size of the resulting context vector. 

For now, we'll set $d_q = d_k = 24$ and $d_v = 30$.

In [6]:
d = embedded.shape[1]

# TODO: Set the dimensions of the query, key and value vectors
d_q, d_k, d_v = 24, 24, 30

W_q = torch.nn.Parameter(torch.rand(d_q, d))
# TODO: Create the key and value weight matrices
W_k = torch.nn.Parameter(torch.rand(d_k, d))
W_v = torch.nn.Parameter(torch.rand(d_v, d))

### Unnormalized attention weights

Computing the unnormalized attention weight that the $i$-th word ("the query") attributes to the $j$-th word ("the key") is straight forward: it's simply the _dot-product_ of the corresponding query and key vectors:

$$
\omega_{i,j} = {\mathbf{q}^{(i)}}^\top \mathbf{k}^{(j)}
$$

Let's pick the fourth word (zero-indexed, so $i = 3$):

In [7]:
x_4 = embedded[3]
q_4 = W_q.matmul(x_4)
k_4 = W_k.matmul(x_4)
v_4 = W_v.matmul(x_4)

Now it's your turn, generalize the computation of the keys and values to all $j$. 

_Hint: It involves matrix multiplication._ Ensure that $K$ is $d_k \times n$ and $V$ is $d_v \times n$.

In [15]:
# TODO: Implement this.
K = W_k @ embedded.T
V = W_k @ embedded.T 

assert K.shape == (d_k, len(sentence.split()))

With $K$ available, computing the unnormalized weight vector for the $i$-th token is simple as simple as multiplying the query vector with the $K$-matrix:

In [12]:
omega_4 = q_4.matmul(K)
omega_4.shape

torch.Size([7])

### Attention scores

Attention scores $\alpha_{i, j}$ are simply normalized attention weights $\omega_{i, j}$ passed through a softmax:

$$\alpha_{i, j} = \operatorname{softmax}(\frac{\omega_{i, j}}{\sqrt{d_k}})$$

Scaling by $d_k$ ensures that the Euclidean norm of the weight vectors will be approximately of the same magnitude. This aids in curbing numerical instabilities during training.

Compute the attention scores for the forth word:

In [18]:
# TODO: Compute alpha_4

alpha_4 = F.softmax(omega_4 / d_k ** 0.5, dim=0).detach()

### Context vectors

The final step in the attention mechanism is the computation of the context vector. This is simply the input vector but re-weighted by the attention scores:

$$
\mathbf{z}^{i} = \sum\limits_{j = 1}^{T}\alpha_{i, j} \mathbf{v}^{(j)}
$$

Below, compute the context vector for the forth word.

### Putting everything together

Now that we've seen the self-attention mechanism in detail, let's implement an attention layer. Attention layers in $torch$ and similar libraries don't implement the computation of the query, key, and value vector but instead leave this to preceding layers in the neural network. As such, you only have to implement the $\operatorname{softmax}(\dots)$-part and return the context vector.

We've already provided you with a template in the cell below. Note that `query` is a matrix of $\mathbf{q}$-vectors, i.e. the matrix product of $W_q$ and the input embedding matrix.

_If you are looking for inspiration, the [PyTorch documentation for `scaled_dot_product_attention`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#) might be helpful._

In [31]:
K.shape

torch.Size([24, 7])

In [30]:
K.transpose(-2, -1).shape

torch.Size([7, 24])

In [40]:
K.size(-1)

7

In [41]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, query, key, value):
        # TODO: Compute the dot product of query and key
        scores = query @ key.transpose(-2, -1)

        # TODO: Scale the scores by the square root of the key dimension
        scores = scores / torch.sqrt(torch.tensor(key.size(-1), dtype=torch.float32))

        # TODO: Apply softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)

        # TODO: Compute the weighted sum of the value vectors
        output = attention_weights @ value

        return output, attention_weights

## Multi-Head Attention

One set of weight matrices $(\mathbf{W}_q, \mathbf{W}_k, \mathbf{W}_v)$ is called an _attention head_. Each layer in a transformer actually has multiple heads. Since one attention head learns some notion of relevance, multiple attention heads allow the model to use multiple notions of relevance simultaneously. Additionally, the influence field representing "relevance" becomes narrower with increasing depth.

Interestingly, the different heads tend to learn concepts that are meaningful to humans: Some heads may attend to the next word, while others attend to the direct or indirect object in a sentence.

So, if multiple heads are so [powerful](https://arxiv.org/abs/2104.00887) (which, by the way, is somewhat [debatable](https://arxiv.org/abs/1905.10650)), how do we implement them?
It's sadly a bit underwhelming: Multi-head attention is simply multiple heads concatenated together:

$$
\operatorname{MultiHead(Q, K, V)} = \operatorname{Concat}(\operatorname{head}_1, \dots, \operatorname{head}_n)\mathbf{W}^O
$$

where $\operatorname{head}_i = \operatorname{Attention}(Q, K, V)$. Implementation-wise, not much changes:

In [42]:
# Number of heads
headcnt = 4

# The heads are packed in a single dimension. We could also use a 3D tensor 
# but that would be more complicated.
W_q = torch.randn(headcnt * d_q, d) / (d**0.5)
W_k = torch.randn(headcnt * d_k, d) / (d**0.5)
W_v = torch.randn(headcnt * d_v, d) / (d**0.5)

$\mathbf{K}, \mathbf{V}$ computation is the same as before:

In [43]:
K = W_k.matmul(embedded.T)
V = W_v.matmul(embedded.T)

... and the actual attention computation also proceeds as before.

In [44]:
q_4 = W_q.matmul(x_4)
omega_4 = q_4.matmul(K)
alpha_4 = F.softmax(omega_4 / d_k ** 0.5, dim=0).detach()

### Causal attention mask

The final ingredient is masking: For models such as GPT, each token can only attend to tokens before it, thus the attention score needs to be modified before entering softmax.
The most common way of masking is to add a large negative number to the locations that you'd not want the model to attend to.

In [45]:
attn_mask = torch.ones(n, n)
attn_mask = -1E4 * torch.triu(attn_mask,1)
attn_mask

tensor([[    -0., -10000., -10000., -10000., -10000., -10000., -10000.],
        [    -0.,     -0., -10000., -10000., -10000., -10000., -10000.],
        [    -0.,     -0.,     -0., -10000., -10000., -10000., -10000.],
        [    -0.,     -0.,     -0.,     -0., -10000., -10000., -10000.],
        [    -0.,     -0.,     -0.,     -0.,     -0., -10000., -10000.],
        [    -0.,     -0.,     -0.,     -0.,     -0.,     -0., -10000.],
        [    -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.]])

Once you have a mask, you obtain masked attention by simply adding it to your normalized weights:

$$
\operatorname{MaskedAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \operatorname{softmake}(M + \frac{\mathbf{QK}}{\sqrt{d_k}})
$$

where $M$ is the mask.

## Transformer Block

Having gained some intuition for attention, we can now begin to assemble our transformer. First, we need `MaskedMultiHeadAttention`.

In [46]:
import torch.nn as nn


class MaskedMultiHeadAttention(nn.Module):
    def __init__(self, num_heads, d_model):
        super(MaskedMultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        # Hey, look, it's your friend!
        self.attention = ScaledDotProductAttention()

    def forward(self, query, key, value):
        batch_size = query.size(0)

        # Linear projections
        Q = self.W_q(query)
        K = self.W_k(key)
        V = self.W_v(value)

        # Reshape for multi-head attention
        Q = Q.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = K.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = V.view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

        # Masked attention
        mask = torch.ones(batch_size, 1, 1, query.size(1))
        mask = -1e4 * torch.triu(mask, 1)
        output, attention_weights = self.attention(Q, K, V, mask)

        # Reshape and concatenate heads
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        return output, attention_weights

Below you see a schematic of GPT2. The block that is repeated twelve times is a transformer block.

![GPT2 Transformer Block](imgs/gpt_transformer_block.png)

We have prepared a skeleton for your GPT2 transformer block. Implement the block!

In [47]:
import torch.nn as nn

class GPT2TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(GPT2TransformerBlock, self).__init__()

        # TODO: Your layer definitions here
        self.self_attention = MaskedMultiHeadAttention(num_heads, d_model)

        self.norm1 = nn.LayerNorm(d_model)

        # Don't change this, we already implemented it for you
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # TODO: Self-attention
        attn_output, _ = self.self_attention(x, x, x)
        x = self.norm1(x + attn_output)

        # TODO: Feed-forward
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)

        return x

## Bonus Task

You have implemented all the components required for GPT2 - but you can go further.
Here's how:

- Use the `transformers` library to download weights for GPT2.
- Construct a GPT2 model using your blocks.
- Copy the downloaded GPT2 weights into your own own implementation.
- Congrats, you now have a working GPT2 :) (minus the output layer... 🤫)
