We've learned about the multi-head attention mechanism. We will now learn to code the other building blocks of an LLM and assemble them into a GPT-like model.

We start with a simplified version of a GPT-like model as an example. It consists of token and positional embeddings, dropout, a series of transformer blocks, a final layer of normalisation, and a linear output layer. The configuration is passed in via a python dictionary.

In [1]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # num of words used by BPE tokenizer
    "context_length": 1024, # max num of input tokens model can handle
    "emb_dim": 768,         # embedding size: each token -> 768-d vector
    "n_heads": 12,          # num of heads in multi-head attention mechanism
    "n_layers": 12,         # num of transformer blocks in the model
    "drop_rate": 0.1,       # 10% random dropout of hidden units 
    "qkv_bias": False       # whether to include a bias vector in linear layers
}    

In [4]:
import torch
import torch.nn as nn

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])

        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) # placeholder for transformer block
              for _ in range(cfg["n_layers"])]
        )

        self.final_norm = DummyLayerNorm(cfg["emb_dim"]) # placeholder for layernorm
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)

        pos_embeds = self.pos_emb(
            torch.arange(seq_len, device=in_idx.device)
        )
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

In [3]:
class DummyTransformerBlock(nn.Module): # a placeholder to be replaced later
    def __init__(self, cfg):
        super().__init__()

    def forward(self, x):
        return x # block does nothing but return its input
    

class DummyLayerNorm(nn.Module): # a placeholder to be replaced later
    def __init__(self, normalised_shape, eps=1e-5):
        super().__init__()

    def forward(self, x):
        return x

The forward method describes the data flow through the model: it computes token and positional embeddings for the input indices, applies dropout, processes the data through the transformer blocks, applies normalisation, and finally produces logits with the linear output layer.

Now, we prepare the input data and initialise a new GPT model to illustrate its usage.

In [5]:
import tiktoken

tokeniser = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokeniser.encode(txt1)))
batch.append(torch.tensor(tokeniser.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


Now we initialise a new 124-mllion parameter DummyGPTModel instance and feed it the tokenised batch.

In [6]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)
logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


The output tensor has two rows, corresponding to the two text samples. Each text consists of 4 tokens; each token is a 50,257-dimensional vector, which matches the size of the tokeniser's vocab. Each of these dimensions refers to a unique token in the vocab. Later, we will convert these vectors back into token IDs, which we can then decode into words. 

### Normalising activations with layer normalisation

Training deep neural networks with many layers can prove challenging due to problems like vanishing or exploding gradients. These lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights - so the learning process struggles to find a set of parameters (weights) for the network that minimises the loss function. In other words, the network has difficulty learning the underlying patterns in the data, affecting its ability to make accurate predictions.

<i>Layer normalisation</i> is designed to improve the stability and efficiency of neural network training. The idea is to adjust the activations (outputs) of a neural network layer to have a mean of 0 and a variance of 1 (a.k.a. unit variance). This speeds up the convergence to effective weights and ensures consistent, reliable training. In GPT-2, this is typically applied before and after the multi-head attention module and before the final output layer. 

In [7]:
torch.manual_seed(123)
batch_example = torch.rand(2, 5) # 2 training examples with 5 dimensions each (features)
layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.0000, 0.0000, 0.4091, 0.6587, 0.3914, 0.0000],
        [0.0000, 0.0000, 0.1902, 0.3182, 0.6486, 0.0000]],
       grad_fn=<ReluBackward0>)


In [8]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[0.2432],
        [0.1928]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0799],
        [0.0670]], grad_fn=<VarBackward0>)


As can be seen, the means are above 0 and variances are far from 1. Using keepdim=True ensures that the ouput tensor retains the same number of dimensions as the input tensor, even though the operation (in this case mean and variance) reduces the tensor along the dimension specified via 'dim'. Without it, the returned mean tensor would be a two-dimensional vector insted of a 2x1-dimensional matrix. The dim parameter specifies the dimension along which the calculation should be performed. Dim=-1 (and dim=1 in this case) calculates the mean across the column dimension to obtain one mean per row (dim=0 would do so across the row dimension to obtain one mean per column).