After completing the attention heads and combining attention heads to form multi-headed attention mechanism. Now, we will form the surrounding tranformer architecture to support the GPT model. So in the book, a dummy model is created first without containing any functionality in it, just to explain how the structre of a GPT model works. First the input is tokenized, them embedded, then runs through multiple attention head mechanisms, linear output layers and then decoded back to tokens. 

In [60]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

"vocab_size" indicates a vocabulary size of 50,257 words, supported by the BPE tokenizer discussed in Chapter 2
"context_length" represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2
"emb_dim" is the embedding size for token inputs, converting each input token into a 768-dimensional vector
"n_heads" is the number of attention heads in the multi-head attention mechanism implemented in Chapter 3
"n_layers" is the number of transformer blocks within the model, which we'll implement in upcoming sections
"drop_rate" is the dropout mechanism's intensity, discussed in Chapter 3; 0.1 means dropping 10% of hidden units during training to mitigate overfitting
"qkv_bias" decides if the Linear layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing query (Q), key (K), and value (V) tensors; we'll disable this option, which is standard practice in modern LLMs;

In [61]:
import torch
import torch.nn as nn
import tiktoken

In [62]:
class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.token_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.position_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        self.transformer_blocks = nn.Sequential(
            *[DummyTransformerBlocks(cfg) for _ in range(cfg["n_layers"])])
        
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias = False
        )
        
    def forward(self, index):
        batch_size, seq_len = index.shape
        token_embeds = self.token_emb(index)
        position_embeds = self.position_emb(torch.arange(seq_len, device = index.device))
        x = token_embeds + position_embeds
        x = self.drop_emb(x)
        x = self.transformer_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
    
class DummyTransformerBlocks(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        
    def forward(self, x):
        return x
    
class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps = 1e-5):
        super().__init__()
        
    def forward(self, x):
        return x
            

THe above cell contains the skeleton of a Dummy GPT model, this is how it would look, the classes of transformer blocks and normalization layers are not complete yet, those will be formed step by step by learning.

In [63]:
tokenizer = tiktoken.get_encoding('gpt2')
batch = []

text1 = "Life is a"
text2 = "always believe in"

batch.append(torch.tensor(tokenizer.encode(text1)))
batch.append(torch.tensor(tokenizer.encode(text2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[14662,   318,   257],
        [33770,  1975,   287]])


In [64]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)

logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 3, 50257])
tensor([[[-0.4662,  0.3252, -0.0501,  ..., -0.4346,  0.7034,  0.3292],
         [-0.3497, -0.1606, -0.4492,  ..., -0.5948,  0.3007,  0.0621],
         [ 0.0887, -0.9600,  0.1678,  ...,  0.6472,  0.1326, -0.1579]],

        [[-0.4402,  0.3932, -1.4061,  ..., -0.2005,  0.6720, -0.8172],
         [ 0.2087,  0.6763, -0.2571,  ...,  0.1412, -0.1501, -0.0122],
         [-0.1937,  2.7545, -0.0680,  ...,  0.3875, -0.0067,  0.7153]]],
       grad_fn=<UnsafeViewBackward0>)


After completing this dummy skeleton of a GPT model, let's create the Normalization layer. The role of normalization is to stabilize training and enable faster convergance of the weights. Basically applying normalization after each attention layer, allows us to bring the mean of the outputs to 0 and the variance to 1 which allows us to keep the outputs on the same page after each layer. So reducing possible places of problems.

In [65]:
torch.manual_seed(123)
batch_example = torch.randn(2,6)
layer = nn.Sequential(nn.Linear(6,8), nn.ReLU())
out = layer(batch_example)
out

tensor([[0.0000, 0.0000, 0.3391, 0.1583, 0.0000, 0.0000, 0.4158, 0.0176],
        [0.0000, 0.0000, 0.5672, 0.0000, 0.0000, 0.0000, 0.0000, 0.2426]],
       grad_fn=<ReluBackward0>)

In [66]:
mean = out.mean(dim=-1, keepdim=True)
var = out.var(dim=-1, keepdim=True)
print(mean)
print(var)

tensor([[0.1163],
        [0.1012]], grad_fn=<MeanBackward1>)
tensor([[0.0293],
        [0.0427]], grad_fn=<VarBackward0>)


In [67]:
out_norm = (out - mean) / torch.sqrt(var)
print(out_norm)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
torch.set_printoptions(sci_mode=False)
print(mean)
print(var)

tensor([[-0.6800, -0.6800,  1.3018,  0.2451, -0.6800, -0.6800,  1.7501, -0.5771],
        [-0.4901, -0.4901,  2.2562, -0.4901, -0.4901, -0.4901, -0.4901,  0.6843]],
       grad_fn=<DivBackward0>)
tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
tensor([[1.],
        [1.]], grad_fn=<VarBackward0>)


In [68]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
        
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim = True, unbiased = False)
        norm_x = (x-mean)/torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

Some features we have used here:
eps - When the variance of a row might be 0 then while calculating the norm, it would be dividing by zero, so we apply a very small epsilon so that it is not undefined. 
scale and shift - If during the training, the model finds that the norm_x is deviating from its expected value then the parameters of scale and shift are changed so that it brings back the mean to 0 and the var to 1. 

In [69]:
ln = LayerNorm(emb_dim=6)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)

print(mean)
print(var)

tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)
