# Chapter 3: Building the Full GPT Model

<div class="alert alert-block alert-success">

In the previous chapters, we built the core components of a transformer, most importantly the `MultiHeadAttention` module. Now, it's time to assemble these pieces into a complete, functional GPT model.

Our approach will be "top-down." First, we will define the high-level architecture of the entire model using simple placeholders for the complex parts. This will help us understand the overall data flow. 
Then, in the following sections, we will build the real components to replace these placeholders one by one.
</div>

## 3.1 Imports and Setup

<div class="alert alert-block alert-success">
As always, we begin by importing the necessary libraries.
</div>

In [6]:
# Standard library and third-party imports
import torch
import torch.nn as nn
import tiktoken

## 3.2 Model Configuration

<div class="alert alert-block alert-success">
First, let's define a configuration dictionary that will hold all the hyperparameters for our model. We will base this on the parameters of the original GPT-2 "small" model, which has 124 million parameters.
</div>

In [24]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "emb_dim": 768,          # Embedding dimension
    "n_heads": 12,           # Number of attention heads
    "n_layers": 12,          # Number of layers
    "dropout_rate": 0.1,        # Dropout rate
    "qkv_bias": False        # Query-Key-Value bias in attention
}

## 3.3 The GPT Architecture I: A High Level View

<div class="alert alert-block alert-success">
<div class="alert alert-block alert-info">
    
  <b>A "Top-Down" Approach</b><br>
    
Before we assemble the final, complex components of our GPT model, it's helpful to first understand the big picture. In this section, we'll adopt a **"top-down"** design approach by building a high-level **"skeleton"** of the entire model.
</div>
    

We will use simple dummy classes as placeholders for the internal machinery, like the `TransformerBlock` and `LayerNorm`. This strategy allows us to focus on the overall architecture and trace the journey of a tensor as it flows from the input embedding layer to the final output logits, ensuring our high-level structure is correct before we dive into implementing the details.
</div>

### 3.3.1 Defining the Model Skeleton

<div class="alert alert-block alert-success">

The code below defines our `DummyGPTModel`. It includes the complete end-to-end structure of the network. Notice the placeholder classes, `DummyTransformerBlock` and `DummyLayerNorm`, which are designed to simply pass the input through without any changes for now.
</div>

In [43]:
class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["dropout_rate"])

        # Use a placeholder for the stack of Transformer Blocks
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )

        # Use a placeholder for the final LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits        


# --- Placeholder Classes ---

class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
    def forward(self, x):
        # This block does nothing and just returns its input
        return x

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
    def forward(self, x):
        # This layer also does nothing and just returns its input
        return x
    

<div class="alert alert-block alert-warning">

The `DummyGPTModel` defines the complete, high-level architecture. Let's analyze the `forward` method to understand the data flow:

1. **Embedding:** The input token IDs are converted into token embeddings. Simultaneously, positional embeddings are created for eachtoken based on its position in the sequence. These two embeddings are added together to give each token semantic meaning and positional context.
   
2. **Transformer Blocks:** The resulting embeddings are preocessed through a stack of `n_layers` Transformer blocks. In this dummy version, these blocks don't do anything yet, but this is where the self-attention and feed-forwardlogic will go.

3. **Final Normalization:** A final `LayerNorm` is applied after the Transformer Blocks.

4. **Output Head:** A final linear layer projects the output of the transformers back into vocabulary space, producing raw, unnormalized scores (logits) for every possible token in the vocabulary.

Now that we have this hisgh-level skeleton, the next sections will focus on building the real, functional version of `LayerNorm` and `TransformerBlock` to replace these placeholders.

</div>

### 3.3.2 Testing the GPT Model Skeleton

<div class="alert alert-block alert-success">
    
Now that we have the high-level skeleton of our model, let's run a quick test to ensure the data flows through it correctly. We will perform two steps:
    
1.  First, we'll tokenize two sample sentences and create an input batch.
2.  Second, we'll pass this batch through our `DummyGPTModel` and verify that the output tensor has the shape we expect.
</div>

In [49]:
tokenizer = tiktoken.get_encoding("gpt2")
batch = []
txt1 = "Every effort moves you"
txt2 = "Every day holds a"
batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print("Input shape:", batch.shape)
print(batch)

Input shape: torch.Size([2, 4])
tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [50]:
# Create an instance of the DummyGPTModel
torch.manual_seed(100)
model = DummyGPTModel(GPT_CONFIG_124M)

# Pass the batch through the model
logits = model(batch)

# Print the shape of the output
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.1073, -0.3112,  0.4756,  ..., -0.0372,  1.4159,  0.0902],
         [-0.8873,  0.4433,  0.1565,  ..., -0.2307,  0.2448, -0.8779],
         [-2.3526, -0.9504, -0.7646,  ...,  0.7654,  0.2491,  0.3775],
         [ 0.5979,  1.0420, -1.5600,  ...,  0.4073, -0.0435, -0.3111]],

        [[-0.7754, -0.3975,  0.8207,  ..., -0.4550,  1.2873, -0.1645],
         [-1.2486, -0.1166,  0.6246,  ...,  0.5602,  0.8671, -0.6983],
         [-0.0773, -0.1393, -0.3316,  ..., -0.0703,  1.7652, -0.1684],
         [-0.1209,  0.4775,  0.3865,  ...,  0.4667, -1.6924, -0.4930]]],
       grad_fn=<UnsafeViewBackward0>)


### 3.3.3 Analyzing the Model's Output

<div class="alert alert-block alert-success">

The output of our model is a tensor of **logits** with the shape `[2, 4, 50257]`. Let's breakdown what each dimension represents:

* **`2` (Batch Size):** This corresponds to the two input sentences we provided ("Every effort moves you" and "Every day holds a").
* **`4` (Sequence Length):** This corresponds to the four tokens in each input sentence.
* **`50257` (Vocabulary Size):** For each of the 4 tokens in each sentence, the model has produced a vector of 50,257 raw, unnormalized scores. Each score corresponds to a unique token in the vocabulary.

These logits represent the model's raw "prediction" for the **next** token in the sequence at each position. Later, we will pass these logits through a softmax function to turn them into probabilities and generate text.

<div class="alert alert-block alert-info">
    
Now that we have successfully taken a top-down look at the GPT architecture and verified its data flow, we will begin coding the real components, starting with the `LayerNorm` class to replace `DummyLayerNorm`.
</div>    

</div>

## 3.4 The GPT Architecture II: Layer Normalization

<div class="alert alert-block alert-success">

Layer Normalization is a critical technique used in transformers to stabilize the training process. Deep neural networks can suffer from a problem where the distribution of activations changes between layers, making training difficult. Layer Normalization helps by re-centering and re-scaling the activations at each layer to a standard distribution (with a mean of 0 and a variance of 1).

</div>

### 3.4.1 Building LayerNorm from First Principles

<div class="alert alert-block alert-success">
To understand what Layer Normalization does, let's start with a simple example. We'll create a mini-batch of random data.
</div>

In [91]:
torch.manual_seed(100)

# Create a batch of 2 examples with 5 features each
batch_example = torch.randn(2, 5)

print(batch_example)

tensor([[ 0.3607, -0.2859, -0.3938,  0.2429, -1.3833],
        [-2.3134, -0.3172, -0.8660,  1.7482, -0.2759]])


<div class="alert alert-block alert-success">
<div class="alert alert-block alert-info">
    
As you can see, the output values are arbitrary. The purpose of LayerNorm is to rescale these activations.

</div>

Before we normalize, let's first calculate their current mean and variance for each example in the batch.

</div>

In [92]:
mean = batch_example.mean(dim=-1, keepdim=True)
var = batch_example.var(dim=-1, keepdim=True)

print("Mean:\n", mean)
print("Var:\n", var)

Mean:
 tensor([[-0.2919],
        [-0.4049]])
Var:
 tensor([[0.4784],
        [2.1288]])


<div class="alert alert-block alert-success">
Now for the core operation, Layer Normalization standardizes the activations by subtracting the mean and dividing by the standard deviation (the square root of the variance).
</div>

In [100]:
# Manually apply normalization
batch_norm = (batch_example - mean) / torch.sqrt(var)

print("Normalized batch:\n", batch_norm)

Normalized batch:
 tensor([[ 0.9435,  0.0086, -0.1474,  0.7733, -1.5780],
        [-1.3081,  0.0601, -0.3161,  1.4757,  0.0884]])


<div class="alert alert-block alert-success">
Let's verify the result. If the normalization worked, then the mean should be approximately 0 and the new variance should be 1. Note that the mean may not be exactly zero due to tiny floating-point inaccuracies.
</div>

In [99]:
# Verify the mean and variance of the normalized inputs
mean_norm = batch_norm.mean(dim=-1, keepdim=True)
var_norm = batch_norm.var(dim=-1, keepdim=True)

torch.set_printoptions(sci_mode=False)
print("Mean of normalized batch:\n", mean_norm)
print("Var of normalized batch:\n", var_norm)

Mean of normalized batch:
 tensor([[     0.0000],
        [    -0.0000]])
Var of normalized batch:
 tensor([[1.0000],
        [1.0000]])


### 3.4.2 Encapsulating the Logic in a `LayerNorm` Class

<div class="alert alert-block alert-success">
    
Now that we understood the step-by-step math, we can encapsulate this logic into a reusable PyTorch module. A production-ready `LayerNorm` class also includes two extra trainable parameters (`scale` and `shift`) and a small `epsilon` term for numerical stability.
</div>

In [58]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift
        

<div class="alert alert-block alert-info">
    
<b>What do  `scale` and `shift` do?</b>

While forcing the activation to have a mean of 0 and variance of 1 is great for stability, it can sometimes limit the expressive power of the network. `scale` (gamma) and `shift` (beta) are trainable parameters that allow the model to learn the optimal scale and shift for the normalized activations. In essence , it gives the network the ability to "undo" the normalization if it finds that a different distribution works better for that specific layer.

<b>Note on `unbiased=False`</b>

In statistics, the variance of a sample is often calculated by dividing by `n-1` (unbiased) instead of `n` (biased). We explicitly set `unbiased=False`to divide by `n`. For the large embedding dimensions used in LLMs, the difference is negligible. This choice ensures  our implementation is compatible with the original GPT-2 model.

</div>

### 3.4.3 Testing the `LayerNorm` Class

<div class="alert alert-block alert-success">

Finally, let's instantiate our new `LayerNorm` class and test it on our original example data to confirm it works as expected.

</div>

In [66]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)
mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, keepdim=True, unbiased=False)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[    -0.0000],
        [    -0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)
