# Implementing a GPT model from Scratch To Generate Text

This section covers:

- Coding a GPT-like large language model (LLM) that can be trained to generate human-like text.

- Normalizing layer activations to stabilize neural network training

- Adding shortcut connections in deep neural networks to train models more effectively.

- Implementing transformer blocks to create GPT models of various sizes.

- Computing the number of parameters and storage requirements of GPT models.

In [2]:
from importlib.metadata import version

print("matplotlib version:", version("matplotlib"))
print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

matplotlib version: 3.10.6
torch version: 2.0.1
tiktoken version: 0.11.0


![Alt text](../../assests/figure41.png)

## Coding an LLM architecture

LLMs, such as GPT (Generative Pretrained Transformer), are large deep neural network architectures designed to generate new text one word (or token) at a time. 

![Alt text](../../assests/figure42.png)

- We've covered `Input tokenization and embedding`, `masked multi-head attention` module.

- This section focuses on implementing the core structure of the GPT model, including its transformer blocks.
- Therefore, these LLMs are often referred to as `"decoder-like"` LLMs
- Language models are Unsupervised Multitask Learners.
- We are scaling up to the size of a small GPT-2 model, specifically the smallest version with `124` million parameters.
- In the context of deep learning and LLMs like GPT, the term "parameters" refers to the trainable weights of the model. These weights are essentially the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function. This optimization allows the model to learn from the training data.
- Compared to conventional deep learning models, LLMs are larger, mainly due to their vast number of parameters, not the amount of code.

- Configuration details for the `124 million` parameter GPT-2 model include:

In [3]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size
    "context_length": 1024, # Context length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of layers
    "drop_rate": 0.1,       # Dropout rate
    "qkv_bias": False       # Query-Key-Value bias
}

We use short variable names to avoid long lines of code later
- `"vocab_size"` indicates a vocabulary size of `50,257` words, supported by the `BPE` tokenizer discussed in Chapter 2.

- `"context_length"` represents the model's maximum input token count, as enabled by positional embeddings covered in Chapter 2

- `"emb_dim"` is the embedding size for token inputs, converting each input token into a `768`-dimensional vector

- `"n_heads"` is the number of attention heads in the `multi-head attention` mechanism implemented in Chapter 3
  
- `"n_layers"` is the number of transformer blocks within the model, which we'll implement in upcoming sections

- `"drop_rate"` is the dropout mechanism's intensity, discussed in Chapter 3; `0.1` means dropping `10%` of hidden units during training to mitigate overfitting

- `"qkv_bias"` decides if the Linear layers in the multi-head attention mechanism (from Chapter 3) should include a bias vector when computing `query (Q)`, `key (K)`, and `value (V)` tensors; we'll disable this option, which is standard practice in modern LLMs; however, we'll revisit this later when loading pretrained GPT-2 weights from OpenAI into our reimplementation in chapter 5



Using the configuration above, we will start this section by implementing a `GPT placeholder` architecture (`DummyGPTModel`) in this section, as shown in the figure below. This will provide us with a big-picture view of how everything fits together and what other components we need to code in the upcoming sections to assemble the full GPT model architecture.


![Alt text](../../assests/figure43.png)

In [7]:
# A placeholder GPT model architecture class

import torch
import torch.nn as nn 


class DummyGPTModel(nn.Module):
    def __init__(self, config: dict):
        super().__init__()
        self.tok_emb = nn.Embedding(config["vocab_size"], config["emb_dim"])
        self.pos_emb = nn.Embedding(config["context_length"], config["emb_dim"])
        self.drop_emb = nn.Dropout(config["drop_rate"])
        
        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(config) for _ in range(config["n_layers"])]
        )
        
        # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(config["emb_dim"])
        self.out_head = nn.Linear(
            config["emb_dim"], config["vocab_size"], bias=False
        )
        
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits
    
    

class DummyTransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        # A simple placeholder
        
    def forward(self, x):
        # This block does nothing and just returns its input
        return x
    

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface
        
    def forward(self, x):
        # This layer does nothing and just returns its input
        return x

The `DummyGPTModel` class above defines a simplified version of a GPT-like model using PyTorch's neural network module (nn.Module). The model architecture in the `DummyGPTModel` class consists of token and positional embeddings, dropout, a series of transformer blocks `(DummyTransformerBlock)`, a final layer normalization `(DummyLayerNorm)`, and a linear output layer `(out_head)`. 

- The configuration is passed in via a Python dictionary, for instance, the `GPT_CONFIG_124M` dictionary we created earlier.

- The `forward` method describes the data flow through the model; it computes `token` and `positional` embeddings for the input indices, applies `dropout`, processes the data through the `transformer` blocks, applies normalization, and finally produces logits with the linear output layer.

- The code above is already functional, as we will see later in this section after we prepare the input layer. However, the placeholders `(DummyLayerNorm and DummyTransformerBlock)` for the transformer block and layer normalization, which we will develop in later sections.

![Alt text](../../assests/figure44.png)

Next, we will prepare the input data and initialize a new GPT model to illustrate its usage. 


- To implement the step in the figure above, we tokenize a batch consisting of two text inputs for the GPT model using the `tiktoken` tokenizer introduced in chap.2

In [8]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = []

text1 = "Every effort moves you"
text2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(text1)))
batch.append(torch.tensor(tokenizer.encode(text2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [10]:
batch.shape

torch.Size([2, 4])

- The output above are the resulting token IDs for the two texts. The first row corresponds to the first text, and the second row corresponds to the second text.

- Next, we initialize a new `124 million` parameter `DummyGPTModel` instance and feed it the tokenized `batch`:

In [9]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)

logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.2034,  0.3201, -0.7130,  ..., -1.5548, -0.2390, -0.4667],
         [-0.1192,  0.4539, -0.4432,  ...,  0.2392,  1.3469,  1.2430],
         [ 0.5307,  1.6720, -0.4695,  ...,  1.1966,  0.0111,  0.5835],
         [ 0.0139,  1.6755, -0.3388,  ...,  1.1586, -0.0435, -1.0400]],

        [[-1.0908,  0.1798, -0.9484,  ..., -1.6047,  0.2439, -0.4530],
         [-0.7860,  0.5581, -0.0610,  ...,  0.4835, -0.0077,  1.6621],
         [ 0.3567,  1.2698, -0.6398,  ..., -0.0162, -0.1296,  0.3717],
         [-0.2407, -0.7349, -0.5102,  ...,  2.0057, -0.3694,  0.1814]]],
       grad_fn=<UnsafeViewBackward0>)


- The model outputs above, are commonly referred to as logits.

- The output tensor has two rows corresponding to the two text samples. Each text sample consists of 4 tokens; each token is a `50,257`-dimensional vector, which matches the size of the tokenizer's vocabulary.

- The embedding has `50,257` dimensions because each of these dimensions refers to a unique token in the vocabulary. 

In [13]:
model

DummyGPTModel(
  (tok_emb): Embedding(50257, 768)
  (pos_emb): Embedding(1024, 768)
  (drop_emb): Dropout(p=0.1, inplace=False)
  (trf_blocks): Sequential(
    (0): DummyTransformerBlock()
    (1): DummyTransformerBlock()
    (2): DummyTransformerBlock()
    (3): DummyTransformerBlock()
    (4): DummyTransformerBlock()
    (5): DummyTransformerBlock()
    (6): DummyTransformerBlock()
    (7): DummyTransformerBlock()
    (8): DummyTransformerBlock()
    (9): DummyTransformerBlock()
    (10): DummyTransformerBlock()
    (11): DummyTransformerBlock()
  )
  (final_norm): DummyLayerNorm()
  (out_head): Linear(in_features=768, out_features=50257, bias=False)
)

In [20]:
# Access model parameters
for name, param in model.named_parameters():
    print(f"Layer: {name}, Shape: {param.shape}")
    print(f"Weights: \n{param.data}") # .data to get the tensor values

Layer: tok_emb.weight, Shape: torch.Size([50257, 768])
Weights: 
tensor([[ 3.3737e-01, -1.7778e-01, -3.0353e-01,  ..., -3.1813e-01,
         -1.3936e+00,  5.2262e-01],
        [ 2.5787e-01,  3.4197e-01, -8.1678e-01,  ..., -4.0981e-01,
          4.9785e-01, -3.7207e-01],
        [ 7.9574e-01,  5.3501e-01,  9.4275e-01,  ..., -1.0749e+00,
          9.5492e-02, -1.4138e+00],
        ...,
        [-7.1278e-01, -5.0190e-01,  1.4119e+00,  ..., -1.4979e-01,
         -4.8977e-01, -1.0620e+00],
        [ 2.0646e+00,  1.1190e+00,  3.8486e-01,  ..., -7.2015e-01,
         -5.5703e-01,  9.8639e-01],
        [ 1.1364e-03, -7.5320e-01, -1.7924e-01,  ..., -3.2443e-01,
          2.6055e-01,  5.8885e-01]])
Layer: pos_emb.weight, Shape: torch.Size([1024, 768])
Weights: 
tensor([[ 0.8769,  0.2550,  0.8441,  ..., -1.0354,  1.3085,  1.7957],
        [-1.0029,  0.0995,  1.2459,  ...,  1.5453, -0.1126, -1.5197],
        [ 1.3317,  0.7561,  0.9077,  ...,  0.0830,  1.8336, -2.2225],
        ...,
        [-0.1055

## Normalizing activations with layer normalization

- Training deep neural networks with many layers can sometimes prove challenging due to issues like `vanishing` or `exploding` gradients. These issues lead to unstable training dynamics and make it difficult for the network to effectively adjust its weights, which means the learning process struggles to find a set of parameters (weights) for the neural network that minimizes the `loss function`. In other words, the network has difficulty learning the underlying patterns in the data to a degree that would allow it to make accurate predictions or decisions. 


- We will implement a  `layer normalization` to improve the stability and efficiency of neural network training.
- The main idea behind `layer normalization` is to adjust the activations (outputs) of a neural network layer to have a mean of `0` and a variance of `1`, also know as `unit variance`.
- This adjustment speeds up the convergence to effective weights and ensures consistent, reliable training. 
- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later.
- It's also applied before the final output layer.


Here is a visual overview of how layer normalization functions.

![Alt text](../../assests/figure45.png)

- Let's see how layer normalization works by passing a small input sample through a simple neural network layer:

In [25]:
torch.manual_seed(123)

# create 2 training examples with 5 dimensions (features) each
batch_example = torch.randn(2, 5)

layer = nn.Sequential(nn.Linear(5, 6), nn.ReLU())
out = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


- The first row of the output above lists the layer outputs for the first input and the second row lists the layer outputs for the second row:

- The neural network layer above consists of a `Linear` layer followed by a non-linear activation function, `ReLU` (short for Rectified Linear Unit), which is a standard activation function in neural networks. 

- `ReLU` simply thresholds negative inputs to `0`, ensuring that a layer outputs only positive values, which explains why the resulting layer output does not contain any negative values.

In [26]:
print(out.shape)

torch.Size([2, 6])


Let's compute the mean and variance for each of the 2 inputs above:

In [23]:
mean = out.mean(dim=1, keepdim=True)
var = out.var(dim=1, keepdim=True)

print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


In [29]:
mean.shape, var.shape

(torch.Size([2, 1]), torch.Size([2, 1]))

- The first row in the `mean` tesnor contains the `mean` value for the first input row, and the second output row contains the `mean` for the second input row.

- Using `keepdim=True` in operations like mean or variance calculation ensures that the output tensor retains the same number of dimensions as the input tensor, even though the operation reduces the tensor along the dimension specified via `dim`. For instance, without `Keepdim=True`, the returned mean tensor would be a 2-dimensional vector of `[0.1324,
0.2170]` instead of a 2×1-dimensional matrix `[[0.1324], [0.2170]]`.

- The `dim` parameter specifies the dimension along which the calculation of the statistic (here, mean and variance) should be performed in a tensor.

![Alt text](../../assests/figure46.png)


From the figure above, for a `2D` tensor (like a matrix), using `dim=-1` for operations such as mean and variance calculation is the same as using `dim=1`. This is beacuase `-1` refers to the tensor's last dimension, which corresponds to the columns in a `2D` tensor. 

- Later, when adding layer normalization to the GPT model, which produces `3D` tensors with shape `[batch_size, num_tokens, embedding_size]`, we can still use `dim=-1` for normalization across the last dimension, avoiding a change from `dim=1` to `dim=2`.

- Let us apply layer normalization to the layer outputs we obtained earlier. The operation consists of substracting the mean and dividing by the `square-root` of the variance (also known as `standard deviation`). This centers the inputs to have a mean of `0` and a variance of `1` across the column (feature) dimension:

In [32]:
out_norm = (out - mean) / torch.sqrt(var)
mean = out_norm.mean(dim=-1, keepdim=True)
var = out_norm.var(dim=-1, keepdim=True)
print("Normalized layer outputs:\n", out_norm)
print("Mean:\n", mean)
print("Variance:\n", var)

Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[2.9802e-08],
        [3.9736e-08]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.],
        [1.]], grad_fn=<VarBackward0>)


- The normalized layer outputs above contain negative values, have zero mean and a variance of 1.

- To improve readability, we can also turn off the scientific notation when printing tensor values by setting `sci_mode` to False:

In [33]:
torch.set_printoptions(sci_mode=False)
print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[    0.0000],
        [    0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.],
        [1.]], grad_fn=<VarBackward0>)


- Above we normalized the features of each input.
- Now, using the same idea, we can implement a `LayerNorm` class:

In [34]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))
        
    def forward(self, x: torch.Tensor):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

- This specific implementation of `layer Normalization` operates on the last dimension of the input tensor `x`, which represents the embedding dimension `(emb_dim)`. The variable `eps` is a small constant `(epsilon)` added to the variance to prevent division by zero during normalization.

**Scale and shift**

- Note that in addition to performing the normalization by subtracting the mean and dividing by the variance, we added two trainable parameters, a `scale` and a `shift` parameter.

- The initial `scale` (multiplying by 1) and `shift` (adding 0) values don't have any effect; however, `scale` and `shift` are trainable parameters that the LLM automatically adjusts during training if it is determined that doing so would improve the model's performance on its training task.

- This allows the model to learn appropriate scaling and shifting that best suit the data it is processing.

- Note that we also add a smaller value `(eps)` before computing the square-root of the variance; this is to avoid division-by-zero errors if the variance is `0`.

**BIASED VARIANCE**

- In the variance calculation above, setting `unbiased=False` means using the formula:

$$\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$$

to compute the variance, where $n$ is the sample size (here, the number of features or columns). This formula does not include Bessel's correction (which uses $n-1$ in the denominator), thus providing a **biased estimate** of the population variance.

- For LLMs, where the embedding dimension `n` is very large, the difference between using `n` and `n-1` is negligible.

- However, `GPT-2` was trained with a biased variance in the normalization layers, which is why we also adopted this setting for compatibility reasons with the pretrained weights that we will load in later section.

- Let's now try the LayerNorm module in practice and apply it to the batch input:

In [35]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)

mean = out_ln.mean(dim=-1, keepdim=True)
var = out_ln.var(dim=-1, unbiased=False, keepdim=True)

print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[    -0.0000],
        [     0.0000]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


![Alt text](../../assests/figure47.png)