## Implementing a GPT model from scratch to generate text
In this chapter **Sebastian** explains the code to implement the architecture of LLM that can be trained to generate human-like text.
The implementation includes normalized layer activations, shortcut connections, transformer blocks and, finally, the computation of the number of parameters and storage requirements of GPT models.  

In the previous chapter we imeplmented most of the component present in the architecture of LLM. Now, it's time to recreate the implementation of all the system and loading pretrained weights into our implementation. These weights are essentially the internal variables of the model that are essentiallu the internal variables of the model that are adjusted and optimized during the training process to minimize a specific loss function. 

**This optimization allows the model to learn from the training data** 



In [1]:
from importlib.metadata import version

print("matplotlib version:", version("matplotlib"))
print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

matplotlib version: 3.10.7
torch version: 2.9.0
tiktoken version: 0.12.0


![image.png](attachment:image.png)

In [2]:
# Firstly, we need to specify the configuration of the small GPT-2 model via dictionary
GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size -----> as used by the BPE
    "context_length": 1024, # Context length ----> max number of input tokens the model can handle
    "emb_dim": 768,         # Embedding dimension ----> each token will be transformed into vector 768-dimension
    "n_heads": 12,          # Number of attention heads ----> how many parallel attention mechanisms
    "n_layers": 12,         # Number of layers ----> how many transformer blocks
    "drop_rate": 0.1,       # Dropout rate ----> regularization technique to prevent overfitting
    "qkv_bias": False       # Query-Key-Value bias ----> determines wheter to include a bias vector in the Linear layers of the multi-head attention for query, key and value computations.
    
}

![image.png](attachment:image.png)

In [3]:
import torch
import torch.nn as nn
# Use this as a placeholder for the full GPT model implementation.

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # Token embedding layer from vocabulary to embedding dimension
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        # Positional embedding layer to encode token positions
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        # Dropout layer for regularization
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg["n_layers"])])
        
        # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(cfg["emb_dim"])
        # Output head to project embeddings back to vocabulary size
        self.out_head = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        # Extract batch size and sequence length from input indices
        batch_size, seq_len = in_idx.shape
        # Get token embeddings 
        tok_embeds = self.tok_emb(in_idx)
        # Get positional embeddings
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        # Combine token and positional embeddings
        x = tok_embeds + pos_embeds
        # Apply dropout
        x = self.drop_emb(x)
        # Pass through transformer blocks
        x = self.trf_blocks(x)
        # Apply final layer normalization
        x = self.final_norm(x)
        # Project to vocabulary size to get logits, through the logit computation with the linear output layer.
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # A simple placeholder

    def forward(self, x):
        # This block does nothing and just returns its input.
        return x


class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface.

    def forward(self, x):
        # This layer does nothing and just returns its input.
        return x

![image.png](attachment:image.png)

In [None]:
import tiktoken

# Initialize the tokenizer for GPT-2
tokenizer = tiktoken.get_encoding("gpt2")

batch = []

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
# Stack the list of tensors into a single tensor along a new dimension (batch dimension)
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [None]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)

logits = model(batch)
print("Output shape:", logits.shape)
print(logits)

''' 
The output tensor has two rows corresponding to the two text samples.
Each text sample consists of four tokens; each token is a 50 257-dimension vector, which
matches the size of the tokenizer's vocabulary.
'''

Output shape: torch.Size([2, 4, 50257])
tensor([[[-1.2034,  0.3201, -0.7130,  ..., -1.5548, -0.2390, -0.4667],
         [-0.1192,  0.4539, -0.4432,  ...,  0.2392,  1.3469,  1.2430],
         [ 0.5307,  1.6720, -0.4695,  ...,  1.1966,  0.0111,  0.5835],
         [ 0.0139,  1.6754, -0.3388,  ...,  1.1586, -0.0435, -1.0400]],

        [[-1.0908,  0.1798, -0.9484,  ..., -1.6047,  0.2439, -0.4530],
         [-0.7860,  0.5581, -0.0610,  ...,  0.4835, -0.0077,  1.6621],
         [ 0.3567,  1.2698, -0.6398,  ..., -0.0162, -0.1296,  0.3717],
         [-0.2407, -0.7349, -0.5102,  ...,  2.0057, -0.3694,  0.1814]]],
       grad_fn=<UnsafeViewBackward0>)
