### implementing a GPT model from Scratch to Generate Text

In [1]:
from importlib.metadata import version

import matplotlib, tiktoken, torch

print("matplotlib ver:", version("matplotlib"), "tiktoken ver:", version("tiktoken"), "torch:", version("torch"))


matplotlib ver: 3.7.2 tiktoken ver: 0.6.0 torch: 2.1.2


- In this chapter, we implement a GPT-like LLM architecture; the next chapter will focus on training this LLM
- ![image.png](attachment:image.png)

## ***Coding an LLM architecture***
- Compared to conventional deep learning models, LLms are larger, mainly due to their vast number of parameters, not the amount of code.
- ![image.png](attachment:image.png)
- In prev chapters, we used small embedding dimensions for token inputs and outputs for ease of illustration, ensuring they fit on a single page
- In this chapter, we consider embedding and model size akin to small GPT-2 model
- We'll specifically code the architecture of the smallest GPT-2 model (124m params). 
- Chapter 6 will show how to load pretrained weights into our implementation, which will be compatible with model sizes of 345, 762 and 1542 mill params

- Config details for the 124 million parameter GPT-2 model include:

In [2]:
GPT_CONFIG_124M = {
    "vocab_size"     : 50257,   # Supported by BPE tokenizer 
    "context_length" : 1024,    # Rep the models max input token count, as enabled by positional embeddings
    "emb_dim"        : 768,     # Embedding size for token inputs, converting each input token in 768-d vec
    "n_heads"        : 12,      # Number of attention heads in multi head attention
    "n_layers"       : 12,      # Number of transformer blocks within the model
    "drop_rate"      : 0.1,     # means dropping 10% of hidden units during training to mitigate overfitting
    "qkv_bias"       : False    # decides if the Linear layers in the multi-head should include a bias vector
}                                                                                       

![image.png](attachment:image.png)

### 1) GPT backbone

In [3]:
GPT_CONFIG_124M

{'vocab_size': 50257,
 'context_length': 1024,
 'emb_dim': 768,
 'n_heads': 12,
 'n_layers': 12,
 'drop_rate': 0.1,
 'qkv_bias': False}

In [4]:
import torch
import torch.nn as nn 

class DummyGPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb  = nn.Embedding(cfg["vocab_size"], cfg['emb_dim'])
        self.pos_emb  = nn.Embedding(cfg["context_length"], cfg['emb_dim'])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])


    # Use a placeholder for TransformerBlock
        self.trf_blocks = nn.Sequential(
            *[DummyTransformerBlock(cfg) for _ in range(cfg['n_layers'])]
        )

    # Use a placeholder for LayerNorm
        self.final_norm = DummyLayerNorm(
            cfg["emb_dim"]
        )

        self.out_head   = nn.Linear(
            cfg["emb_dim"], cfg["vocab_size"], bias=False
        )

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds  = self.tok_emb(in_idx)
        pos_embeds  = self.pos_emb(torch.arange(seq_len, device = in_idx.device))
        x  = tok_embeds + pos_embeds
        x  = self.drop_emb(x)
        x  = self.trf_blocks(x)
        x  = self.final_norm(x)
        logits = self.out_head(x)
        return logits


class DummyTransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        # A simple placeholder

    def forward(self, x):
        # this block does nothing and just returns its inputs
        return x

class DummyLayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps = 1e-5):
        super().__init__()
        # The parameters here are just to mimic the LayerNorm interface

    def forward(self, x):
        # This layer does nothing and just returns its input. 
        return x


![image.png](attachment:image.png)

In [6]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

batch = [ ]

txt1 = "Every effort moves you"
txt2 = "Every day holds a"

batch.append(torch.tensor(tokenizer.encode(txt1)))
batch.append(torch.tensor(tokenizer.encode(txt2)))
batch = torch.stack(batch, dim=0)
print(batch)

tensor([[6109, 3626, 6100,  345],
        [6109, 1110, 6622,  257]])


In [7]:
torch.manual_seed(123)
model = DummyGPTModel(GPT_CONFIG_124M)

logits = model(batch)
print("output shape:", logits.shape)
print(logits)

output shape: torch.Size([2, 4, 50257])
tensor([[[-0.9289,  0.2748, -0.7557,  ..., -1.6070,  0.2702, -0.5888],
         [-0.4476,  0.1726,  0.5354,  ..., -0.3932,  1.5285,  0.8557],
         [ 0.5680,  1.6053, -0.2155,  ...,  1.1624,  0.1380,  0.7425],
         [ 0.0447,  2.4787, -0.8843,  ...,  1.3219, -0.0864, -0.5856]],

        [[-1.5474, -0.0542, -1.0571,  ..., -1.8061, -0.4494, -0.6747],
         [-0.8422,  0.8243, -0.1098,  ..., -0.1434,  0.2079,  1.2046],
         [ 0.1355,  1.1858, -0.1453,  ...,  0.0869, -0.1590,  0.1552],
         [ 0.1666, -0.8138,  0.2307,  ...,  2.5035, -0.3055, -0.3083]]],
       grad_fn=<UnsafeViewBackward0>)


### Normalizing activation with layer normalization
- Layer normalization, also known as LayerNorm, centers the activation of a neural network layer around a mean of 0 and normalizes their variance to 1.
- This stabilizes training and enables faster convergence to effective weights
- Layer normalization is applied both before and after the multi-head attention module within the transformer block, which we will implement later; its also applied before the final output layer.
- ![image.png](attachment:image.png)
- Lets see how layer normalization works by passing a small input sample throught a simple neural network layer:

In [9]:
torch.manual_seed(123)

#Create 2 tranining examples with 5-d (features) each
batch_example = torch.randn(2,5)

layer = nn.Sequential(nn.Linear(5,6), nn.ReLU())
out   = layer(batch_example)
print(out)

tensor([[0.2260, 0.3470, 0.0000, 0.2216, 0.0000, 0.0000],
        [0.2133, 0.2394, 0.0000, 0.5198, 0.3297, 0.0000]],
       grad_fn=<ReluBackward0>)


In [10]:
layer

Sequential(
  (0): Linear(in_features=5, out_features=6, bias=True)
  (1): ReLU()
)

In [11]:
#Let's compute the mean and variance for each of the 2 inputs above:
mean = out.mean(dim = -1, keepdim=True)
var  = out.var(dim=-1, keepdim=True)

print("mean:\n", mean)
print("variance:\n", var)

mean:
 tensor([[0.1324],
        [0.2170]], grad_fn=<MeanBackward1>)
variance:
 tensor([[0.0231],
        [0.0398]], grad_fn=<VarBackward0>)


- The normalization is applied to each of the two inputs(rows) independently; using dim=-1 applies the calculations across the last dimensions(in this case, the feature dim) instead of row dimension
- ![](attachment:image.png)

- Substracting the mean and dividing by the square-root of the variance(standard deviation) centers the inputs to have a mean of 0 and a variance of 1 across the column dimensions:

In [12]:
out_norm = (out - mean) / torch.sqrt(var)
print("Normalized layer outputs:\n", out_norm)

mean = out_norm.mean(dim = -1, keepdim = True)
var  = out_norm.var(dim=-1 , keepdim = True)

print("Mean:\n", mean)
print("variance:\n", var)

Normalized layer outputs:
 tensor([[ 0.6159,  1.4126, -0.8719,  0.5872, -0.8719, -0.8719],
        [-0.0189,  0.1121, -1.0876,  1.5173,  0.5647, -1.0876]],
       grad_fn=<DivBackward0>)
Mean:
 tensor([[9.9341e-09],
        [0.0000e+00]], grad_fn=<MeanBackward1>)
variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


In [13]:
class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim= -1, keepdim = True)
        var  = x.var(dim= -1, keepdim = True, unbiased = False)
        norm_x = (x-mean)/torch.sqrt(var + self.eps)
        return self.scale *norm_x + self.shift

In [14]:
ln = LayerNorm(emb_dim=5)
out_ln = ln(batch_example)

In [15]:
mean = out_ln.mean(dim = -1, keepdim=True)
var = out_ln.var(dim=-1, unbiased= False, keepdim=True)

print("Mean:\n", mean)
print("Variance:\n", var)

Mean:
 tensor([[-2.9802e-08],
        [ 0.0000e+00]], grad_fn=<MeanBackward1>)
Variance:
 tensor([[1.0000],
        [1.0000]], grad_fn=<VarBackward0>)


-![image.png](attachment:image.png)