# Let’s Reproduce GPT-2 (124M)

## Andrej Karpathy GPT Course - Neural Networks: Zero to Hero

### Lesson 10 - Let's Reproduce GPT-2 (124M): Section 2 - Optimization

+ YouTube video:
    + https://www.youtube.com/watch?v=l8pRSuU81PU

#####  Jun 9, 2024

##### Andrej's Video Comments
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.

+ Full course details: 
    + https://karpathy.ai/zero-to-hero.html
+ GitHub repository with all the changes in this video as individual commits (build-nanogpt)
    + https://github.com/karpathy/build-nanogpt
+ GitHub repository of GPT-2 based on the this "Let's Reproduce GPT-2" tutorial by user Lxrd-AJ
    + https://github.com/Lxrd-AJ/GPT2
+ nanoGPT repository
    + https://github.com/karpathy/nanoGPT
+ LLM.c Repository
    + https://github.com/karpathy/llm.c

### YouTube video contents: Section - Optimization

The times in milliseconds in brackets at the end of each chapter here show how much each computation step reduced to after adding the optimization improvement, many of which are as simple as adding a single line of code.

#### Chapters
+ 01:22:18 - SECTION 2: Let’s make it fast. GPUs, mixed precision, (1000ms)
+ 01:28:14 - Tensor Cores, timing the code, TF32 precision, (333ms)
+ 01:39:38 - float16, gradient scalers, bfloat16, (300ms)
+ 01:48:15 - torch.compile, Python overhead, kernel fusion, (130ms)
+ 02:00:18 - flash attention, (96ms)
+ 02:06:54 - nice/ugly numbers. vocab size 50257 → 50304, (93ms)

## Loading the Tiny Shakespeare dataset which we will use to train the model

We'll begin with a very small dataset which gets us off the ground and helps with debugging. Once we've ironed out the kinds with the small dataset we can load something bigger!
+ The tiny Shakespeare dataset is available from `https://github.com/karpathy/ng-video-lecture`

I already downloaded the dataset to use in the **Let's build GPT** tutorial. It can be imported like this:
```
import os
file_path = r"07 - Let's Build GPT (nanoGPT)\input.txt"
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read()
```
The tokenizer compression ratio is roughly 3 to 1 so the first 1000 characters is roughly 300 tokens.

In [7]:
# Test loading data. We'll rearrange this code below

# Load Tiny Shakespeare Dataset
import os
file_path = r"07 - Let's Build GPT (nanoGPT)\input.txt"
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read()
print("\nTiny Shakespeare dataset loaded.\nLength of dataset in characters:", len(text), "\n")
data = text[:1000]                                               # First 1,000 characters (which is roughly 300 tokens)
print("First 100 characters in Tiny Shakespeare dataset:\n\n", data[:100], "\n")

# Import tokenizer
import tiktoken
enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode(data)                                        # Encode the data we extracted from the text earlier
print("Sample tokens:\n", tokens[:24], "\n")                     # For example, token 198 is the newline character

# Build input and label tensors from input tokens to push through the model to begin training it.
import torch
buf = torch.tensor(tokens[:24 + 1])                              # Include extra token to have ground truth for final label
x = buf[:-1].view(4,6)                                           # .view gives 2D rearrangement of tokens
y = buf[1:].view(4,6)                                            # y loads all but first token (x loaded all but last)
print("Input tensor as four rows of six tokens:\n", x, "\n")
print("Labels tensor as four rows of six tokens (inputs shifted 1 to the right):\n", y, "\n")

# Encode starter phrase
tokens = enc.encode("Hello, I'm a language model,")              # Encode starter phrase as list of 8 integer tokens
tokens = torch.tensor(tokens, dtype=torch.long)                  # Form tensor from encoding tokens (8, )
tokens = tokens.unsqueeze(0).repeat(5, 1)                        # Replicate the starter encoder tokens several times (5, 8) 
print ("Encoded tokens:\n", tokens)


Tiny Shakespeare dataset loaded.
Length of dataset in characters: 1115394 

First 100 characters in Tiny Shakespeare dataset:

 First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 

Sample tokens:
 [5962, 22307, 25, 198, 8421, 356, 5120, 597, 2252, 11, 3285, 502, 2740, 13, 198, 198, 3237, 25, 198, 5248, 461, 11, 2740, 13] 

Input tensor as four rows of six tokens:
 tensor([[ 5962, 22307,    25,   198,  8421,   356],
        [ 5120,   597,  2252,    11,  3285,   502],
        [ 2740,    13,   198,   198,  3237,    25],
        [  198,  5248,   461,    11,  2740,    13]]) 

Labels tensor as four rows of six tokens (inputs shifted 1 to the right):
 tensor([[22307,    25,   198,  8421,   356,  5120],
        [  597,  2252,    11,  3285,   502,  2740],
        [   13,   198,   198,  3237,    25,   198],
        [ 5248,   461,    11,  2740,    13,   198]]) 

Encoded tokens:
 tensor([[15496,    11,   314,  1101,   257,  3303,  2746,    11],
  

## Second Version: Optimize and train from radomized weights

Note that despite using the same Hugging Face model, carefully setting hyperparaters to match and setting the same random seeds the hand-built GPT-2 does not match the output. Andrej tried to hunt down the discrepancy but couldn't find it! However he was able to verify that the model internals (tensors) match by loading getting the same results when loading the Hugging Face model. Basically, something in the `transformers.pipeline` procedure differs from the hand built GPT-2 that Andrej built.

Now that we have verified the models are in the same place the idea is to work through the GPT-2 code to improve it. We also want to actually retrain our own model from an initialization of random numbers. By default, when the model is initialized PyTorch does what we want by loading it with random values. In the first version we then imported the Hugging Face trained model and loaded it over the top but in this version we will train the model ourselves.

In [8]:
# VERSION TWO: OPTIMIZED GPT-2 TRAINED FROM RANDOMIZED WEIGHTS

import math
from dataclasses import dataclass
import torch
import torch.nn as nn
from torch.nn import functional as F
import inspect


class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embed % config.n_head == 0
        # key, query, value projections for all heads but batched
        self.c_attn = nn.Linear(config.n_embed, 3 * config.n_embed)          # Project q,k,v in one step
        # output projection of the attention layer
        self.c_proj = nn.Linear(config.n_embed, config.n_embed)
        self.c_proj.NANOGPT_SCALE_INIT = 1                                   # Enable flag for this module
        # Regularization
        self.n_head = config.n_head
        self.n_embed = config.n_embed
        # not really a 'bias' - more of a mask. The name is chosen to match OpenAI/HF naming
        # buffers are equivalent to States in MATLAB's custom layers
        # `bias` is a (1, 1, T, T) tensor, useful for broadcasting
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                             .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()     # Batch size, sequence length, embedding dimensionality (e_embed)
        # Calculate query, key & values for all heads in batch and move head forward to be the batch
        # nh is "number of heads", hs is "head size", and C is "number of channels" (nh * hs)
        # In GPT-2 (124M), n_head=12, hs=64 so nh*hs = 768 channels in the transformer
        qkv = self.c_attn(x)   # computes query, key & value in parallel. Resulting tensor is (B,T,C) where C = 3 * n_embed
        q, k, v = qkv.split(self.n_embed, dim=2)         # Split the BxTx(3*C) matrix into (B,T,C) chunks where C = n_embed
        # Apply these multi-head operations efficiently in parallel batches
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, T, nh, hs) --> (B, nh, T, hs)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, T, nh, hs) --> (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, T, nh, hs) --> (B, nh, T, hs)
        # Attention (materialises the large (T,T) matrix for all the queries and keys)
        
        # Original Attention can be replaced by the much more efficient flash attention below:
        #att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1))) # (B, nh, T, hs) @ (B, nh, hs, T) = (B, nh, T, T)
        #att = att.masked_fill(self.bias[:,:, :T, :T] == 0, float('-inf'))
        #att = F.softmax(att, dim=-1)
        #y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) --> (B, nh, T, hs)
            
        # Alternative faster Flash Attention:
        # This is not optimized for ANE as at of 2024, WWDC just happened so it's all beta, and alternative is to re-write 
        # the attention following Apple's guide at:
        #   https://github.com/apple/ml-ane-transformers/blob/main/ane_transformers/reference/multihead_attention.py
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True) 

        y = y.transpose(1, 2).contiguous().view(B, T, C) # (B, nh, T, hs) -> (B, T, nh, hs) -> (B, T, C) where C = nh * hs
        # output projection
        y = self.c_proj(y)
        return y
    
    
# Multi-Layer Perceptron (MLP) position-wise fully-connected feedforward network with two linear projections
# This is called from within the Block function where it is the final step of each transformer block.
# THe GELU activation function is similar to ReLU but with a smoother turn around 0. Historically, the tanh approxomation
# version of GELU is used as it is faster than standard GELU but Andrej suspects the speed difference is no longer so large.
# Either way, sticking to tanh approximated GELU ensures the model follows the original GPT implementation. BERT uses GELU.
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embed, 4 * config.n_embed)    # Linear layer 1
        self.gelu = nn.GELU(approximate='tanh')                      # GELU Activation function with tanh approximation
        self.c_proj = nn.Linear(4 * config.n_embed, config.n_embed)  # Linear layer 2
        self.c_proj.NANOGPT_SCALE_INIT = 1                           # Attach flag
    
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

    
# Transformer Blocks building Self Attention and MLP feedforawrd layers in nn.ModuleList(). Each block is h.0, h.1, etc
# Note in the initializaion that the Layer Norms are placed before the Self Attention and MLP layers rather than after
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embed)
        self.attn = CausalSelfAttention(config)     # Call attention operation
        self.ln_2 = nn.LayerNorm(config.n_embed)
        self.mlp = MLP(config)                      # Call MLP operation

    # Forward pass of what the block computes
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))   # Attention applied after Layer Norm with clean residual pathway for adding back x
        x = x + self.mlp(self.ln_2(x))    # Position-wide feedforward applied after Layer Norm with clean residual pathway for x
        return x

    
@dataclass                          # Use decorator to log function calls
class GPTConfig:
    block_size: int = 1024          # GPT2LMHeadModel uses Max sequence length of 1024, `T`
    vocab_size: int = 50257         # GPT2LMHeadModel uses 50257 tokens: 50,000 BPE merges + 256 byte tokens + 1 end of text
    n_layer: int = 12               # GPT2LMHeadModel has 12 hidden layers
    n_head: int = 12                # GPT2LMHeadModel has 12 heads
    n_embed: int = 768              # GPT2LMHeadModel has 768-length embedding

# Transformer architecture following the GPT-2 modification of the Attention is All you Need architecture
# Use schema and naming used in Hugging Face's GPT2LMHeadModel so we can reuse weights from state.dict()
class GPT(nn.Module):
    
    def __init__(self, config):
        super().__init__()
        self.config = config

        # Match "transformer" naming used GPT2LMHeadModel's container which holds the NN modules so we can easily reuse weights
        # The ModuleDict dictionary allows you to index into the transformer modules using keys
        # The transformer blocks are created as a list 'h' (hidden layer) so each block can be easily indexed with an integer
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embed),             # Token embedding weights
            wpe = nn.Embedding(config.block_size, config.n_embed),             # Positional embedding weights
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # Call transformer blocks ('n_layer' times)
            ln_f = nn.LayerNorm(config.n_embed)                                # Final layer norm
        ))
        # Language Model Head: linear module projects from 768 embeddings to 50,000+ possible tokens used in vocabulary
        self.lm_head = nn.Linear(config.n_embed, config.vocab_size, bias=False) # Final classifier uses no bias in GPT-2   
    
        # Weight sharing/tying
        self.transformer.wte.weight = self.lm_head.weight
        print ("Weight tying\n")
        
        # Initialize params - `apply` is called for every sub module which runs _init_weights to set std=0.02
        self.apply(self._init_weights)
        print ("Initialize weights with standard deviation 0.02\n")
            

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):     
            std = 0.02                                                       # Default Standard Deviation used by OpenAI
            if hasattr(module, 'NANOGPT_SCALE_INIT'):                        # If flag is set ...
                std *= (2 * self.config.n_layer) ** -0.5                     #   use Xavier initialization instead
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)          # Set standard deviation (OpenAI use 0.02)
            if module.bias is not None:                                      # Bias weights initialized to zero
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)         # We'll use 0.02 even though OpenAI use 0.02

    # Forward step may include optional targets for the model's training phase
    def forward(self, idx, targets=None):                                    # idx has shape (B, T). Tagets are optional
        B, T = idx.size()
        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is {self.config.block_size}"

        # forward the token and position embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)        # Shape (T) - Train on GPU if available
        pos_emb = self.transformer.wpe(pos)                                  # position embeddings of shape (T, n_embed)
        tok_emb = self.transformer.wte(idx)                                  # token embeddings of shape (B, T, n_embed)
        x = tok_emb + pos_emb                                                # Add position & token embeddings with broadcasting

        # forward through the transformer blocks
        for block in self.transformer.h:
            x = block(x)

        # forward through the final LayerNorm and classifier
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)                                             # (B, T, vocab_size)

        loss = None                                                          # Default (i.e if we don't load targets)
        
        # Cross Entropy does not take the 3D BxTbVocab_size tensor input so it must be flattened to 2D
        # The -1 as the first argument to .view automatically flattens BxT onto 1D leaving us with 2D data overasll
        # The operation also flattens the targets from BxT to a flat 1D tensor.
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))   # Cross entropy loss

        return logits, loss                                                  # Return targets and loss
        

    @classmethod
    def from_pretrained(cls, model_type):
        # Load pretrained GPT-2 model weights from hugging Face
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print(f"Loading weights from pretrained gpt: {model_type}")

        # n_layer, n_head and n_embd hyperparameters are determed from model_type
        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embed=768),   # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embed=1024),  # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embed=1280),  # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embed=1600),  # 774M params
        }[model_type]
        config_args['vocab_size'] = 50257  # Always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024   # Always 1024 for GPT model checkpoints
        
        # Create from scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()             # Create state_dict for our model and the Hugging Face model
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')]  # discard unnecessary buffers (auto regressive masks)
        
        # init a hugging Face transformer model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # Copy Hugging Face model keys while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]  # ignore these
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]         # ignore these

        # The Hugging Face model is from a TensorFlow repository where some weights are transposed so that needs fixing.
        # In other words, the OpenAI implementation uses a conv1D rather than a vanilla linear layer which requires a fix.
        # Manually make a list of weights that require transposing
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        
        # As stated above, the OpenAI checkpoints use a conv1D module, but we only want to use a vanilla linear layer
        # This means that we have to transpose weights when we import them if they are in the manually built list above
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy as-is the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])
        return model


    # FINAL CODE NOT ADDED OR CHECKED YET - Training
    def configure_optimizers(self, weight_decay, learning_rate, device):
        # start with all candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}

        # Create optim groups. Any parameter that is 2D will be weight decayed, otherwise no
        # i/e all weight tensors in matmuls + embedding decay, all biases and layernorm don't
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")

        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device == 'cuda'
        print(f"Using fused AdamW: {use_fused}, inspect check: {fused_available}")
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95), eps=1e-8, fused=use_fused)
        return optimizer

#### Call the optimized model to test it runs

In [9]:
# Call the optimized model to test it runs
model = GPT.from_pretrained('gpt2')
print ("didn't crash yay!")
print(GPT)

  from .autonotebook import tqdm as notebook_tqdm


Loading weights from pretrained gpt: gpt2
Weight tying

Initialize weights with standard deviation 0.02

didn't crash yay!
<class '__main__.GPT'>


## Training our GPT-2 model

We'll use the Tiny Shakespeare dataset to train the GPT-2 model 

The first training run just uses the very small training sample of B=4, T=32 and a learning rate of 3e-4 which is ideal for early stages of model training with small samples. The small data set gets reloaded at every training step. This reduces the loss very rapidly with an overfitted set of weights but it demonstrates that the model is learning.

```
Using device: cuda
Initial randomized loss should be about -ln(1/50,257) = 10.825 if all 50,257 tokens are equally likely:
 tensor(11.1613, device='cuda:0', grad_fn=<NllLossBackward0>)
step 0, loss: 11.161322593688965
step 1, loss: 6.7335896492004395
step 2, loss: 4.3205389976501465
   :         :           :
step 48, loss: 0.0030398822855204344
step 49, loss: 0.0029682039748877287
```
The next stage is to iterate over different batches which requires a dataloader to ensure we always get a fresh batch. This ensurs we optimize over a reasonable data set.

#### Fixing a Hugging Face model shared weights "bug"
The weights from Hugging Face come in with the `lm_head.weight` and `transformer.wte.weight` having the same 2D shape:
```
print(sd_hf["lm_head.weight"].shape)
print(sd_hf["transformer.wte.weight"].shape)

torch.Size([50257, 768])
torch.Size([50257, 768])
```
+ `lm_head.weight` is the token embedding at the bottom of the transformer decoder (listed as **Output Embedding** in the **Attention is All you Need** diagram). 
+ `transformer.wte.weight` is the language model head classifier layer at the top of the transformer (listed as **Linear Layer** at the top of the decoder in the **Attention is All you Need** diagram).

Not only do these input and output tensors have the same shape, they have element-wise equality:
```
(sd_hf["lm_head.weight"] == sd_hf["transformer.wte.weight"]).all()

tensor(True)
```
And going one step further we can also see they have the same PyTorch data pointer which is refered to as a **weight-tying scheme**:
```
print(sd_hf["lm_head.weight"].data_ptr())
print(sd_hf["transformer.wte.weight"].data_ptr())

1406505003360576
1406505003360576

```
Having the same PyTorch data pointer means they are pointing to an identical tensor. Two embeddings that are measurably close in the token embedding feeding in to the transformer should also have similar probabilities in the classifier layer at the top of the Transformer.

Employing a **weight-tying scheme** is by design in the **Attention is All you Need** paper as it leads to optimization to tie these weights together. 

You can see this in action in OpenAI's released GPT-2 TensorFlow code (**gpt-2/src/model.py**) in the definition of the **model** class. When forwarding the model, the **wte** matrix which encodes the token embeddings are used when feeding data into the transformer as you would expect but they are also used when calculating the logits at the tail end of the transformer process. **wte** is also used at both ends of the Transformer during backpropogation.

All of that is a long way of saying we weren't doing this data sharing so we fixed this "bug" by adding this functionality! This was accomplished by adding this line to the GPT class which simply copies the PyTorch data pointer from `self.transformer.wte.weight` to `self.lm_head.weight`:
```
# Weight sharing/tying
self.transformer.wte.weight = self.lm_head.weight
```
Weight typing results in `wte.weight` being orphaned but PyTorch handles the clean up under the hood.

This optimization saves **768*50257** = **38,597,376** parameters which is roughly a third of all of the parameters in the 124M model.

#### Model initialization weights

How the model was initialized isn't explicitly stated anywhere. However, by again looking at the OpenAI sopurce code for GPT-2 (**gpt-2/src/model.py**) we can see in definition of the **conv1d** class that they use weights initiation with a standard deviation of 0.02:
```
def conv1d(x, scope,*, w_init_stddev=0.002):
```
Later in the **model** class code, positional embeddings are initialized with a standard deviation of 0.01 and token embeddings with a initialized with a standard deviation of 0.02.
```
wte = tf.get_variable('wpe',[hparams.n_ctx, hparams.n_embd], initializer=tf.random_normal_initializer(stddev=0.001))
wpe  = tf.get_variable('wpe',[hparams.n_vocab, hparams.n_embd], initializer=tf.random_normal_initializer(stddev=0.002))
```
We can set up something similar. Typically you would want a value that changes if the model gets larger using something like Xavier initialization. However, we are hard coding because that is what OpenAI did! It turns out that Xavier initialization which is equal to one over the square root of the number of features is very close to 0.02 anyway.

#### Monitoring training with std and inspecting torch.no_grad()

You can monitor the standard deviation of the output logits to check that the training is stable.
```
std = logits.std()
```
When loss is falling steadily the model is learning and when the Standard deviation of logits is rising slightly it suggests that the model's outputs are becoming more confident (moving away from uniform probability across the vocabulary). In the early steps, the model is mostly guessing — so the logits have a low spread. As it learns, the distribution sharpens slightly (but shouldn’t explode), hence the small increase in std.If std starts to rise past 2 or 3 there s a risk of exploding gradients and if it drops below 0.1 there is a risk of vanishing gradients. If there are sudden jumps in standard deviation or loss there is potential numerical instability or bad gradient spikes.

Looking at **torch.no_grad()** is a great way to inspect activations or logits without interfering with gradients. Andrej prefers to look at activations and logits, not raw token IDs when monitoring progress.

Other useful diagnostics include:
+ logits.mean().item()
+ logits.max().item()
+ logits.min().item()
+ grad_norm = model.lm_head.weight.grad.norm().item() - after running `.backward()`

These are great for building an intuitive feel for how models behave.

In [22]:
import os
import gc
import sys
import time
import torch
import tiktoken
import torch._dynamo as dynamo

# Build dedicated DataLoader class hat suppports stepping thrtough the entuire data set encoding
class DataLoaderLite:
    def __init__(self, B, T):
        self.B = B
        self.T = T

        file_path = r"07 - Let's Build GPT (nanoGPT)\input.txt"
        with open(file_path, 'r', encoding='utf-8') as f:                # Load full text from disk
            text = f.read()
        
        # Just keep these small data samples on the CPU to free up space on GPU
        enc = tiktoken.get_encoding('gpt2')                              # Store entire encoding in memory at initialization
        tokens = enc.encode(text)
        self.tokens = torch.tensor(tokens)
        print(f"Loaded {len(self.tokens)} tokens.\n")
        print(f"1 epoch = {len(self.tokens) // (B * T)} batches.\n")     # How many batches are needed for each epoch?

        self.current_position = 0                                        # Start at position 0 in encoding

    # Advance through encoding in groups of B*T but making sure to always grab the extra token needed for final target
    def next_batch(self):
        B, T = self.B, self.T

        buf = self.tokens[self.current_position:self.current_position+(B*T)+1]
        x = (buf[:-1]).view(B, T)                                        # Inputs
        y = (buf[1:]).view(B, T)                                         # Targets

        self.current_position += B * T                                   # Advance the position in the tensor 

        if self.current_position + (B*T)+1 > len(self.tokens):           # If we reach the end of the encoding ...
            self.current_position = 0                                    # Loop back round and start again
        return x, y
    
# Autodetect available devices to automatically run on the device with highest capability.
# Ideally run on a GPU with CUDA support if one is available
device = "cpu"                                            # By default select cpu which is always available
if torch.cuda.is_available():
    device = "cuda"                                       # Switch to best option cuda if it exists
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"                                        # Try mps if cuda not available (mps is on decent Apple macbooks)
print(f"Using device: {device}", "\n")
#device = "cpu"                                           # OVERRIDE to run on CPU by uncommenting

# Hyperparameters
num_return_sequences = 5
max_length = 30

# Set seeds for reproducability
torch.manual_seed(1337)
if torch.cuda.is_available():
    torch.cuda.manual_seed(1337)

# Original dataloader for Tiny Shakespeare Dataset and encoder -- REPLACED BY DEDICATED DATALOADER ABOVE
#enc = tiktoken.get_encoding('gpt2')
#file_path = r"07 - Let's Build GPT (nanoGPT)\input.txt"
#with open(file_path, 'r', encoding='utf-8') as f:
#    text = f.read()
#text = text[:1000]                                              # First 1,000 characters (which is roughly 300 tokens)
#tokens = enc.encode(data)                                       # Encode the data we extracted from the text earlier
#                                                                
#B, T = 4, 32                                                    # Batch and Time gives size of input and table tensors
#buf = torch.tensor(tokens[:B*T + 1])                            # Include extra token to have ground truth for final label
#buf = buf.to(device)                                            # Move buf to GPU if available. Note this requires "buf =""
#x = buf[:-1].view(B, T)                                         # .view gives 2D rearrangement of tokens
#y = buf[1:].view(B, T)                                          # y loads all but first token (x loaded all but last)

# Clear cache if starting affresh
gc.collect()
torch.cuda.empty_cache()

# Call dedicated dataloader
B = 16                                                           # 16  Batch size of 24 gives out of memory error on RTX3090
T = 768                                                          # 768 T=1024 runs but is relatively slow
print(f"Batch Size: {B}, Time: {T}.")
train_loader = DataLoaderLite(B=B, T=T)
    
# Setting this will use TensorFloat32 precision if its available which is a more optimal matrix multiplication method
torch.set_float32_matmul_precision('high')                       # 'Highest' is default for float32, 'High' is tensor32

# Create model
model = GPT(GPTConfig(vocab_size=50304))                         # Create larger vocab size that is nice multiple of 2
model.to(device)

# Torch.compile is a Linux just-in-time compiler which is added here for completeness. It must be run in "eager" mode to have
# any chance of running on Windows. Even then, it may not speed things up much and may actually slow things down. However, it
# is added here for completeness and testing. If it doesn't help the execution times much on Windows disable it!
# backend="aot_eager" is same as "eager" but routes through AOTAutograd which is useful for debugging trace errors.
# torch.complie requires installation of functorch. 
#print("Compiling with torch.complile...")
#model = torch.compile(model, backend="aot_eager")               # Compile neural network model for faster execution (Linix)
#print("Complete\n")

# Optimize! - Train the model
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)       # AdamW optimizer with 2 moment buffers (Momentum, RMSProp)
tstart = time.time()                                             # Record start time of training loop
for i in range(50):                                              # 50 training steps
    t0 = time.time()                                             # Set timer start
    x, y = train_loader.next_batch()                             # Walk through the entire encoding by grabbing next batch
    x, y = x.to(device), y.to(device)                            # Place tensors on GPU   
    optimizer.zero_grad()                                        # Always remember to start with zero gradients!
    with torch.autocast(device_type=device, dtype=torch.bfloat16): # Invoke bfloat16 autocasting to optimize data flow ...
        logits, loss = model(x, y)                               # only available on Ampere GPUs (which includes RTX3090)
    loss.backward()                                              # Accumulate gradients from loss
    optimizer.step()                                             # Update parameters to decrease the loss
    std = logits.std()                                           # Monitor standard deviation of logits to check stability
    if (i==0):
        print("Initial randomized loss should be about -ln(1/50,257) = 10.825 if all 50,257 tokens are equally likely:")
        print(loss, "\n")
    torch.cuda.synchronize()                                     # Complete any work in queue so time() is more accurate
    t1 = time.time()                                             # Set timer end
    dt = (t1 - t0)*1000                                          # Time difference in milliseconds
    tokens_per_sec = (train_loader.B * train_loader.T) / (t1-t0)
    # Print stats for each training step. Item converts tensor to float and moves from GPU to CPU
    print(f"step {i}, loss: {loss.item():.3f},  dt: {dt:.2f}ms,  tok/s {tokens_per_sec:.2f},  std: {std:.2f}")          

tend = time.time()                                                # Set timer for end of trainiong loop
print(f"\nTotal training time: {tend-tstart:.2f}ms")
print("\nCheck data types for precision (bfloat16 is optimal)")
print("Logits data type:", logits.dtype)
print("Transformer wte matrix data type:", model.transformer.wte.weight.dtype)

sys.exit(0)                                                      # Early exit

# Generate right now x is (B, T) where B=5 and T=8. 
#torch.manual_seed(42)                                            # Set seed to 42
#torch.cuda.manual_seed(42)                                       # Match seed on GPU
while x.size(1) < max_length:
    # Forward thge model to get the logits
    with torch.no_grad():                                        # no_grad means don't call backpropagation so less processing
        logits = model(x)                                        # (B, T, vocab_size)
        logits = logits[:, -1, :]                                # (B, vocab_size) - logits at the last position
        probs = F.softmax(logits, dim=-1)                        # Get the probabilities
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) # top-k sampling of 50 - topk_probs/indices become (5, 50)
        ix = torch.multinomial(topk_probs,1)                     # (B, 1) - Select a token from the top-k probabilities
        xcol = torch.gather(topk_indices, -1, ix)                # (B, 1) - Gather corresponding indices
        x = torch.cat((x, xcol), dim=1)                          # Append thr sequence
    
# Print the generated text
for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist()                          # Get the tokens from all of the 5 samples 
    decoded = enc.decode(tokens)                                 # Use tiktoken's decoder to convert tokens back to characters
    print(">", decoded)

Using device: cuda 

Batch Size: 16, Time: 768.
Loaded 338025 tokens.

1 epoch = 27 batches.

Weight tying

Initialize weights with standard deviation 0.02

Initial randomized loss should be about -ln(1/50,257) = 10.825 if all 50,257 tokens are equally likely:
tensor(10.9487, device='cuda:0', grad_fn=<NllLossBackward0>) 

step 0, loss: 10.949,  dt: 503.94ms,  tok/s 24384.05,  std: 0.55
step 1, loss: 9.408,  dt: 231.98ms,  tok/s 52970.83,  std: 0.55
step 2, loss: 8.829,  dt: 229.82ms,  tok/s 53469.09,  std: 0.56
step 3, loss: 8.719,  dt: 225.74ms,  tok/s 54434.72,  std: 0.56
step 4, loss: 8.602,  dt: 221.76ms,  tok/s 55410.65,  std: 0.57
step 5, loss: 8.286,  dt: 225.97ms,  tok/s 54378.78,  std: 0.59
step 6, loss: 8.227,  dt: 219.59ms,  tok/s 55960.06,  std: 0.60
step 7, loss: 8.055,  dt: 223.09ms,  tok/s 55081.57,  std: 0.62
step 8, loss: 7.922,  dt: 221.15ms,  tok/s 55564.18,  std: 0.66
step 9, loss: 7.742,  dt: 223.86ms,  tok/s 54891.27,  std: 0.69
step 10, loss: 7.441,  dt: 221.45ms

SystemExit: 0

## Optimization tricks

#### Making best use of your GPU

Running `C:\Windows\System32\nvidia-smi.exe` in a Power Shell will give you lots of useful information about your GPU which you can use to configure how you run large Neural Net models efficiently. When Andrej ran the command it showed he had eight A100 Tensor Core GPUs each with 80GB of memory. That's over $100,000 worth of advanced graphics cards!! I just have a single GPU. But it is an RTX3090 so that's not so bad! Andrej's configuration comes from hiring a powerful machine at Lambda labs which I could do for the final training step if needed.
```
PS C:\Users\johns> C:\Windows\System32\nvidia-smi.exe
Sun Aug  3 16:55:58 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 577.00                 Driver Version: 577.00         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:0A:00.0  On |                  N/A |
|  0%   54C    P8             37W /  420W |    1084MiB /  24576MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
```
#### Using bfloat16 or tensorfloat32 where possible
By default Tensors are often created as float32 which can much more than is needed. You can make large networks much more efficient by reducing the number of bits in the tensors which lowers the precision of the model at very little cost in accuracy overall. Instead of using FP32 consider using TF32 (TensorFloat 32) which is 8 times faster. 

<img src="../assets/TensorFlow32-format.jpg" alt="TensorFlow32 format" style="width: 600px;"/>

There are more details in the A100 Architecture White Paper. One key feature of TensorFLoats is that is that they crop the mantissa in interal operations which makes them run much faster at the cost of a litte pricison which is often not noticable in prtactice. They still return a standard float 32 so code can run on them without requiring any modification at all.

<img src="../assets/Tensor-cropped-mantissa.jpg" alt="Tensor with cropped mantissa" style="width: 600px;"/>

Of course, this is only useful if you have Tensor Cores or are hiring a machine that has them!

The next level down in precision is using a bfloat16 which is still very suitable for out needs but much faster to computer - and more importantly also much faster to move data around to reduce bottlenecks when waiting for data to move back and forth. Float16 is less useful as it does not represent the full exponent range that you get with bfloat16 because it uses more of its bits for mantissa precision accuracy. When float16 is used gradient scaling workarounds are required to deal with the loss of exponent range which adds significantly to the complexity of model building. bfloat16 does not require these workarounds with a trade off of slightly reduced precision which is often not significant. 

You can read about dealing with precision in the "Atomatic Mixed Precision" PyTorch document by Micheal Carilli. Most of the document is not relevant but the section on torch.autcast is useful. This section says to not set bfloat16 explicitly but to let PyTOrchj handle using it when appropriate using autocasting. It also says to only use on the forward pass not in backpropogation steps. This is the line added to the code to make use of it:
```
with torch.autocast(device_type=device, dtype=torch.bfloat16):
    logits, loss = model(x, y)
```
Note how this line is within the optimizer loop which is a forward pass only.

#### torch.compile

This is a mechanism for compiling neural nets so they run faster. It's just a single line of code to implement:
```
model = torch.compile(model) 
```
This will have an intitial slow down while the model is compiled but will give a speed up during training which can make a huge difference for long training runs of big models.

torch.complie requires installation of functorch. In a conda shell you can run:
```
conda install -c pytorch functorch
```
Torch compile using dymamo is designed for Linux. Although there are Windows implementations for the required triton mpdule they are very unstable and may cause CUDA problems. You can still run torch_compile in Windows but you must run it in **eager** mode. This will disable the Inductor to run it without Triton:
```
model = torch.compile(model, backend="eager")
```
It won’t give you speedups (and may slow things down a tiny bit), but it keeps your code future-proof and testable. Ideally, switching to a Linux environment in WSL2, Docker, or Ubuntu is the safe way to implement torch.compile

Really, torch.compile should be run by default if you have support for it unless you are debugging so need to run lots of loops really fast. It should reduce each training step to less than half of its execution time.

#### Flash attention

Flash attention is an efficient way to run attention by fusing four separate operations into one. It actually does more calculations but is still faster because the extra computations are fast and are designed to reduce the read and writes which are very slow.

#### Fix ugly numbers!
This is a suprisingly effective optimization that simply tweaks any numbers that aren't nice multiples of 2. A great example is the number of tokens can be increased from 50257 to 50304. Although this increases the total computations required it removes the need for special handling as the number can eqasily be broken down in to multiples of 8,16, 32, 64 and 128.

In [None]:
# Check torch.compile (experimental)
import torch._dynamo as dynamo
dynamo.explain(model)(x, y)

## Strategies for reducing overfitting
Validation Strategy Proposal for GPT-2 Training Using FineWeb-edu

#### Current Setup Recap
+ Model: GPT-2 (custom implementation following Karpathy)
+ Dataset: FineWeb-edu 10B token sample
+ Validation split: 1 shard out of 100 (1%)
+ Validation pattern: Scanned linearly with no shuffling
+ Observation: Training loss continues to fall; validation loss decreases initially, then steadily increases from ~step 3500 to step 16000.

#### Concerns Identified
+ Small validation set (1%) may be:
    + Too narrow to reflect generalization ability
    + Susceptible to variance or domain mismatch
    + Missing harder or more diverse examples
+ No corpus randomization may introduce positional bias and smooth overfitting

#### Proposed Adjustments
Option 1: Expand Validation Set Using Additional FineWeb-edu Shards
+ Identify additional FineWeb shards not included in the current 10B token sample.
+ Sample a fixed number of examples (e.g., 2% to 5% of total corpus) from different files.
+ Ensure domain and difficulty diversity (e.g., include harder educational material).
+ Pre-tokenize and cache if necessary.

Option 2: Stratify Validation Selection
+ Perform lightweight metadata classification or domain tagging of samples.
+ Use stratified sampling to create a validation set that matches the full corpus distribution.
+ May require one pass through the corpus but would give more reliable results.

Option 3: Rotate Validation Across Epochs
+ Instead of a fixed validation shard, rotate which shard is held out each epoch.
+ Compute running average of validation loss.
+ Benefit: exposes model to broader set of out-of-training samples.

Option 4: Hybrid: Keep Static + Floating Validation Sets
+ Maintain original shard as static benchmark.
+ Add rotating or diversified secondary validation set to compare trends.
+ Track both sets during training and compare convergence behavior.

#### Tooling Enhancements
+ Track validation loss with higher frequency (e.g., every 200 steps) to catch sharper inflections.
+ Store per-validation checkpoint metrics to assess per-shard behavior.

#### Next Steps
1. Identify and acquire additional FineWeb-edu shards.
2. Analyze overlap (if any) with training shards to ensure clean separation.
3. Implement at least one of the proposed strategies and replot training vs. validation.

#### Goal
Make validation loss a more meaningful diagnostic of generalization, and ensure training signals guide architecture/tuning decisions effectively.