## Section 3: algorithmic improvements to GPT2

The paper for gpt2 or code (inference code) doesn't talk too much about the algorithmic details or hyperparameters being used. So we refer to elements of gpt3 paper, since their architecture is very similar

Gpt2: less details, open weights <br>
Gpt3: more details, no weights

Key differences:<br>
- context length: 1024 vs 2048
- gpt3: trained for lot longer, on bigger dataset, more validation
- 1.6b vs 175b parameters

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import tiktoken
import time
import math

from sec2 import GPT, GPTConfig, DataLoaderLite ,get_device


  from .autonotebook import tqdm as notebook_tqdm


Refer [gpt3 paper](https://arxiv.org/pdf/2005.14165): __Section B - Details of model training__

<img src="images/details of model training.png" style="width:50%;">`

In actual gpt3: ~__300 billion__ training tokens used, we use much lesser.

1. AdamW parameters: $\beta = $[0.99, 0.95], $\epsilon = 10^{-8}$

2. Gradient clipping at $\text{max norm} = 1.0 $
    - how it works:  scales all grads by $\frac{\text{max norm}}{\text{norm}}$ if norm > 1
    - why: to prevent _shocks_ in model training, in case when batch sampling is too unusual
    - useful to track it during training (is grad norm $\uparrow \text{or} \downarrow$ abnormally etc)

3. Cosine decay learning rate with warmup
    - handwritten _here_, plug and play function also available in pytorch
    - Other schedules can be used (active research area) 

4. Changing batch size (not implemented here)
    - Starts with 32k tokens per batch and increase linearly up to 4-12 billions tokens 
    - Gain is incremental anyway and complicates the arithmatic, so skipped _here_

5. weight decay, Fused AdamW 
    - weight decay for 2D tensors (matmuls and embeddings) and not for biases or layernorms (1D)
    - [Intuition behind weight decay](https://towardsdatascience.com/weight-decay-and-its-peculiar-effects-66e0aee3e7b8/): regularization, prevent individual weight becoming too big
    - [Fused adamw](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html) introduces fused kernels in later versions of pytorch which reduces some overheads when `device = 'cuda'` 

### Gradient accumulation

<img src="images/hyperparams.png" style="width:60%;">`

We will adhere to 0.5M tokens/batch but that may not fit on a GPU with limited VRAM. Instead we retain the (16,1024) batch size and sample $~\frac{524288}{16*1024} = 32$ batches per epoch. $524288 = 2^{19}$ is the closest _nice number_ to 500000. 

In practice: <br>
- since loss.backward() is additive, for mini_step in range $32$:
    - sample x,y
    - model(x)
    - loss and loss.backward()
- optimizer.step()

This introduces a small bug, since loss is of `reduction = "mean"` we must divide by 32 to get the correct loss over 32 mini_steps. Look at the below cell for intuition.

In [7]:
max_lr = 6e-4 # from gpt3 small 
min_lr = max_lr * 0.1
max_steps = 50
warmup_steps = 10

def get_lr(it):
    
    # 1. for warmup phase scale
    if it < warmup_steps:
        return max_lr * (it+1)/max_steps
    
    #2. for it > max_steps
    if it> max_steps:
        return min_lr

    # for in between: cosine decay with linear increment
    decay_ratio = (it - warmup_steps)/ (max_steps - warmup_steps)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff starts at 1 and goes to 0
    return min_lr + coeff * (max_lr - min_lr)


In [None]:
device = get_device()

# revent accidental run and kernal clogging. 
import sys; sys.exit(0)

model = GPT(GPTConfig(vocab_size=50304)) #- random weights init
model.to(device)
model = torch.compile(model) # compiles the model 

B,T = 16,1024

total_batch_size  = 2**19 #524288, ~0.5M, in number of tokens
assert total_batch_size % (B*T) == 0, "make sure total_batch_size is divisible by B * T"
grad_accum_steps = total_batch_size // (B*T)

print(f"Desired total batch size = {total_batch_size} tokens")
print(f"calculated gradient accumulation steps = {grad_accum_steps}")


train_loader = DataLoaderLite(B,T)
# torch.set_float32_matmul_precision('high') -- old api, soon to be deprecated 
torch.backends.fp32_precision = "tf32" # new api, use "ieee" to enforce global fp32 precision 


# optimizer with weight decay + fused kernels. 
optimizer = model.configure_optimizers(weight_decay=0.1, learning_rate=6e-4, device = device)

#training loop
for step in range(max_steps):
    t0 = time.time()
    optimizer.zero_grad()
    loss_accum = 0.0

    for mini_step in range(grad_accum_steps):
        x,y = train_loader.next_batch()
        x, y = x.to(device), y.to(device)

    # single line autocast
        with torch.autocast(device_type = device, dtype = torch.bfloat16):
            logits, loss = model(x,y)

        # scale to account for gradient accumulation
        loss = loss / grad_accum_steps
        loss_accum += loss.detach() #accumulate loss at mini_step into loss at each step
        loss.backward()

    #grad clipping
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    #determine and set learning rate for this iteration 
    lr = get_lr(step)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    optimizer.step()
    torch.cuda.synchronize() # allow gpu bandwidth to catch up and clear the queue of operations
    t1 = time.time()
    dt = (t1-t0)*1000 # time difference in milliseconds
    
    tokens_processed = train_loader.B * train_loader.T * grad_accum_steps # = total_batch_tokens
    tokens_per_sec = tokens_processed / dt

    print(f"step {step:4d} | loss: {loss_accum.item():.6f} | lr {lr:.4e} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec:.2f}")


#verify total no of parameters
total_params_M = sum(p.numel() for p in model.parameters()) / 1e6
print(f"Total parameters: {total_params_M:.2f}M")

### Gradient accumulation motivation through a toy example

In [3]:
import torch

torch.manual_seed(1729)

# super simple little MLP
net = torch.nn.Sequential(
    torch.nn.Linear(16, 32),
    torch.nn.GELU(),
    torch.nn.Linear(32, 1)
)
torch.random.manual_seed(42)
x = torch.randn(4, 16)
y = torch.randn(4, 1)
net.zero_grad()
yhat = net(x)
loss = torch.nn.functional.mse_loss(yhat, y)
loss.backward()
print(net[0].weight.grad.view(-1)[:10])

# the loss objective here is (due to readuction='mean')
# L = 1/4 * [
#            (y[0] - yhat[0])**2 +
#            (y[1] - yhat[1])**2 +
#            (y[2] - yhat[2])**2 +
#            (y[3] - yhat[3])**2
#           ]
# NOTE: 1/4!

tensor([ 0.0953,  0.0498, -0.0077, -0.0817, -0.0166,  0.0079,  0.0189,  0.1085,
         0.1615, -0.0739])


In [4]:
# now let's do it with grad_accum_steps of 4, and B=1
# the loss objective here is different because
# accumulation in gradient <---> SUM in loss
# i.e. we instead get:
# L0 = 1/4(y[0] - yhat[0])**2
# L1 = 1/4(y[1] - yhat[1])**2
# L2 = 1/4(y[2] - yhat[2])**2
# L3 = 1/4(y[3] - yhat[3])**2
# L = L0 + L1 + L2 + L3
# NOTE: the "normalizer" of 1/4 is lost
net.zero_grad()
for i in range(4):
    yhat = net(x[i])
    loss = torch.nn.functional.mse_loss(yhat, y[i])
    loss = loss / 4 # <-- have to add back the "normalizer"!
    loss.backward()
print(net[0].weight.grad.view(-1)[:10])

tensor([ 0.0953,  0.0498, -0.0077, -0.0817, -0.0166,  0.0079,  0.0189,  0.1085,
         0.1615, -0.0739])


### Sampling and generation

In [None]:
torch.manual_seed(42)
torch.cuda.manual_seed(42)

model.eval()

out = model.generate(("Hello, I'm a language model,"), num_return_sequences=5, max_length=30) # print statement included within generate
with open('output.txt', 'w') as f:
    for o in out:
        f.write(o + '\n')


## Distributed data parallel 

Bringing out the heavy weapons :)  - Using multiple GPUs. 
