# Lecture 16: NanoGPT Implementation with W&B Tracking

Extended from [karpathy/nanoGPT](https://github.com/karpathy/nanoGPT/tree/master) and made even more nano.

Modified to remove functions related to loading the pre-trained GPT2 model. If
you would like to use the pre-trained model, please refer to the original
repository.
We also remove flash attention and any optimizations, so things are simpler.


In [1]:
import math
import inspect
from dataclasses import dataclass

In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
import wandb
import requests

Here, we will also use `einops` for more readable tensor manipulations.
The syntax can be thought of as a more readable way to write `permute`, `reshape`, `transpose`, etc.,
via _describing the indices_ before and after the transformation.

In [3]:
from einops import rearrange, reduce, repeat

For example, let's flatten a tensor, and then restore it to its original shape:

In [4]:
x = torch.randn(32, 2, 3, 4) # 32 elements in batch, 2 channels, height 3, width 4
y = rearrange(x, 'batch channel height width -> batch (height width) channel')
print(y.shape)
z = rearrange(y, 'batch (height width) channel -> batch channel height width', height=3, width=4)
assert torch.allclose(x, z)

torch.Size([32, 12, 2])


We can also use `reduce` to sum over some dimensions:

In [6]:
x = torch.randn(32, 2, 3, 4)
y = reduce(x, 'batch channel height width -> batch channel', 'sum')
print(y.shape)

torch.Size([32, 2])


In this notebook, we want to follow best practices for deep learning, and track our experiments.

As a demo, we will work with a small Shakespeare text dataset.

In [7]:
import collections

# Fetch and preprocess Shakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text[:10000]  # Using first 10,000 characters

# Start with character-level tokenization
tokens = list(text) # A list of all the characters in the text

# Create initial vocabulary (unique characters)
vocab = sorted(list(set(tokens)))
print(f"Initial vocab size (characters): {len(vocab)}") # All the vocab in the tokens

max_vocab_size = 500

# Perform merges
while len(vocab) < max_vocab_size:
    # Count frequencies of adjacent pairs
    pairs = collections.Counter()
    for j in range(len(tokens) - 1):
        pair = (tokens[j], tokens[j+1])
        pairs[pair] += 1

    # If no pairs are left, we're done
    if not pairs:
        break

    # Find most frequent pair
    best_pair = max(pairs, key=pairs.get)
    best_pair_str = best_pair[0] + best_pair[1]

    # Add the merged pair to vocabulary
    vocab.append(best_pair_str)

    # Replace all occurrences of the pair in the token list
    new_tokens = []
    i = 0
    while i < len(tokens):
        if i < len(tokens) - 1 and tokens[i] == best_pair[0] and tokens[i+1] == best_pair[1]:
            new_tokens.append(best_pair_str)
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    tokens = new_tokens

# Create mappings between tokens and IDs
token_to_id = {token: i for i, token in enumerate(vocab)}
id_to_token = {i: token for i, token in enumerate(vocab)}

# Encode the text using our BPE vocabulary
encoded = [token_to_id[token] for token in tokens]
data = torch.tensor(encoded, dtype=torch.long)

vocab_size = len(vocab)
print(f"Final vocabulary size: {vocab_size}")

# This gives us:
# data, vocab_size, token_to_id, id_to_token

print(vocab)

Initial vocab size (characters): 57
Final vocabulary size: 500
['\n', ' ', '!', "'", ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'e ', 'th', 't ', 'ou', 's ', ' th', ', ', 'd ', 'en', 'er', 'in', 'an', 'y ', 'or', 'll', 'you', 'on', ' the ', 'it', ':\n', 'ar', 'ha', '\n\n', 'ir', 'st ', ',\n', 'es', 'o ', 'ea', ' s', 'iti', 'at', 'ic', 'EN', 'itiz', 'itizen', 'no', 'Th', 'Citizen', 'Citizen:\n', ' m', '.\n\n', 'is ', ' w', ' a', 'ell', 'hat ', ' you', 'irst ', 'of', ' the', 'e th', 'ing', 're', 'us', 'First ', 'First Citizen:\n', 'and ', 'om', ' c', 've', 'is', 'to', 'al', 'IU', 'IUS', 'IUS:\n', '. ', 'our', ' b', 'for', 'to ', 'st', ' h', 'el', 'un', 'at ', '?\n\n', 'the ', 'ith', 'it ', 'wh', 'ed ', 've ', 'gh', 'MEN', 'MENEN', 'MENENIUS:\n', 'ke ', 'ur', 'ra', 'ec',

In [8]:
# Our data is just categorical now:
data

tensor([113,  11,  35,  ...,  97,  35, 248])

In [9]:
# Dataset class
class TextDataset(Dataset):
    def __init__(self, data, block_size=16):
        self.data = data
        self.block_size = block_size
    def __len__(self):
        return max(0, len(self.data) - self.block_size)
    def __getitem__(self, idx):
        # We give the data in fixed chunks:
        chunk = self.data[idx:idx + self.block_size + 1]
        return chunk[:-1], chunk[1:]


First, we create a layer norm with an optional bias parameter (can be toggled off):

In [10]:
class LayerNorm(nn.Module):
    """ LayerNorm but with an optional bias """

    def __init__(self, ndim, bias):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(ndim))
        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None

    def forward(self, input):
        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)

We create a *causal* self-attention layer since
GPT-2 is a decoder-only model:

In [18]:
class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout

        self.register_buffer('bias', torch.tril(torch.ones(config.block_size, config.block_size)).view(
            1, 1, config.block_size, config.block_size
        ))

    def forward(self, x):
        B, T, C = (
            x.size()
        )  # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = rearrange(k, 'b t (h d) -> b h t d', h=self.n_head)  # (B, nh, T, hs)
        q = rearrange(q, 'b t (h d) -> b h t d', h=self.n_head)  # (B, nh, T, hs)
        v = rearrange(v, 'b t (h d) -> b h t d', h=self.n_head)  # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = rearrange(y, 'b h t d -> b t (h d)')  # re-assemble all head outputs side by side

        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

After each self-attention, we apply a 1-layer MLP for the feedforward network.

This also includes dropout (though by default it is off in GPT-2):

In [19]:
class MLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x

A single GPT-2 block:

In [20]:
class Block(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

n.b., `@dataclass` is a good way to define a configuration in Python:

In [21]:
@dataclass
class GPTConfig:
    # Adjusted for faster training
    block_size: int = 16          # [1024]
    vocab_size: int = len(vocab)  # [50304]
    n_layer: int = 3      # [12]
    n_head: int = 2       # [12]
    n_embd: int = 32      # [768]
    dropout: float = 0.0
    bias: bool = True

Finally, GPT itself is a stack of these blocks, with extra
parameters for the input embedding and the final output linear layer:

In [24]:
class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),  # Token embeddings
            wpe = nn.Embedding(config.block_size, config.n_embd),  # Position embeddings
            drop = nn.Dropout(config.dropout),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = LayerNorm(config.n_embd, bias=config.bias),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight # https://paperswithcode.com/method/weight-tying

        # init all weights
        self.apply(self._init_weights)

        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            # Scale by input dimension
            std = 1.0 / math.sqrt(module.in_features)
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            # Scale by embedding dimension
            std = 1.0 / math.sqrt(module.embedding_dim)
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
        pos = torch.arange(0, t, dtype=torch.long, device=device) # shape (t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(rearrange(logits, 'b t v -> (b t) v'), rearrange(targets, 'b t -> (b t)'), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the lm_head on the very last position
            logits = self.lm_head(x[:, [-1], :]) # note: using list [-1] to preserve the time dim
            loss = None

        return logits, loss

    def configure_optimizers(self, weight_decay, learning_rate, betas):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas)

        return optimizer

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

## Training + W&B Demo

Let's test our model with a training loop and track progress with Weights & Biases:

In [28]:
# Config and model (smaller for speed)
config = GPTConfig()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPT(config)
model = model.to(device)

# Configure optimizer with higher learning rate for faster demo
optimizer = model.configure_optimizers(
    weight_decay=0.1,
    learning_rate=1e-3,
    betas=(0.9, 0.95),
)

# Data and loader
dataset = TextDataset(data, config.block_size)
loader = DataLoader(dataset, batch_size=32)

# W&B initialization for experiment tracking
wandb.init(project="nanoGPT-cambridge", config=config.__dict__)

# Track the entirety of the model parameters:
wandb.watch(model, log="all", log_freq=10)

# Training loop with W&B logging
for epoch in range(30):
    for inputs, targets in loader:
        # Zero gradients
        optimizer.zero_grad()

        inputs, targets = inputs.to(device), targets.to(device)

        # Forward pass
        logits, loss = model(inputs, targets)

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

        # Log loss to W&B
        wandb.log({"loss": loss.item()})

    # Print progress
    print(f"Epoch {epoch}, Loss: {loss.item()}")

    # Generate a sample text after each epoch
    idx = torch.zeros((1, 1), dtype=torch.long, device=device)
    gen = model.generate(idx, max_new_tokens=10)[0].tolist()
    gen_text = ''.join(id_to_token[i] for i in gen)
    wandb.log({"generated": gen_text})

num decayed parameter tensors: 14, with 53,376 parameters
num non-decayed parameter tensors: 26, with 1,312 parameters
Epoch 0, Loss: 6.060898780822754
Epoch 1, Loss: 5.786173343658447
Epoch 2, Loss: 5.566069602966309
Epoch 3, Loss: 5.246857643127441
Epoch 4, Loss: 4.920835971832275
Epoch 5, Loss: 4.6106672286987305
Epoch 6, Loss: 4.3157958984375
Epoch 7, Loss: 4.040560722351074
Epoch 8, Loss: 3.759960174560547
Epoch 9, Loss: 3.4803078174591064
Epoch 10, Loss: 3.1892292499542236
Epoch 11, Loss: 2.908733606338501
Epoch 12, Loss: 2.6430392265319824
Epoch 13, Loss: 2.4047434329986572
Epoch 14, Loss: 2.1739206314086914
Epoch 15, Loss: 1.9398540258407593
Epoch 16, Loss: 1.7195838689804077
Epoch 17, Loss: 1.5091955661773682
Epoch 18, Loss: 1.3294602632522583
Epoch 19, Loss: 1.1861246824264526
Epoch 20, Loss: 1.0832997560501099
Epoch 21, Loss: 1.0123296976089478
Epoch 22, Loss: 0.9456520080566406
Epoch 23, Loss: 0.8823789358139038
Epoch 24, Loss: 0.8141838312149048
Epoch 25, Loss: 0.783283293

In [33]:
gen_text

'\nailke ppledicke m mas\n'

## Challenge & Wrap-Up

### Challenge

Try tweaking `n_head` or `learning_rate` and log the results to W&B. Can you improve the loss?

For example:

In [None]:
# Experiment with different configurations
config = GPTConfig(n_head=4)  # Try more attention heads
# OR
optimizer = model.configure_optimizers(weight_decay=0.1, learning_rate=5e-3, betas=(0.9, 0.95), device_type='cpu')