# gpt, from scratch, in code, spelled out - exercises

Notes on the exercises from the [gpt, from scratch video](https://www.youtube.com/watch?v=kCc8FmEb1nY).

1. Watch the [gpt, from scratch video](https://www.youtube.com/watch?v=kCc8FmEb1nY) on YouTube
2. Come back and solve these exercises to level up :)

I *highly* recommend tackling these exercises with a GPU-enabled machine.

In [2]:
import torch
import random
import torch.nn as nn
from tqdm import tqdm
from datasets import load_dataset
from torch.nn import functional as F
from transformers import AutoTokenizer

## Exercise 1 - The $n$-dimensional tensor mastery challenge

**Objective:** Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel,<br>
treating the heads as another batch dimension (answer can also be found in [nanoGPT](https://github.com/karpathy/nanoGPT)).

Let's see what we're working with:

In [2]:
block_size = 256 # What is the maximum context length for predictions?
dropout = 0.2    # Dropout probability
n_embd = 384     # Number of hidden units in the Transformer (384/6 = 64 dimensions per head)

In [3]:
class Head(nn.Module):
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # Register a buffer so that it is not a parameter of the model
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape   # Batch size, block size, vocab size (each token is a vector of size 32)
        k = self.key(x)   # (B,T,C) -> (B,T, head_size)
        q = self.query(x) # (B,T,C) -> (B,T, head_size)
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5                       # (B, T, head_size) @ (B, head_size, T) = (B, T, T) (T is the block_size)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # Masking all values in wei where tril == 0 with -inf
        wei = F.softmax(wei, dim=-1)                                 # (B, T, T)
        wei = self.dropout(wei)
        # Weighted aggregation of the values
        v = self.value(x) # (B, T, C) -> (B, T, head_size)
        out = wei @ v     # (B, T, T) @ (B, T, head_size) = (B, T, head_size)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) # Create num_heads many heads
        self.proj = nn.Linear(n_embd, n_embd)                                   # Projecting back to n_embd dimensions (the original size of the input, because we use residual connections)
        self.dropout = nn.Dropout(dropout)                                      # Dropout layer for regularization

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # Concatenate the outputs of all heads
        out = self.dropout(self.proj(out))                  # Project back to n_embd dimensions (because we use residual connections) and apply dropout
        return out

We want the `key`, `query` and `value` Linear layers to be applied across all `num_heads` in parallel.

To start, recall that each token is represented by a vector of size `n_embd=384`.<br>
Multi-head attention distributes this `n_embd` across smaller, equal-sized heads.<br>
In this context, `head_size` is how much each head receives of the total embedding.

We can write the internal `n_embd` flexibly as the product of `num_heads` and `head_size`.<br>
This generalizes better to any number of heads and head sizes.

Below is the implemented fused `Head` and `MultiHeadAttention` class with comments to guide through:

In [4]:
class MultiHeadAttention(nn.Module):
    """ Multi-head self-attention processing all heads in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.num_heads = num_heads           # Apply this many parallel attention layers
        self.head_size = head_size           # Each head has this size (part of the embedding size)
        self.n_embd = num_heads * head_size  # Total size of all heads together forms the token embedding size

        # Combining key, query, and value transformations across heads in a single linear layer each
        # All heads together process the input sequence and all together produce the output sequence
        # As self.embed = num_heads * head_size, input and output dim for all heads at once are the same (n_embd)
        self.key = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.query = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.value = nn.Linear(self.n_embd, self.n_embd, bias=False)

        # Register a buffer so that causal mask is not a parameter of the model
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        # Final linear output transformation, dropout
        # Same as with the key, query, value transformations,
        # As self.embed = num_heads * head_size, input and output dim for all heads at once are the same (n_embd)
        self.proj = nn.Linear(self.n_embd, self.n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape  # Batch size, sequence length (aka block size), embedding size (aka vocab size)

        # Apply linear transformations to get keys, queries, and values for all heads
        # Produce a dimension for the number of heads and a dimension for the head size
        # We then move the T dimension to the second index position make the attention matrix multiplication applicable
        k = self.key(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)    # (B, T, C) -> (B, num_heads, T, head_size)
        q = self.query(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)  # (B, T, C) -> (B, num_heads, T, head_size)
        v = self.value(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)  # (B, T, C) -> (B, num_heads, T, head_size)

        # Compute the attention scores
        wei = q @ k.transpose(-2, -1) * self.head_size ** -0.5        # (B, num_heads, T, head_size) @ (B, num_heads, head_size, T) -> (B, num_heads, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # Apply the causal mask, i.e. mask out the upper triangular part of the attention matrix
        wei = F.softmax(wei, dim=-1)  # Normalize attention scores to form (pseudo-)probabilities
        wei = self.dropout(wei)       # Apply dropout, promotes flexibility and robustness

        # Weighted aggregation of values
        out = wei @ v  # (B, num_heads, T, T) @ (B, num_heads, T, head_size) -> (B, num_heads, T, head_size)
        out = out.transpose(1, 2).contiguous().view(B, T, C)  # (B, num_heads, T, head_size) -> (B, T, C)

        # Final projection
        out = self.dropout(self.proj(out))
        return out

I now integated this into the video-derived GPT implementation and ran this first on the `tiny-shakespeare.txt` dataset to verify the implementation and produce the baseline needed for later exercises:

In [None]:
# Hyperparameters
batch_size = 64      # How many independent sequences to process at once?
block_size = 256     # What is the maximum context length for predictions?
max_iters = 5000     # How many training iterations to run?
eval_interval = 500  # How often to evaluate the model on the validation set?
learning_rate = 3e-4 # Learning rate for Adam optimizer (found through trial and error)
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Don't run on CPU if possible (it's slow. really.)
eval_iters = 200     # How many batches to use per loss evaluation?
n_embd = 384         # Number of hidden units in the Transformer (384/6 = 64 dimensions per head)
n_head = 6           # Number of attention heads in a single Transformer layer
n_layer = 6          # Number of Transformer layers
dropout = 0.2        # Dropout probability

torch.manual_seed(1337)
print(f'Training on {device}')

# Load Tiny Shakespeare dataset
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
# (also refer to Andrej's blog: http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
with open('../tiny-shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Find all unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create mappings from characters to indices and vice versa
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]          # encoder: Take a string, return a list of indices/integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: Take a list of indices/integers, return a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data)) # first 90% of all characters are for training
train_data = data[:n]
val_data = data[n:]

# Data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # Generates a tensor of shape (batch_size,) with random sequence start indices between 0 and len(data) - block_size
    x = torch.stack([data[i:i+block_size] for i in ix])       # Stack all (ix holds batch_size many) sequences of this batch row-wise on top of each other to form a tensor
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])   # Same as x but shifted by one token
    x, y = x.to(device), y.to(device)
    return x, y # x is batch_size x block_size, y is batch_size x block_size

@torch.no_grad() # Disable gradient calculation for this function
def estimate_loss():
    out = {}
    model.eval() # Set model to evaluation mode
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Set model back to training mode
    return out

class MultiHeadAttention(nn.Module):
    """ Multi-head self-attention processing all heads in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.num_heads = num_heads           # Apply this many parallel attention layers
        self.head_size = head_size           # Each head has this size (part of the embedding size)
        self.n_embd = num_heads * head_size  # Total size of all heads together forms the token embedding size

        # Combining key, query, and value transformations across heads in a single linear layer each
        # All heads together process the input sequence and all together produce the output sequence
        # As self.embed = num_heads * head_size, input and output dim for all heads at once are the same (n_embd)
        self.key = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.query = nn.Linear(self.n_embd, self.n_embd, bias=False)
        self.value = nn.Linear(self.n_embd, self.n_embd, bias=False)

        # Register a buffer so that causal mask is not a parameter of the model
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        # Final linear output transformation, dropout
        # Same as with the key, query, value transformations,
        # As self.embed = num_heads * head_size, input and output dim for all heads at once are the same (n_embd)
        self.proj = nn.Linear(self.n_embd, self.n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape  # Batch size, sequence length, embedding size

        # Apply linear transformations to get keys, queries, and values for all heads
        k = self.key(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)    # (B, T, C) -> (B, num_heads, T, head_size)
        q = self.query(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)  # (B, T, C) -> (B, num_heads, T, head_size)
        v = self.value(x).view(B, T, self.num_heads, self.head_size).transpose(1, 2)  # (B, T, C) -> (B, num_heads, T, head_size)

        # Compute the attention scores
        wei = q @ k.transpose(-2, -1) * self.head_size ** -0.5        # (B, num_heads, T, head_size) @ (B, num_heads, head_size, T) -> (B, num_heads, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # Apply the causal mask
        wei = F.softmax(wei, dim=-1)  # Normalize attention scores to form (pseudo-)probabilities
        wei = self.dropout(wei)       # Apply dropout, promotes flexibility and robustness

        # Weighted aggregation of values
        out = wei @ v  # (B, num_heads, T, T) @ (B, num_heads, T, head_size) -> (B, num_heads, T, head_size)
        out = out.transpose(1, 2).contiguous().view(B, T, C)  # (B, num_heads, T, head_size) -> (B, T, C)

        # Final projection
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd), # Linear layer with 4*n_embd outputs (AIAYN suggests 4*n_embd for residual connections as channel size)
            nn.ReLU(),                     # ReLU introduces non-linearity
            nn.Linear(4 * n_embd, n_embd), # Linear layer with n_embd outputs
            nn.Dropout(dropout),           # Dropout layer for regularization
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """
    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head                    # Adapting the head size to the number of heads
        self.sa = MultiHeadAttention(n_head, head_size) # Self-attention multi-head layer (the communication)
        self.ffwd = FeedFoward(n_embd)                  # Feed-forward so that the output has the same dimension as the input (the computation)
        self.ln1 = nn.LayerNorm(n_embd)                 # Layer normalization (normalizes the output of the self-attention layer)
        self.ln2 = nn.LayerNorm(n_embd)                 # Layer normalization (normalizes the output of the feed-forward layer)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))                    # Residual connection, forking off to the self-attention layer, LayerNorm is applied before the self-attention layer
        x = x + self.ffwd(self.ln2(x))                  # Residual connection, forking off to the feed-forward layer, LayerNorm is again applied before the feed-forward layer
        return x

class BigramLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embd = nn.Embedding(vocab_size, n_embd)                                   # Embedding the vocabulary, each individual token is represented by a vector of size vocab_size x n_embd
        self.position_embd = nn.Embedding(block_size, n_embd)                                # Embedding the position, each position is represented by a vector of size block_size x n_embd
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)                                         # Linear layer to map the embedding to the vocabulary size

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_embd = self.token_embd(idx)                               # Embedding the input, shape is (batch_size, block_size, n_embd) (B, T, n_embd)
        pos_embd = self.position_embd(torch.arange(T, device=device)) # Embedding the position by providing an integer sequence up to block_size, shape is (block_size, n_embd) (T, n_embd)
        x = tok_embd + pos_embd                                       # Adding the token embedding and the position embedding, shape is (batch_size, block_size, n_embd) (B, T, n_embd)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)                                      # Calculating the logits, shape is (batch_size, block_size, vocab_size) (B, T, C)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)            # Transpose logits to (B, C, T) (B=batch_size, T=block_size, C=vocab_size)
            targets = targets.view(B*T)             # Transpose targets to (B, T)
            loss = F.cross_entropy(logits, targets) # Calculating cross entropy loss across all tokens in the batch

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]                    # Condition on the last block_size tokens (B, T)
            logits, _ = self(idx_cond)                         # Forward pass (this is the forward function) with the current sequence of characters idx, results in (B, T, C)
            logits = logits[:, -1, :]                          # Focus on the last token from the logits (B, T, C) -> (B, C)
            probs = F.softmax(logits, dim=-1)                  # Calculate the set of probabilities for the next token based on this last token, results in (B, C)
            idx_next = torch.multinomial(probs, num_samples=1) # Sample the next token (B, 1), the token with the highest probability is sampled most likely
            idx = torch.cat((idx, idx_next), dim=1)            # Add the new token to the sequence (B, T+1) for the next iteration
        return idx

# Model
model = BigramLanguageModel()
m = model.to(device)
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters') # print the number of parameters in the model

# Create a PyTorch optimizer
opt = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for iter in range(max_iters):
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    xb, yb = get_batch('train')     # Get batch
    logits, loss = model(xb, yb)    # Forward pass
    opt.zero_grad(set_to_none=True) # Reset gradients
    loss.backward()                 # Backward pass
    opt.step()                      # Update parameters

torch.save(model, "gpt_tinyshakespeare.pt")

# Generate text from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)     # Start with single token as context
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

Training on cuda
10.788929 M parameters
step 0: train loss 4.2837, val loss 4.2825
step 500: train loss 1.8859, val loss 1.9993
step 1000: train loss 1.5367, val loss 1.7271
step 1500: train loss 1.3955, val loss 1.6106
step 2000: train loss 1.3112, val loss 1.5488
step 2500: train loss 1.2523, val loss 1.5206
step 3000: train loss 1.2033, val loss 1.4990
step 3500: train loss 1.1641, val loss 1.4893
step 4000: train loss 1.1263, val loss 1.4817
step 4500: train loss 1.0898, val loss 1.4814
step 4999: train loss 1.0560, val loss 1.5007

KING RICHARD II:
Now from sentence come throw and mercy.
Be yonder that I behold to do tower:
'Tis you be longed on our words to much urge.
Never know, with use boody to make and tweMILet him.
To chain the people mother, the rest veril
Let him bring God hence, but be afflied,
Were traitors with one heart the flower days:
The sarch of a companion, dost thou body this heir?
The again all-as this body's world;
That nightly he should be the bal deform'd,
No

sounds about right.

## Exercise 2 - Mathematic Mastery

**Objective:** Train the GPT on your own dataset of choice! What other data could be fun to blabber on about?<br>
A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. $a+b=c$. And once you have this, swole doge project: Build a calculator clone in GPT, for all of $+-*/$.<br>
- You may find it helpful to predict the digits of $c$ in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too.
- You may want to modify the data loader to simply serve random problems and skip the generation of `train.bin`, `val.bin`.<br>
- You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index).

**Not an easy problem.** But, [GPT can solve mathematical problems without a calculator](https://arxiv.org/abs/2309.03241).<br>
You may need [Chain of Thought](https://arxiv.org/abs/2412.14135) and other [slightly more advanced architecture](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf) traces, but don't overthink it.

I exported this implementation into a separate script to run it more easily on a larger generated problems dataset.<br>
You can find the script in the `7_gpt_solved_exercise_mathematica.py` file. I pushed it down to train for ~10 minutes on a single GPU.

The trained model can be found here: [nanogpt_mathematica](https://huggingface.co/Marcus2112/nanogpt_mathematica).

Notable changes to the model:
- `block_size` is reduced to $27$ to accomodate the maximum length of a single generated math problem.
- `max_iters` is increased to $20,000$ to allow for more training iterations, `eval_itnterval` is stretched to $1,000$.
- `learning_rate` is increased to `1e-3` and now scheduled with `CosineAnnealingLR` to decay to `1e-4`.
- `n_head` and `n_layer` are both reduced from $6$ to $4$
- `dropout` is decreased heavily from $0.2$ to $0.05$ to allow for more thorough learning and retention.
- `generate_problem` helps in the on-demand generation of batches both for training and validation.
- `generate_problem` introduces the operations `+`, `-`, `*`, `/` in that order gradually and smoothly as training progresses.
- `generate_problem` introduces the `=` character to signal the end of a problem statement, and the `;` character to indicate the end of a solution statement
- `generate_problem` introduces addition-based chain of thought for multiplication with problems looking like `2*3=[2+2+2=6]=6;` to help the model interpret multiplication as repeated addition.
- A padding character `#` is used and handed over as `ignore_index` to the `F.cross_entropy` loss function to ignore the padding in the loss calculation.
- The inference section now starts out with a math problem and the model is tasked with solving it.
- model output is cut at appearance of the `;` character to extract the solution from the output.

The trained model produced this exact output:
```
Input: 2+3=
Generated: 2+3=3=5]=3
Expected: 5

Input: 5-2=
Generated: 5-2=2=3]=2
Expected: 3

Input: 4*3=
Generated: 4*3=4+4=12]=1=2222]]122818
Expected: 12

Input: 8/4=
Generated: 8/4=/4=/]=2
Expected: 2
```

The model's output format consistently places the final result either after the last `=` before `]` or as the terminal number if no `=` precedes `]`.<br>
The model is rambling quite a bit still, but the results indicate a clear learning effect across all four operations.<br>
This conclusion is supported by a deliberate design of example input sequences.<br>
I made the examples such that no input sequence would hold digits that are also part of the expected solution.<br>
And yet, crucially, the model is able to generate solution digits as part of the output.

The model holds $7.1213\ \text{M}$ parameters, presented an initial loss of $3.3212$ and concluded training with a final loss of $0.0000$.

## Exercise 3 - Finetuning for the better?

**Objective:** Find a dataset that is very large, so large that you can't see a gap between train and val loss.<br>
Pretrain the transformer on this data. Then, initialize with that model and finetune it on `tiny-shakespeare` with a smaller number of steps and lower learning rate.<br>Can you obtain a lower validation loss by the use of large-scale pretraining?

I will use the [minipile_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned) dataset on Hugging Face.<br>
It has $946\text{k}$ text examples and $2$ features, `text` and `pile_idx`. We only need the `text` feature.

I exported the code for this exercise into a separate script to run it more easily on the bigger dataset.<br>
You can find the script in the `7_gpt_solved_exercise_finetune.py` file.

The pretrained model can be found here: [nanogpt_base](https://huggingface.co/Marcus2112/nanogpt_base).<br>
The finetuned model can be found here: [nanogpt_shakespeare](https://huggingface.co/Marcus2112/nanogpt_shakespeare).

**Why did I choose this dataset?**

I made the above dataset based on [\[Kaddour, Jean. 2023\]](https://arxiv.org/abs/2304.08442). The paper proposes a method to create a distilled dataset that, when used for training, enables models to achieve performance comparable to those trained on datasets $\sim100 \times$ larger. Akin to the paper, I built [minipile_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned) by extending the paper's idea on how to distill a bigger dataset.

I use it here because [minipile_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned) is built to be as representative as possible of an even larger, diverse dataset (The Pile Deduplicated). Using this distilled version makes pretraining effective, multi-purpose, yet way faster, way more resource efficient and thus more accessible for learners with resource-constrained environments.

This was the raw output of my script:

---
```
step 0: train loss 10.9707, val loss 10.9712
step 1000: train loss 6.3237, val loss 6.3552
step 2000: train loss 5.9566, val loss 5.9684
step 3000: train loss 5.6662, val loss 5.6960
step 4000: train loss 5.5015, val loss 5.5110
step 5000: train loss 5.3519, val loss 5.3793
step 6000: train loss 5.2336, val loss 5.2582
step 7000: train loss 5.1896, val loss 5.1757
step 8000: train loss 5.0282, val loss 5.0876
step 9000: train loss 5.0095, val loss 5.0130
step 10000: train loss 4.9012, val loss 4.9480
step 11000: train loss 4.8356, val loss 4.8823
step 12000: train loss 4.8116, val loss 4.8242
step 13000: train loss 4.8621, val loss 4.7645
step 14000: train loss 4.8088, val loss 4.7139
step 15000: train loss 4.7179, val loss 4.6632
step 16000: train loss 4.5663, val loss 4.6210
step 17000: train loss 4.5689, val loss 4.5840
step 18000: train loss 4.5564, val loss 4.5486
step 19000: train loss 4.4584, val loss 4.5159
step 20000: train loss 4.4928, val loss 4.4862
step 21000: train loss 4.4235, val loss 4.4589
step 22000: train loss 4.3791, val loss 4.4317
step 23000: train loss 4.3556, val loss 4.4106
step 24000: train loss 4.3154, val loss 4.3879
step 25000: train loss 4.3555, val loss 4.3737
step 26000: train loss 4.3680, val loss 4.3523
step 27000: train loss 4.3849, val loss 4.3361
step 28000: train loss 4.2780, val loss 4.3169
step 29000: train loss 4.3688, val loss 4.3038
step 29268: train loss 4.4642, val loss 4.2959

! Used by Oakland Spaniow on Ste-Rervica


PING WOENET HAS


OTG


May 2, 2018 4:11 AMAY/3/NWR Legal DAITIONART


Kore are cute and diligent. For others this will soon please huge impressions along wouldn’t have Patton Magic ever to watch him. This may be an
s grace and sometimes anti-inviting at it. We have to watch this in any moment but let it borrow my mind to the finals, drowning hand (average) contrast. 94, the sale of Christmas
was to break and have looked back along with some
life at something and some sort of rage for this.

27 April 2018 - What am I kind?

Understanding and formal […]John Farosizzled to Unpluglecka Boxers?

William A. Fur-Reth professor to you in few
cause we know that missed our baby helps raise to add% confidence from bailed, and
silly's, depending on your spouse, else

30 April 2018 2:21 AM 15:

One thing first that has ever progressed to those
and feel
to that once again. In fact, you suffer from embarrassment which 

Starting fine-tuning...
Fine-tuning step 0: train loss 5.5030, val loss 5.2990
Fine-tuning step 1000: train loss 7.6126, val loss 7.6633
Fine-tuning step 2000: train loss 8.0405, val loss 8.1042
Fine-tuning step 3000: train loss 8.3200, val loss 8.3808
Fine-tuning step 4000: train loss 8.6191, val loss 8.6992
Fine-tuning step 4999: train loss 8.7287, val loss 8.8173

Generated text after fine-tuning:
!My work hath yet not warm'd me: he is well:
The blood I drop is rather physical
Than dangerous to me: to Aufidius thus
I will appear, and fight.


LARTIUS:
Now the fair goddess, Fortune,
Fall deep in love with thee; and her great charms
Misguide thy opposers' swords! Bold gentleman,
Prosperity be thy page!


MARCIUS:
Thy friend no less
Than those she placeth highest! So, farewell.


LARTIUS:
Thou worthiest Marcius!
Go, sound thy trumpet in the market-place;
Call thither all the officers o' the town,
Where they shall know our mind: away!


COMINIUS:
Breathe you, my friends: well fought;
we are come off
Like Romans, neither foolish in our stands,
Nor cowardly in retire: believe me, sirs,
We shall be charged again. Whiles we have struck,
By interims and conveying gusts we have heard
The charges of our friends. Ye Roman gods!
Lead their successes as we wish our own, we
```
---

|Model|Losses|
|-|-|
| Tiny-Shakespeare only | train loss 1.0560, val loss 1.5007 |
| MiniPile-Density + Tiny-Shakespeare | train loss 8.7287, val loss 8.8173 |

**Ok, what do we make of this?**

While the loss progression through finetuning seems horrible for the pretraining part, it really is not.<br>
The results indicate that the non-pretrained model is plainly overfitting on the smaller `tiny-shakespeare` dataset.<br>
The pretrained model is not overfitting, hence the higher error both on train and val sets. But is it still learning anything useful?<br>
Does the finetuning even make sense? Let's compare.

**Output of Tiny-Shakespeare only:** 

```
KING RICHARD II:
Now from sentence come throw and mercy.
Be yonder that I behold to do tower:
'Tis you be longed on our words to much urge.
Never know, with use boody to make and tweMILet him.
To chain the people mother, the rest veril
Let him bring God hence, but be afflied,
Were traitors with one heart the flower days:
The sarch of a companion, dost thou body this heir?
The again all-as this body's world;
That nightly he should be the bal deform'd,
Not latt, but sadisment a medow's is that
Now
```

**Output of MiniPile-Density + Tiny-Shakespeare:**

```
My work hath yet not warm'd me: he is well:
The blood I drop is rather physical
Than dangerous to me: to Aufidius thus
I will appear, and fight.

LARTIUS:
Now the fair goddess, Fortune,
Fall deep in love with thee; and her great charms
Misguide thy opposers' swords! Bold gentleman,
Prosperity be thy page!

MARCIUS:
Thy friend no less
Than those she placeth highest! So, farewell.

LARTIUS:
Thou worthiest Marcius!
Go, sound thy trumpet in the market-place;
Call thither all the officers o' the town,
Where they shall know our mind: away!

COMINIUS:
Breathe you, my friends: well fought;
we are come off
Like Romans, neither foolish in our stands,
Nor cowardly in retire: believe me, sirs,
We shall be charged again. Whiles we have struck,
By interims and conveying gusts we have heard
The charges of our friends. Ye Roman gods!
```

The pretrained model's output seems way more coherent and structured.<br>
There are also more complex rhymes and characters referencing each other.

We see that the validation loss on the finetuning dataset alone cannot be used as the sole indicator of the model's quality.<br>
And while we're at it, no, you can't and you shouldn't expect a smaller validation loss when fintuning a pretrained model.<br>

**Why is the finetuned model's validation loss higher?**

Finetuning nudges the rich understanding attained from the larger dataset to incorporate information from the smaller finetuning dataset.<br>
In other words, the model generalizes to the patterns and nuances of the smaller dataset while retaining the broader understanding from the larger dataset.<br>
Since the pretrained model is optimizing for generalization rather than memorization, the finetuning dataset's validation loss may be higher, yet the outputs are richer and more aligned with real-world expectations.

**Why don't we just use a dataset of historic texts for pretraining?**

You can, but pretraining is supposed to allow the model to learn general patterns and structures from across a wide range of tasks and domains.<br>
If you pretrain on a dataset that is too specific, too similar to your finetuning dataset, you risk overfitting to the patterns in the finetuning dataset characteristics. You would attain a lower validation loss, but the model would be less flexible, less knowledgeable about the broader world, and thus less creative.<br>
This is why it is recommended to pretrain on a large, diverse dataset and then finetune on a smaller, specific dataset to leverage both more flexibly.

**Why didn't you choose a higher learning rate for the finetuning?**

Pretraining is done at `3e-4` and finetuning at `2e-5` to avoid what's called catastrophic forgetting:<br>
Finetuning at higher learning rates risks overwriting the broader knowledge from pretraining with the tiny dataset's patterns, rendering pretraining pretta much useless.<br>
Choosing a lower learning rate helps the model retain the general knowledge while adapting to the specifics of the finetuning dataset.<br>
We can see above that the small learning rate indeed suffices for the model to capture the topic, style, and structure of the `tiny-shakespeare` dataset.

## Exercise 4 - Read up and implement

**Objective:** Read some transformer papers and implement one additional feature or change that people seem to use.<br>
Does it improve the performance of your GPT?

I will go for the [Gated Linear Units (GLU) (Shazeer, Noam. 2020)](https://arxiv.org/abs/2002.05202).<br>
Essentially, we replace the ReLU activations with a gated mechanism each, allowing the model to control information flow more dynamically/learnably.<br>
The rest of the model remains unchanged.

I exported the code for this exercise into a separate script to run it more easily on [minipile_density-proportioned](https://huggingface.co/datasets/Marcus2112/minipile_density-proportioned).<br>
You can find the script in the `7_gpt_solved_exercise_opti.py` file.

The pretrained model can be found here: [nanogpt_glu_base](https://huggingface.co/Marcus2112/nanogpt_glu_base).<br>
The finetuned model can be found here: [nanogpt_glu_shakespeare](https://huggingface.co/Marcus2112/nanogpt_glu_shakespeare)

Same setup, same everything except the GLU activation, we get:

|Model|Losses|
|-|-|
| Tiny-Shakespeare only | train loss 1.0560, val loss 1.5007 |
| MiniPile-Density + Tiny-Shakespeare | train loss 8.7287, val loss 8.8173 |
| MiniPile-Density + Tiny-Shakespeare (GLU) | train loss 7.6610, val loss 7.6660 |

Let's see the generated text after finetuning with the GLU activation:

```
Was ever man so proud as is this Marcius!

BRUTUS:
He has no equal.

SICINIUS:
When we were chosen tribunes for the people,--

BRUTUS:
Mark'd you his lip and eyes?

SICINIUS:
Nay. but his taunts.

BRUTUS:
Being moved, he will not spare to gird the gods.

SICINIUS:
Be-mock the modest moon.

BRUTUS:
The present wars devour him: he is so valiant.

SICINIUS:
O, true-bred!

BRUTUS:
O, good success, disdains the shadow
Which he treads on at noon: but I do wonder
His insolence can brook to be commanded
Under Cominius.

BRUTUS:
Fame, at the which he aims,
In whom already he's well graced, can not
Better be held nor more attain'd than by
A place below the first:--

BRUTUS:
Fame, at the which he aims,
```

We have two characters Brutus and Sicinius talk about a third, Marcius. And moreover, the text can actually be interpreted:<br>
The dialogue reflects a sort of tension between admiration for Marcius’s valor and criticism of his hubris, discussed by Brutus and Sicinius.

The GLU activation seems to have improved the model's ability to generate coherent, structured text.<br>
Through that, the model seems to have learned more complex relationships between (three) characters and is able to generate more nuanced dialogue.<br>
The validation loss is not significantly lower, but the generated text noticably coherent and structured.<br>
Also, note how validation and train losses are still way higher than the non-pretrained model's losses, confirming what we discussed in exercise $3$.

**Why does GLU help?**

GLU allows the model to control the flow of information learnably.<br>
Before, ReLU activations would simply zero out negative values, which can be too harsh and non-learnable.<br>
GLU, on the other hand, allows the model to learn how much of negative value inputs to keep and how much to discard, even in context of the surrounding tokens.<br>
This makes activations more dynamic and precise, allowing the model to learn more complex relationships and structures in the data.

Models trained for the solutions above:

- Exercise 2: 
    - [nanogpt_mathematica](https://huggingface.co/Marcus2112/nanogpt_mathematica)
- Exercise 3: 
    - [nanogpt_base](https://huggingface.co/Marcus2112/nanogpt_base), 
    - [nanogpt_shakespeare](https://huggingface.co/Marcus2112/nanogpt_shakespeare)
- Exercise 4: 
    - [nanogpt_glu_base](https://huggingface.co/Marcus2112/nanogpt_glu_base), 
    - [nanogpt_glu_shakespeare](https://huggingface.co/Marcus2112/nanogpt_glu_shakespeare)

<center>Notebook by <a href="https://github.com/mk2112" target="_blank">mk2112</a>.</center>