# GPT, from scratch, in code, spelled out - Exercises

Notes on the exercises from the [gpt, from scratch video](https://www.youtube.com/watch?v=kCc8FmEb1nY).

1. Watch the [gpt, from scratch video](https://www.youtube.com/watch?v=kCc8FmEb1nY) on YouTube
2. Come back and solve these exercises to level up :)

I *highly* recommend tackling these exercises with a GPU-enabled machine.

In [5]:
import torch
import random
import torch.nn as nn
from tqdm import tqdm
from torch.nn import functional as F

## Exercise 1 - The $n$-dimensional tensor mastery challenge

**Objective:** Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel,<br>
treating the heads as another batch dimension (answer can also be found in [nanoGPT](https://github.com/karpathy/nanoGPT)).

Let's see what we're working with:

In [2]:
block_size = 256 # What is the maximum context length for predictions?
dropout = 0.2    # Dropout probability
n_embd = 384     # Number of hidden units in the Transformer (384/6 = 64 dimensions per head)

In [3]:
class Head(nn.Module):
    """ one head of self-attention """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # Register a buffer so that it is not a parameter of the model
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape   # Batch size, block size, vocab size (each token is a vector of size 32)
        k = self.key(x)   # (B,T,C) -> (B,T, head_size)
        q = self.query(x) # (B,T,C) -> (B,T, head_size)
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5                       # (B, T, head_size) @ (B, head_size, T) = (B, T, T) (T is the block_size)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # Masking all values in wei where tril == 0 with -inf
        wei = F.softmax(wei, dim=-1)                                 # (B, T, T)
        wei = self.dropout(wei)
        # Weighted aggregation of the values
        v = self.value(x) # (B, T, C) -> (B, T, head_size)
        out = wei @ v     # (B, T, T) @ (B, T, head_size) = (B, T, head_size)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)]) # Create num_heads many heads
        self.proj = nn.Linear(n_embd, n_embd)                                   # Projecting back to n_embd dimensions (the original size of the input, because we use residual connections)
        self.dropout = nn.Dropout(dropout)                                      # Dropout layer for regularization

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1) # Concatenate the outputs of all heads
        out = self.dropout(self.proj(out))                  # Project back to n_embd dimensions (because we use residual connections) and apply dropout
        return out

End of copy-pasting from the video, let's get to work!

In [4]:
# TODO: Merge the two classes into one

I now integated this into the video-derived GPT implementation and ran this first on the `tiny-shakespeare.txt` dataset to verify the implementation and produce the baseline needed for later exercises:

In [None]:
# TODO: Integrate the combined class from above into the model
# TODO: Verify that your new model works by training it on tiny-shakespeare.txt (you'll need the training loss info and results later)

## Exercise 2 - Mathematic Mastery

**Objective:** Train the GPT on your own dataset of choice! What other data could be fun to blabber on about?<br>
A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. $a+b=c$. And once you have this, swole doge project: Build a calculator clone in GPT, for all of $+-*/$.<br>
- You may find it helpful to predict the digits of $c$ in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too.
- You may want to modify the data loader to simply serve random problems and skip the generation of `train.bin`, `val.bin`.<br>
- You may want to mask out the loss at the input positions of $a+b$ that just specify the problem using $y=-1$ in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: Build a calculator clone in GPT, for all of $+-*/$.

**Not an easy problem.** But, [GPT can solve mathematical problems without a calculator](https://arxiv.org/abs/2309.03241).<br>
You may need [Chain of Thought](https://arxiv.org/abs/2412.14135) and other [slightly more advanced architecture](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf) traces, but don't overthink it.

In [None]:
# TODO: Train a model on mathematical expressions so that (some) generated expressions are valid

## Exercise 3 - Finetuning for the better?

**Objective:** Find a dataset that is very large, so large that you can't see a gap between train and val loss.<br>
Pretrain the transformer on this data. Then, initialize with that model and finetune it on `tiny shakespeare` with a smaller number of steps and lower learning rate.<br>Can you obtain a lower validation loss by the use of large-scale pretraining?

In [None]:
# TODO: Train a model on a text dataset bigger than tiny-shakespeare.txt
# TODO: Use this now pre-trained model to (lightly) fine-tune on tiny-shakespeare.txt
# TODO: Compare the losses and generated text of the fine-tuned model and the model trained from scratch on tiny-shakespeare.txt

## Exercise 4 - Read up and implement

**Objective:** Read some transformer papers and implement one additional feature or change that people seem to use.<br>
Does it improve the performance of your GPT?

In [None]:
# TODO: The stage is yours! Add any popular model feature and see how it goes!