# How to train a Generative Pretrained Transformer (GPT) model?

The goal of this notebook is to explore the main aspects around training a generative AI model for text.\
We will review the main concepts & steps for training and talk also about how the prediction of new content happens.

Topics:
- Tokenization
- Main model components
- How the model performance is evaluated
- How the model is trained
- How the model is used to create new text

In [4]:
import os
import subprocess
from pathlib import Path

repo_path = (
    subprocess.check_output(["git", "rev-parse", "--show-toplevel"])
    .strip()
    .decode("utf-8")
)
os.chdir(repo_path)

In [5]:
## The different phases of model training

### Pre-training
### Supervised fine-tuning
### Alignement with human preferences (RLHF)

## Loading the data

In [6]:
# load wiki english
import json

with open("wikiGPT/data/shuffled_shards/shard_0.json", "r") as file:
    for line in file:
        data = json.loads(line)
        break

In [7]:
data["content"][:200]

'Nowa Kiszewa  is a village in the administrative district of Gmina Kościerzyna, within Kościerzyna County, Pomeranian Voivodeship, in northern Poland. It lies approximately  south-east of Kościerzyna '

## Creating the tokenizer

### A naive tokenizer

In [8]:
vocab = sorted(list(set(data["content"])))
vocab_size = len(vocab)


def create_naive_tokenizer(text):
    tokenizer = {vocab[i]: i + 3 for i in range(len(vocab))}
    tokenizer["<s>"] = 0
    tokenizer["</s>"] = 1
    tokenizer["<unk>"] = 2
    detokenizer = {v: k for k, v in tokenizer.items()}
    return tokenizer, detokenizer


naive_tokenizer, naive_detokenizer = create_naive_tokenizer(data["content"])


def tokenize(text, tokenizer):
    return [tokenizer.get(letter, 2) for letter in text]


def detokenize(tokens, detokenizer):
    return "".join([detokenizer.get(token, "<unk>") for token in tokens])


tokens = tokenize("Let's train a GPT model!", naive_tokenizer)
print(tokens)
print(detokenize(tokens, naive_detokenizer))

[2, 21, 34, 2, 33, 4, 34, 32, 18, 25, 29, 4, 18, 4, 10, 15, 2, 4, 28, 30, 20, 21, 27, 2]
<unk>et<unk>s train a GP<unk> model<unk>


### A modern tokenizer based on the Byte Pair Encoding (BPE) algorithm



In [9]:
# Byte pair encoding => use the data to tell how to tokenize

# From a corpus of text to train the tokenizer
# Choose a vocabulary size: i.e. the maximum number of tokens our vocabulary can have

# Start with a character-level tokenizer from the corpus
"I loave to learn about AI"  # => [' ', 'A', 'I', 'a', 'b', 'e', 'l', 'n', 'o', 'r', 't', 'u', 'v']

# Count the frequency of pairs of characters in the corpus
"I love to learn about AI!"  # => {"I ": 1, " l": 1, "lo": 1, "ov": 1, ...}

# Merge the most frequent pair of characters into a single token
"I love to learn about AI!"  # => {"I ": 1, " l": 1, "lo": 1, "ov": 1, ...}

# Repeat until the vocabulary size is reached

'I love to learn about AI!'

In [10]:
from wikiGPT.tokenize import Tokenizer

tokenizer = Tokenizer(Path("wikiGPT/tokenizers/tok32000.model"))

tokens = tokenizer.encode("I love to learn about AI!", bos=False, eos=False)
print(tokens)
print(tokenizer.decode(tokens))

[318, 3832, 298, 3178, 900, 14788, 19086]
I love to learn about AI!


In [11]:
### A batch of data from human to machine reading

# A batch is composed of two dimensions: the number of samples we are going to pass to the model and the context length

corpus = [
    "I love to learn about AI!",
    "Let's train a GPT model!",
    data["content"][:30],
]
corpus

['I love to learn about AI!',
 "Let's train a GPT model!",
 'Nowa Kiszewa  is a village in ']

In [12]:
batch_size = 1
context_length = 4

# let's tokenize our text
tokenized_corpus = [tokenizer.encode(text, False, False) for text in corpus]

tokenized_corpus

[[318, 3832, 298, 3178, 900, 14788, 19086],
 [6974, 19055, 19006, 3820, 262, 11867, 19028, 3005, 19086],
 [344, 4397, 385, 273, 15544, 19000, 351, 262, 1429, 280]]

In [13]:
# Creating a first batch for the next token prediction
import random

sample = random.choice(range(len(tokenized_corpus)))
print(f"Selected sample: {sample}")

batch = tokenized_corpus[sample][: context_length + 1]

Selected sample: 0


In [14]:
batch

[318, 3832, 298, 3178, 900]

In [15]:
for i in range(context_length):
    print(f"token(s) {batch[:i+1]} need to predict {batch[i+1]}")

token(s) [318] need to predict 3832
token(s) [318, 3832] need to predict 298
token(s) [318, 3832, 298] need to predict 3178
token(s) [318, 3832, 298, 3178] need to predict 900


In [16]:
for i in range(context_length):
    print(
        f"""subword(s) '{tokenizer.decode(batch[:i+1])}' need to predict '{tokenizer.decode(batch[i+1])}'"""
    )

subword(s) 'I' need to predict 'love'
subword(s) 'I love' need to predict 'to'
subword(s) 'I love to' need to predict 'learn'
subword(s) 'I love to learn' need to predict 'about'


In [17]:
## Defining the model

### Transformer architecture:
#### We are going to focus on the embedding layer, talk about the attention mechanism

In [18]:
import torch

### The embedding layer

#### This is the initial layer that will transform the token (int) into a vector
vocab = [0, 1, 2]
vocab_size = len(vocab)
embedding_size = 6

embedding_layer = torch.rand(vocab_size, embedding_size)
print(embedding_layer)

tensor([[0.6752, 0.7663, 0.9452, 0.7748, 0.2242, 0.4582],
        [0.4008, 0.8364, 0.6558, 0.8886, 0.4826, 0.1786],
        [0.5440, 0.5719, 0.5054, 0.6391, 0.1919, 0.5260]])


In [19]:
batch = [1, 0, 0, 2]
batch_size = 1
context_length = 4
embedding = torch.zeros(batch_size, context_length, embedding_size)
for i, token in enumerate(batch):
    embedding[0, i] = embedding_layer[token]

print(embedding.shape)
print(embedding)

torch.Size([1, 4, 6])
tensor([[[0.4008, 0.8364, 0.6558, 0.8886, 0.4826, 0.1786],
         [0.6752, 0.7663, 0.9452, 0.7748, 0.2242, 0.4582],
         [0.6752, 0.7663, 0.9452, 0.7748, 0.2242, 0.4582],
         [0.5440, 0.5719, 0.5054, 0.6391, 0.1919, 0.5260]]])


In [20]:
batch_one_hot_encoded = torch.Tensor([[0, 1, 0], [1, 0, 0], [1, 0, 0], [0, 0, 1]])

batch_one_hot_encoded @ embedding_layer

tensor([[0.4008, 0.8364, 0.6558, 0.8886, 0.4826, 0.1786],
        [0.6752, 0.7663, 0.9452, 0.7748, 0.2242, 0.4582],
        [0.6752, 0.7663, 0.9452, 0.7748, 0.2242, 0.4582],
        [0.5440, 0.5719, 0.5054, 0.6391, 0.1919, 0.5260]])

In [21]:
from torch import nn
from torch.nn import functional as F

torch.manual_seed(56)
### What is the attention mechanism?

#### Attention allows the model to learn the affinities between the tokens
#### Given the past (previous token), how can they interact to predict the most likely next token?

# [I love to learn] ---> about
print("I love to learn")
tokens = tokenizer.encode("I love to learn", False, False)
print(tokens)
vocab_size = 32000
embedding_size = 8

embedding_layer = nn.Embedding(vocab_size, embedding_size)
x = embedding_layer(torch.LongTensor(tokens))
x

I love to learn
[318, 3832, 298, 3178]


tensor([[-0.1219, -0.9217,  0.0804, -0.1049, -0.4088,  1.1438, -0.2155,  0.8280],
        [-0.5726,  2.1356,  0.1306,  1.0591,  0.0731,  0.1786, -0.8104, -0.1310],
        [ 0.0997,  0.6453,  2.4119,  0.5138,  0.2973,  0.0155, -0.8086,  0.3057],
        [ 1.0756,  0.4220, -1.1676,  2.2978,  0.5456,  0.6231,  2.5453, -0.4455]],
       grad_fn=<EmbeddingBackward0>)

In [22]:
# I     ---> 318  ---> [-0.1219, -0.9217,  0.0804, -0.1049, -0.4088,  1.1438, -0.2155,  0.8280],
# love  ---> 3832 ---> [-0.5726,  2.1356,  0.1306,  1.0591,  0.0731,  0.1786, -0.8104, -0.1310],
# to    ---> 298  ---> [ 0.0997,  0.6453,  2.4119,  0.5138,  0.2973,  0.0155, -0.8086,  0.3057],
# learn ---> 3178 ---> [ 1.0756,  0.4220, -1.1676,  2.2978,  0.5456,  0.6231,  2.5453, -0.4455]

In [23]:
context_length = 4
mask = torch.tril(torch.ones(context_length, context_length))
mask

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])

In [24]:
weights = torch.zeros(context_length, context_length)
weights = weights.masked_fill(mask == 0, float("-Inf"))
weights = F.softmax(weights, dim=-1)
weights

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])

In [25]:
weights @ x

tensor([[-0.1219, -0.9217,  0.0804, -0.1049, -0.4088,  1.1438, -0.2155,  0.8280],
        [-0.3473,  0.6069,  0.1055,  0.4771, -0.1679,  0.6612, -0.5130,  0.3485],
        [-0.1983,  0.6197,  0.8743,  0.4894, -0.0128,  0.4460, -0.6115,  0.3342],
        [ 0.1202,  0.5703,  0.3638,  0.9415,  0.1268,  0.4903,  0.1777,  0.1393]],
       grad_fn=<MmBackward0>)

In [26]:
### How can we get meaningfull weights for the attention?
# queries and key will interact to decide the level of affinity between tokens
torch.manual_seed(78)
head_size = 4
key = nn.Linear(embedding_size, head_size, bias=False)
query = nn.Linear(embedding_size, head_size, bias=False)
k = key(
    x
)  # (context_length, embedding_size) @ (embedding_size, head_size) => (context_length, head_size)
q = query(x)
print(k.shape, q.shape)

torch.Size([4, 4]) torch.Size([4, 4])


In [27]:
print(k)
print(q)

tensor([[ 0.2632,  0.2153, -0.1248,  0.4342],
        [-0.4048,  0.7153, -0.1434,  0.3645],
        [-0.5860,  0.5796,  0.3083, -0.0288],
        [ 1.0913, -0.8812,  0.7187, -0.1261]], grad_fn=<MmBackward0>)
tensor([[ 0.6620,  0.3543,  0.1199,  0.3901],
        [-0.2861, -0.3934,  0.7692, -0.6666],
        [ 0.1974,  0.6948,  0.3016,  0.3322],
        [ 0.0445, -0.4471,  0.7331,  1.0798]], grad_fn=<MmBackward0>)


In [28]:
weights = q @ k.T

In [29]:
weights
weights = weights.masked_fill(mask == 0, float("-Inf"))
weights = F.softmax(weights, dim=-1)
weights

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.4933, 0.5067, 0.0000, 0.0000],
        [0.3059, 0.3686, 0.3255, 0.0000],
        [0.2434, 0.1729, 0.1659, 0.4179]], grad_fn=<SoftmaxBackward0>)

In [30]:
#          I      love     to     learn
# I     [1.0000, 0.0000, 0.0000, 0.0000]
# love  [0.4933, 0.5067, 0.0000, 0.0000]
# to    [0.3059, 0.3686, 0.3255, 0.0000]
# learn [0.2434, 0.1729, 0.1659, 0.4179]

In [31]:
# aggregating information from the values
torch.manual_seed(78)
values = nn.Linear(embedding_size, head_size, bias=False)
v = values(x)
output = weights @ v
output

tensor([[ 0.2632,  0.2153, -0.1248,  0.4342],
        [-0.0752,  0.4686, -0.1342,  0.3989],
        [-0.2595,  0.5182,  0.0093,  0.2578],
        [ 0.3529, -0.0960,  0.2963,  0.1112]], grad_fn=<MmBackward0>)

In [None]:
# HEAD 1                                     HEAD 2
# [ 0.2632,  0.2153, -0.1248,  0.4342]       [... ... ... ...]
# [-0.0752,  0.4686, -0.1342,  0.3989]   +   [... ... ... ...]
# [-0.2595,  0.5182,  0.0093,  0.2578]       [... ... ... ...]
# [ 0.3529, -0.0960,  0.2963,  0.1112]       [... ... ... ...]

In [32]:
### Final layer to produce logits and probabilities

output_to_project = torch.cat([output, output], dim=-1)

final_projection = nn.Linear(embedding_size, vocab_size)
logits = final_projection(output_to_project)

logits.shape

torch.Size([4, 32000])

In [33]:
probs = F.softmax(logits, dim=-1)
probs[-1]

tensor([3.6348e-05, 5.0826e-05, 3.4992e-05,  ..., 4.3651e-05, 3.5069e-05,
        3.6860e-05], grad_fn=<SelectBackward0>)

In [None]:
### How can we measure the performance of a model? The cross-entropy loss

##### For each target token, we can retrieve the probability of the model to predict it
##### In a batch, we have batch_size * context_length predictions tokens
##### One way to measure the quality is to multiply all the model probability to predict the next token

# token(s) [6974] need to predict 19055
# token(s) [6974, 19055] need to predict 19006
# token(s) [6974, 19055, 19006] need to predict 3820
# token(s) [6974, 19055, 19006, 3820] need to predict 262

In [34]:
print(probs[0, 19055])
print(probs[1, 19006])
print(probs[2, 3820])
print(probs[3, 262])

tensor(4.0643e-05, grad_fn=<SelectBackward0>)
tensor(3.7046e-05, grad_fn=<SelectBackward0>)
tensor(2.8668e-05, grad_fn=<SelectBackward0>)
tensor(3.1741e-05, grad_fn=<SelectBackward0>)


In [35]:
# To avoid numeric instability, we can use the log of the probabilities
#  log(a*b*c) = log(a) + log(b) + log(c)

(
    torch.log(probs[0, 19055])
    + torch.log(probs[1, 19006])
    + torch.log(probs[2, 3820])
    + torch.log(probs[3, 262])
)

tensor(-41.1317, grad_fn=<AddBackward0>)

In [36]:
# Because, we like to minimize the loss, we can use the negative log likelihood

-(
    torch.log(probs[0, 19055])
    + torch.log(probs[1, 19006])
    + torch.log(probs[2, 3820])
    + torch.log(probs[3, 262])
)

tensor(41.1317, grad_fn=<NegBackward0>)

In [37]:
### How can we actually train the model?
#### What is gradient descent and backpropagation?

from wikiGPT.model import Transformer, ModelArgs

model_config = ModelArgs(
    dim=16,
    n_layers=1,
    n_heads=1,
    vocab_size=32000,
    hidden_dim=16,
    max_context_length=10,
)

model = Transformer(model_config)


In [38]:
print(model)

Transformer(
  (tok_embeddings): Embedding(32000, 16)
  (dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0): TransformerBlock(
      (attn): Attention(
        (wq): Linear(in_features=16, out_features=16, bias=False)
        (wk): Linear(in_features=16, out_features=16, bias=False)
        (wv): Linear(in_features=16, out_features=16, bias=False)
        (wo): Linear(in_features=16, out_features=16, bias=False)
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=16, out_features=32, bias=False)
        (w2): Linear(in_features=32, out_features=16, bias=False)
        (w3): Linear(in_features=16, out_features=32, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (attn_norm): RMSNorm()
      (ff_norm): RMSNorm()
    )
  )
  (norm): RMSNorm()
  (output): Linear(in_features=16, out_features=32000, bias=False)
)


In [45]:
params = {pn: p for pn, p in model.named_parameters() if p.requires_grad}
params = [p for _, p in params.items()]
sum(p.numel() for p in params)

514608

In [75]:
from wikiGPT.iterate import TokenIterator
from functools import partial

# training loop
iter_params = {
        "pretokenized_source": Path(f"wikiGPT/data/tok{model_config.vocab_size}"),
        "context_length": model_config.max_context_length,
        # "verbose": True,
    }
iter_batches = partial(
        TokenIterator.iter_batches,
        batch_size=2,
        device="cpu",
        num_workers=0,
        **iter_params,
)
train_batch_iter = iter_batches(split="train")
X, Y = next(train_batch_iter)

# define an optimizer
optimizer = torch.optim.AdamW(params=model.parameters(), lr=5e-4)

In [48]:
X.shape

torch.Size([2, 10])

In [83]:
for i in range(100):
    # if i > 0:
    #     break
    # project the input through the model
    logits = model(X, Y)

    # compute the loss
    logits = logits.view(-1, logits.size(-1))
    targets = Y.view(-1)
    loss = F.cross_entropy(logits, targets)

    # update the weights
    loss.backward()
    optimizer.step()
    optimizer.zero_grad(set_to_none=True)

# good resources on understanding backpropagation and gradient descent
# https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=1
# https://www.youtube.com/watch?v=IHZwWFHWa-w&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2
# https://www.youtube.com/watch?v=Ilg3gGewQ5U&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=3
# https://www.youtube.com/watch?v=tIeHLnjs5U8&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=4 


In [None]:
### A note on model sizes and required training computation

In [None]:
### How do we make a prediction with the model?
#### probabilistic sampling
#### inefficiency of the attention mechanism at inference time