# How to train a Generative Pretrained Transformer (GPT) model?

The goal of this notebook is to explore the main aspects around training a generative AI model for text.\
We will review the main concepts & steps for training and talk also about how the prediction of new content happens.

In [7]:
import os
import subprocess
from pathlib import Path

repo_path = subprocess.check_output(["git", "rev-parse", "--show-toplevel"]).strip().decode("utf-8")
os.chdir(repo_path)

In [None]:
## The different phases of model training 

### Pre-training
### Supervised fine-tuning
### Alignement with human preferences (RLHF)

## Loading the data

In [9]:
# load wiki english
import json
with open('wikiGPT/data/shuffled_shards/shard_0.json', 'r') as file:
    for line in file:
        data = json.loads(line)
        break

In [10]:
data["content"][:200]

'Nowa Kiszewa  is a village in the administrative district of Gmina Kościerzyna, within Kościerzyna County, Pomeranian Voivodeship, in northern Poland. It lies approximately  south-east of Kościerzyna '

## Creating the tokenizer

### A naive tokenizer

In [11]:
vocab = sorted(list(set(data["content"])))
vocab_size = len(vocab)
def create_naive_tokenizer(text):
    tokenizer = {vocab[i]: i + 3 for i in range(len(vocab))}
    tokenizer["<s>"] = 0
    tokenizer["</s>"] = 1
    tokenizer["<unk>"] = 2
    detokenizer = {v: k for k, v in tokenizer.items()}
    return tokenizer, detokenizer

naive_tokenizer, naive_detokenizer = create_naive_tokenizer(data["content"])

def tokenize(text, tokenizer):
    return [tokenizer.get(letter, 2) for letter in text]

def detokenize(tokens, detokenizer):
    return "".join([detokenizer.get(token, "<unk>") for token in tokens])

tokens = tokenize("Let's train a GPT model!", naive_tokenizer)
print(tokens)
print(detokenize(tokens, naive_detokenizer))

[2, 21, 34, 2, 33, 4, 34, 32, 18, 25, 29, 4, 18, 4, 10, 15, 2, 4, 28, 30, 20, 21, 27, 2]
<unk>et<unk>s train a GP<unk> model<unk>


### A modern tokenizer based on the Byte Pair Encoding (BPE) algorithm



In [12]:
# Byte pair encoding => use the data to tell how to tokenize

# From a corpus of text to train the tokenizer
# Choose a vocabulary size: i.e. the maximum number of tokens our vocabulary can have

# Start with a character-level tokenizer from the corpus
"I loave to learn about AI" # => [' ', 'A', 'I', 'a', 'b', 'e', 'l', 'n', 'o', 'r', 't', 'u', 'v']

# Count the frequency of pairs of characters in the corpus
"I love to learn about AI!" # => {"I ": 1, " l": 1, "lo": 1, "ov": 1, ...}

# Merge the most frequent pair of characters into a single token
"I love to learn about AI!" # => {"I ": 1, " l": 1, "lo": 1, "ov": 1, ...}

# Repeat until the vocabulary size is reached




'I love to learn about AI!'

In [15]:
from wikiGPT.tokenize import Tokenizer
tokenizer = Tokenizer(Path("wikiGPT/tokenizers/tok32000.model"))

tokens = tokenizer.encode("I love to learn about AI!", bos=False, eos=False)
print(tokens)
print(tokenizer.decode(tokens))

[318, 3832, 298, 3178, 900, 14788, 19086]
I love to learn about AI!


In [19]:
### A batch of data from human to machine reading

# A batch is composed of two dimensions: the number of samples we are going to pass to the model and the context length

corpus = [
    "I love to learn about AI!",
    "Let's train a GPT model!",
    data["content"][:30],
]
corpus


['I love to learn about AI!',
 "Let's train a GPT model!",
 'Nowa Kiszewa  is a village in ']

In [25]:
batch_size = 1
context_length = 4

# let's tokenize our text
tokenized_corpus = [
    tokenizer.encode(text, False, False) for text in corpus
]

tokenized_corpus

[[318, 3832, 298, 3178, 900, 14788, 19086],
 [6974, 19055, 19006, 3820, 262, 11867, 19028, 3005, 19086],
 [344, 4397, 385, 273, 15544, 19000, 351, 262, 1429, 280]]

In [58]:
# Creating a first batch for the next token prediction
import random
sample = random.choice(range(len(tokenized_corpus)))
print(f"Selected sample: {sample}")

batch = tokenized_corpus[sample][:context_length + 1]

Selected sample: 1


In [59]:
batch

[6974, 19055, 19006, 3820, 262]

In [60]:
for i in range(context_length):
    print(f"token(s) {batch[:i+1]} need to predict {batch[i+1]}")

token(s) [6974] need to predict 19055
token(s) [6974, 19055] need to predict 19006
token(s) [6974, 19055, 19006] need to predict 3820
token(s) [6974, 19055, 19006, 3820] need to predict 262


In [61]:
for i in range(context_length):    
    print(f"""subword(s) '{tokenizer.decode(batch[:i+1])}' need to predict '{tokenizer.decode(batch[i+1])}'""")

subword(s) 'Let' need to predict '''
subword(s) 'Let'' need to predict 's'
subword(s) 'Let's' need to predict 'train'
subword(s) 'Let's train' need to predict 'a'


In [None]:
## Defining the model

### Transformer architecture:
#### We are going to focus on the embedding layer, talk about the attention mechanism

In [64]:
import torch
### The embedding layer

#### This is the initial layer that will transform the token (int) into a vector
vocab = [0, 1, 2]
vocab_size = len(vocab)
embedding_size = 6

embedding_layer = torch.rand(vocab_size, embedding_size)
print(embedding_layer)

tensor([[0.1192, 0.3502, 0.7753, 0.6743, 0.4592, 0.7192],
        [0.7392, 0.7237, 0.7500, 0.4516, 0.7722, 0.8238],
        [0.8384, 0.2815, 0.0946, 0.2193, 0.4113, 0.9652]])


In [72]:
batch = [1, 0, 0, 2]
embedding = torch.zeros(batch_size, context_length, embedding_size)
for i, token in enumerate(batch):
    embedding[0, i] = embedding_layer[token]

print(embedding.shape)
print(embedding)

torch.Size([1, 4, 6])
tensor([[[0.7392, 0.7237, 0.7500, 0.4516, 0.7722, 0.8238],
         [0.1192, 0.3502, 0.7753, 0.6743, 0.4592, 0.7192],
         [0.1192, 0.3502, 0.7753, 0.6743, 0.4592, 0.7192],
         [0.8384, 0.2815, 0.0946, 0.2193, 0.4113, 0.9652]]])


In [75]:
batch_one_hot_encoded = torch.Tensor(
    [
        [0, 1, 0],
        [1, 0, 0],
        [1, 0, 0],
        [0, 0, 1]
    ]
)

batch_one_hot_encoded @ embedding_layer

tensor([[0.7392, 0.7237, 0.7500, 0.4516, 0.7722, 0.8238],
        [0.1192, 0.3502, 0.7753, 0.6743, 0.4592, 0.7192],
        [0.1192, 0.3502, 0.7753, 0.6743, 0.4592, 0.7192],
        [0.8384, 0.2815, 0.0946, 0.2193, 0.4113, 0.9652]])

In [None]:
### What is the attention mechanism?

In [None]:
### How can we measure the performance of a model? The cross-entropy loss

In [None]:
### How can we actually train the model?
#### What is gradient descent and backpropagation?

In [None]:
### A note on model sizes and required training computation

In [None]:
### How do we make a prediction with the model?
#### probabilistic sampling
#### inefficiency of the attention mechanism at inference time