# How to train a Generative Pretrained Transformer (GPT) model?

The goal of this notebook is to explore the main aspects around training a generative AI model for text.\
We will review the main concepts & steps for training and talk also about how the prediction of new content happens.

## Loading the data

In [7]:
# load wiki english
import json
with open('../data/shuffled_shards/shard_0.json', 'r') as file:
    for line in file:
        data = json.loads(line)
        break

In [11]:
data["content"][:200]

'Nowa Kiszewa  is a village in the administrative district of Gmina Kościerzyna, within Kościerzyna County, Pomeranian Voivodeship, in northern Poland. It lies approximately  south-east of Kościerzyna '

## Creating the tokenizer

### A naive tokenizer

In [33]:
vocab = sorted(list(set(data["content"])))
vocab_size = len(vocab)
def create_naive_tokenizer(text):
    tokenizer = {vocab[i]: i + 3 for i in range(len(vocab))}
    tokenizer["<s>"] = 0
    tokenizer["</s>"] = 1
    tokenizer["<unk>"] = 2
    detokenizer = {v: k for k, v in tokenizer.items()}
    return tokenizer, detokenizer

naive_tokenizer, naive_detokenizer = create_naive_tokenizer(data["content"])

def tokenize(text, tokenizer):
    return [tokenizer.get(letter, 2) for letter in text]

def detokenize(tokens, detokenizer):
    return "".join([detokenizer.get(token, "<unk>") for token in tokens])

tokens = tokenize("Let's train a GPT model!", naive_tokenizer)
print(tokens)
print(detokenize(tokens, naive_detokenizer))

[2, 21, 34, 2, 33, 4, 34, 32, 18, 25, 29, 4, 18, 4, 10, 15, 2, 4, 28, 30, 20, 21, 27, 2]
<unk>et<unk>s train a GP<unk> model<unk>


### A modern tokenizer based on the Byte Pair Encoding (BPE) algorithm



In [None]:
### A batch of data from human to machine reading
##### Show what we are predicting

In [None]:
## Defining the model

In [None]:
### The embedding layer


In [None]:
### What is the attention mechanism?

In [None]:
### How can we measure the performance of a model? The cross-entropy loss

In [None]:
### How can we actually train the model?
#### What is gradient descent and backpropagation?

In [None]:
### A note on model sizes and required training computation

In [None]:
### How do we make a prediction with the model?
#### probabilistic sampling
#### inefficiency of the attention mechanism at inference time