<a href="https://colab.research.google.com/github/JamorMoussa/Build-GPT-From-Scratch/blob/main/notebooks/Build_GPT_From_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a GPT from scratch

**GPT** stands for **Generative Pre-Trained Transformer**, a probabilistic system based on the **Transformer** architecture. This architecture was introduced in the renowned 2017 paper by Google, [Attention is All You Need](https://arxiv.org/pdf/1706.03762). The paper proposed the **Transformer** model for machine translation tasks.

This notebook contains notes for **Andrej Karpathy**'s tutorial, titled [Let's Build GPT: From Scratch, in Code, Spelled Out](https://www.youtube.com/watch?v=kCc8FmEb1nY) on his YouTube channel.

## Let's Prepare The Dataset

In this tutorial, we use the Tiny Shakespeare dataset. It is a text file of 1.06 MB in size, which is a concatenation of all the works of [William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare).

In [1]:
# Let's download the dataset first.

!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-06-05 23:57:15--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-06-05 23:57:16 (18.4 MB/s) - ‘input.txt’ saved [1115394/1115394]



Next, let's read the dataset and print the 1000 first characters.

In [2]:
dataset_path = "/content/input.txt"

with open(dataset_path, "r", encoding="utf-8") as f:
  text = f.read()

In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [4]:
# Let's print the lenght of dataset in characters:

print(f"length of dataset in characters: {len(text)}")

length of dataset in characters: 1115394


The next step is to build the vocabulary by finding the unique characters present in the text. Then, we build an `encoder` that maps characters to integers and a `decoder` that maps integers back to characters.

In [37]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


The dataset contains 65 unique characters. Let's build the `encode` function, which maps a string to a list of integers, and the `decode` function, which performs the inverse operation.

In [6]:
stoi = {char: i for i, char in enumerate(chars)}
itos = {i: char for i, char in enumerate(chars)}

In [7]:
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: "".join([itos[i] for i in l])

hello_code = encode('hello')

print(f"the encode of 'hello' is {hello_code}")
print(f"Let's decode it {decode(hello_code)}")

the encode of 'hello' is [46, 43, 50, 50, 53]
Let's decode it hello


In fact, this is a naive encoding strategy. However, in practice, we use sub-word encoders. For example, Google uses [SentencePiece](https://github.com/google/sentencepiece), an unsupervised text tokenizer. Another example is [Tiktoken](https://github.com/openai/tiktoken) from OpenAI, which is based on [Byte Pair Encoding (BPE)](https://en.wikipedia.org/wiki/Byte_pair_encoding). BPE is used by OpenAI in their models.

Let's have an example using the `Tiktoken` tokenizer.

In [None]:
!pip install tiktoken

In [9]:
import tiktoken

enc = tiktoken.get_encoding("gpt2")
print(f"vocab size: {enc.n_vocab}")

vocab size: 50257


In [10]:
enc.encode("hello enveryone")

[31373, 551, 548, 505]

In [11]:
[enc.decode([code]) for code in [31373, 551, 548, 505]]

['hello', ' en', 'very', 'one']

So, for this tutorial, we will continue to use character encoding for simplicity. Now, it's time to encode the entire dataset using this encoder. Let's start using the `PyTorch` framework to work with tensors.

In [13]:
import torch

In [14]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


In [15]:
data[:100]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])

In [16]:
length = int(0.9 * len(data))

In [17]:
train_set = data[: length]
test_set = data[length:]

We won't fit the transformer on the entire dataset at once, as it would be very expensive. Instead, we will process the dataset in chunks of text with a specified `block_size` or context size.


In [18]:
block_size: int = 8

In [19]:
train_set[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [20]:
decode(train_set[:block_size + 1].tolist())

'First Cit'

Now, Given a context of characters the Transformer model, will predict the next character. The context is going from 1 to `block_size`.

In [21]:
x = train_set[:block_size]
y = train_set[1: block_size + 1]

for t in range(block_size):
  context = x[:t + 1]
  target = y[t]
  print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


For a chunk, we train on eight training examples. This is done to make the transformer model be used working with context sizes ranging from one character to `block_size` characters.


Until now, we have only worked on the time dimension. Let's now prepare the data to introduce the batch dimension.

In [25]:
torch.manual_seed(1337) # fix the seed
block_size = 8 # maximum length of context.
batch_size = 4

In [28]:
def get_batch(split: str = "train"):
  data = train_set if split == "train" else test_set
  ix = torch.randint(len(data) - block_size, (batch_size,))

  x = torch.stack([data[i: i + block_size] for i in ix])
  y = torch.stack([data[i + 1: i + block_size + 1] for i in ix])
  return x, y

In [29]:
xb, yb = get_batch(split = "train")

In [30]:
xb

tensor([[43,  1, 51, 39, 63,  1, 40, 43],
        [58, 46, 43,  1, 43, 39, 56, 57],
        [39, 58, 47, 53, 52, 12,  1, 37],
        [53, 56, 43,  1, 21,  1, 41, 39]])

In [31]:
yb

tensor([[ 1, 51, 39, 63,  1, 40, 43,  1],
        [46, 43,  1, 43, 39, 56, 57, 10],
        [58, 47, 53, 52, 12,  1, 37, 53],
        [56, 43,  1, 21,  1, 41, 39, 51]])

In [34]:
for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")
    print()

when input is [43] the target: 1
when input is [43, 1] the target: 51
when input is [43, 1, 51] the target: 39
when input is [43, 1, 51, 39] the target: 63
when input is [43, 1, 51, 39, 63] the target: 1
when input is [43, 1, 51, 39, 63, 1] the target: 40
when input is [43, 1, 51, 39, 63, 1, 40] the target: 43
when input is [43, 1, 51, 39, 63, 1, 40, 43] the target: 1

when input is [58] the target: 46
when input is [58, 46] the target: 43
when input is [58, 46, 43] the target: 1
when input is [58, 46, 43, 1] the target: 43
when input is [58, 46, 43, 1, 43] the target: 39
when input is [58, 46, 43, 1, 43, 39] the target: 56
when input is [58, 46, 43, 1, 43, 39, 56] the target: 57
when input is [58, 46, 43, 1, 43, 39, 56, 57] the target: 10

when input is [39] the target: 58
when input is [39, 58] the target: 47
when input is [39, 58, 47] the target: 53
when input is [39, 58, 47, 53] the target: 52
when input is [39, 58, 47, 53, 52] the target: 12
when input is [39, 58, 47, 53, 52, 12] 

## Build Bigram Language Model

The Bigram Language model is the simplest language model we can use initially.

In [35]:
import torch, torch.nn as nn
from torch.nn import functional as F

In [131]:
class BigramLanguageModel(nn.Module):

  def __init__(self, vocab_size: int = 65):
    super(BigramLanguageModel, self).__init__()

    self.emb = nn.Embedding(vocab_size, vocab_size) # table of (vocab_size, vocab_size)

  def forward(self, ids: torch.Tensor) -> torch.Tensor:
    ids = torch.Tensor(ids).type(torch.long)

    logits = self.emb(ids) # (B, T, C)
    return logits.permute(0, 2, 1) # (B, C, T)

  def generate(self, idx: torch.Tensor, max_tokens: int):

    for _ in range(max_tokens):
      logits = self(idx)[:, :, -1]

      probs = F.softmax(logits, dim=-1)

      idx_next = torch.multinomial(probs, num_samples=1)
      idx = torch.cat((idx, idx_next), dim=1)

    return idx

In [132]:
model = BigramLanguageModel(vocab_size)

In [133]:
model

BigramLanguageModel(
  (emb): Embedding(65, 65)
)

In [138]:
start_index = torch.zeros((1, 1), dtype=torch.long)

In [139]:
decode(model.generate(idx= start_index, max_tokens= 100)[0].tolist())

'\nI-Lrg3,N!. htSZn&.UC&hJ.zVLaM?EFeaOQmEWlqKSGjb\nKW.vuIqHZBPiJSp BPwmUsRcLKPBoZ$o;OvCJVbviApwGYlMneO\ne'

As we can see, the generated output is totally random. This is because we have not trained the model yet, which is the next step.