In [38]:
#| default_exp bigram

In [39]:
#| echo false
#| export
from httpx import get as hget
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.set_printoptions(linewidth=150)

Hyperparameters

In [40]:
#| export
batch_size = 32 # how many independent sequences will be processed in parallel
block_size = 8 # maximum context length for predictions
max_iters = 3_000
eval_interval = 300
lr = 1e-2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 32

torch.manual_seed(1337);

Here we download sample dataset of Tiny Shakespeare dataset, which is a collection of all Shakespeare texts. The size is approximately 1 mln characters

In [41]:
#| export
f = hget("https://raw.githubusercontent.com/karpathy/ng-video-lecture/refs/heads/master/input.txt")
text = f.text

In [42]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [43]:
# let's look at the first 1000 characters
print(text[:500])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


There are in total 65 unique characters that our model can see and work with.

In [44]:
#| export
chars = sorted(list(set(text)))
vocab_size = len(chars)

In [45]:
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


Models work numbers, so let's create a mapping from characters to integers: encoder and decoder. Here we use a very simple encoder-decoder (tokenizer) by simply tokenizing each character by using their position. See tiktoken (used by OpenAI)

In [46]:
#| export
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch, i in stoi.items()}
encode = lambda s: [stoi[c] for c in s] # encoder: take a string and output a list of integers
decode = lambda l: ''.join((itos[o] for o in l) )# decoder: take a list of integers and output a string

In [47]:
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


Let's encode the whole text dataset and store it into a torch.Tensor `data`. Currently data is simply a tensor stretched in a row. We also split our data into training (90% of the data) and validation (10% of the data) sets to calculate accuracy of our model and avoid overfitting. Without testing the model on a hold-out validation set we risk our model just memorizing the whole training set and having no actual predictive / creative power.

In [48]:
#| export
data = torch.tensor(encode(text), dtype=torch.long)

In [49]:
print(data.shape, data.dtype)
print(data[:100]) # the 1000 characters we looked at earlier will look to GPT like this)

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44, 53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52,
        63,  1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1, 57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43,
        39, 49,  6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Split data into training (90%) and validation sets

In [50]:
#| export
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [51]:
print(len(train_data))
print(len(val_data))

1003854
111540


We will pass data (Shakespeare texts) into transformer model using batches because feeding the whole text at once will be computationally prohibitive. The idea is to pass random blocks (sequence) of text into a model and train it on predicting the next character. We can illustrate it below. In each sequence we train transformer model on context size from 1 to `block_size`. This allows the model to see different contexts in predicting the next character in a sequence. In generating text the model starts with a context on 1 and then it will predict up to a block_size characters and then just truncate because it is limited by the context of `block_size`.

In [52]:
block_size = 8

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in  range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target : {target}")

When input is tensor([18]) the target : 47
When input is tensor([18, 47]) the target : 56
When input is tensor([18, 47, 56]) the target : 57
When input is tensor([18, 47, 56, 57]) the target : 58
When input is tensor([18, 47, 56, 57, 58]) the target : 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target : 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target : 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target : 58


In [53]:
torch.manual_seed(1337)
batch_size = 4 # number of independent sequences we process in parallel
block_size = 8 # maximum context length for predictions

We want to utilize GPU's power of parallel calculations and feed multiple sequences in a batch.

In [54]:
#| export
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == "train" else val_data
    ix = torch.randint(0, len(data)-block_size, (batch_size,)) # random offsets
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    return x, y

In [55]:
xb, yb = get_batch('train')

print('inputs:')
print(xb.shape)
print(xb,'\n')
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f"When input is {context.tolist()} the target : {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]], device='cuda:0') 

targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]], device='cuda:0')
----
When input is [24] the target : 43
When input is [24, 43] the target : 58
When input is [24, 43, 58] the target : 5
When input is [24, 43, 58, 5] the target : 57
When input is [24, 43, 58, 5, 57] the target : 1
When input is [24, 43, 58, 5, 57, 1] the target : 46
When input is [24, 43, 58, 5, 57, 1, 46] the target : 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target : 39
When input is [44] the target : 53
When input is [44, 53] the target : 56
When input is [44, 53, 56] the target : 1
When input is [44, 53, 56, 1] the target : 58
When input is [44, 53,

So each batch of 4 random sequences yields 32 examples (4 * 8) that will be fed into a neural net.

In [56]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]], device='cuda:0')


In [57]:
torch.manual_seed(1337);

Let's implement the simplest language model - Bigram Language Model. Each token is embedded in a vocab_size vector, so our batch of size (B,T) becomes (B,T,C), where C stands for channels and is `vocab_size` in our example. Those channels are treated as logits (scored) in predicting the next character based on the individual identity of the current token, because there is no interaction between current token and previous ones. Tokesn don't talk to each other and there is no context.

In [92]:
#| export
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B, T) tensor of integers
        logits = self.token_embedding_table(idx) # (B, T, C)

        if targets is None:
            loss = None
        else:
            # we reshape logits to pass it to F.cross_entropy as it expects (B,C,T)
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            target = targets.view(B*T)
            loss = F.cross_entropy(logits, target)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, _ = self(idx)
            # focus only on the last time step T (in forward pass logits retain original shape (no .view))
            logits = logits[:,-1,:] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            # print(idx.shape)
        return idx

model = BigramLanguageModel().to(device)

In [96]:
logits, loss = model(xb, yb)
print(f'{logits.shape=}')
print(f'{loss=}')

print(decode(model.generate(torch.zeros((1, 1), dtype=torch.long).to(device), max_new_tokens=100)[0].tolist()))
print(decode(model.generate(torch.zeros((4, 1), dtype=torch.long).to(device), max_new_tokens=100)[2].tolist()))

logits.shape=torch.Size([32, 65])
loss=tensor(4.8430, device='cuda:0', grad_fn=<NllLossBackward0>)

lMUo$Rq,fVFEPEDFCgRySoU.JaHK-NLbE!rs,iyb&F&:
aadYabWy$!JEgDxsYBhuihScNIp?Fa'?Qe
yRJy:tMVq&fR;VEd!3MJ

fd 
cAwPNV;JiATVP
WYXjpif3IWuZk&-RQk,VDj
VXBt-tB!H:KlPoejvlNGcN?LEbzWUtu$sLyG:N:$ZtQFHUAc$P
doJyyLhW


So each batch of 4 random sequences yields 32 examples (4 * 8) that will be fed into a neural net.

To measure how well we are predicting the next character in a sequence, we use Cross-Entropy Loss. It basically converts logits to probabilities for each example, then plucks out the correct dimension in a C dimension that corresponds to the actual next character, takes -log of it and averages across all examples.

In [90]:
logits = out[0]
targets = yb.view(-1) #  to flatten out the targets (B,T) - (B*T) and match logits
probs = logits.exp()/(logits.exp()).sum(dim=1, keepdim=True)
-torch.log(probs[range(targets.shape[0]), targets]).mean(), out[1]

(tensor(4.8786, device='cuda:0', grad_fn=<NegBackward0>),
 tensor(4.8786, device='cuda:0', grad_fn=<NllLossBackward0>))

Function to estimate the loss by averaging over `eval_iters`

In [31]:
#|export
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Create a PyTorch optimizer, using AdamW

In [100]:
#| export
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

In [101]:
batch_size = 32

In [102]:
#| export
for iter in range(max_iters):
   
    # evaluate loss on train and val sets once in a while
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.7379, val loss 4.7258
step 300: train loss 2.8299, val loss 2.8475
step 600: train loss 2.5496, val loss 2.5516
step 900: train loss 2.5087, val loss 2.5169
step 1200: train loss 2.4758, val loss 2.5053
step 1500: train loss 2.4721, val loss 2.4993
step 1800: train loss 2.4719, val loss 2.4880
step 2100: train loss 2.4702, val loss 2.4887
step 2400: train loss 2.4743, val loss 2.4883
step 2700: train loss 2.4659, val loss 2.4898


In [103]:
#| export
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(model.generate(context, max_new_tokens=300)[0].tolist()))


Th fepyotssthecas l.
TAn.
Mourethal wove.
seazende benenovetour dis?



TI's cok hedin tie s inds he be feRUCatos:
Whit Clo ghasundisthou ld, he n, soxcone.

Anthataker aghercobun ws m s s withoumas Fond t s wllo INour id, morsed
Fourd?
TI idurd po venond, d Caltey
K:
BIUSoou tiund thornofen e sutan


In [104]:
from nbdev.export import nb_export
nb_export('GPT_dev0.ipynb')