# GPT2 from scratch

**Base on the video of Andrej Karpathy** https://www.youtube.com/watch?v=kCc8FmEb1nY

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.

Links:
- Google colab for the video: https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-
- GitHub repo for the video: https://github.com/karpathy/ng-video-lecture
- Playlist of the whole Zero to Hero series so far:    • The spelled-out intro to neural netwo...  
- nanoGPT repo: https://github.com/karpathy/nanoGPT
- my website: https://karpathy.ai
- my twitter:   / karpathy  
- our Discord channel:   / discord  

Supplementary links:
- Attention is All You Need paper: https://arxiv.org/abs/1706.03762
- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 
- OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/
- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.

Suggested exercises:
- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).
- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)
- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?
- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?

## Basic settings and imports

In [1]:
# Automatically reload modules when they have changed
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
# Update width of the jupyter page
from IPython.core.display import HTML
from IPython.display import display

display(HTML("<style>.container { width:80% !important; }</style>"))

In [4]:
# Show images inline
%matplotlib inline

In [5]:
# Check python version
from platform import python_version
print(python_version())

3.8.19


In [6]:
# Import libraries
import torch
from torch import nn
from torch.nn import functional as F

## Start Course

In [14]:
# Get dataset
!curl https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -o tinyshakespeare.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  5327k      0 --:--:-- --:--:-- --:--:-- 5313k


In [7]:
# read to inspect
with open('tinyshakespeare.txt', mode='r', encoding='utf-8') as f:
    text = f.read()

In [8]:
print(f"Length of dataset: {len(text)}")
print(f"First 100 chars: \n{text[:100]}")

Length of dataset: 1115394
First 100 chars: 
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [9]:
# All characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(chars)
print(vocab_size)

['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
65


In [10]:
# Create mapping from charecters to integers
stoi = {s:i for i,s in enumerate(chars)}
itos = {i:s for i,s in enumerate(chars)}
encode = lambda s: [stoi[el] for el in s]
decode = lambda i: "".join([itos[el] for el in i])

print(encode("Hi There"))
print(decode([20, 47, 1, 32, 46, 43, 56, 43]))

[20, 47, 1, 32, 46, 43, 56, 43]
Hi There


In [11]:
# Tokenize the whole dataset, using torch tensor
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


In [12]:
# Split dataset in train and validation
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [13]:
# define block size (context length)
block_size= 16
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43])

In [14]:
# Example what is x and wahat is to predict for tranformer
x = train_data[:block_size]
y = train_data[1: block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'When input is {context} the target is {target}')

When input is tensor([18]) the target is 47
When input is tensor([18, 47]) the target is 56
When input is tensor([18, 47, 56]) the target is 57
When input is tensor([18, 47, 56, 57]) the target is 58
When input is tensor([18, 47, 56, 57, 58]) the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47]) the target is 64
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64]) the target is 43
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43]) the target is 52
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52]) the target is 10
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10]) the target is 0
When input is tensor([

In [28]:
# code example
torch.randint(100,(8,))

tensor([50, 71, 42, 50, 31, 12, 69,  0])

In [29]:
# code example
[train_data[i:i+block_size] for i in [50, 51, 100]]

[tensor([ 1, 51, 43,  1, 57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10]),
 tensor([51, 43,  1, 57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0]),
 tensor([ 1, 39, 56, 43,  1, 39, 50, 50,  1, 56, 43, 57, 53, 50, 60, 43])]

In [30]:
torch.manual_seed(1337)
block_size = 8
batch_size = 4

def get_batch(split=None):
    """
    Generate a batch of data for training o validation.
    """
    if split=="train":
        data = train_data
    else:
        data = val_data
    ix = torch.randint(len(data)-block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')

print('inputs')
print(xb.shape)
print(xb)

print('targets')
print(yb.shape)
print(yb)
print("-"*20)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f'When input is {context} the target is {target}')

inputs
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
--------------------
When input is tensor([24]) the target is 43
When input is tensor([24, 43]) the target is 58
When input is tensor([24, 43, 58]) the target is 5
When input is tensor([24, 43, 58,  5]) the target is 57
When input is tensor([24, 43, 58,  5, 57]) the target is 1
When input is tensor([24, 43, 58,  5, 57,  1]) the target is 46
When input is tensor([24, 43, 58,  5, 57,  1, 46]) the target is 43
When input is tensor([24, 43, 58,  5, 57,  1, 46, 43]) the target is 39
When input is tensor([44]) the target is 53
When input is tensor([44, 53]) the target is 56
When input is tensor([44, 53, 56])

In [1]:
# test git

In [2]:
#test 3: git with d.bedok@gmail.com email