<a href="https://colab.research.google.com/github/ML-Guy/tutorials_transformers/blob/main/transformers_01_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a GPT

A significant part of code is referenced from companion notebook to the Karpathy's [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT. You can follow along with blogpost [Building a Transformer LLM with Code: Fundamental Transformer & GPT](https://www.yadavsaurabh.com/building-a-transformer-llm-with-code-fundamental-transformer-gpt/) for better conceptual understanding.

In [2]:
%pip install -q einops transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m98.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-08-17 05:55:35--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-08-17 05:55:36 (129 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [4]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [6]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



# 1. Basic Tokenizer

In [7]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [8]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


# 2. Data Loader

In [9]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [10]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [11]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [12]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [13]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split, device="cpu"):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


# 3. Basic Bigram Model

In [14]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [15]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [16]:
batch_size = 32
for steps in range(20000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    if steps%1000==0: print(steps, loss.item())

print(loss.item())


0 4.704006195068359
1000 3.7031264305114746
2000 3.137178659439087
3000 2.776794672012329
4000 2.5844571590423584
5000 2.5105183124542236
6000 2.531585931777954
7000 2.504757881164551
8000 2.4696712493896484
9000 2.4838879108428955
10000 2.4183998107910156
11000 2.529956817626953
12000 2.379727363586426
13000 2.4651365280151367
14000 2.3533310890197754
15000 2.4624435901641846
16000 2.4509522914886475
17000 2.325801372528076
18000 2.4357123374938965
19000 2.493601083755493
2.4832067489624023


In [17]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))





QUCatheesicis bas, angomitouze chifomerawhe ty ag

Doone thesellande west, me fffol, whandiffe IAfithods misue, knild he I:
Whe! toudirer' My ayosbly louroura s m', uthos s reveprthoukerdi't avorure fotemowe.
Whamo es t, tstt g t RTRushy,
WAsbr spr my ou pl y,
Witoft at o s me,
Whabr'the Cicuomants awonte qungur thme wrar d parsupl by:
'sul ve ave,
Kconit ped bim; fam elathelch easutlll teye A d che'd, met its

IVo wnkn cave!
I thengr ts, IO t
Hoyoolove
ONCENo breppo onder t this r is.
I cken


# 4. Decoder based Transformer Model

## Self-Attention

In [18]:
B, T, head_size = 4, 8, 20  # batch, time, head_size
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
v = torch.randn(B, T, head_size)

# Attention calculation
attention = torch.einsum('b t h, b s h-> b t s', q, k) * head_size ** -0.5

print(attention.shape)
print(attention[0])

torch.Size([4, 8, 8])
tensor([[-1.4140, -1.4541,  1.2883, -0.8343, -0.7507,  0.2150,  0.2863,  1.2984],
        [-0.8598, -1.0200,  2.2931, -0.0266, -0.3606, -0.2385,  0.6361,  0.4431],
        [ 0.4401,  0.0087,  0.7046,  0.4918,  1.1250, -0.2419, -0.9671, -0.1771],
        [ 1.1277,  1.2113,  0.0667, -2.2026, -0.2301,  0.1449,  0.6227,  0.2465],
        [ 1.3303,  0.4084,  0.2546, -1.4567, -0.0959,  0.8121, -0.5556,  0.1851],
        [-2.9640, -1.2445, -0.3679,  1.2755, -0.2505,  0.9012, -0.6636,  0.2606],
        [-1.1415,  0.4146, -1.0367,  1.2819, -0.1212,  1.3331, -1.0537, -0.7873],
        [-1.1877, -1.2639,  0.5272,  0.5571, -0.7398,  0.0139, -0.9727,  1.1099]])


In [19]:
k.var()

tensor(1.0022)

In [20]:
q.var()

tensor(1.1228)

In [21]:
attention.var()

tensor(1.2978)

In [22]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [23]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [24]:
out = torch.einsum("b t s, b s h -> b t h", attention, v)
out.shape, out[0]

(torch.Size([4, 8, 20]),
 tensor([[ 1.3185,  0.2416, -2.1490,  1.5089,  1.3606,  1.6404, -3.7433,  0.7983,
           3.2315,  1.1444,  1.9813,  2.9230,  0.2450, -1.5903,  1.2641,  0.9344,
           1.3850,  1.1515, -4.9946, -4.6597],
         [ 3.7353,  4.3092, -2.1823,  5.1039,  0.2383,  4.5018, -2.6678,  0.1401,
           2.6685,  0.1990, -1.3295,  2.2029,  0.6242, -1.4831, -0.2833, -1.2358,
           4.2056,  1.4439, -1.6959, -4.7094],
         [ 5.0230,  1.6963, -3.0545, -1.0792, -3.2957, -2.3639,  0.8241, -0.7501,
          -2.2606, -0.2115, -4.5275, -0.0738, -2.8369,  1.8390, -1.6529, -3.2078,
           0.1043, -0.4969,  0.8913, -1.1536],
         [-4.0455,  1.5426,  1.8352, -0.4279,  5.5237, -0.2386,  2.5327,  0.2951,
          -2.5087, -5.2358,  0.9561,  2.0047,  1.0873, -5.0192, -1.3411,  2.0614,
           2.3928,  4.1955,  0.8336,  1.7550],
         [ 1.0922,  0.1419, -1.8042, -2.8209,  1.3728, -2.7551,  3.1434, -0.5888,
          -5.8389, -2.3750, -0.0225,  3.3857, -2.

Notes:

* Attention is a communication mechanism. Can be seen as nodes in a directed
graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
* There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
* Each example across batch dimension is of course processed completely independently and never "talk" to each other
* "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much.

## Casual-attention

In [25]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [26]:
# consider the following toy example:
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [27]:
# version 1: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow = torch.einsum("t s, b s c -> b t c", wei, x) # (B, T, T) @ (B, T, C) ----> (B, T, C)
wei, xbow.shape

(tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
         [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
         [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]]),
 torch.Size([4, 8, 2]))

In [28]:
# version 2: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow2 = torch.einsum("t s, b s c -> b t c", wei, x)
torch.allclose(xbow, xbow2), wei

(True,
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
         [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
         [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
         [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]]))

In [29]:
# version 3: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  torch.einsum('B T H, B S H-> B T S', q, k) * head_size ** -0.5 # S=T

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = torch.einsum("B T S, B S H-> B T H", wei, v)

out.shape

torch.Size([4, 8, 16])

In [30]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3966, 0.6034, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3069, 0.2892, 0.4039, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3233, 0.2175, 0.2443, 0.2149, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1479, 0.2034, 0.1663, 0.1455, 0.3369, 0.0000, 0.0000, 0.0000],
        [0.1259, 0.2490, 0.1324, 0.1062, 0.3141, 0.0724, 0.0000, 0.0000],
        [0.1598, 0.1990, 0.1140, 0.1125, 0.1418, 0.1669, 0.1061, 0.0000],
        [0.0845, 0.1197, 0.1078, 0.1537, 0.1086, 0.1146, 0.1558, 0.1553]],
       grad_fn=<SelectBackward0>)

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [31]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [32]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [33]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))

## Multi-Head Attention Model

In [34]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.2
# ------------

torch.manual_seed(1337)

<torch._C.Generator at 0x7a014810d190>

In [35]:
from einops import rearrange

class MultiHeadAttention(nn.Module):
    """ multi head of self-attention """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.num_heads = num_heads
        self.head_size = head_size

        self.key = nn.Linear(head_size, head_size, bias=False)
        self.query = nn.Linear(head_size, head_size, bias=False)
        self.value = nn.Linear(head_size, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout_wei = nn.Dropout(dropout)
        n_embd= num_heads*head_size
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout_proj = nn.Dropout(dropout)


    def forward(self, x):
        # Reshape the tensor to B N T H for N heads
        B,T,C = x.shape
        x = rearrange(x, 'B T (N H) -> B N T H', N=self.num_heads)

        k = self.key(x)   # (B,N,T,H)
        q = self.query(x) # (B,N,T,H)

        # compute attention scores \
        wei = torch.einsum("BNTH, BNSH -> BNTS", q,k) * self.head_size**-0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout_wei(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,N,T,H)
        out = torch.einsum("BNTS, BNSH -> BNTH", wei, v)

        # concat and mix N Heads
        out = rearrange(out, 'B N T H -> B T (N H)')
        out = self.dropout_proj(self.proj(out))
        return out

In [36]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


class TransformerModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# Train and Evaluate

In [37]:
model = TransformerModel()
m = model.to(device)
m = torch.compile(m)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, device=device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)


0.163649 M parameters


In [38]:
if not m.training:
  m.train()
  print(m.training)

In [39]:

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()




step 0: train loss 4.3344, val loss 4.3391
step 100: train loss 2.7035, val loss 2.7273
step 200: train loss 2.5637, val loss 2.5650
step 300: train loss 2.5157, val loss 2.5195
step 400: train loss 2.4669, val loss 2.4771
step 500: train loss 2.4620, val loss 2.4711
step 600: train loss 2.4245, val loss 2.4269
step 700: train loss 2.3958, val loss 2.4109
step 800: train loss 2.3705, val loss 2.3790
step 900: train loss 2.3512, val loss 2.3633
step 1000: train loss 2.3198, val loss 2.3409
step 1100: train loss 2.3089, val loss 2.3301
step 1200: train loss 2.2932, val loss 2.3115
step 1300: train loss 2.2795, val loss 2.2951
step 1400: train loss 2.2657, val loss 2.2966
step 1500: train loss 2.2547, val loss 2.2747
step 1600: train loss 2.2452, val loss 2.2647
step 1700: train loss 2.2319, val loss 2.2546
step 1800: train loss 2.2193, val loss 2.2295
step 1900: train loss 2.2036, val loss 2.2205
step 2000: train loss 2.1882, val loss 2.2229
step 2100: train loss 2.1827, val loss 2.2179


In [40]:
# generate from the model
m.eval()
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


Gef tiche sovour'de her theill, be! Thy row his you ferly ear Camughan badencel,
Nent may his lay with he now contronter's critcess.

Fertly an he. Betillour Camess mards,
This and Macen lead Rimpans:
Aat, moy, staced your mord's gran the fas atI. Whosell steselfore!

PERCLIUTER INAN:
My you had I rive groked,
That to my not I the tranows; Madend hear,
These of fath, greal, my herim ding, loakie our of lasent?

GLORGHPHENIO:

My,'ll is a lord.

MENENTES:
That My shou shoust hall! be died Reare, the lown. Vans, Wilmle, ge hefor you deck in a cay her you grood usintason fer grouder;
I'll spinced fom's Thim.

LEh's I det, I Caven if you dothers,
In deaking his your with card dioodser, Beatul of dock fascesing.

LUCELLIO:
Wich Teare he comany, though a more hav beardences:
The comintle?
 be love fief's me of word Rint;
Cally frokes Bouden:
The shille he there not to ance ut frote:
Man sown's goth
poule, zimortre;
For do mines; entre hart ippling awith fast reentch
facithlur shou his sompe

# Optimisations

## PreNorm

  ## Weight Tying

In [41]:
class TransformerModel(nn.Module):
    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # https://paperswithcode.com/method/weight-tying
        self.token_embedding_table.weight = self.lm_head.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

## Flash Attention and Merged QKV Computation

In [49]:
from einops import rearrange

class MultiHeadAttention(nn.Module):
    """ multi head of self-attention """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.num_heads = num_heads
        self.head_size = head_size

        self.qkv = nn.Linear(head_size, 3*head_size, bias=False) # stacked QKV
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout_wei = nn.Dropout(dropout)
        n_embd= num_heads*head_size
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout_proj = nn.Dropout(dropout)


    def forward(self, x):
        # Reshape the tensor to B N T H for N heads
        B,T,C = x.shape
        x = rearrange(x, 'B T (N H) -> B N T H', N=self.num_heads)

        # k = self.key(x)   # (B,N,T,H)
        # q = self.query(x) # (B,N,T,H)
        # v = self.value(x) # (B,N,T,H)
        q,k,v = rearrange(self.qkv(x), "B N T (n H)-> n B N T H", n=3)    #(B,N,T,3H) - > 3 B N T H

        # compute attention scores \
        # wei = torch.einsum("BNTH, BNSH -> BNTS", q,k) * self.head_size**-0.5
        # wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        # wei = F.softmax(wei, dim=-1)
        # wei = self.dropout_wei(wei)
        # # perform the weighted aggregation of the values
        # out = torch.einsum("BNTS, BNSH -> BNTH", wei, v)
        ## Flash attention
        out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=dropout if self.training else 0, is_causal=True)

        # concat and mix N Heads
        out = rearrange(out, 'B N T H -> B T (N H)')
        out = self.dropout_proj(self.proj(out))
        return out

## Tokenizer
Let's use sub-word tokenizer in place of individual character level tokenization.

In [43]:
len(set(text.split()))

25670

In [44]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model = GPT2LMHeadModel.from_pretrained('gpt2')
# Step 4: Encode the Text Use the tokenizer to encode your input text into a sequence of tokens.
tokenizer

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

GPT2Tokenizer(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'eos_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True), 'unk_token': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}, clean_up_tokenization_spaces=True)

In [45]:
input_text = "Your input text goes here."
input_ids = tokenizer.encode(input_text)
print(input_ids)
print(tokenizer.batch_decode(input_ids))

[7120, 5128, 2420, 2925, 994, 13]
['Your', ' input', ' text', ' goes', ' here', '.']


In [46]:
tokenised_text=tokenizer.encode(text)
vocab = sorted(list(set(tokenised_text)))
vocab_size = len(vocab)
print(tokenizer.decode(vocab[:100]))
print(vocab_size)

Token indices sequence length is longer than the specified maximum sequence length for this model (338025 > 1024). Running this sequence through the model will result in indexing errors


!$',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWYZabcdefghijklmnoprstuvwxyz
  t aheinreon theer sat w oen citisanores bed fing pou analar to m of in d h andicasle
11706


In [47]:
# create a mapping from characters to integers
stoi = { t:i for i,t in enumerate(vocab) }
itos = { i:t for i,t in enumerate(vocab) }
encode = lambda s: [stoi[t] for t in tokenizer.encode(s)] # encoder: take a string, output a list of integers
decode = lambda l: tokenizer.decode([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[42, 2717, 393]
hii there


### Training with Optimisations

In [54]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [55]:
model = TransformerModel()
m = model.to(device)
m = torch.compile(m)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', device=device)

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

0.916154 M parameters
step 0: train loss 9.5608, val loss 9.5581
step 100: train loss 6.4186, val loss 6.4983
step 200: train loss 6.3865, val loss 6.4410
step 300: train loss 6.3548, val loss 6.4353
step 400: train loss 6.2841, val loss 6.3581
step 500: train loss 6.0196, val loss 6.1388
step 600: train loss 5.8834, val loss 6.0358
step 700: train loss 5.7925, val loss 5.9063
step 800: train loss 5.6809, val loss 5.8681
step 900: train loss 5.6291, val loss 5.7854
step 1000: train loss 5.5819, val loss 5.7384
step 1100: train loss 5.4503, val loss 5.6572
step 1200: train loss 5.3920, val loss 5.6034
step 1300: train loss 5.3024, val loss 5.5707
step 1400: train loss 5.2503, val loss 5.5114
step 1500: train loss 5.1651, val loss 5.4456
step 1600: train loss 5.1322, val loss 5.3905
step 1700: train loss 5.0726, val loss 5.3987
step 1800: train loss 5.0042, val loss 5.3218
step 1900: train loss 4.9959, val loss 5.3336
step 2000: train loss 4.9146, val loss 5.3330
step 2100: train loss 4.

In [56]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))

!


WARWOW:
The poking-aw son must have reconcilbeler like thee myvul
A school-made seven line of the world!

KING ED III:
Sometime, our ears have got me
They temporal something tick all his voice in the block
Against the store against thy middle
His sword; weapons, but wanting; by famous-morrow,
That sword to threaten the warranted could my conquest of your affair,
To leave by alike to'sake.
Of love that after the gone reckoning:
Your city; Aufethath thereforeFame sit your words,
If far so return obed of the eye of a
 tend but discover, ever have not love. Mark Bol Place

CAPULET:
Were honour bleok down suspicion!
What heavens hears: nor England, awake you did adoreest he dENT craft.
Mine event in yither for earth it had see.

KING HENRYUS:
The dost: then that upon thyDeclither: the dire out
And with much whom you faded. What gentleman that made thine true,

W begging and in being put your hard obstruction or told
As rareought night, great; integrityed the sanctuary;
As what must be r

### Training Tokenizer from Scratch

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer


tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

In [None]:
tokenizer.train_from_iterator(text.split("\n"), trainer=trainer)

In [None]:
print(tokenizer.encode(text.split("\n")[1]).tokens)

['Before', 'we', 'proceed', 'any', 'further', ',', 'hear', 'me', 'speak', '.']


In [None]:
tokenizer.get_vocab_size()

18151

# What's Next?

Several advancements have been made to improve the capabilities and efficiency of transformer models
1. **Integrated Positional embeddings** - Fixed, Relative PE (T5), RoPe (llama), AliBi
2. **Attention optimisations** - Efficient attention mechanisms such as Sparse Attention(BIG BIRD), FAVOR+ (Performer), MultiQuery Attention, and Longformer(Sliding Attention).
3. **Feedforward Network** - Optimisations use of CNNs, routing mechanisms
4. **Training Data** - Proper preprocessing and cleaning of training datasets

Next, we will take a look at positional embeddings related improvements.