### Downloading the data

In [24]:
!wget https://www.gutenberg.org/files/11/11-0.txt

--2024-10-12 22:48:33--  https://www.gutenberg.org/files/11/11-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 154638 (151K) [text/plain]
Saving to: ‘11-0.txt.1’


2024-10-12 22:48:33 (1.46 MB/s) - ‘11-0.txt.1’ saved [154638/154638]



Reading the data

In [1]:
text = open("./11-0.txt", 'r', encoding='UTF-8').read()[664:]

In [2]:
print(text[:1000])

CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh
dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the
time it all seemed quite natural); but when the Rabbit actually _took a
watch out of its w

In [3]:
chars = "". join(sorted(list(set(text))))
print(f"Vocab Size: {len(chars)}")

# Encoding and Decoding Mapping
stoi = { ch: i for i, ch in enumerate(chars) }
itos = { i:ch for i, ch in enumerate(chars) }

# Encoding and Decoding Lambdas
encode = lambda x: [stoi[ch] for ch in x]
decode = lambda x: "".join([itos[token] for token in x])

Vocab Size: 74


In [4]:
# An Example of the tokenization
example = "Hello! My name is Siddhant Singh Karki"
tokens = encode(example)
print(f"Encoded Tokens: {tokens}")
print(f"Decoded Tokens: {decode(tokens)}")

Encoded Tokens: [20, 46, 53, 53, 56, 2, 1, 25, 66, 1, 55, 42, 54, 46, 1, 50, 60, 1, 31, 50, 45, 45, 49, 42, 55, 61, 1, 31, 50, 55, 48, 49, 1, 23, 42, 59, 52, 50]
Decoded Tokens: Hello! My name is Siddhant Singh Karki


### What we want is the  ```(n+1)th``` sequence is a label for ```nth``` sequece 

In [5]:
samples = 5
curr_list = [j for j in range(samples + 1)]
for idx, idy in zip(curr_list, curr_list[1:]):
    print(f"{text[:idx+1]} --> {text[idy]}")
    print("*" * 20)    

C --> H
********************
CH --> A
********************
CHA --> P
********************
CHAP --> T
********************
CHAPT --> E
********************


In [6]:
import torch
import torch.nn as nn
from torch.nn import functional as F

import tqdm


torch.manual_seed(42);

In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
split_size = 0.9
n = int(split_size * len(data))
train_set = data[:n]
val_set = data[n:]

In [96]:
batch_size = 8
block_size = 16
vocab_size = len(chars)
epochs = 1000
log_iter = 100
eval_ter = 200
learning_rate = 1e-3
n_embd = 64
num_heads = 8
n_blocks = 8
dropout = 0.2
device = "cuda" if torch.cuda.is_available() else "cpu"
# device = "cpu"

In [97]:
# Creating a dataloader from scratch
def get_batches(split):
    batch_data = train_set if split == "train" else val_set
    idxs = torch.randint(len(data) - block_size - 1, (batch_size, ))
    x = torch.stack([data[i:block_size+i] for i in idxs])
    y = torch.stack([data[i+1:block_size+i+1] for i in idxs])
    x, y = x.to(device), y.to(device)
    return x, y

In [98]:
x, y = get_batches("train")

In [99]:
print(f"X shape: {x.shape}")
print(f"y shape: {y.shape}")

X shape: torch.Size([8, 16])
y shape: torch.Size([8, 16])


## Model Code goes here

## Why do we need a Linear Projection in MultiHeadAttention?
- Concatenation of the individual head outputs
- Learning Richer Represntation: Even if the dim matches the embedding layer, The projection layer plays an important role in learning a combination of the outputs from different heads.
- The proj. layer allows the model to lean how to optimally mix the information from each head.

## Why do we need a Feed Forward Layer?
- Feature Transformation: The FF Layer processe each token independently and applied two linear transformations with an intermediate activation function.
- Enhancing Non-Linearity: SA mechanism in a transformer is Linear, which means it can capture relationships between tokens but might not introduce enough non-linearity. FF layer adds a non-linear transformation through an activation function which allows the model to capture complex patterns and relationships.

## Why add LayerNorm?
- Below is the training statistics without layer norm (Exploding gradients)
- `step 0: train loss 4.7718, val loss 4.7709`
-`step 100: train loss 2.5493, val loss 2.5492`
-`step 200: train loss 60.1986, val loss 59.9926`
-`step 300: train loss 558319.7500, val loss 557347.4375`

- With Layer Norm:
- `step 0: train loss 4.4536, val loss 4.4521
step 100: train loss 2.6987, val loss 2.7117
step 200: train loss 2.5321, val loss 2.5077
step 300: train loss 2.4262, val loss 2.4186
step 400: train loss 2.3479, val loss 2.3534
step 500: train loss 2.2614, val loss 2.2863
step 600: train loss 2.2484, val loss 2.2344
step 700: train loss 2.2003, val loss 2.1847
step 800: train loss 2.1532, val loss 2.1447
step 900: train loss 2.1368, val loss 2.1196
step 999: train loss 2.1124, val loss 2.0768`

In [100]:
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        # Calculating the affinities
        wei = (q @ k.transpose(-2, -1)) * k.shape[-1]**0.5
        wei = wei.masked_fill(self.tril[:T,:T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)

        v = self.value(x)
        out = wei @ v
        return out

# Till MultiHeadAttention we're still extracting feature (how are tokens reacting to each other
# How much of "Attention" is one token emmitting to the other
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        out = self.dropout(out)
        return out

class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.multi_head = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        ''' 
            We fork off and do some communication and come back.
            Doing so adds gradients equally up to the original input during back propogation.
            Addtionally distributes gradient equally to both the branches.
            Gradients hop to every additional node.
            There is a gradient superhighway .
            In the beginning they contribute very less.
            But during optimization, then the block over time kick in
            Original paper does this a bit differently 
        '''
        x = x + self.multi_head(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x
        

class GPT2Transformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, n_embd)
        self.pos_embedding_table = nn.Embedding(block_size, n_embd)
        
        self.blocks = nn.Sequential(
            *[Block(n_embd, num_heads) for _ in range(n_blocks)]
        )
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        # pluck out the learned token embedding vectors for each token
        token_embds = self.embedding_table(idx)
        # pluck out the learned position embedding vectors for each token
        pos_emb = self.pos_embedding_table(torch.arange(T, device=device))
        x = token_embds + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        
        logits = self.lm_head(x)
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
        return loss, logits 
    
    def generate(self, idx, max_tokens):
        for _ in range(max_tokens):
            idx_cropped = idx[:, -block_size:]
            loss, logits = self(idx_cropped)
            logits = logits[:,-1,:]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1) # concat along the time dimension
        return idx

# Model Code ends here

In [101]:
model = GPT2Transformer()
model = model.to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

## Training Loop

In [102]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ["train", "val"]:
        losses = torch.zeros(eval_ter)
        for k in range(eval_ter):
            X, Y = get_batches(split)
            loss, logits = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

for iter in range(epochs):
    if iter % log_iter == 0 or iter == epochs - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses["train"]:0.4f}, val loss {losses['val']:.4f}")
    xb, yb = get_batches("train")
    loss, logits = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 4.5045, val loss 4.5102


KeyboardInterrupt: 

In [84]:
context = torch.zeros((1,1), dtype=torch.long, device=device)
output = model.generate(context, max_tokens=10000)[0].tolist()

In [94]:
print(decode(output))


they that knat, (as and whep dead still contis
dowea. “Ick chough: add sand she knars look of nold book witt less corm a fanded as al a
tain’t a knory hard the went,” said the poed, about all said in tand they _long ppeald as she put gooden and gating the said think! Would belf,
fleeps dine, are all her taking for dra a geft one the chance. “Non yes, and thing sheas a know chagn dend butle bit Alice_, whif after as? sholk I wist nrad that por till: what ste
ot_treton: one gamet dinked,” said the DTurble.
 “—You _. I’ve
conted, said put a mong all the hat be was, go and all a
going think, and,” said Alice wat thim it o take I at then, hizs
alh, fus in as to mind: and no a they fats: all I dill_, with undennat hre went for fut a lost all all at lookmy
now ,” Alice Tharpiar cumpling sol
of,” the clakily
to was mut poop on.”

would wis to the hea? Alif all,” The Furt was
staid Alich, and she ind—for they le. “It wo” dack cound an to cone, and
ree as beg, on.

“I fist litttge corthpat coki

# Masked Self Attention

In [257]:
X = torch.randn((batch_size, block_size, n_embd))
y = torch.randn((batch_size, block_size, n_embd))

In [258]:
key = nn.Linear(n_embd, head_size, bias=False)
query = nn.Linear(n_embd, head_size, bias=False)
value = nn.Linear(n_embd, head_size, bias=False)

k = key(X)
q = query(X)
v = value(X)

In [259]:
x.shape

torch.Size([8, 8])

In [260]:
# (B, T, C) @ (B, C, T)
wei = (q @ k.transpose(-2, -1)) * k.shape[0]**-0.5
mask = torch.tril(torch.ones(batch_size, block_size))

In [261]:
wei.var()

tensor(0.1653, grad_fn=<VarBackward0>)

In [262]:
wei = wei.masked_fill(mask == 0, float('-inf'))

In [263]:
wei = F.softmax(wei, dim=-1)

In [264]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.6074, 0.3926, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2696, 0.3481, 0.3823, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1844, 0.2523, 0.2507, 0.3126, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1398, 0.2291, 0.2481, 0.2378, 0.1453, 0.0000, 0.0000, 0.0000],
        [0.1095, 0.1839, 0.1945, 0.1197, 0.2366, 0.1557, 0.0000, 0.0000],
        [0.1638, 0.1249, 0.0948, 0.1708, 0.1656, 0.1554, 0.1248, 0.0000],
        [0.1059, 0.1645, 0.1684, 0.1861, 0.0689, 0.0699, 0.1075, 0.1289]],
       grad_fn=<SelectBackward0>)

In [265]:
v[0]

tensor([[-0.4090, -0.0844,  0.4595,  0.0465,  0.5468,  1.1730, -0.3282, -0.3627,
          0.1783,  0.0353, -0.2119, -0.3698,  0.4134,  0.2859,  0.2247, -0.6131],
        [-0.7410,  0.0779, -0.2925,  0.2071, -0.5912,  0.3608, -0.0498,  0.7420,
         -0.2828, -0.3913, -0.2852, -0.9858, -0.2219, -0.2099, -0.1447,  0.2150],
        [ 0.3926,  0.2877,  0.2059, -0.3759,  0.1268, -0.2432, -0.4747,  0.4679,
          0.6460,  0.7017, -0.6274,  0.3068,  0.0086, -0.1442, -0.3745,  0.2156],
        [-0.6158,  0.9428,  0.1268,  0.2106,  0.1436,  0.0402,  0.0900, -0.2539,
          0.5076,  0.9316,  1.0298, -0.8456, -0.8128,  0.4148, -1.2114, -0.4425],
        [ 0.6647, -0.5455,  0.5326, -0.9780, -0.2174,  0.9473, -0.1027,  0.8601,
          1.3140,  0.1001,  0.1058,  0.4890,  0.2137, -0.1716,  0.4304,  0.1098],
        [-1.0986, -0.0348,  0.3473, -0.4341,  0.2286,  1.0323, -0.3086,  0.1645,
          0.2484,  0.4556, -0.4360,  0.8109, -0.4500, -0.1652, -0.0652,  0.6274],
        [-0.0714,  0.1

# Each one of these attention values communicate only backward, not in the future (decoder)

In [266]:
wei[0]@v[0]

tensor([[-0.4090, -0.0844,  0.4595,  0.0465,  0.5468,  1.1730, -0.3282, -0.3627,
          0.1783,  0.0353, -0.2119, -0.3698,  0.4134,  0.2859,  0.2247, -0.6131],
        [-0.5393, -0.0206,  0.1643,  0.1095,  0.1000,  0.8542, -0.2189,  0.0710,
         -0.0027, -0.1321, -0.2407, -0.6116,  0.1640,  0.0913,  0.0797, -0.2881],
        [-0.2181,  0.1144,  0.1008, -0.0591, -0.0099,  0.3489, -0.2873,  0.3394,
          0.1966,  0.1416, -0.3963, -0.3255,  0.0375, -0.0511, -0.1329, -0.0081],
        [-0.3565,  0.3709,  0.1022,  0.0324,  0.0284,  0.2589, -0.1639,  0.1582,
          0.2822,  0.3749,  0.0536, -0.5043, -0.2317,  0.0933, -0.4676, -0.1431],
        [-0.1794,  0.2224,  0.1558, -0.1313, -0.0250,  0.3334, -0.1686,  0.3000,
          0.4320,  0.3254,  0.0096, -0.3314, -0.1531,  0.0298, -0.3202, -0.0722],
        [-0.1921,  0.0394,  0.2319, -0.3038, -0.0228,  0.5372, -0.1990,  0.3865,
          0.5036,  0.2746, -0.1173, -0.0213, -0.1107, -0.0520, -0.1282,  0.0850],
        [-0.2971,  0.1