## Loading our Dataset

In [None]:
!wget http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz
!gunzip gutenberg-poetry-v001.ndjson.gz
!jq -r '.s' gutenberg-poetry-v001.ndjson > poems.txt

--2025-04-17 16:59:15--  http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz
Resolving static.decontextualize.com (static.decontextualize.com)... 207.244.116.232
Connecting to static.decontextualize.com (static.decontextualize.com)|207.244.116.232|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz [following]
--2025-04-17 16:59:15--  https://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz
Connecting to static.decontextualize.com (static.decontextualize.com)|207.244.116.232|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54837372 (52M) [application/x-gzip]
Saving to: ‘gutenberg-poetry-v001.ndjson.gz’


2025-04-17 16:59:17 (37.0 MB/s) - ‘gutenberg-poetry-v001.ndjson.gz’ saved [54837372/54837372]

gzip: gutenberg-poetry-v001.ndjson already exists; do you wish to overwrite (y or n)? n
	not overwritten
/bin/bash: line 1: jq: command not

In [None]:
import json

with open('gutenberg-poetry-v001.ndjson', encoding='utf-8') as fin, \
     open('poems.txt', 'w', encoding='utf-8') as fout:
    for line in fin:
        poem = json.loads(line).get('s', '')
        fout.write(poem + '\n')

In [None]:
with open('poems.txt', 'r', encoding = 'utf-8') as f:
  text = f.read()

In [None]:
print(text[:1000])

The Song of Hiawatha is based on the legends and stories of
many North American Indian tribes, but especially those of the
Ojibway Indians of northern Michigan, Wisconsin, and Minnesota.
They were collected by Henry Rowe Schoolcraft, the reknowned
Schoolcraft married Jane, O-bah-bahm-wawa-ge-zhe-go-qua (The
fur trader, and O-shau-gus-coday-way-qua (The Woman of the Green
Prairie), who was a daughter of Waub-o-jeeg (The White Fisher),
who was Chief of the Ojibway tribe at La Pointe, Wisconsin.
Jane and her mother are credited with having researched,
authenticated, and compiled much of the material Schoolcraft
included in his Algic Researches (1839) and a revision published
in 1856 as The Myth of Hiawatha.  It was this latter revision
that Longfellow used as the basis for The Song of Hiawatha.
Longfellow began Hiawatha on June 25, 1854, he completed it
soon as the poem was  published its popularity was assured.
However, it also was severely criticized as a plagiary of the
Finnish epic po

We have about a million tokens

In [None]:
# Lets get the unique characters in our text in a sorted format
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("".join(chars))

	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡£¤¦§©ª«­®°´¶·º»¼½¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÖ×ÚÜÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿāăĒēĕěħĩīĭŌōŏŒœŚśũūŭƚǣǹǽȜȝȳ̷̄̆̓;΄Ά·ΈΌΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΩάέήίαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϑϕḍṅṇṛṣṭẽἀἁἂἄἅἆἈἉἌἐἑἒἔἕἘἙἜἝἠἡἢἣἤἥἦἧἨἩἫἰἱἲἳἴἵἶἷἸἹἼἽὀὁὂὃὄὅὈὉὍὐὑὒὔὕὖὗὙὠὡὢὣὤὥὦὧὨὩὫὬὭὰάὲέὴήὶίὸόὺύὼώᾐᾗᾧᾳᾴᾶ᾽ῂῃῆῇΐῖῡῤῥῦῬῳῶῷ–—‖‘’“”„†‡…‧⁂─☞


We can see that we have multiple special characters, uppercase and lowercase letters

### Building Decoder and Encoder

In [None]:
stoi = {ch: i for i,ch in enumerate(chars)}
itos = {i: ch for i,ch in enumerate(chars)}

def encode (chars):
  return [ stoi[ch] for ch in chars]


def decode (indices):
  return "".join([itos[i] for i in indices])

In [None]:
print(encode("Hello there"))

[45, 74, 81, 81, 84, 5, 89, 77, 74, 87, 74]


In [None]:
print(decode([45, 74, 81, 81, 84, 5, 89, 77, 74, 87, 74]))

Hello there


We can either have long seq with small vocab
or short seq with big vocab

Currently we are using a small vocab size (65) which will give us larger vectors for each input

In [None]:
import torch
data = torch.tensor(encode(text), dtype = torch.long)
print(data.shape, data.dtype)

torch.Size([121559783]) torch.int64


We see that the torch size is same as the len of our dataset.

This is because each character in the dataset has been assigned an integer value, keeping the overall size same.

### Creating DataLoaders

In [None]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

Method to estimate Loss

In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 128
n_head = 8
n_layer = 8
dropout = 0.0

## Making a Self Attention Block

Self Attention


Query -> What are you looking for?


Key -> What is contained in that token.


Dot product gives affinity. Large value means the token has what query was looking for


We divide by the scaled factor so that while appling softmax, no single value is too large or too big. Making it difficult to aggregate information from other nodes

Step 1: query @ key.transpose

Step 2: fill lower half of the square matrix with -inf

Step 3: apply softmax and dropout

Step 4: multiply with value maxtrix and return


In [None]:
class Head(nn.Module):
    def __init__(self,head_size):
        super().__init__()
        self.key = nn.Linear(n_embd,head_size,bias = False)
        self.query = nn.Linear(n_embd,head_size,bias = False)
        self.value = nn.Linear(n_embd,head_size,bias = False)
        self.register_buffer('tril',torch.tril(torch.ones(block_size,block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        # transpose (-2,-1) turns B,T,C to B,C,T
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        # we replace the zeroes with -inf, so they turn into 0 after softmax
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        v = self.value(x)
        out = wei @ v # gives us B,T,C again
        return out


## Making a Multi Attention Block

We Basically Run multiple self attention heads in parallel and concat them

More heads let the model split its embedding space into multiple “views,” so each head can focus on different types of relationships (e.g. syntax vs. semantics) in parallel. That can increase the richness of what it learns—but if you keep the total embedding size fixed, adding heads shrinks each head’s dimensionality, so there’s a trade‑off and diminishing returns.

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, head_size):
    super().__init__()
    # we create num_heads heads and run them in parallel
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.proj = nn.Linear(n_embd,n_embd)
    self.dropout = nn.Dropout(dropout)

  def forward(self,x):
    out = torch.cat([h(x) for h in self.heads],dim = -1)
    out = self.proj(out)
    out = self.dropout(out)
    return out

## Creating a FeedForward NeuralNet

We create a simple feed forward neural network according to the dimensions suggested in the paper

In [None]:
class FeedFoward(nn.Module):
  def __init__(self,n_embd):
    super().__init__()
    self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )
  def forward(self,x):
    return self.net(x)

## Creating an Attention Block

We create a Block, which has its own MultiAttention head, feedforward nn layer as well as layer norms

In [None]:
class Block(nn.Module):
  def __init__(self,n_embd, n_head):
    super().__init__()
    head_size = n_embd // n_head

    self.sa = MultiHeadAttention(n_head, head_size)
    self.ffwd = FeedFoward(n_embd)
    self.ln1 = nn.LayerNorm(n_embd)
    self.ln2 = nn.LayerNorm(n_embd)

  def forward(self,x):
    x = x + self.sa(self.ln1(x))
    x = x + self.ffwd(self.ln2(x))

    return x

In [None]:
class LanguageModel(nn.Module):

  def __init__(self):
        super().__init__()

        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        # to transform the C dim back to vocab_size to extract logits
        self.lm_head = nn.Linear(n_embd, vocab_size)

  def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

  def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [None]:
model = LanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

1.693597 M parameters
step 0: train loss 6.2193, val loss 6.2207
step 100: train loss 2.5466, val loss 2.6080
step 200: train loss 2.4395, val loss 2.4932
step 300: train loss 2.3719, val loss 2.4155
step 400: train loss 2.3208, val loss 2.3587
step 500: train loss 2.2461, val loss 2.3053
step 600: train loss 2.2010, val loss 2.2494
step 700: train loss 2.1572, val loss 2.2287
step 800: train loss 2.1257, val loss 2.1786
step 900: train loss 2.1033, val loss 2.1556
step 1000: train loss 2.0553, val loss 2.1336
step 1100: train loss 2.0565, val loss 2.1221
step 1200: train loss 2.0349, val loss 2.0986
step 1300: train loss 2.0330, val loss 2.0926
step 1400: train loss 1.9820, val loss 2.0559
step 1500: train loss 1.9975, val loss 2.0468
step 1600: train loss 1.9767, val loss 2.0207
step 1700: train loss 1.9615, val loss 2.0256
step 1800: train loss 1.9546, val loss 2.0107
step 1900: train loss 1.9416, val loss 2.0200
step 2000: train loss 1.9384, val loss 1.9952
step 2100: train loss 1.