**Download Dataset to train on**

We will download the tiny shakespeare dataset (following Karpathy's tutorial)

In [39]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-09-05 14:33:48--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2025-09-05 14:33:49 (12.9 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



In [40]:
# we will read it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [41]:
print(f'length of dataset in characters: {len(text):,}')

length of dataset in characters: 1,115,394


Let's take a look at the first 1000 characters

In [42]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



Let's order the set

In [43]:
chars = sorted(list(set(text)))
print(f'{len(chars)} unique characters: {chars}')

65 unique characters: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [44]:
vocab_size = len(chars)
print(f'vocab size: {vocab_size}')

vocab size: 65


There are 65 unique characters, that will be our vocabulary size, and you can see, the characters include numbers, symbols, capital alphabets and small alphabets

Now, **tokenize** the text. Here we will be doing character level tokenization, but bigger LLMs do it more efficiently and better

In [45]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s]  # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l])  # decoder: take a list of integers, output a string
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


1. h = 46
2. i = 47
3. i = 47
4. *space* = 1
5. t = 58
6. h = 46
7. e = 43
8. r = 56
9. e = 43

As you can see, even the `space` is encoded, and so is next line `\n` (see example above).

Other ways of doing tokenization:

1. [Sentence Piece](https://github.com/google/sentencepiece): This is how google does it. In this you are not encoding the full wors, neither are you encoding individual characters, you are encoding sub words.
2. [tiktoken](https://github.com/openai/tiktoken): This is how OpenAI does it.


In [46]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")

print(enc.encode("hii there"))


[71, 3573, 1354]


In [47]:
# how many tokens are in the text
print(f'There are {len(enc.encode(text))} tokens in the text')

There are 297606 tokens in the text


In [48]:
# what is the vocab size of this tokenizer
print(f'vocab size: {enc.n_vocab}')

vocab size: 200019


We won't use these big tokenizers, we will keep things simple

Now, going back to shakespeare, since we have the encode and decode function, we can tokenize the entire shakespeare dataset

In [49]:
import torch # will have to use pytorch
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:1000])  # first 1000 characters encoded as integers

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

Now, let's separate the train and validation sets

In [50]:
# train and validation split
n = int(0.9*len(data)) # first 90% will be train
train_data = data[:n]
val_data = data[n:]

print(train_data.shape, val_data.shape)

torch.Size([1003854]) torch.Size([111540])


In [51]:
block_size = 8 # how many characters do we take to predict the next character
train_data[:block_size+1]  # we will take the first block_size characters to predict the next character


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [52]:
x = train_data[:block_size]  # inputs
y = train_data[1:block_size+1]  # targets, shifted by one character
for t in range(block_size):
    context = x[:t+1]  # the context we have so far
    target = y[t]  # the next character we want to predict
    print(f'when input is {context.tolist()} the target: {target}')


when input is [18] the target: 47
when input is [18, 47] the target: 56
when input is [18, 47, 56] the target: 57
when input is [18, 47, 56, 57] the target: 58
when input is [18, 47, 56, 57, 58] the target: 1
when input is [18, 47, 56, 57, 58, 1] the target: 15
when input is [18, 47, 56, 57, 58, 1, 15] the target: 47
when input is [18, 47, 56, 57, 58, 1, 15, 47] the target: 58


This code demonstrates how next-character prediction works in a GPT-style language model.

- `x` contains the first `block_size` characters from the training data, which serve as the input context.
- `y` contains the next `block_size` characters, shifted by one position, which are the targets for prediction.
- For each position `t` in the block:
    - `context` is the input sequence up to position `t` (i.e., the context available so far).
    - `target` is the next character that the model should predict, given the current context.
    - The print statement shows the context and the corresponding target character.

This setup mimics how autoregressive models like GPT learn to predict the next token given a sequence of previous tokens.

Now, let's see batch dimension

In [53]:
seed = 135
torch.manual_seed(seed)

batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))  # random starting indices for each sequence in the batch
    x = torch.stack([data[i:i+block_size] for i in ix])  # each row is a sequence of length block_size
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # targets, shifted by one character
    return x, y

In [54]:
xb, yb = get_batch('train')
print('inputs:')
print(xb)
print('targets:')
print(yb)

inputs:
tensor([[ 1, 53, 44,  1, 59, 57, 12,  1],
        [46,  6,  1, 47, 58,  1, 47, 57],
        [50, 50,  1, 53, 59, 58,  0, 32],
        [52,  1, 58, 46, 43,  1, 41, 50]])
targets:
tensor([[53, 44,  1, 59, 57, 12,  1, 41],
        [ 6,  1, 47, 58,  1, 47, 57,  1],
        [50,  1, 53, 59, 58,  0, 32, 53],
        [ 1, 58, 46, 43,  1, 41, 50, 53]])


In [55]:
# and there shapes
print(xb.shape)
print(yb.shape)

torch.Size([4, 8])
torch.Size([4, 8])


In [56]:
for b in range(batch_size):  # for each sequence in the batch
    for t in range(block_size):  # for each time step in the sequence
        context = xb[b, :t+1]  # the context we have so far
        target = yb[b, t]  # the next character we want to predict
        print(f'when input is {context.tolist()} the target: {target}') 

when input is [1] the target: 53
when input is [1, 53] the target: 44
when input is [1, 53, 44] the target: 1
when input is [1, 53, 44, 1] the target: 59
when input is [1, 53, 44, 1, 59] the target: 57
when input is [1, 53, 44, 1, 59, 57] the target: 12
when input is [1, 53, 44, 1, 59, 57, 12] the target: 1
when input is [1, 53, 44, 1, 59, 57, 12, 1] the target: 41
when input is [46] the target: 6
when input is [46, 6] the target: 1
when input is [46, 6, 1] the target: 47
when input is [46, 6, 1, 47] the target: 58
when input is [46, 6, 1, 47, 58] the target: 1
when input is [46, 6, 1, 47, 58, 1] the target: 47
when input is [46, 6, 1, 47, 58, 1, 47] the target: 57
when input is [46, 6, 1, 47, 58, 1, 47, 57] the target: 1
when input is [50] the target: 50
when input is [50, 50] the target: 1
when input is [50, 50, 1] the target: 53
when input is [50, 50, 1, 53] the target: 59
when input is [50, 50, 1, 53, 59] the target: 58
when input is [50, 50, 1, 53, 59, 58] the target: 0
when input

Let's feed our input to simple neural networks, we will be making use of `BigramLanguageModel`

In [57]:
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(135)

# simple bigram model
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)  # each token directly reads off the logits for the next token from a lookup table

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx)  # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            logits, loss = self(idx)  # (B,T,C)
            logits = logits[:, -1, :]  # focus only on the last time step (B,C)
            probs = F.softmax(logits, dim=-1)  # (B,C)
            idx_next = torch.multinomial(probs, num_samples=1)  # (B,1)
            idx = torch.cat((idx, idx_next), dim=1)  # append to the sequence
        return idx

In [58]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)  # loss is a single scalar


torch.Size([32, 65])
tensor(4.6684, grad_fn=<NllLossBackward0>)


In [59]:
idx = torch.zeros((1,1), dtype=torch.long)  # starting with a batch of size 1 and sequence length 1
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))  # decode the generated sequence to see what it looks like


D.:VV'QlZlCfFBS!P$nChtJVrMn;X3ax3rYnTlpIFmtbp RZNNTvagSPexCfPbJhadsRz'prG&y,;;eB&FFqnXaLZlpIFV!-F3af


A very random output

Let's **train** the model

In [60]:
# prepare the optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)



In [62]:
batch_size = 32
for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    if steps % 1000 == 0:
        print(steps, loss.item())

0 2.380831480026245
1000 2.364262104034424
2000 2.4205198287963867
3000 2.4657340049743652
4000 2.4013941287994385
5000 2.3721094131469727
6000 2.593458652496338
7000 2.561976194381714
8000 2.5262978076934814
9000 2.496509075164795


In [63]:
print(decode(m.generate(idx, max_new_tokens=500)[0].tolist()))  # decode the generated sequence to see what it looks like



Towatt llerus g ss fik NNCETherd, gouco oonginhe t huepibee ys'sus ane main,
Manighebroceat ld
ARaun soubukis.


elinoryor, Pl y?
Fifonnd tie ot hit kes odeaisam fore ssst.
TIne;
jomy ave baceit MNS:
Wimp t collepand yokerowores be, reer cesprrithar' ted.
Fiolayoswites tcoth t w
OUTheving by:
KIft eroofakemecen I ar fo meisss y mer:
NThafaueajurin:
OMEved yothange, a th touf I fref yorouecl'serowindiff s KI's baponout oveethod omag he?
There fowamin Is je cullcel hind ave fomealene trooowh a Pr


**A mathematical trick in self-attention**

In [64]:
# a toy example

B,T,C = 4,8,2
x = torch.randn(B,T,C)
print(x.shape)

torch.Size([4, 8, 2])


We want the tokens to talk to each other

In [None]:
# we want the tokens to talk to each other, but only to the left context

# **A mathematical trick in self-attention**

xbow = torch.zeros((B,T,C))
for b in range(B):                            # doing a for loop : not very efficient
    for t in range(T):
        xbow[b,t] = torch.mean(x[b,:t+1], 0)  # average over the context
print(xbow)

tensor([[[-0.2821, -0.1073],
         [-1.0871,  0.0314],
         [ 0.1231,  0.2967],
         [-0.0860,  0.3534],
         [-0.3350,  0.2186],
         [-0.4511,  0.2609],
         [-0.6646,  0.3740],
         [-0.6330,  0.1526]],

        [[ 0.9311, -0.1941],
         [ 0.6566,  0.0801],
         [ 0.6677, -0.3998],
         [ 0.8455, -0.5342],
         [ 0.4392, -0.3708],
         [ 0.2252, -0.6094],
         [ 0.0256, -0.5129],
         [ 0.1372, -0.5472]],

        [[-0.6001, -0.9921],
         [-0.1383, -0.5118],
         [-0.9251, -0.4939],
         [-0.3496, -0.4052],
         [-0.2337, -0.4883],
         [-0.2670, -0.2690],
         [-0.5445, -0.0915],
         [-0.2067, -0.0571]],

        [[ 0.3005,  0.0087],
         [ 0.2932,  0.3764],
         [-0.0314, -0.1346],
         [-0.0230,  0.0532],
         [ 0.2354, -0.0380],
         [ 0.4469, -0.2601],
         [ 0.4726, -0.4345],
         [ 0.3180, -0.4246]]])


In [66]:
x[0]

tensor([[-0.2821, -0.1073],
        [-1.8922,  0.1701],
        [ 2.5436,  0.8273],
        [-0.7131,  0.5235],
        [-1.3311, -0.3204],
        [-1.0315,  0.4724],
        [-1.9455,  1.0520],
        [-0.4120, -1.3968]])

In [67]:
xbow[0]

tensor([[-0.2821, -0.1073],
        [-1.0871,  0.0314],
        [ 0.1231,  0.2967],
        [-0.0860,  0.3534],
        [-0.3350,  0.2186],
        [-0.4511,  0.2609],
        [-0.6646,  0.3740],
        [-0.6330,  0.1526]])

In [68]:
# we can do this with matrix operations, lets understand this with a toy example

a = torch.ones(3,3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b

print(a)
print(b)
print(c)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
tensor([[6., 2.],
        [7., 2.],
        [8., 6.]])
tensor([[21., 10.],
        [21., 10.],
        [21., 10.]])


In [71]:
# lets see tril

a = torch.tril(torch.ones(3,3))
a = a / torch.sum(a,1,keepdim=True)  # normalize the rows to sum to 1

print(a)
print(b)
c = a @ b
print(c)


tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
tensor([[6., 2.],
        [7., 2.],
        [8., 6.]])
tensor([[6.0000, 2.0000],
        [6.5000, 2.0000],
        [7.0000, 3.3333]])


We can get averages in incremental fashion, by using `tril`. Lets go back

In [72]:
wei = torch.tril(torch.ones(T,T))
wei = wei / torch.sum(wei,1,keepdim=True)  # normalize the rows to sum to 1
print(wei)
xbow2 = wei @ x  # (B,T,T) @ (B,T,C) -> (B,T,C)
print(torch.allclose(xbow, xbow2))


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
True


In [None]:
# another version of the same, using softmax
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
print(wei)

xbow3 = wei @ x
print(torch.allclose(xbow, xbow3))

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
True


### **Multi-Head Attention**

In [84]:
B,T,C = 4,8,32
x = torch.randn(B,T,C)


# single head self attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)



k = key(x)   # (B,T,16)
q = query(x) # (B,T,16)
v = value(x) # (B,T,16)



# compute attention scores ("affinities")
wei = q @ k.transpose(-2,-1) * head_size**-0.5  # (B,T,16) @ (B,16,T) -> (B,T,T)
print(wei.shape)

tril = torch.tril(torch.ones(T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

out = wei @ v  # (B,T,T) @ (B,T,C) -> (B,T,C)
print(out.shape)

torch.Size([4, 8, 8])
torch.Size([4, 8, 16])


In [81]:
# you can see the tril matrix is used to mask out future tokens
print(tril)

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])


In [83]:
# and the wei matrix is the softmaxed version of it
print(wei[0])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4997, 0.5003, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3270, 0.3513, 0.3217, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1513, 0.1852, 0.1590, 0.5044, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1755, 0.1318, 0.1860, 0.3205, 0.1862, 0.0000, 0.0000, 0.0000],
        [0.1726, 0.2249, 0.1000, 0.0921, 0.1424, 0.2680, 0.0000, 0.0000],
        [0.1253, 0.2302, 0.1376, 0.1384, 0.2054, 0.0554, 0.1078, 0.0000],
        [0.1617, 0.0989, 0.1200, 0.1255, 0.1418, 0.0963, 0.1171, 0.1387]],
       grad_fn=<SelectBackward0>)


1. Attention is a communication mechanism, can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data dependent weights.
2. There is no notion of space, attention simply acts over a set of vectors, so we need positional encoding.


**Scaled Attention**

In [None]:
k = torch.randn(B,T, head_size)
q = torch.randn(B,T, head_size)
wei = q @ k.transpose(-2,-1) * head_size**-0.5 # scaling is used to prevent the variance of the dot product from getting too large


In [92]:
print(f" k.var(): {k.var()}")
print(f" q.var(): {q.var()}")
print(f" v.var(): {v.var()}")

print(f" wei.var(): {wei.var()}")

 k.var(): 1.0613439083099365
 q.var(): 0.9738199710845947
 v.var(): 0.3055475056171417
 wei.var(): 0.8701637983322144
