Always start off with the dataset to train on

In [2]:
!curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1089k  100 1089k    0     0  4908k      0 --:--:-- --:--:-- --:--:-- 4928k


In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print(f'Length of dataset in characters: {len(text)}')

Length of dataset in characters: 1115394


In [5]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



Time to build a basic token vocabulary (just figuring out what character are present in the dataset)

In [6]:
chars = sorted(list(set(text))) # Taking all of the unique inputs
vocab_size = len(chars) # Figuring out their length
print(''.join(chars)) # Putting everything together
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


Tokenisation step for character level vocab. There is a tradeoff between the length of the vocabulary and the length of the sequence that it will produce for encoded sentences. Popular methods: tiktoken (by OpenAI), SentencePiece (by Google).

In [7]:
stoi = { ch:i for i,ch in enumerate(chars) } # stoi stands for strings to index mapping (first element is the index and the second is the element of the sequence)
itos = { i:ch for i,ch in enumerate(chars) } # itos stands for index to string mapping
encode = lambda s: [stoi[c] for c in s] # lambda - quick way to define functions without def
decode = lambda l: ''.join([itos[i] for i in l]) # list  comprehension joins everything into one list without an explicit append request

print(encode('hii there'))
print(decode(encode('hii there')))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


Next step is to create a data tensor (encode the whole text.txt). A tensor is a fundamental data structure similar to NumPy arrays that is optimised for machine learning tasks. Data tensor refers to a tensor that holds encoded data (encoded text, images), to be used for ML.

In [8]:
import torch
print(torch.__version__)
data = torch.tensor(encode(text), dtype=torch.long) # The tensor stores 64-bit integers
print(data.shape, data.dtype)
print(data[data[:1000]])

2.5.1
torch.Size([1115394]) torch.int64
tensor([53, 43, 43, 39, 49, 47, 14, 43, 49, 43, 50, 56, 43, 64, 18,  0, 56,  6,
         1, 43, 56, 47,  0, 56, 47, 57, 43,  1, 46, 56, 56, 43, 47, 56, 43, 50,
        47,  6,  8, 43, 49, 46, 56, 43, 15, 47, 46, 56, 56, 43, 47, 51, 56, 47,
        39, 57, 56, 56, 56, 58, 18, 18, 10,  1,  1, 64, 18, 42, 57, 56, 56, 56,
        15, 47, 39, 57, 56, 56, 56, 58, 18, 18, 53, 43, 43, 39, 49, 47, 14, 43,
        49, 43, 50, 56, 43, 64, 18, 44,  1,  8, 47, 56, 43, 56, 47, 56,  1,  1,
        47, 43, 56, 39,  1,  1,  0, 56, 43, 47, 43, 56, 49, 46, 56, 43, 47, 49,
         1, 47, 43, 43, 56, 47, 49, 46, 56, 43, 47, 49,  1, 47,  6, 56, 51, 43,
        39, 46, 52, 18, 18, 10,  1,  1, 64, 18, 43, 56, 39,  1,  1,  0, 56, 43,
        58, 47, 43, 56, 39,  1,  1,  0, 56, 43, 58, 18, 18, 53, 43, 43, 39, 49,
        47, 14, 43, 49, 43, 50, 56, 43, 64, 18, 53, 43, 43, 39, 49, 15, 47, 50,
         1,  8, 47, 56, 43,  1,  0, 47, 14, 56, 43,  8, 39, 47, 54, 56, 43, 46,


In [9]:
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

To train a transformer you separate the text into chunks

In [10]:
block_size = 8
train_data[:block_size+1] # We do +1 to be able to predict the next character for each one of the 8 positions in the list

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This is done for both computational and context reasons, to be able to predict from 1 character to 8 (block size). After that, need to truncate to predict.

In [11]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f'When the input is {context}, the target is: {target}')

When the input is tensor([18]), the target is: 47
When the input is tensor([18, 47]), the target is: 56
When the input is tensor([18, 47, 56]), the target is: 57
When the input is tensor([18, 47, 56, 57]), the target is: 58
When the input is tensor([18, 47, 56, 57, 58]), the target is: 1
When the input is tensor([18, 47, 56, 57, 58,  1]), the target is: 15
When the input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is: 47
When the input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is: 58


Multiple batches of chunks should be processed in parallel to optimise

In [12]:
torch.manual_seed(1337)
batch_size = 4 # How many independent sequences will we process in parallel
block_size = 8 # What will be the context length for predictions

def get_batch(split):
    # Generate a small batch of data of input x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f'When the input is {context.tolist()} the target is {target}') # Printing out the context and targets for each of the batches

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
When the input is [24] the target is 43
When the input is [24, 43] the target is 58
When the input is [24, 43, 58] the target is 5
When the input is [24, 43, 58, 5] the target is 57
When the input is [24, 43, 58, 5, 57] the target is 1
When the input is [24, 43, 58, 5, 57, 1] the target is 46
When the input is [24, 43, 58, 5, 57, 1, 46] the target is 43
When the input is [24, 43, 58, 5, 57, 1, 46, 43] the target is 39
When the input is [44] the target is 53
When the input is [44, 53] the target is 56
When the input is [44, 53, 56] the target is 1
When the input is [44, 53, 56, 1] the target is 5

Implementation of the Bigram Language Model. It relies on the Markov assumption that the next word in the sequence depends only on the preceding word

The probability of a sequence of words is:

$$
P(w_1, w_2, \dots, w_n) \approx P(w_1) \prod_{i=2}^n P(w_i \mid w_{i-1})
$$

This means:
- $P(w_1)$: The probability of the first word.
- $P(w_i \mid w_{i-1})$: The probability of word $w_i$ given the preceding word $w_{i-1}$.
- The overall sequence probability is the product of the probabilities for each bigram.


### **Embedding Table in a Bigram Language Model**
- The **embedding table** is a learnable lookup table of size $vocab\_size \times vocab\_size$.
  - Rows represent the current token ($w_i$).
  - Columns represent logits for predicting the next token ($w_{i+1}$).

#### **How It Works**
1. **Input**:
   - Each token $w_i$ (integer index) selects the corresponding row in the embedding table.
2. **Output**:
   - A vector of size $vocab\_size$, where each value (logit) represents the unnormalized probability of the next token.
3. **Training**:
   - The model learns these logits to predict the most likely next token based on bigram relationships in the data.

#### **Key Points**
- It’s conceptually similar to a **Markov chain** transition matrix.
- Logits are converted into probabilities using the **softmax function** during training or inference.


In [13]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)  # Set a fixed random seed for reproducibility

# Define the Bigram Language Model class
class BigramLanguageModel(nn.Module):
    
    def __init__(self, vocab_size):
        super().__init__()
        # Initialize an embedding table of size (vocab_size, vocab_size)
        # Each token in the vocabulary is mapped to a vector of size vocab_size,
        # representing unnormalized logits for predicting the next token.
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        # idx: Tensor of shape (B, T), where B is the batch size, T is the sequence length.
        # targets: Tensor of shape (B, T), representing the true next tokens.

        # Lookup the embedding table for each token in idx.
        # logits shape: (B, T, C), where C is the vocab size.
        logits = self.token_embedding_table(idx)

        if targets is None:
            loss = None
        else:
            # Flatten the logits and targets for compatibility with cross-entropy loss:
            # - logits is reshaped from (B, T, C) to (B*T, C).
            # - targets is reshaped from (B, T) to (B*T).
            B, T, C = logits.shape
            logits = logits.view(B * T, C)  # Reshape logits to (batch_size * sequence_length, vocab_size)
            targets = targets.view(B * T)  # Reshape targets to match flattened logits

            # Compute the cross-entropy loss between the logits and targets.
            # Cross-entropy measures how well the predicted logits align with the true targets.
            loss = F.cross_entropy(logits, targets)

        # Return logits and loss
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# Example usage
m = BigramLanguageModel(vocab_size)  # Initialize the model with the vocabulary size
logits, loss = m(xb, yb)  # Forward pass: compute logits and loss for input xb and targets yb
print(logits.shape)  # Shape of logits should be (batch_size * sequence_length, vocab_size)
print(loss)  # The computed loss (scalar)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


For now, just feeding the last character into the model

In [14]:
# creating a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-4)

In [15]:
batch_size = 32
for steps in range(100000):

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    print(loss.item())

4.692410945892334
4.6652069091796875
4.767594814300537
4.709383010864258
4.599318027496338
4.71548318862915
4.719142913818359
4.6942458152771
4.708678245544434
4.7276105880737305
4.7265944480896
4.695520877838135
4.7581658363342285
4.749091148376465
4.680098533630371
4.601218223571777
4.731154441833496
4.690022945404053
4.733815670013428
4.764334678649902
4.650589466094971
4.7303080558776855
4.694482326507568
4.6067280769348145
4.764771461486816
4.70198917388916
4.83330774307251
4.77999210357666
4.721240043640137
4.633662700653076
4.752765655517578
4.776197910308838
4.641935348510742
4.698157787322998
4.770230770111084
4.774563312530518
4.726474285125732
4.681177139282227
4.778504848480225
4.751583576202393
4.782617568969727
4.736804008483887
4.766269683837891
4.798072814941406
4.613471508026123
4.692795276641846
4.746232032775879
4.858419895172119
4.625951766967773
4.768550395965576
4.5632429122924805
4.654340744018555
4.719803333282471
4.769106864929199
4.796699523925781
4.8703618049

Testing the optimized model

In [16]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=300)[0].tolist()))



PURY: ht IUS:
S:

INGRLI enje ssutefrsighe! t cl igagimous pray whars:
Panalit I It aithit terised thevermenghau buaror VOubed spo mng as chathab llll:
Ware,

ee her,
Thooured aly y hindr's.
Fashat-owhrees s, share hathure Anfaneof f s llon!

ICLiroushanot

Then
MOMpewon gss, be jestrty

AROUFLAm, 


In [None]:
!python -c "import torch; print(torch.__version__)"

2.5.1


The mathematical trick in self-attention

In [18]:
# considering the following toy example:

torch.manual_seed(1337)
B, T, C = 4, 8, 2 # batch, time channels
x = torch.randn(B, T, C)
x.shape
torch.Size([4, 8, 2])

torch.Size([4, 8, 2])

In [None]:
# We want x[b, t] = mean_{i<=t} x[b, i]
xbow = torch.zeros((B, T, C)) # bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # (t, C)
        xbow[b, t] = torch.mean(xprev, 0) # to run x[0] xbow[0]

Cool way to calculate averages using matrices

In [24]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ---> (B, T, C)
torch.allclose(xbow, xbow2)

True

In [23]:
torch.tril(torch.ones(3, 3))

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [22]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b
print('a=')
print(a)
print('b=')
print(b)
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


Best way to calculate the averages

In [25]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf')) # Past cannot communicate with the future
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True

In [33]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Self attention (adding keys and queries). Self attention given that it comes from the same source. Difference between decoder/encoder.

Definitions:
1. Head refers to an independent set of query (q), key (k), and value (v) computations within a multi-head attention mechanism

In [30]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# single head of self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf')) # Past cannot communicate with the future
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
# out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [31]:
v[0]

tensor([[-0.1571,  0.8801,  0.1615, -0.7824, -0.1429,  0.7468,  0.1007, -0.5239,
         -0.8873,  0.1907,  0.1762, -0.5943, -0.4812, -0.4860,  0.2862,  0.5710],
        [ 0.8321, -0.8144, -0.3242,  0.5191, -0.1252, -0.4898, -0.5287, -0.0314,
          0.1072,  0.8269,  0.8132, -0.0271,  0.4775,  0.4980, -0.1377,  1.4025],
        [ 0.6035, -0.2500, -0.6159,  0.4068,  0.3328, -0.3910,  0.1312,  0.2172,
         -0.1299, -0.8828,  0.1724,  0.4652, -0.4271, -0.0768, -0.2852,  1.3875],
        [ 0.6657, -0.7096, -0.6099,  0.4348,  0.8975, -0.9298,  0.0683,  0.1863,
          0.5400,  0.2427, -0.6923,  0.4977,  0.4850,  0.6608,  0.8767,  0.0746],
        [ 0.1536,  1.0439,  0.8457,  0.2388,  0.3005,  1.0516,  0.7637,  0.4517,
         -0.7426, -1.4395, -0.4941, -0.3709, -1.1819,  0.1000, -0.1806,  0.5129],
        [-0.8920,  0.0578, -0.3350,  0.8477,  0.3876,  0.1664, -0.4587, -0.5974,
          0.4961,  0.6548,  0.0548,  0.9468,  0.4511,  0.1200,  1.0573, -0.2257],
        [-0.4849,  0.1

Note: scaling the attention by sqrt(k) (head size) is important to not overemphasize the bigger values post softmax