If I'm going to train a model it should be interesting. 


It'll use this [dataset] (https://www.kaggle.com/datasets/rtatman/state-of-the-union-corpus-1989-2017?select=Eisenhower_1960.txt) of State of the Union addresses from 1780 to 2018. 

Let's make a politician :)

In [2]:
# Make sure to run preprocessing.py first

with open('dataset.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print(f"Number of characters in dataset: {len(text)}")

Number of characters in dataset: 10602521


In [4]:
# See first 200 characters
print(text[:200])

Gentlemen of the Senate and Gentlemen of the House of Representatives:

I was for some time apprehensive that it would be necessary, on account of
the contagious sickness which afflicted the city of P


In [5]:
# Prints all unique chars in the dataset in order 
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"All unique characters in this dataset: {''.join(chars)}\n")
print(f"Lenght of unique characters: {vocab_size}")

All unique characters in this dataset: 	
 !"$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz

Lenght of unique characters: 89


In [6]:
# Map characters to integers and viceversa
# Only as many mapping as available chars 
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# Encoder: str -> int. Decoder: int -> str
# stoi: mapping to encode str -> int. itos: mapping to reverse map int -> str.
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

encoded = encode("Words")
print(f"Encoded string {encoded}\n")

decoded = decode(encoded)
print(f"Decoded string: {decoded}")

Encoded string [54, 77, 80, 66, 81]

Decoded string: Words


Note from video: there are many other tokenizers, eg. sentencepiece (google), tiktoken (openai). They vary in the extent to which they break a word and how they map it.

In [8]:
import torch 

# Torch tensor creates an array with the encoded dataset.
# A tensor is basically an array that supports GPU computing
# It's faster for ML/AI and torch comes with configs for this specifically
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:200])



torch.Size([10602521]) torch.int64
tensor([38, 67, 76, 82, 74, 67, 75, 67, 76,  2, 77, 68,  2, 82, 70, 67,  2, 50,
        67, 76, 63, 82, 67,  2, 63, 76, 66,  2, 38, 67, 76, 82, 74, 67, 75, 67,
        76,  2, 77, 68,  2, 82, 70, 67,  2, 39, 77, 83, 81, 67,  2, 77, 68,  2,
        49, 67, 78, 80, 67, 81, 67, 76, 82, 63, 82, 71, 84, 67, 81, 27,  1,  1,
        40,  2, 85, 63, 81,  2, 68, 77, 80,  2, 81, 77, 75, 67,  2, 82, 71, 75,
        67,  2, 63, 78, 78, 80, 67, 70, 67, 76, 81, 71, 84, 67,  2, 82, 70, 63,
        82,  2, 71, 82,  2, 85, 77, 83, 74, 66,  2, 64, 67,  2, 76, 67, 65, 67,
        81, 81, 63, 80, 87, 13,  2, 77, 76,  2, 63, 65, 65, 77, 83, 76, 82,  2,
        77, 68,  1, 82, 70, 67,  2, 65, 77, 76, 82, 63, 69, 71, 77, 83, 81,  2,
        81, 71, 65, 73, 76, 67, 81, 81,  2, 85, 70, 71, 65, 70,  2, 63, 68, 68,
        74, 71, 65, 82, 67, 66,  2, 82, 70, 67,  2, 65, 71, 82, 87,  2, 77, 68,
         2, 47])


In [None]:
n = int(0.9 * len(data))

# Keep only the 90th of the data for the training data
train_data = data[:n]
val_data = data[n:]

print(f"The model will be trained with {n} characters, {len(data) - n} will be set aside for validation")

The model will be trained with 9542268 characters, 1060253 will be set aside for validation


In [None]:
# Training Params

# How many chars to sample from data for each training run
# block_size matters because it will be the upper bound of how much the transformer will infer off of 
# If it gets inputted anything longer than this it might have a hard time fitting what is **expected** to come afterwards, because it did not see that during training
block_size = 8
train_data[:block_size+1]


tensor([38, 67, 76, 82, 74, 67, 75, 67])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

# X is the normal array of block_size
# Y is X shifted one space left (by passing start = 1 it skips the 0-th index)
print(x)
print(y)

# This creates a relationship where the i-th x element precedes the i-th y element 

for t in range(block_size):
    context = x[:t+1] 
    target = y[t]  
    print(f"Input {context} | Target {target}")

# This loops print x up to the current t bound 
# And the char at the t - 1 position that goes after in the sequence


tensor([38, 67, 76, 82, 74, 67, 75, 67])
tensor([67, 76, 82, 74, 67, 75, 67, 76])
Input tensor([38]) | Target 67
Input tensor([38, 67]) | Target 76
Input tensor([38, 67, 76]) | Target 82
Input tensor([38, 67, 76, 82]) | Target 74
Input tensor([38, 67, 76, 82, 74]) | Target 67
Input tensor([38, 67, 76, 82, 74, 67]) | Target 75
Input tensor([38, 67, 76, 82, 74, 67, 75]) | Target 67
Input tensor([38, 67, 76, 82, 74, 67, 75, 67]) | Target 76


In [None]:
torch.manual_seed(1234)
batch_size = 4 # How many sequences to process in each training run 
block_size = 8 # The maximum context lenght for each prediction 

def get_batch(split):
    data = train_data if split == 'train' else val_data
    # Creates random offsets to sample blocks from 
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    # Y is the tensor that is offset by one for prediction
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    # torch.stack stacks by rows all the pairs
    return x, y

xb, yb = get_batch('train')

print("Inputs:")
print(xb.shape)
print(xb)
print("Targets")
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f"Input {context} | Target {target}")

Inputs:
torch.Size([4, 8])
tensor([[70, 80, 67, 67,  2, 69, 83, 76],
        [66,  2, 71, 76, 15,  2, 51, 70],
        [82, 70, 67, 80, 74, 63, 76, 66],
        [ 2, 82, 70, 67,  2, 64, 71, 69]])
Targets
torch.Size([4, 8])
tensor([[80, 67, 67,  2, 69, 83, 76, 64],
        [ 2, 71, 76, 15,  2, 51, 70, 67],
        [70, 67, 80, 74, 63, 76, 66, 81],
        [82, 70, 67,  2, 64, 71, 69, 69]])
Input tensor([70]) | Target 80
Input tensor([70, 80]) | Target 67
Input tensor([70, 80, 67]) | Target 67
Input tensor([70, 80, 67, 67]) | Target 2
Input tensor([70, 80, 67, 67,  2]) | Target 69
Input tensor([70, 80, 67, 67,  2, 69]) | Target 83
Input tensor([70, 80, 67, 67,  2, 69, 83]) | Target 76
Input tensor([70, 80, 67, 67,  2, 69, 83, 76]) | Target 64
Input tensor([66]) | Target 2
Input tensor([66,  2]) | Target 71
Input tensor([66,  2, 71]) | Target 76
Input tensor([66,  2, 71, 76]) | Target 15
Input tensor([66,  2, 71, 76, 15]) | Target 2
Input tensor([66,  2, 71, 76, 15,  2]) | Target 51
Input

Some notes so far:

The core of the code above is separating the original encoded train set into two splits: one that is the same as the original (X) and one that is offset by 1 to the left (Y).

The offset by 1 tensor will be used as the 'target'. The transformer will see pairs of varying block sizes with X as the base and Y as the target.

This makes sense because the i-th offset tensor has the token that goes after the i-th X token **in the original array**. 

We will adjust the weights to lower the loss function depending on the accuracy of the model's predictions. 


In [43]:
import torch 
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss 

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([256, 89])
tensor(5.0302, grad_fn=<NllLossBackward0>)
	MxRl&__ESAn
)vay1V&fQ
lWJ]Ep/0r=A-@Wcx7:U!GzYuatrUUYj](IUG`-u1F2@8Ordz+];dugM[o\=Lp%:agj7CP6r5T	?hlR


Notes on Bigram:

Not sure how this works right now. Should watch Karpathy's videos on that.

What I do know is that the cross entropy is basically the difference (from The Demon in the Machine: the randomness) between the input and the target. 

# Todo
- softmax = ?
- cross-entropy = ?
- B, T, C and re-shaping = ?

In [44]:
# Adam pytorch optimizer. Used for gradient descent
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

The code cell below uses Adam as an optimizer to lower the loss function of the X-th and Y-th tokens without using attention (ie. this is only done at the next-token level).

# Todo:
- zero_grad = ?
- loss.backward = ?
- how does .step() do what it does = ? 

In [None]:
batch_size = 32

for steps in range(10000):

    # Fetch some data
    xb, yb = get_batch('train')

    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    # Gets the gradients 
    loss.backward()
    # This descends using adam at each step
    optimizer.step()

print(loss.item())


2.47403621673584


In [56]:
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

	EGofueses ofe-y f
mbutereconccongrs 1 thesthefomywhered twourkst ad nancacillicy stid gthay
wecit d ticanaichove alicil Whinde, od tins thec r iorcthad, l Se pass.

Gede
boncy athes o tombjo urwislthioptherte bctoded on
pe

pr toavied n Brthe beg witha, opubys f

ise as beche dof

s, ontatofis latory bis
ote fl in. m, on
ben prar fece It ss wsufutr theshagrench d pin Corslththakexionorizerainse w to harctr Moterem sar inde rucomoflutifrcomede

t
us rere t as. tthe be Janorsiospa thansllingehit t


# Note on attention

For a token to pay 'attention' to other tokens, we need to relate them.

It would be ilogical to relate a past token at the i-th index with a future token at the i-th + 1 index if we want to predict the i-th + 1 token. 

Instead, autoregressive models ough to do this using previous tokens, so the question becomes: 

how relate a set of tokens in a meaningful way to express what the next token ought to be?

Self-attention is how to tie these to the last token in a sentence. 


We may for example average all previous token up to the last token of interest to capture their relative 'relationship' with each other. 

In [65]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2
xbow = torch.randn(B, T, C)
x.shape


torch.Size([4, 8, 2])

In [None]:
xbox = torch.zeros((B, T, C))
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]
        xbow[b, t] = xprev.mean(dim=0)

In [78]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [79]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In the example above, the resulting tensor follows the sequence:

a(1), a(1,2), a(1,2,3), a(1,2,3,4), ...
where a = average

So, each element is an average of all the elements before it (inclusive).

This captures up to each element the context/relationship of the previous tokens.


In [None]:
torch.manual_seed(42)

a = torch.ones(3, 3)
b = torch.randint(0, 10, (3, 2)).float()
# @ is used for matrix multiplication. Equivalent to torch.matmul(a, b) and a.dot(b)
c = a @ b

# Referesher on matmuls:
# the dot product is each element in the n-th row multiplied by the corresponding element in the m-th column

print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)


a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])
torch.Size([3, 2])


In [71]:
torch.manual_seed(42)

a = torch.tril(torch.ones(3, 3))
b = torch.randint(0, 10, (3, 2)).float()
# @ is used for matrix multiplication. Equivalent to torch.matmul(a, b) and a.dot(b)
c = a @ b

# Referesher on matmuls:
# the dot product is each element in the n-th row multiplied by the corresponding element in the m-th column

print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)


a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


Note on output:

For the first row of c, the product is 

1 * 2 + 6 * 0 + 6 * 0 = 2 
and 
1 * 7 + 4 * 0 + 5 * 0 = 6

For the following rows of c:

1 * 2 + 6 * 1 + 6 * 0 = 8
and
1 * 7 + 4 * 1 + 5 * 0 = 11

Because we're multiplying downwards each row of a by each column of b


In [None]:
# This can be used to create a metrix that averages out another matrix.

# This results in the average because it is the elements of a divided by the sum
# Each row sums up to exactly one 
a = a / torch.sum(a, 1, keepdim=True)

print(a)

tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])


In [82]:
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x 
torch.allclose(xbow,xbow2)


True

Note: this uses a triangular matrix wei where each row sums up to one to normalize the existing X tensor. (X = Train)

Practically this means each block in X is averaged out by wei in the resulting matrix as a weighted sum. 


In [83]:
print(xbow[0])
print(xbow2[0])

# The resulting matrices are the same because xbow was just an average of B, T, C - xbow 2 achieves the same thing using matrix multiplication by a matrix of averaged weights

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


In [84]:
# There is a third way to accomplish the above using softmax as well
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
# masked_fill = replace() -> tril == target, replacement
# Here, all 0's are replaced with negative infinity
wei = wei.masked_fill(tril == 0, float('-inf'))
# The softmax function normalizes all 1's to their probability of happening (basically replaces a value with its chance of happening) and all -inf to 0 
# Same as wei = wei / wei.sum(1, keepdim=True)
wei = F.softmax(wei, dim=-1)
xbow3 = wei @  x
torch.allclose(xbow, xbow3)

True

Note: The reason why the triangular matrix is important here is because of the previous set up for off by one text prediction.

That looks something like this: 

Input tensor([70]) | Target 80
Input tensor([70, 80]) | Target 67
Input tensor([70, 80, 67]) | Target 67
Input tensor([70, 80, 67, 67]) | Target 2
Input tensor([70, 80, 67, 67,  2]) | Target 69
Input tensor([70, 80, 67, 67,  2, 69]) | Target 83
Input tensor([70, 80, 67, 67,  2, 69, 83]) | Target 76
Input tensor([70, 80, 67, 67,  2, 69, 83, 76]) | Target 64

While a averaged tril matrix looks something like this:

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

Notice the similarity? 

They both traverse horizontally, only 'averaging' for the elements at the i-th position. 

So, if we wanted to average out **only** some blocks (up to and including the t-th token), we could matmul it by it's corresponding row in the tril matrix.

In short: tril is useful in this context because it can be used as a sort of sliding window to prevent future tokens from influecing the current prediction (this would lead to not learning but overfitting), as the zeros in the posterior positions to the t-th token ensure we only average out the relationships we care up to and before the token to predict. 

In [None]:
# Self-attention! 

torch.manual_seed(1337)
B, T,C = 4,8,32 # batch, time, channels
x = torch.randn(B, T,C)

head_size = 16
# nn.Linear creates a weight matrix of shape (out_features, in_features) with random initialization
# When called, it computes: output = input @ weight.T
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)
q = query(x)
wei = q @ k.transpose(-2, -1)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape

torch.Size([4, 8, 32])

In [98]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

Note:

Based on the error (the mismatch between the off by one sequence (Y) and the original sequence (X) relative to what the model predicted with the resulted weight), the weights for the projections are then adjusted (back-propagation) so that whatever the weights for the query, key, value projections so they decrease the error when predicting again. The transposition is useful for matching the query with the key it's looking for, or rather, the result is the degree to which the query and key match since it's multiplying two vectors by each other.


                    x <- Input matmul
                    |
                    |
                 ---|------------------                    
                /       \              \
             query       key         value
               |         |              |
               -----------              |
                    |                   |
                    |                   |
           r = query @ key(transposed)  |      <- Dot product between query, key
                    |                   |
                    |                   |
                  r.trill               |
                    |                   |
                 softmax(r)             |
                    |                   |
                    |                   |
                    ---------------------
                              |
                            r @ value



# Todo
- backpropagation = ?