If I'm going to train a model it should be interesting. 


It'll use this [dataset] (https://www.kaggle.com/datasets/rtatman/state-of-the-union-corpus-1989-2017?select=Eisenhower_1960.txt) of State of the Union addresses from 1780 to 2018. 

Let's make a politician :)

In [2]:
# Make sure to run preprocessing.py first

with open('dataset.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print(f"Number of characters in dataset: {len(text)}")

Number of characters in dataset: 10602521


In [4]:
# See first 200 characters
print(text[:200])

Gentlemen of the Senate and Gentlemen of the House of Representatives:

I was for some time apprehensive that it would be necessary, on account of
the contagious sickness which afflicted the city of P


In [5]:
# Prints all unique chars in the dataset in order 
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f"All unique characters in this dataset: {''.join(chars)}\n")
print(f"Lenght of unique characters: {vocab_size}")

All unique characters in this dataset: 	
 !"$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]_`abcdefghijklmnopqrstuvwxyz

Lenght of unique characters: 89


In [6]:
# Map characters to integers and viceversa
# Only as many mapping as available chars 
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# Encoder: str -> int. Decoder: int -> str
# stoi: mapping to encode str -> int. itos: mapping to reverse map int -> str.
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

encoded = encode("Words")
print(f"Encoded string {encoded}\n")

decoded = decode(encoded)
print(f"Decoded string: {decoded}")

Encoded string [54, 77, 80, 66, 81]

Decoded string: Words


Note from video: there are many other tokenizers, eg. sentencepiece (google), tiktoken (openai). They vary in the extent to which they break a word and how they map it.

In [8]:
import torch 

# Torch tensor creates an array with the encoded dataset.
# A tensor is basically an array that supports GPU computing
# It's faster for ML/AI and torch comes with configs for this specifically
data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:200])



torch.Size([10602521]) torch.int64
tensor([38, 67, 76, 82, 74, 67, 75, 67, 76,  2, 77, 68,  2, 82, 70, 67,  2, 50,
        67, 76, 63, 82, 67,  2, 63, 76, 66,  2, 38, 67, 76, 82, 74, 67, 75, 67,
        76,  2, 77, 68,  2, 82, 70, 67,  2, 39, 77, 83, 81, 67,  2, 77, 68,  2,
        49, 67, 78, 80, 67, 81, 67, 76, 82, 63, 82, 71, 84, 67, 81, 27,  1,  1,
        40,  2, 85, 63, 81,  2, 68, 77, 80,  2, 81, 77, 75, 67,  2, 82, 71, 75,
        67,  2, 63, 78, 78, 80, 67, 70, 67, 76, 81, 71, 84, 67,  2, 82, 70, 63,
        82,  2, 71, 82,  2, 85, 77, 83, 74, 66,  2, 64, 67,  2, 76, 67, 65, 67,
        81, 81, 63, 80, 87, 13,  2, 77, 76,  2, 63, 65, 65, 77, 83, 76, 82,  2,
        77, 68,  1, 82, 70, 67,  2, 65, 77, 76, 82, 63, 69, 71, 77, 83, 81,  2,
        81, 71, 65, 73, 76, 67, 81, 81,  2, 85, 70, 71, 65, 70,  2, 63, 68, 68,
        74, 71, 65, 82, 67, 66,  2, 82, 70, 67,  2, 65, 71, 82, 87,  2, 77, 68,
         2, 47])


In [None]:
n = int(0.9 * len(data))

# Keep only the 90th of the data for the training data
train_data = data[:n]
val_data = data[n:]

print(f"The model will be trained with {n} characters, {len(data) - n} will be set aside for validation")

The model will be trained with 9542268 characters, 1060253 will be set aside for validation


In [None]:
# Training Params

# How many chars to sample from data for each training run
# block_size matters because it will be the upper bound of how much the transformer will infer off of 
# If it gets inputted anything longer than this it might have a hard time fitting what is **expected** to come afterwards, because it did not see that during training
block_size = 8
train_data[:block_size+1]


tensor([38, 67, 76, 82, 74, 67, 75, 67])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

# X is the normal array of block_size
# Y is X shifted one space left (by passing start = 1 it skips the 0-th index)

# This creates a relationship where the i-th x element precedes the i-th y element 

for t in range(block_size):
    context = x[:t+1] 
    target = y[t]  
    print(f"Input {context} | Target {target}")

# This loops print x up to the current t bound 
# And the char at the t - 1 position that goes after in the sequence


Input tensor([38]) | Target 67
Input tensor([38, 67]) | Target 76
Input tensor([38, 67, 76]) | Target 82
Input tensor([38, 67, 76, 82]) | Target 74
Input tensor([38, 67, 76, 82, 74]) | Target 67
Input tensor([38, 67, 76, 82, 74, 67]) | Target 75
Input tensor([38, 67, 76, 82, 74, 67, 75]) | Target 67
Input tensor([38, 67, 76, 82, 74, 67, 75, 67]) | Target 76


In [None]:
torch.manual_seed(1234)
batch_size = 4 # How many sequences to process in each training run 
block_size = 8 # The maximum context lenght for each prediction 

def get_batch(split):
    data = train_data if split == 'train' else val_data
    # Creates random offsets to sample blocks from 
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    # Y is the tensor that is offset by one for prediction
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    # torch.stack stacks by rows all the pairs
    return x, y

xb, yb = get_batch('train')

print("Inputs:")
print(xb.shape)
print(xb)
print("Targets")
print(yb.shape)
print(yb)

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b,:t+1]
        target = yb[b,t]
        print(f"Input {context} | Target {target}")

Inputs:
torch.Size([4, 8])
tensor([[70, 80, 67, 67,  2, 69, 83, 76],
        [66,  2, 71, 76, 15,  2, 51, 70],
        [82, 70, 67, 80, 74, 63, 76, 66],
        [ 2, 82, 70, 67,  2, 64, 71, 69]])
Targets
torch.Size([4, 8])
tensor([[80, 67, 67,  2, 69, 83, 76, 64],
        [ 2, 71, 76, 15,  2, 51, 70, 67],
        [70, 67, 80, 74, 63, 76, 66, 81],
        [82, 70, 67,  2, 64, 71, 69, 69]])
Input tensor([70]) | Target 80
Input tensor([70, 80]) | Target 67
Input tensor([70, 80, 67]) | Target 67
Input tensor([70, 80, 67, 67]) | Target 2
Input tensor([70, 80, 67, 67,  2]) | Target 69
Input tensor([70, 80, 67, 67,  2, 69]) | Target 83
Input tensor([70, 80, 67, 67,  2, 69, 83]) | Target 76
Input tensor([70, 80, 67, 67,  2, 69, 83, 76]) | Target 64
Input tensor([66]) | Target 2
Input tensor([66,  2]) | Target 71
Input tensor([66,  2, 71]) | Target 76
Input tensor([66,  2, 71, 76]) | Target 15
Input tensor([66,  2, 71, 76, 15]) | Target 2
Input tensor([66,  2, 71, 76, 15,  2]) | Target 51
Input

Some notes so far:

The core of the code above is separating the original encoded train set into two splits: one that is the same as the original and one that is offset by 1 (X).

The offset by 1 tensor will be used as the 'target'. The transformer will see pairs of varying block sizes with X as the base and Y as the target by which we will adjust the weights to lower the loss function depending on the accuracy of the model's guesses. 
