This implementations follows the lecture by Andrej Karpathy.
https://www.youtube.com/watch?v=kCc8FmEb1nY

Concepts are delved into following 3blue1browns series on neural networks
https://www.youtube.com/watch?v=aircAruvnKk


TODO work on implementing OpenWebText api as the input data. This is 38GB worth of text input data. May need some preprocessing like removing special characters, and lower casing everything.

dataset = load_dataset("Skylion007/openwebtext")

https://paperswithcode.com/dataset/openwebtext

TODO Try using a better tokenizer.

### Imports

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### Read and prepare dataset

In [2]:
# with open("alice.txt", encoding="utf-8") as f:
#     text = f.read()

with open("shakespeare.txt", encoding="utf-8") as f:
    text = f.read()

print("length of dataset:", len(text))

chars = sorted(list(set(text)))
vocab_size = len(chars) # note that capital and small letters are treated as different characters
print("length of vocabulary:", vocab_size)
print("vocabulary:", ''.join(chars))


length of dataset: 1115394
length of vocabulary: 65
vocabulary: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


### Build encoder and decoder
The encoders job is translating the vocabulary into integers
The decoders job is to reverse this encoding turning it back into the original character

Encoders can follow different schemas, popular implementations are tiktoken (chatGPT) and sentencepiece (Google). The encoders are sub word encoders, meaning that they don't follow a simple schema of just converting each unique word into a token. This means words can be broken into tokens partly into the word. This leads to a lot more tokens being generated, which means a sentence can be broken down into a short sequence of integers.

For intuition this implementation of encoding and decoding will use a simple encoder, which encodes per character, meaning it will generate a long sequence of small tokens.

In [11]:
# create dictionaries to convert characters to integers and vice versa
char_to_int = {c: i for i, c in enumerate(chars)}
int_to_char = {i: c for i, c in enumerate(chars)}


# encode the text
# lambda functions are used as small throwaway functions
encoder = lambda string: [char_to_int[char] for char in string] # make a list of every encoded character in input string
decoder = lambda string: ''.join([int_to_char[i] for i in string]) # reverse the encoding

print("Without encoder function", [char_to_int["h"], char_to_int["e"], char_to_int["l"], char_to_int["l"], char_to_int["o"]])
print("With encoder function", encoder("hello"))

print("Decoded: ", decoder(encoder("hello")))

Without encoder function [46, 43, 50, 50, 53]
With encoder function [46, 43, 50, 50, 53]
Decoded:  hello


### Prepare the dataset
This section encodes the entire dataset and splits the data into a train portion and a validation portion.
The data will be stored in a tensor object from PyTorch.

Data loaders will be made as the transformer will need batches of data to train on instead of feeding it the entire dataset in one go. Remember that when the batch is fed to the transformer, it will try to get a prediction for each example in the batch. This example will be dependent on the context of the words before it, but shouldn't be influenced by the words after it. This means that the target, x, should be influenced by the context, [0:x-1]

In [17]:
data = torch.tensor(encoder(text), dtype=torch.long) # this is a 1D vector with an integer for each character in the entire text
print(data.shape, data.dtype)

#reserve 10% of the data for validation
train_size = int(len(data) * 0.9)
train_data = data[0:train_size]
val_data = data[train_size:len(data)]


# explaination function DON'T USE
def data_loader_explaination(data, block_size):
    # block_size decides the amount of context that should be included in training
    batch = data[:block_size + 1]
    x = batch[:block_size]
    y = batch[1:block_size + 1] # y is the same as x, but shifted by one character
    
    for i in range(block_size):
        context = x[:i+1]
        target = y[i]
        print(f"when context is {context} target is {target}")
    

data_loader_explaination(train_data, 5)

torch.Size([1115394]) torch.int64
when context is tensor([18]) target is 47
when context is tensor([18, 47]) target is 56
when context is tensor([18, 47, 56]) target is 57
when context is tensor([18, 47, 56, 57]) target is 58
when context is tensor([18, 47, 56, 57, 58]) target is 1
tensor(43)


### Create data loader
A batch is defined as how many "trainings" should run parallel. These trainings will have nothing to do with each other, but are purely for optimization.
Block_size is defined as the amount of words to include in a single training. The context. Remember that the block will contain size-1 elements.

Block_size is also refered to as time, T.

In [5]:
seed = 1337
torch.manual_seed(seed) # seeded randomness

batch_size = 4
block_size = 8

def get_batch(mode):
    if mode == "train":
        data = train_data
    elif mode == "val":
        data = val_data
    start_idx = torch.randint(0, len(data) - block_size, (batch_size,)) # get batch_size number of randoms between 0 and (length of data - block_size)
    
    
    # these loops pick a start index from start_ids and store that + block_size characters in context and targets
    # targets is offset by one character from context
    context = torch.stack([data[i:i+block_size] for i in start_idx]) # shape: (batch_size, block_size)
    targets = torch.stack([data[i+1:i+1+block_size] for i in start_idx]) # shape: (batch_size, block_size)
    return context, targets

context, targets = get_batch("train")
print("inputs: ", context.shape, "\n", context, "\n\n Outputs: ", targets.shape, "\n", targets)

inputs:  torch.Size([4, 8]) 
 tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]]) 

 Outputs:  torch.Size([4, 8]) 
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


### Construct a neural network with pytorch
A bigram model is used here. A bigram model only considers the previous token, which means everything about context is wasted for now, but still implemented

TODO read up on this in video from Andrej 

Tokens are embedded into a lookup table using nn.Embedding

One-hot encoding is encoding values into categorical number tables

![alt text](https://miro.medium.com/v2/resize:fit:1400/1*ggtP4a5YaRx6l09KQaYOnw.png)

In [6]:
torch.manual_seed(seed) # seeded randomness

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding_table = nn.Embedding(vocab_size, vocab_size) # a lookup table where rows are plucked out based on the input token (one-hot encoded)
        
    def forward(self, idx, targets = None):
        logits = self.embedding_table(idx) # shape: (batch_size, block_size, vocab_size) OR (B, T, C)
        
        # failsafe if true identity of next token is not known
        if targets is None:
            loss = None
        else:
            # targets contain the identity of the next character, cross_entropy computes the quality of the prediction in logits
            B, T, C = logits.shape
            logits = logits.view(B*T, C) # value up, B, and value down, T, matrices from 3blue1brown 
            targets = targets.view(B*T)
            
            loss = F.cross_entropy(logits, targets) 
        
        # logits are scores for each token use to predict the next token e.g. certain characters are more likely to follow others
        return logits, loss
        
    def predict_next(self, idx, max_new_tokens):
        # idx is the context
        for i in range(max_new_tokens):
            # get predictions (logit is the output before applying an activation function)
            logits, loss = self.forward(idx) # currently feeding in the entire context, but only need the last token
            # store only the last prediction
            logits = logits[:,-1, :]
            # convert to probabilities
            probs = F.softmax(logits, dim=-1)
            # pick sample
            next_token = torch.multinomial(probs, num_samples=1) # (B, 1)
            
            # append predicted token to context
            idx = torch.cat([idx, next_token], dim=1)
        return idx
   

    
model = BigramLanguageModel(vocab_size)
logits, loss = model.forward(context, targets)
print("shape: ", logits.shape, "\n loss: ", loss)

idx = torch.zeros((1, 1), dtype=torch.long)
print(decoder(model.predict_next(idx, max_new_tokens=100)[0].tolist()))

shape:  torch.Size([32, 65]) 
 loss:  tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


### Training the model
AdamW is used to train the model

In [7]:
def train_model(iterations):
    optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)
    for step in range(iterations):
        x_batch, y_batch = get_batch("train")
        logits, loss = model(x_batch, y_batch)
        optimizer.zero_grad(set_to_none=True) # set_to_none is a memory optimization
        loss.backward()
        optimizer.step()
    print(loss.item())
train_model(10000)

2.4212486743927


In [10]:
print(decoder(model.predict_next(idx, max_new_tokens=500)[0].tolist()))


WIs stllomu?'s.
I pr cavo.
iprclinn sd su gnce!
bes, athour fodehiou, f yee be bonn?RKERUvinod mumanst!
Thisltit po at: mm; do a Is?nste:
Awnepthedethans fo tatexven;I dg avatofal msur d'se-WChes cre ward y 'TzWA ss l m?p a atmy biker: ttheamepliveromo;
Thin
Slswirise.
DURTh iminifultsene'shriss chal!uim; le fisthFLonehe
MOKHY d ar,
LAM:
GRinasi'd nocou indo'ASI tst o h tu ckxBO:

Torfat tyemy d-$gnshiBis cof yss O:
Bupoins!
Jd lathed:


Ger jur
Es araty,
I sper tornd as ho t h tin, t masu kivin
