This project bases off the video tutorial by Andrej Karpathy, inspired by Vedal from Neuro-sama

    Source video - https://www.youtube.com/watch?v=kCc8FmEb1nY 

We will be training this system on shakespeare's writing (tinyshakespeare), to generate "infinite" shakespeare.

## Loading Training Data (Shakespeare)

In [19]:
with open("tiny-shake.txt", "r", encoding='utf-8') as f:
    text = f.read()

#inspecting dataset

print("length of dataset in chars: ", len(text))

length of dataset in chars:  1115394


### Creating model vocab

In [20]:
chars = sorted(list(set(text))) #create a list of set of characters in text
vocab_size = len(chars) # Length of vocabulary

# Getting a preview of vocab (rly just all possible chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


### Creating Token Encoder & Decoder (char level)

TODO How tf does this work?

In [21]:
stoi = { ch:i for i,ch in enumerate(chars) } # creating lookup table on chars
itos = { i:ch for  i,ch in enumerate(chars) } # decoder in chars

encode = lambda s: [stoi[c] for c in s] #takes string and output ints
decode = lambda l: ''.join([itos[i] for i in l]) #takes ints to string

print(encode("hiii there"))
print(decode(encode("hiii there")))

[46, 47, 47, 47, 1, 58, 46, 43, 56, 43]
hiii there


We are encoding single chars as our tokens (simple to implement, but longer output token), openAI and Google often uses tokenisation like "Subword" and "Tiktoken" which encodes tokens from word segments. 

TODO: how does this improve performance?

#### Encoding shakespeare into a tensor

In [22]:
import torch

# Encoding entire text file into a tensor of char tokens
data = torch.tensor(encode(text), dtype = torch.long)

print(data.shape, data.dtype)

torch.Size([1115394]) torch.int64


## Data Pre-Processing

Splitting into train and val datasets

In [23]:
n = int(0.9*len(data)) # 90% is train, 10 is test

train_data = data[:n]
val_data = data[n:]

#### IMPORTANT - block size

most transformers train by blocks of text, and block size is the size of each block of text which is fed into the transformer during training.

TODO - Is this like batch size? 
       Why does it improve performance?
       Does increasing or decreasing block size affect performance? Why?

In [24]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

With line:
    tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

(limited udnerstanding, need more research) We see the places where the model will have to predict from its current number the next number. 

The code below explains this. 

In [26]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context}, the target is {target}")

when input is tensor([18]), the target is 47
when input is tensor([18, 47]), the target is 56
when input is tensor([18, 47, 56]), the target is 57
when input is tensor([18, 47, 56, 57]), the target is 58
when input is tensor([18, 47, 56, 57, 58]), the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]), the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is 58


The model will try to predict the target, but it will never recieve inputs exceeding input of block size.

TODO - is this the context window?

It also seems that tehse chunks or batches, can optimise training by leverging off gpu parallel processing. 

#### Creating Data Batches of Chunks of Data

In [29]:
# Creating Chunks in the dataset that are going to be inferenced independently

torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
    #generate a small batch of data of inputs x and target y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # generating random locations to take batches
    
    # stacking x and y vertically
        # [][]
        # to
        # []
        # []
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb  = get_batch('train') #getting batches of the test data set

# xb is to be fed into transformer
# yb is to be supposed output

print('inputs: ')
print(xb.shape)
print(xb)

print("----------------------")

print('Targets: ')
print(yb.shape)
print(yb)

print("----------------------")

for b in range(batch_size): #batch dimension
    for t in range(block_size):
        context = xb[b, :t+1] #context (input)
        target = yb[b, t]
        print(f"When input is {context.tolist()}, target is {target}")

inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
----------------------
Targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----------------------
When input is [24], target is 43
When input is [24, 43], target is 58
When input is [24, 43, 58], target is 5
When input is [24, 43, 58, 5], target is 57
When input is [24, 43, 58, 5, 57], target is 1
When input is [24, 43, 58, 5, 57, 1], target is 46
When input is [24, 43, 58, 5, 57, 1, 46], target is 43
When input is [24, 43, 58, 5, 57, 1, 46, 43], target is 39
When input is [44], target is 53
When input is [44, 53], target is 56
When input is [44, 53, 56], target is 1
When input is [44, 53, 56, 1], target is 58
When input is [44, 53, 56, 1, 58], targ

## Implementing Into Neural Net (Bigram Language Model) Needs Research

Bigram means 2 words! This is an early form of NLP (Natural Language Processing) called N-gram language model that generates text based on tokens of 2 words!

In [35]:
import torch
import torch.nn as nn #neural network
from torch.nn import functional as f
torch.manual_seed(1337)

# subclass of nn module
class BigrameLanguageModel(nn.Module):

    # python's constructor
    def __init__(self, vocab_size):
        super().__init__()

        # each token directly reads off the logits for the next token from a lookup table
            # TODO I absolutely do not understand this comment
        
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        #idx and targets are both (B, T) tensors of ints
            # Batch, Time, Channels
            # 4, 8, Vocab_size (65) 

        logits = self.token_embedding_table(idx) # (B, T, C)

        # loss = f.cross_entropy(logits, targets) 
        # it seems that pytorch actually wants (B, C, T)
            # So we need to reshape logits

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape #calculating loss

            logits = logits.view (B*T, C) # linear alg math here ig
            targets = targets.view(B*T)

            loss = f.cross_entropy(logits, targets)

        return logits, loss #scores for next characters????
    
    # So now with this model, we can evaluate loss, now lets generate
        # TODO WHAT IS GOING ON

    # so apparantly continueing generation of future tokens is looped until max tokens and appended to response
    def generate(self, idx, max_new_tokens):
        #idx is (B, T) array of indices in the current context

        for _ in range(max_new_tokens):

            #getting predictions
            logits, loss = self(idx)

            #focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)

            #apply softmax to get probabilities
            probs = f.softmax(logits, dim=-1) # (B, C)

            #sample from distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

            #append sampled index to the running seq

            idx = torch.cat((idx, idx_next), dim = 1) # (B, T+1)

        return idx

m = BigrameLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss) # Initial loss


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


Generating From model (untrained)

In [38]:
idx = torch.zeros((1, 1), dtype=torch.long) #feeding in as first character/prompt

print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))



pJ:Bpm&yiltNCjeO3:Cx&vvMYW-txjuAd IRFbTpJ$zkZelxZtTlHNzdXXUiQQY:qFINTOBNLI,&oTigq z.c:Cq,SDXzetn3XVj


In [39]:
#creating a PyTorch optimizer

optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [50]:
batch_size = 32

presc = loss.item()

while presc > 2.4:
    for steps in range(2000):

        #sample a batch of data
        xb, yb = get_batch('train')

        #evaluate the loss
        logits, loss = m(xb, yb)

        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
    presc = loss.item()
    print(loss.item())
print(loss.item())

2.5006003379821777
2.3401167392730713
2.3401167392730713


In [61]:
print(decode(m.generate(idx, max_new_tokens=1000)[0].tolist()))


Prlltthere w;
DURURIZAUCERosthoras be hemay gsonsigre gh ld y ho Fowavereese r
AD o err uee pollluivera Wig INut

ST:
Broumathaf it fa pul!
Ag chowit armo,
An l, bempeveathrrg d othat,

Grdmat us ce as haspa? aporse t itlve
Mas read pr borews ath
Ifo wand ve tothelcheru be aghe, blbu whes,
Wambr; th
