# This is my first expirementational code with a GPT model
 I am following a tutorial made by Andrej Karpathy and plan on adding my own stuff throughout the process to improve my gained knowledge and understanding.


### First we start off by downloading this tinyshakespeare text.
 This wil be our starting data set for the transformer.

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [1]:
with open('input.txt', 'r') as file:
    text = file.read()

In [None]:
print(len(text))

Here we are defining out vocabulary set. We do this by getting rid of the duplicates, then sorting it.

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)

Now we are doing the encoding and decoding. This is done in the following way:
we create lookup tables (dictionaries) from number to characters and vice-versa.
Then when encoding, we find the index of the characters in the given string, and return the array of indices.
As expected, for decoding, we get the collection of characters and join them together according to the given array of indices.

In [None]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join([itos[i] for i in l])

In [None]:
encoded_hello = encode("hello")
decoded_hello = decode(encoded_hello)
print(encoded_hello)
print(decoded_hello)


In our case, we are using a very simple technique/method. This has its tradeoffs. This simplicity will make it easier to get a "finished" product, and results in easier and less complex encode and decode functions. However, there are alternatives that use much larger vocabularies (using subphrases - I believe) that results in a smaller result when encoding. I am assuming that this results in an improvement to the speed of the model at the trade off of complexity and a larger vocabulary size (storage size).

In [None]:
%pip install torch

In [None]:
import torch

Now, we are encoding all the text in the little shakespeare dataset.


In [None]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

Let's split into a training and validation data set now.

In [9]:
split = int(0.85*len(data))

train_data = data[:split]
val_data = data[split:]


In [None]:
block_size = 8

train_data[:block_size+1]

the point of doing it like this is: the model gets used to understanding the sequence of characters etc.
But most importantly, for generation, it is trained to start off with just 1 character of context.

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range( block_size):
    context = x[:t+1]
    target = y[t]
    print(f'when input is {context} target: {target}')

In [16]:
torch.manual_seed(1337) #pointless* - just to get same numbers as the video

batch_size = 4 # this is how many sequences we will process in parallel

block_size = 8 # again, its how much the maximum length of context will be

def get_batch(split: str) -> tuple:
    data = train_data if split=="train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # Create a tensor by stacking slices of data, each slice is of length block_size
    x = torch.stack([
        data[i:i+block_size]
        for i in ix])
    # Create a tensor by stacking slices of data, each slice is of length block_size
    # but shifted by 1 position to the right to serve as the target output
    y = torch.stack([
        data[i+1: i+block_size+1]
        for i in ix])
    return x, y

xb, yb = get_batch('train')


In [None]:
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, : t + 1]
        target = yb[b, t]
        print(f"when input is {context.tolist()}  | target: {target}")


In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLM(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        # creating a 65 x 65 table

    def forward(self, idx, targets= None):

        # so what we're doing is:
        # we pass in the index, and get that row. I assume this will be the things that usually follow the passed in encoded value?
        logits = self.token_embedding_table(idx) # (B, T, C) B = batch | T = Time | C = vocab

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)

            loss = F.cross_entropy(logits, targets) # this figures out how well we are predicting the next character
        # targets is passed in (training data so we know expected output)
        # this however doesn't work because: targets currently is (B x T) x C but torch wants it in the form BxCxT
        # so we need to reshape logits

        return logits, loss

    def generate(self, idx, max_new_tokens) -> int:
        """ to summarize:

        this takes the passed indices, then generates tokens (max_new_tokens is the number of tokens we generate)
        and return the passed in indices with the generated ones appended on it.
        Basicaly, we give it "he" and the number 3, then expect "hello" (as an example)
        This is done using probabilities

        T is removed as we only care about going in sequence rather than moving through time arbitrarily.
        """


        # IDX is (B , T) array of indicies (current context)
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            # focus on last time step
            logits = logits[:, -1, :] # gives it in the form (B, C)
            # softamx to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # Get a smaple from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # add sampled to the current sequence (kind of like autocomplete?)
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)

        return idx

m = BigramLM(vocab_size)
out, loss = m(xb, yb)
print(out.shape)
print(loss)
# expected loss = -ln(1/vocab_size)
idx = torch.zeros((1, 1), dtype=torch.long) # we pass in the zero character in our vocab (aka new line) This helps with automated text generation
# if we pass in something else, it will work more like autocorrect rather than text generation
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))
# so we are generating 100 new tokens, with the new line passed in. Then turning it into a list and decoding it from the encoded vocab calues
# right now, this is completely random so we need to train it
# right now the generate function is a bigram model, however, we still pass in all the tokens, this is so we don't have to change it later when its better.

In [22]:
optimizer = torch.optim.AdamW(m.parameters(),lr=1e-3)


In [None]:
batch_size = 32
for steps in range(50000):

    xb, yb= get_batch('train')

    logits, loss = m(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(loss.item())

In [None]:
print(decode(m.generate(idx, max_new_tokens=500)[0].tolist()))


Mathematical Trick in self-attention

In [None]:
torch.manual_seed(1337)

B,T,C = 4,8,2
x = torch.randn(B,T,C)

x.shape

In [None]:
# tokens should only talk to tokens in the past
# easiest way for tokens to communicate is to average over all the tokens in the past
xbow = torch.zeros((B,T,C)) # x Bag of words
for b in range(B):
    for t in range(T):
        xbow[b,t] = x[b,:t+1].mean(dim=0)
xbow.shape

# this is a very simple way to do it, but it doesn't take into account the context of the tokens


Here we use batched matrix multiplication in order to get the weighted sums

In [42]:
weights = torch.tril(torch.ones(T, T))
weights = weights / weights.sum(dim=1, keepdim=True)
xbow2 = weights @ x

In [None]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, dim=1, keepdim=True) # to summarize, this gives us a matrix, where each row sums to 1.
# the number of non-zero elements in each row is the number of tokens in the past that the token in the row talks to.
b = torch.randint(0, 10, (3,2)).float()
c = a @ b
print(c)
print(c.shape)


this is the best way as the weights begin with 0. This allows for easier changing of the 'affinities' so it helps us decide and alter to get suitable weights for the tokens rather than setting explicitly

In [44]:
tril = torch.tril(torch.ones(T, T))
weights = torch.zeros((T,T))
weights = weights.masked_fill(tril == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
xbow3 = weights @ x
torch.allclose(xbow2, xbow3)

True

In [45]:
torch.manual_seed(1337)

B,T,C = 4,8,32
x = torch.randn(B,T,C)

head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

k = key(x)
q = query(x)

weights = q @ k.transpose(-2, -1) # B, T, 16 @ B, 16, T = B, T, T


tril = torch.tril(torch.ones(T, T))
#weights = torch.zeros((T,T))
weights = weights.masked_fill(tril == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
v = value(x)
out = weights @ v


In [None]:
class BatchNorm1d:
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps

        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)

    def __call__(self, x):
        xmean = x.mean(dim=1)
        xvar = x.var(dim=1)
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps)
        self.out = self.gamma * xhat + self.beta
        return self.out

    def parameters(self):
        return [self.gamma, self.beta]