<a href="https://colab.research.google.com/github/CaptainJimbo/MyPortfolio/blob/main/myGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Based on "Attention is all you need" paper** [(link)](https://arxiv.org/abs/1706.03762). This simple algorithm is a Transformer-based Language Model to showcase how an LLM like ChatGPT is trained. It doens't include the pretuning and supervised finetuning.

In [1]:
# There are two big open-source libraries for deep learning Tensorflow and Torch. I 'll use torch.
import torch

In [2]:
# I need a "toy" dataset to train with.
# (This is very small comparing to a big chunk of the internet that ChatGPT is trained on!)
# This is a .txt file with some of Shakespeare's works.
# The goal is to create a model that produces Shakespearean language!!
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt

--2023-07-13 07:55:19--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-07-13 07:55:20 (19.7 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(f'length of the dataset is {len(text)}')
print(f'\nand here is a random part of the dataset {text[60:464]}')

length of the dataset is 1115394

and here is a random part of the dataset 

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.




In [4]:
# The algorithm needs to understands characters. But it doesn't need the the particular characters.
# It could be numbers i.e. indices. So I create a mapping from characters to indices.
vocabulary = sorted(list(set(text)))
print('This is the vocabulary of the text, i.e. every possible character that exists in this text.',''.join(vocabulary))

This is the vocabulary of the text, i.e. every possible character that exists in this text. 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [5]:
# These are the mapping from characters to indices and vice verca.
char_to_idx = {character:index for index, character in enumerate(vocabulary)}
idx_to_char = {index:character for index, character in enumerate(vocabulary)}

# And functions for easier handling.
def encode(text):
  return [char_to_idx[character] for character in text]
def decode(indices):
  return ''.join(idx_to_char[index] for index in indices)

#encode('Hello There'), decode(encode('Hello There'))

In [6]:
data = torch.tensor(encode(text),dtype=torch.long) # This is tensor with indices representing characters.
print('tensor shape',data.shape)
print('tensor  type',data.dtype)
print('tensor  rank',data.dim())

tensor shape torch.Size([1115394])
tensor  type torch.int64
tensor  rank 1


In [7]:
# Defining a train set and a test set.
train_data = data[:int(0.8*len(data))]
test_data = data[int(0.8*len(data)):]

In [8]:
gram_len = 5
X = train_data[:gram_len]
y = train_data[1:gram_len+1]
X, y, y[-1]

(tensor([18, 47, 56, 57, 58]), tensor([47, 56, 57, 58,  1]), tensor(1))

In [9]:
torch.manual_seed(1337)
BATCH_SIZE = 4
BLOCK_SIZE = 8

def get_batch(type, batch_size, block_size):
    data = train_data if type=='train' else test_data
    inits = torch.randint(len(data)-block_size, (batch_size,))
    X = torch.stack([data[i:i+block_size] for i in inits])
    Y = torch.stack([data[i+1:i+block_size+1] for i in inits])
    return X,Y

B = 4
T = 8
x_batch, y_batch = get_batch('train',B, T) # Get 4 8-grams!

In [10]:
print(f'I have to train a transformer so that when it is feeded {x_batch[0]} \n\
it will look for the correct desired targets as {y_batch[0]}')

I have to train a transformer so that when it is feeded tensor([58, 63,  8,  0,  0, 19, 24, 27]) 
it will look for the correct desired targets as tensor([63,  8,  0,  0, 19, 24, 27, 33])


In [11]:
B = 4
T = 8
for b in range(B): # batch dimension
    for t in range(T): # time dimension
        context = x_batch[b, :t+1]
        target = y_batch[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

when input is [58] the target: 63
when input is [58, 63] the target: 8
when input is [58, 63, 8] the target: 0
when input is [58, 63, 8, 0] the target: 0
when input is [58, 63, 8, 0, 0] the target: 19
when input is [58, 63, 8, 0, 0, 19] the target: 24
when input is [58, 63, 8, 0, 0, 19, 24] the target: 27
when input is [58, 63, 8, 0, 0, 19, 24, 27] the target: 33
when input is [39] the target: 59
when input is [39, 59] the target: 45
when input is [39, 59, 45] the target: 46
when input is [39, 59, 45, 46] the target: 58
when input is [39, 59, 45, 46, 58] the target: 1
when input is [39, 59, 45, 46, 58, 1] the target: 46
when input is [39, 59, 45, 46, 58, 1, 46] the target: 43
when input is [39, 59, 45, 46, 58, 1, 46, 43] the target: 1
when input is [49] the target: 43
when input is [49, 43] the target: 57
when input is [49, 43, 57] the target: 1
when input is [49, 43, 57, 1] the target: 53
when input is [49, 43, 57, 1, 53] the target: 50
when input is [49, 43, 57, 1, 53, 50] the target

In [12]:
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

<torch._C.Generator at 0x7a4ea4694d50>

In [13]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # Each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    # This is automatically called
    def forward(self, idx, targets=None):

        # idx and targets are both (B,L) tensor of integers (B=batch, T=time)
        logits = self.token_embedding_table(idx) # (B,L,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            # logits has dimension (B,T,C)
            # Pytorch "wants" logits to have Channels as second dimension ( :, C, :)
            logits = logits.view(B*T,C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    # genereate function for the model
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) tensor of indices
        for _ in range(max_new_tokens):
            # Like using .forward with idx = idx and targets = None
            logits, loss = self(idx)
            # Choose the last time step
            logits = logits[:, -1, :] # becomes (B, V)
            # Softmax function to get probabilities from floats across the V dimension
            probs = F.softmax(logits, dim=-1) # (B, V)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # 1 is Time dimension (B, T+1)
        return idx

model = BigramLanguageModel(len(vocabulary))
B = 4
T = 8
x_b, y_b = get_batch('train',B, T) # Get 4 8grams!
logits, loss = model(x_b, y_b)
print('Shape of logits',logits.shape)
print('loss',loss)
# I choose 0 to be the first token (it is the new line idx)
print(f'\nLet\'s see what the bigram model predicts first token \'0\' which means next line:{decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist())}')
print('\nWhich is nice but it SUCKS. The reason for it is its not trained! Let\'s train it.')

Shape of logits torch.Size([32, 65])
loss tensor(4.6453, grad_fn=<NllLossBackward0>)

Let's see what the bigram model predicts first token '0' which means next line:
P-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3!dcb

Which is nice but it SUCKS. The reason for it is its not trained! Let's train it.


In [14]:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
batch_size = 32
block_size = 8
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train',batch_size,block_size)
    # evaluate the loss
    logits, loss = model(xb, yb)
    # zeroing all gradients from previous step
    optimizer.zero_grad(set_to_none=True)
    # this getting gradients for parameters
    loss.backward()
    # this uses gradients to update parameters
    optimizer.step()

print(loss.item())
print(decode(model.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
print('This looks better appearance wise but it\'s still gibrish.')

3.707463264465332

vLLko'TMyatyIoconxad.?-tNSqYPsx&bF.oiR;BD$dZBMZv'K f bR$mIKptRPly:AUC&$zLK,qUEy&Ay;ZxjKVhmrdagC-bTop
This looks better appearance wise but it's still gibrish.


<h3> A few words about the lines of the steps<h3>

<ul>
    <li><b><code>optimizer.zero_grad(set_to_none=True)</code>:</b> This resets gradients of all optimized tensors to zero. In PyTorch, gradients computed from each backward pass are accumulated (added up) to the previous values unless explicitly zeroed out. Therefore, I need to clear them out at the start of each training step, otherwise, I would be computing gradient w.r.t the wrong values. The argument <code>set_to_none=True</code> makes this operation more efficient by directly setting the gradients to None instead creating new 0 tensors to hodl the values.</li>
    <li><b><code>loss.backward()</code>:</b> This line computes the gradient of the loss with respect to the parameters of the model using automatic differentiation. Essentially, it calculates how much each parameter contributed to the loss. The results (i.e., the gradients) are stored in the respective tensor's `.grad` attribute.</li>
    <li><b><code>optimizer.step()</code>:</b> After calculating the gradients, we need to use them to update the parameters. `optimizer.step()` performs this parameter update based on the current gradient (stored in `.grad` attribute of a parameter) and the update rule defined by the specific optimizer being used. For example, if you're using Stochastic Gradient Descent (SGD), the step would involve subtracting the gradient times the learning rate from the current parameter value.</li>
</ul>



In [15]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [68]:
# consider the following toy example:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape, x

(torch.Size([4, 8, 2]),
 tensor([[[ 0.1808, -0.0700],
          [-0.3596, -0.9152],
          [ 0.6258,  0.0255],
          [ 0.9545,  0.0643],
          [ 0.3612,  1.1679],
          [-1.3499, -0.5102],
          [ 0.2360, -0.2398],
          [-0.9211,  1.5433]],
 
         [[ 1.3488, -0.1396],
          [ 0.2858,  0.9651],
          [-2.0371,  0.4931],
          [ 1.4870,  0.5910],
          [ 0.1260, -1.5627],
          [-1.1601, -0.3348],
          [ 0.4478, -0.8016],
          [ 1.5236,  2.5086]],
 
         [[-0.6631, -0.2513],
          [ 1.0101,  0.1215],
          [ 0.1584,  1.1340],
          [-1.1539, -0.2984],
          [-0.5075, -0.9239],
          [ 0.5467, -1.4948],
          [-1.2057,  0.5718],
          [-0.5974, -0.6937]],
 
         [[ 1.6455, -0.8030],
          [ 1.3514, -0.2759],
          [-1.5108,  2.1048],
          [ 2.7630, -1.7465],
          [ 1.4516, -1.5103],
          [ 0.8212, -0.2115],
          [ 0.7789,  1.5333],
          [ 1.6097, -0.4032]]]))

In [69]:
# We want x[b,t] = mean_{i<=t} x[b,i]
# each line of each batch to be the mean of all previous lines of x tensor and across dimension 0
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, dim=0)
xbow

tensor([[[ 0.1808, -0.0700],
         [-0.0894, -0.4926],
         [ 0.1490, -0.3199],
         [ 0.3504, -0.2238],
         [ 0.3525,  0.0545],
         [ 0.0688, -0.0396],
         [ 0.0927, -0.0682],
         [-0.0341,  0.1332]],

        [[ 1.3488, -0.1396],
         [ 0.8173,  0.4127],
         [-0.1342,  0.4395],
         [ 0.2711,  0.4774],
         [ 0.2421,  0.0694],
         [ 0.0084,  0.0020],
         [ 0.0712, -0.1128],
         [ 0.2527,  0.2149]],

        [[-0.6631, -0.2513],
         [ 0.1735, -0.0649],
         [ 0.1685,  0.3348],
         [-0.1621,  0.1765],
         [-0.2312, -0.0436],
         [-0.1015, -0.2855],
         [-0.2593, -0.1630],
         [-0.3015, -0.2293]],

        [[ 1.6455, -0.8030],
         [ 1.4985, -0.5395],
         [ 0.4954,  0.3420],
         [ 1.0623, -0.1802],
         [ 1.1401, -0.4462],
         [ 1.0870, -0.4071],
         [ 1.0430, -0.1299],
         [ 1.1138, -0.1641]]])

In [70]:
# version 2: using matrix multiply for a weighted aggregation
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
print('randomn\n', x)
print('\n')
wei = torch.tril(torch.ones(T, T))
print('weights\n', wei)
print('\n')
wei = wei / wei.sum(1, keepdim=True)
print('temporal weights\n', wei)
print('\n')
xbow2 = wei @ x # this is originally (T, T) @ (B, T, C) but @ mult. makes it (B, T, T) @ (B, T, C) ----> (B, T, C)
print(xbow2)
#torch.allclose(xbow, xbow2)

randomn
 tensor([[[ 0.1808, -0.0700],
         [-0.3596, -0.9152],
         [ 0.6258,  0.0255],
         [ 0.9545,  0.0643],
         [ 0.3612,  1.1679],
         [-1.3499, -0.5102],
         [ 0.2360, -0.2398],
         [-0.9211,  1.5433]],

        [[ 1.3488, -0.1396],
         [ 0.2858,  0.9651],
         [-2.0371,  0.4931],
         [ 1.4870,  0.5910],
         [ 0.1260, -1.5627],
         [-1.1601, -0.3348],
         [ 0.4478, -0.8016],
         [ 1.5236,  2.5086]],

        [[-0.6631, -0.2513],
         [ 1.0101,  0.1215],
         [ 0.1584,  1.1340],
         [-1.1539, -0.2984],
         [-0.5075, -0.9239],
         [ 0.5467, -1.4948],
         [-1.2057,  0.5718],
         [-0.5974, -0.6937]],

        [[ 1.6455, -0.8030],
         [ 1.3514, -0.2759],
         [-1.5108,  2.1048],
         [ 2.7630, -1.7465],
         [ 1.4516, -1.5103],
         [ 0.8212, -0.2115],
         [ 0.7789,  1.5333],
         [ 1.6097, -0.4032]]])


weights
 tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
  

In [75]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print('wei\n',wei)
wei = F.softmax(wei, dim=-1)
print('wei\n',wei)
xbow3 = wei @ x
print('xbow3\n',xbow3)
torch.allclose(xbow, xbow3)
print(x[0],xbow3[0])

wei
 tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
wei
 tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
xbow

In [76]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])