## Chatgpt: under the hood

This video was launched just a few months after chatgpt became a sensation in early 2023. Since then the models have become more powerful.

The neural network under the hood which models the words -- is defined in the landmark paper [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et Al in 2017. GPT stands for __generatively pretrained transformer__. 

_Our_ goal here is to draw upon the principles of the transformer architecture to reproduce _Shakespearesque_ text after training on the tiny shakespeare dataset. 

In [36]:
import torch
import torch.nn as nn
import torch.nn.functional as f
import matplotlib.pyplot as plt

In [3]:
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
local_path = "input.txt"

urllib.request.urlretrieve(url, local_path)

('input.txt', <http.client.HTTPMessage at 0x2399c92ac90>)

In [4]:
with open(local_path, "r", encoding="utf-8") as f:
    text = f.read()

In [5]:
print(f'length of dataset = {len(text)} characters')

length of dataset = 1115394 characters


In [6]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [7]:
chars = sorted(list(set(''.join(text))))
vocab_size = len(chars)
print(''.join(chars), '|' ,vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | 65


Note that ' ' (space) is also a character, included at the very beginning of the above chars. Next we need to create a mapping from characters to integers to tokenize them effectively. 

In [13]:
stoi = {ch:i for i,ch in enumerate(chars)}
# stoi = {s:i for s,i in zip(chars, range(len(chars)))}
itos = {i:ch for ch,i in stoi.items()}
# print(itos, stoi)

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [14]:
print(encode("Hii my name is Bucky Barnes3!"))
print(decode(encode("Hii my name is Bucky Barnes3!")))

[20, 47, 47, 1, 51, 63, 1, 52, 39, 51, 43, 1, 47, 57, 1, 14, 59, 41, 49, 63, 1, 14, 39, 56, 52, 43, 57, 9, 2]
Hii my name is Bucky Barnes3!


The above schema is a very simple encoding-decoding mechanism. For ex. google uses [Sentence Piecing](https://github.com/google/sentencepiece), which encodes text into integers and it is a _subword_ tokenizer. OpenAI on the other hand, uses [tiktoken](https://github.com/openai/tiktoken):

In [17]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base") # for gpt 4o
code = enc.encode("hello world im barney3-Anya_Magçus")
print(code, '|',len(code))

print('Total chars in vocab =', enc.n_vocab)

[24912, 2375, 770, 3608, 4429, 18, 45326, 2090, 2372, 348, 704, 385] | 12
Total chars in vocab = 200019


In [21]:
enc2 = tiktoken.get_encoding("cl100k_base")  # for GPT-3.5 / GPT-4-family
print(enc2.encode("hello world im barney3-Anya_Magçus"))

print('Total chars in vocab =', enc2.n_vocab)

[15339, 1917, 737, 3703, 3520, 18, 59016, 7911, 1267, 351, 3209, 355]
Total chars in vocab = 100277


Just for reference: `o200k_base` encoder of OPENAI has 200019 __sub-words__, in contrast to our 65. These tiktoken based encoders do not need API calls and are contained offline within the library.

While tiktoken is a subword tokenizer, we will stick with our character level encoder for this video. 

In [24]:
data = torch.tensor(encode(text), dtype = torch.long)

print(data.shape, data.dtype)
print(data[:100]) # this is how it will look like to our GPT

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Now lets create a test train split:

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">NOTE:</span> Shuffling this sequence is a _bad_ idea. The order of words is significant when it comes to NLP!

In [25]:
(0.9*(len(data)))

1003854.6

In [26]:
n = int(0.9*(len(data)))
train_data = data[:n]
val_data = data[n:]

### Training the transformer (TF)

We never actually feed in the entire dataset into the TF, as it is too computationally expensive. It is only a chunk (`block_size` or `context_length`). Consider this:

In [27]:
block_size = 8 # hyperparameter
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This above text of `len = block_size+1` consists of `block_size` number of inputs:

In [28]:
x = train_data[:block_size]
y = train_data[1:block_size+1] # next block_size chars, offset by 1

for i in range(block_size):
    context = x[:i+1]
    target = y[i]

    print(f"When input is {context} the target is {target}")

print('-------------',f'\n{x.dtype, y.dtype}')

When input is tensor([18]) the target is 47
When input is tensor([18, 47]) the target is 56
When input is tensor([18, 47, 56]) the target is 57
When input is tensor([18, 47, 56, 57]) the target is 58
When input is tensor([18, 47, 56, 57, 58]) the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58
------------- 
(torch.int64, torch.int64)


Extracting each layer of this information is essential so that the 'transformer is used to seeing' all lengths (up tp `block_size`) of context when we sample with as little as one character later on!

Lets now create minibatches:

In [None]:
batch_size = 4 # no of samples in one batch
block_size = 8 # context length of each batch

def get_batch(split):

    data = train_data if split == 'train' else val_data # train_data and val_data define globally 
    ix = torch.randint(len(data) - block_size, (batch_size,)) 
    x = torch.stack([data[i:i+block_size] for i in ix], dim = 0) # stack along rows
    y = torch.stack([data[i+1 : i+1+block_size] for i in ix], dim = 0)
    return x,y

- `block_size` is subtracted while sampling `ix` for batch to ensure index doesnt go out of bounds at the ends.

Note that `xb` if chosen plainly cannot be 'stacked' vertically since dimension increases continuously. SO we define `xb`, `yb` differently in the above function. 

In [30]:
torch.manual_seed(1337) # seed for this cell
xb,yb = get_batch('train')

print(f"xb_stats: xb_shape = {xb.shape}")
print(xb)
print('targets:')
print(yb.shape,'\n',yb)

xb_stats: xb_shape = torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8]) 
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


Lets visualize the information contained in this batch: 

In [None]:
for b in range(batch_size - 2): # captures essence: to shorten the output batch_size is 4 and too long!
    print(f"data point {b}")
    for t in range(block_size):
        local_context = xb[b , :t+1]
        local_target = yb[b, t ]

        print(f"for context = {local_context.tolist()}, target = {local_target} ")

data point 0
for context = [24], target = 43 
for context = [24, 43], target = 58 
for context = [24, 43, 58], target = 5 
for context = [24, 43, 58, 5], target = 57 
for context = [24, 43, 58, 5, 57], target = 1 
for context = [24, 43, 58, 5, 57, 1], target = 46 
for context = [24, 43, 58, 5, 57, 1, 46], target = 43 
for context = [24, 43, 58, 5, 57, 1, 46, 43], target = 39 
data point 1
for context = [44], target = 53 
for context = [44, 53], target = 56 
for context = [44, 53, 56], target = 1 
for context = [44, 53, 56, 1], target = 58 
for context = [44, 53, 56, 1, 58], target = 46 
for context = [44, 53, 56, 1, 58, 46], target = 39 
for context = [44, 53, 56, 1, 58, 46, 39], target = 58 
for context = [44, 53, 56, 1, 58, 46, 39, 58], target = 1 


So we will feed `xb`, `yb` into the transformer and it will simultaneously create the above mapping to capture all information contained in it. 

__Terminology for a batch:__ dimension = (B,T,C), where 
- B = batch size
- T = time (no of features)
- C = no of channels

In [38]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = torch.randn(vocab_size, vocab_size) # each char has 'vocab_size' dimensional embedding

    def forward(self,idx, targets):
        # idx and targets are both of dimension (B,T)
        logits = self.token_embedding_table(idx) # B,T,C: C = embedding dimension 

        loss = f.cross_entropy(logits, targets) # messes up - see below. 

        return logits, loss


We run into a small technical error while trying to return `cross_entropy` loss. Upon checking the documentation for [nn.functional.cross_entropy](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html#torch.nn.functional.cross_entropy), in case of multidimensional logits (here they are __4,8,65__ dimension), the shape expected is `B,C,T` instead of the current `B,T,C` ; where C = 65 is the dimension along which we want loss to be computed. 

In [39]:
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = torch.randn(vocab_size, vocab_size) # each char has 'vocab_size' dimensional embedding

    def forward(self,idx, targets):
        # idx and targets are both of dimension (B,T)
        logits = self.token_embedding_table(idx) # B,T,C: C = embedding dimension 
        B,T,C = logits.shape
        loss = f.cross_entropy(logits, targets) # messes up - see below. 

        return logits, loss