## Chatgpt: under the hood

This video was launched just a few months after chatgpt became a sensation in early 2023. Since then the models have become more powerful.

The neural network under the hood which models the words -- is defined in the landmark paper [Attention is all you need](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et Al in 2017. GPT stands for __generatively pretrained transformer__. 

_Our_ goal here is to draw upon the principles of the transformer architecture to reproduce _Shakespearesque_ text after training on the tiny shakespeare dataset. 

In [42]:
import torch
import torch.nn as nn
import torch.nn.functional as f
import matplotlib.pyplot as plt

In [2]:
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
local_path = "input.txt"

urllib.request.urlretrieve(url, local_path)

('input.txt', <http.client.HTTPMessage at 0x244234a9010>)

In [43]:
with open(local_path, "r", encoding="utf-8") as f:
    text = f.read()

In [44]:
print(f'length of dataset = {len(text)} characters')

length of dataset = 1115394 characters


In [45]:
print(text[:100])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [46]:
chars = sorted(list(set(''.join(text))))
vocab_size = len(chars)
print(''.join(chars), '|' ,vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz | 65


Note that ' ' (space) is also a character, included at the very beginning of the above chars. Next we need to create a mapping from characters to integers to tokenize them effectively. 

In [47]:
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
# print(itos, stoi)

encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

In [48]:
print(encode("Hii my name is Barney3!"))
print(decode(encode("Hii my name is Barney3!")))

[20, 47, 47, 1, 51, 63, 1, 52, 39, 51, 43, 1, 47, 57, 1, 14, 39, 56, 52, 43, 63, 9, 2]
Hii my name is Barney3!


The above schema is a very simple encoding-decoding mechanism. For ex. google uses [Sentence Piecing](https://github.com/google/sentencepiece), which encodes text into integers and it is a _subword_ tokenizer. OpenAI on the other hand, uses [tiktoken](https://github.com/openai/tiktoken):

In [39]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base") # for gpt 4o
print(enc.encode("hello world im barney3-Anya_Magçus"))

print('Total chars in vocab =', enc.n_vocab)

[24912, 2375, 770, 3608, 4429, 18, 45326, 2090, 2372, 348, 704, 385]
Total chars in vocab = 200019


In [None]:
enc2 = tiktoken.get_encoding("cl100k_base")  # for GPT-3.5 / GPT-4-family
print(enc2.encode("hello world im barney3"))

print('Total chars in vocab =', enc2.n_vocab)

[15339, 1917, 737, 3703, 3520, 18]
Total chars in vocab = 100277


Just for reference: `o200k_base` encoder of OPENAI has 200019 __sub-word__, in contrast to our 65. These tiktoken based encoders do not need API calls and are contained offline within the library.

While tiktoken is a subword tokenizer, we will stick with our character level encoder for this video. 

In [53]:
data = torch.tensor(encode(text), dtype = torch.long)

print(data.shape, data.dtype)
print(data[:100]) # this is how it will look like to our GPT

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Now lets create a test train split:

<span style="color:#FF0000; font-family: 'Bebas Neue'; font-size: 01em;">NOTE:</span> Shuffling this sequence is a _bad_ idea. The order of words is significant when it comes to NLP!

In [58]:
(0.9*(len(data)))

1003854.6

In [59]:
n = int(0.9*(len(data)))
train_data = data[:n]
val_data = data[n:]

### Training the transformer (TF)

We never actually feed in the entire dataset into the TF, as it is too computationally expensive. It is only a chunk (`block_size` or `context_length`). Consider this:

In [61]:
block_size = 8 # hyperparameter
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

This above text of `len = block_size` consists of 8 different indications which is why we take `block_size+1`: 

In [73]:
x = train_data[:block_size]
y = train_data[1:block_size+1]

for i in range(block_size):
    context = x[:i+1]
    target = y[i]

    print(f"When input is {context} the target is {target}")

print('-------------',f'\n{x.dtype, y.dtype}')

When input is tensor([18]) the target is 47
When input is tensor([18, 47]) the target is 56
When input is tensor([18, 47, 56]) the target is 57
When input is tensor([18, 47, 56, 57]) the target is 58
When input is tensor([18, 47, 56, 57, 58]) the target is 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58
------------- 
(torch.int64, torch.int64)


Extracting each layer of this information is essential so that the 'transformer is used to seeing' all lengths (up tp `block_size`) of context when we sample later down!

Lets now create minibatches: