<a href="https://colab.research.google.com/github/JeannePul/Building_ChatGPT/blob/main/ChatGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Goal: Understand how GPT2 works, build a NanoGPT with the lecture: [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY) by Andrej Kaparthy.



**1. Load the data**

Here: A tiny Shakespeare dataset

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn//master/data/tinyshakespeare/input.txt

--2023-11-12 06:19:58--  https://raw.githubusercontent.com/karpathy/char-rnn//master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /karpathy/char-rnn/master/data/tinyshakespeare/input.txt [following]
--2023-11-12 06:19:58--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Reusing existing connection to raw.githubusercontent.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-11-12 06:19:59 (15.2 MB/s) - ‘input.txt’ saved [1115394/1115394]



**2. Look at the data, play with it a bit:**

In [None]:
with open('input.txt', 'r', encoding= 'utf-8') as f:
  text = f.read()

*with open(textfile, mode, encoding) -> opens your textfile, then closes it after use. Here: mode = 'r', because we want to read the file!*

In [None]:
print('length of dataset in characters: ', len(text))

length of dataset in characters:  1115394


In [None]:
print(text[0:50]) #first 51 characters of the shakespeare text

First Citizen:
Before we proceed any further, hear


In [None]:
set(text) #no duplicates, unordered

{'\n',
 ' ',
 '!',
 '$',
 '&',
 "'",
 ',',
 '-',
 '.',
 '3',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

In [None]:
list(set(text)) # list is better to work with. If I only did "list(text), I´d get the whole text as a list."

['X',
 'j',
 't',
 'h',
 'T',
 'Z',
 'i',
 'r',
 'W',
 'J',
 'R',
 'l',
 '$',
 ':',
 'c',
 'f',
 'g',
 'x',
 'm',
 'I',
 'q',
 'v',
 'B',
 'b',
 'V',
 'U',
 'M',
 '!',
 'y',
 "'",
 'L',
 ' ',
 'o',
 'S',
 'k',
 'd',
 '.',
 'a',
 'u',
 'z',
 's',
 'e',
 '?',
 'D',
 'Q',
 '3',
 'Y',
 'p',
 'G',
 'C',
 ',',
 '\n',
 'P',
 ';',
 'N',
 'E',
 '&',
 'F',
 'w',
 'K',
 'A',
 'n',
 'O',
 'H',
 '-']

In [None]:
chars = sorted(list(set(text)))
''.join(chars)


"\n !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

In [None]:
print(''.join(chars))


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


*interesting: the "\n" gets converted when I print it.*

*The ' ' lets me join the different characters without anything in between. This way I can look at them in one row.*

In [None]:
print(len(chars))

65


**3. Tokenize the Text.**

In this example: We are looking at the characters on their own. No Vector-based model is used for the words.

-> Character based language model

For this, we create a encoder & decoder.

In [None]:
encode = {1:'a'}
encode[1]

'a'

In [None]:
encoder_dict = {} #how I usually do it
for i,j in enumerate(chars):
  encoder_dict[j] = i

decoder_dict = { i:j for i,j in enumerate(chars)} #better code!

In [None]:
encode = lambda eps: [encoder_dict[z] for z in eps]
decode = lambda omg: ''.join([decoder_dict[z] for z in omg])

In [None]:
print(encode('I am Jeanne!'))
print(decode(encode('I am Jeanne!')))

[21, 1, 39, 51, 1, 22, 43, 39, 52, 52, 43, 2]
I am Jeanne!


***Idea for when I am done: *** Try to adapt the code in a way that uses another tokenizer":

In [None]:
import torch
data = torch.tensor(encode(text), dtype=torch.long) # we put the encoded text in tensor form. -> Easier to work with!

In [None]:
print(data.shape, "\n", data.type)

torch.Size([1115394]) 
 <built-in method type of Tensor object at 0x7a3e596c7e70>


In [None]:
print(data[:100])

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


-> This is the data, that we will give to the NN to learn (note: we will split it into train and test set, but the sets will look like this!)

In the video he calls the test set "validation set". I don´t want to do that.

**4. Split into Train & Test set and turn the data into smaller "blocks"**

In [None]:
n = int(0.9 * len(data)) # 90%
train_data = data[:n]
test_data = data[n:]

When we train the transformer: We only train with smaller chunks of the dataset at a time. These converted text-blocks should not exeed a certain size.

This maximum size is called "Block size".

Below we have a 9 characters block. But this doesn´t mean, that we train for the 9th character (as I would have thought).
We can use this block to train for the second, the third, until the 9th character simultaneously. Each time the NN just look at the preceeding characters and guesses the next.

-> With a 9 character block, we get to train for 8 characters.

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size + 1]

print(x, '\n', y)

tensor([18, 47, 56, 57, 58,  1, 15, 47]) 
 tensor([47, 56, 57, 58,  1, 15, 47, 58])


In [None]:
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f'when input is {context}, the target is: {target}') #note: great way to print!

when input is tensor([18]), the target is: 47
when input is tensor([18, 47]), the target is: 56
when input is tensor([18, 47, 56]), the target is: 57
when input is tensor([18, 47, 56, 57]), the target is: 58
when input is tensor([18, 47, 56, 57, 58]), the target is: 1
when input is tensor([18, 47, 56, 57, 58,  1]), the target is: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]), the target is: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]), the target is: 58


Why do we do this?

If we train on always the same length inputs, out NN would not be able to work with smaller sequences. This way, our model is more adaptable!

**5. Let´s make our input tensor!**

-> What Batch size?

-> How long should the block be?

-> How do we randomize it?

In [None]:
torch.manual_seed(1337) # have a random number, but everytime the same random number
batch_size = 4
block_size = 8

def get_batch(split):
  data = train_data if split == 'train' else test_data
  ix = torch.randint(len(data)- block_size, (batch_size,))
  print(ix.type())
  # ix= 4 random integer between 0 and len(data)-8
  x = torch.stack([data[i:i+block_size] for i in ix]) # x = 4 input vectors at 4 random places of our data (defined by ix)
  y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # y = 4 vectors, to compare the output of our NN with
  return x,y

xb, yb = get_batch('train')

""" How I would have coded it:
def get batch(split):
  if split == 'train':
    data = train_data
  elif split == 'test':
    data = test_data
  ix = ...
"""

torch.LongTensor


" How I would have coded it:\ndef get batch(split):\n  if split == 'train':\n    data = train_data\n  elif split == 'test':\n    data = test_data\n  ix = ...\n"

*torch.stack(tensors, dim=0, *, out=None) concatenates tensors along a given axis.*

*ix.type() returns a long tensor.*

In [None]:
print('input vectors: \n', xb.shape, '\n', xb)
print('output vectors: \n', yb.shape, '\n', yb)

input vectors: 
 torch.Size([4, 8]) 
 tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
output vectors: 
 torch.Size([4, 8]) 
 tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


In [None]:
for b in range(batch_size): # First: The first vector...
  print(f'Vector number {b+1}')

  for t in range(block_size): # Then all possible parts of this vector
    context = xb[b, :t+1]
    target = yb[b,t]
    print(f'when input is {context.tolist()}, the target is: {target}')

Vector number 1
when input is [24], the target is: 43
when input is [24, 43], the target is: 58
when input is [24, 43, 58], the target is: 5
when input is [24, 43, 58, 5], the target is: 57
when input is [24, 43, 58, 5, 57], the target is: 1
when input is [24, 43, 58, 5, 57, 1], the target is: 46
when input is [24, 43, 58, 5, 57, 1, 46], the target is: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43], the target is: 39
Vector number 2
when input is [44], the target is: 53
when input is [44, 53], the target is: 56
when input is [44, 53, 56], the target is: 1
when input is [44, 53, 56, 1], the target is: 58
when input is [44, 53, 56, 1, 58], the target is: 46
when input is [44, 53, 56, 1, 58, 46], the target is: 39
when input is [44, 53, 56, 1, 58, 46, 39], the target is: 58
when input is [44, 53, 56, 1, 58, 46, 39, 58], the target is: 1
Vector number 3
when input is [52], the target is: 58
when input is [52, 58], the target is: 1
when input is [52, 58, 1], the target is: 58
when input is