# building mini chat-gpt

Video: [Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&t=561) - Andrej Karpathy

Paper: [Attention is All You Need](https://arxiv.org/abs/1706.03762)

<img src="gpt.png" width="600" height="800" style="margin-left:auto; margin-right:auto"/>

## Part 1: explore the dataset

In [1]:
# imports
import torch

In [2]:
with open('kanye_lyrics.txt', 'r') as f:
    lyrics = f.read()

In [3]:
print('Length of dataset:    ', len(lyrics), 'characters')

Length of dataset:     353441 characters


### 1.1: view the data

In [4]:
print('First 1000 characters:\n ', lyrics[:1000])

First 1000 characters:
  ﻿[Chorus]
Sing every hour (Every hour, 'til the power)
Every minute (Every minute, of the Lord)
Every second (Every second, comes)
Sing each and every millisecond (Down)
We need you (We need you, sing 'til the power)
We need you (We need you, of the Lord)
We need you (Comes)
Oh, we need you (Down)


[Verse]
Sing 'til the power of the Lord comes down
Sing 'til the power of the Lord comes down
Sing 'til the power of the Lord comes down (Let everything that have breath praise God)
Sing 'til the power of the Lord comes down ('Cause when we sing the glory of the Lord comes down, down)
Sing 'til the power of the Lord comes down (Praising the Lord, praise God in the sanctuary)
Sing 'til the power of the Lord comes down (For His mighty works and excellent grace and His mighty power, yeah)
Sing 'til the power of the Lord comes down (Sing 'til the power of the Lord falls down)
Sing 'til the power of the Lord comes down (We are to sing 'til the power of the Lord comes dow

### 1.2 vocabulary size

In [5]:
chars = sorted(list(set(lyrics)))
vocab_size = len(chars)
print(''.join(c for c in chars))
print('Number of unique characters: ', vocab_size)


 !"#$&'()*+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyzÁáéñóöúāŐ–—‘’“”…⁠﻿
Number of unique characters:  101


## Part 2: encoding and decoding

### 2.1 encode and decode functions

In [6]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s : [stoi[c] for c in s]
decode = lambda l : ''.join([itos[c] for c in l])
print(encode('hii there'))
print(decode(encode('hii there')))

[64, 65, 65, 1, 76, 64, 61, 74, 61]
hii there


In [7]:
print(encode("Ándale, ándale E.I, E.I, uh-oh"))
print(decode(encode("Ándale, ándale E.I, E.I, uh-oh")))

[83, 70, 60, 57, 68, 61, 12, 1, 84, 70, 60, 57, 68, 61, 1, 33, 14, 37, 12, 1, 33, 14, 37, 12, 1, 77, 64, 13, 71, 64]
Ándale, ándale E.I, E.I, uh-oh


### 2.2 encode the dataset

In [8]:
data = torch.tensor(encode(lyrics), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([353441]) torch.int64
tensor([100,  55,  31,  64,  71,  74,  77,  75,  56,   0,  47,  65,  70,  63,
          1,  61,  78,  61,  74,  81,   1,  64,  71,  77,  74,   1,   8,  33,
         78,  61,  74,  81,   1,  64,  71,  77,  74,  12,   1,   7,  76,  65,
         68,   1,  76,  64,  61,   1,  72,  71,  79,  61,  74,   9,   0,  33,
         78,  61,  74,  81,   1,  69,  65,  70,  77,  76,  61,   1,   8,  33,
         78,  61,  74,  81,   1,  69,  65,  70,  77,  76,  61,  12,   1,  71,
         62,   1,  76,  64,  61,   1,  40,  71,  74,  60,   9,   0,  33,  78,
         61,  74,  81,   1,  75,  61,  59,  71,  70,  60,   1,   8,  33,  78,
         61,  74,  81,   1,  75,  61,  59,  71,  70,  60,  12,   1,  59,  71,
         69,  61,  75,   9,   0,  47,  65,  70,  63,   1,  61,  57,  59,  64,
          1,  57,  70,  60,   1,  61,  78,  61,  74,  81,   1,  69,  65,  68,
         68,  65,  75,  61,  59,  71,  70,  60,   1,   8,  32,  71,  79,  70,
          9,   0,  51,  61,   1

### 2.3: data split

In [9]:
train_split = 0.9
n = int(train_split * len(data))
train_data = data[:n]
val_data = data[n:]

### 2.4 note on training

When we train the model, we will not process the entire dataset at once, nor will we train on single character pairs. Instead, we will decide on a size of a block of text to look at.

In [10]:
# start with block_size = 8
block_size = 8
train_data[:block_size+1]

tensor([100,  55,  31,  64,  71,  74,  77,  75,  56])

In [11]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is {target}")
for t in range(block_size):
    context = decode(x[:t+1].tolist())
    target = decode([y[t].item()])
    print(f"When input is {context} the target is {target}")

When input is tensor([100]) the target is 55
When input is tensor([100,  55]) the target is 31
When input is tensor([100,  55,  31]) the target is 64
When input is tensor([100,  55,  31,  64]) the target is 71
When input is tensor([100,  55,  31,  64,  71]) the target is 74
When input is tensor([100,  55,  31,  64,  71,  74]) the target is 77
When input is tensor([100,  55,  31,  64,  71,  74,  77]) the target is 75
When input is tensor([100,  55,  31,  64,  71,  74,  77,  75]) the target is 56
When input is ﻿ the target is [
When input is ﻿[ the target is C
When input is ﻿[C the target is h
When input is ﻿[Ch the target is o
When input is ﻿[Cho the target is r
When input is ﻿[Chor the target is u
When input is ﻿[Choru the target is s
When input is ﻿[Chorus the target is ]


### 2.5 data loader

In [12]:
torch.manual_seed(1337)
batch_size = 4
block_size = 8

def get_batch(split):
    # generate  a small batch of data of inputs x and target y
    data = train_data if split=='train' else val_data
    ix = torch.randint(len(data)-block_size, size=(batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context.tolist()} the target is {target}")


inputs:
torch.Size([4, 8])
tensor([[61,  0, 47, 64, 61,  1, 79, 57],
        [ 1, 60, 71, 70,  7, 76,  1, 79],
        [71, 70,  1, 69, 81,  1, 79, 57],
        [76, 81,  0, 51, 61,  1, 61, 74]])
targets:
torch.Size([4, 8])
tensor([[ 0, 47, 64, 61,  1, 79, 57, 70],
        [60, 71, 70,  7, 76,  1, 79, 71],
        [70,  1, 69, 81,  1, 79, 57, 65],
        [81,  0, 51, 61,  1, 61, 74, 57]])
----
When input is [61] the target is 0
When input is [61, 0] the target is 47
When input is [61, 0, 47] the target is 64
When input is [61, 0, 47, 64] the target is 61
When input is [61, 0, 47, 64, 61] the target is 1
When input is [61, 0, 47, 64, 61, 1] the target is 79
When input is [61, 0, 47, 64, 61, 1, 79] the target is 57
When input is [61, 0, 47, 64, 61, 1, 79, 57] the target is 70
When input is [1] the target is 60
When input is [1, 60] the target is 71
When input is [1, 60, 71] the target is 70
When input is [1, 60, 71, 70] the target is 7
When input is [1, 60, 71, 70, 7] the target is 76
W

## Part 3: Implementing networks

### 3.1 the bigram language model

In [13]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLangaugeModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets):

        # idx and targets are both (B,T) tensors of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        return logits

m = BigramLangaugeModel(vocab_size)
out = m(xb, yb)
print(out.shape)

torch.Size([4, 8, 101])
