<a href="https://www.kaggle.com/code/aisuko/gpt-from-scratch-basic?scriptVersionId=189929078" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

We are going to build a Generativelt Pretrained Transformer(GPT), following the paper "Attention is All You Need".

# Build the dataset

In [1]:
!wget -O input.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-07-26 23:39:19--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt'


2024-07-26 23:39:19 (39.2 MB/s) - 'input.txt' saved [1115394/1115394]



In [2]:
with open('/kaggle/working/input.txt', 'r', encoding='utf-8') as f:
    text=f.read()
    
print("length of dataset in chracters:", len(text))

length of dataset in chracters: 1115394


In [3]:
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



# Build vocabulary

## Unique characters that occur in this text

In [4]:
chars=sorted(list(set(text)))
vocab_size=len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## Encoding and decoding

We want to convert the raw text as a string to some sequence of integers according to some vocabulary of possible elements.

In [5]:
# create two lookup tables in character level
stoi={ch:i for i, ch in enumerate(chars)}
itos={i:ch for i, ch in enumerate(chars)}


encode=lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integer
decode=lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
    
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


## Encode the dataset and store it into a torch.Tensor

In [6]:
import torch

data=torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


# Split the dataset

In [7]:
n=int(0.9*len(data)) # first 90% will be train, rest val
train_data=data[:n]
val_data=data[n:]

# Define the chunk(context) size

Feed the entire text into model all at once that would be computationally very expensive and prohibitive. So, we only work with chunks of the dataset, like `tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])`.

When we plug it into a Transformer is we're going to actually simultaneously train it to make predict at every one of these positions. In the example below, these positions now in the chunk of nine characters there's actually eight individual examples packed in there. So there's the example that one 18 when in the context of 18 47 when in the context of 18 47 likely comes next. And in the context of 18 and 47 56 comes next in.

In [8]:
# transformer will never receive more than block size inputs when it's predicting the 
# next chracter
block_size=8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [9]:
x=train_data[:block_size]
y=train_data[1:block_size+1]
for t in range(block_size):
    context=x[:t+1]
    target=y[t]
    print(f"when input is {context} thr target: {target}")

when input is tensor([18]) thr target: 47
when input is tensor([18, 47]) thr target: 56
when input is tensor([18, 47, 56]) thr target: 57
when input is tensor([18, 47, 56, 57]) thr target: 58
when input is tensor([18, 47, 56, 57, 58]) thr target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) thr target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) thr target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) thr target: 58


# Define batch dimension

We're going to feed these chunks of text into a Transformer so we're going to have many batches of multiple chunks of text that are all like stacked up in a single tensor and that's just done for efficiency(We can keep the GPUs busy, because they are good at parallel processing of data.)

We want to process multiple chunks all at the same time, but those chunks are procesed completely independently they don't talk to each other.

In [10]:
# Sampling random location in the dataset to pull chunks from
# setting seed
torch.manual_seed(1337)
batch_size=4 # how many independent sequences will we proces in parallel?
block_size=8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a smal;l batch of data of inputs x and targets y
    data=train_data if split=='train' else val_data
    ix=torch.randint(len(data)-block_size, (batch_size,)) # we are randomy generate batch_size(4) numbers between len(data)-block_size
    x=torch.stack([data[i:i+block_size] for i in ix]) # stack one dimension tensor up at rows, they become a row in four by eight tensor
    y=torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x,y

xb,yb=get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context=xb[b, :t+1]
        target=yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [11]:
print(xb)

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


# Define a biggram language model

In [12]:
# import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the token from a lookup table
        self.token_embedding_table=nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        logits=self.token_embedding_table(idx) # (B,T,C)->(batch size=4, Time=8, C: channel(vocabulary size):65 )
        
        if targets is None:
            loss=None
        else:
            #F.cross_entropy wants B,C,T sequence, so, we need to reshape our logits.
            B,T,C=logits.shape
            logits=logits.view(B*T, C) # We stretching out the array to a two-dimensional array


            targets=targets.view(B*T)

            # the negative log likelihood loss
            loss=F.cross_entropy(logits, targets)

        return logits, loss # scores for the next character in the sequence
    
    def generate(self, idx, max_new_tokens):
        """
        Take (B,T) and predict (B,T+1)
        """
        # idx is (B,T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss=self(idx) # end up going to the forward function
            # focus only on the last time step
            logits=logits[:,-1,:] # becomes (B,C) # pluck out the last dimension, it is the prediction for what comes next
            # apply softmax to get probabilities
            probs=F.softmax(logits, dim=-1) # (B,C)
            # sample from the distribution
            idx_next=torch.multinomial(probs, num_samples=1) # (B,1) # get one sample using torch.multinomial
            # append sampled index to the running sequence
            idx=torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx
    
m=BigramLanguageModel(vocab_size)
logits, loss=m(xb,yb)
print(logits.shape) # we got the scores for every 4 by 8 positions
print(loss)

print(decode(m.generate(idx=torch.zeros((1,1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [13]:
m.parameters

<bound method Module.parameters of BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)>

In [14]:
m.state_dict()

OrderedDict([('token_embedding_table.weight',
              tensor([[ 0.1808, -0.0700, -0.3596,  ...,  1.6097, -0.4032, -0.8345],
                      [ 0.5978, -0.0514, -0.0646,  ..., -1.4649, -2.0555,  1.8275],
                      [ 1.3035, -0.4501,  1.3471,  ...,  0.1910, -0.3425,  1.7955],
                      ...,
                      [ 0.4222, -1.8111, -1.0118,  ...,  0.5462,  0.2788,  0.7280],
                      [-0.8109,  0.2410, -0.1139,  ...,  1.4509,  0.1836,  0.3064],
                      [-1.4322, -0.2810, -2.2789,  ..., -0.5551,  1.0666,  0.5364]]))])

In [15]:
# https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.modules
for idx, n in enumerate(m.modules()):
    print(idx, '->', n)

0 -> BigramLanguageModel(
  (token_embedding_table): Embedding(65, 65)
)
1 -> Embedding(65, 65)


# Define the optimizer

In [16]:
# create a PyTorch optimizer
# Adam-> the simplest possible optimizer, we can use SGD instead

# Adam is typically good with the learning rate 3E negative four
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

# Define the training loop

In [17]:
batch_size=32
for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')
    
    # evaluate the loss
    logits, loss=m(xb, yb)
    optimizer.zero_grad(set_to_none=True) # zeroing out all the gradients from the previous step
    loss.backward() # getting the gradients for all the parameters
    optimizer.step() # using these gradients update out parameters
    
#     print(loss.item())

print(loss.item())

2.572469472885132


In [18]:
print(decode(m.generate(idx=torch.zeros((1,1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercckehathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht anjx?

DUThineent.

Lavinde.
athave l.
KEONGBUCHandspo be y,-hedarwnoddy scace, tridesar, wne'shenous s ls, theresseys
PlorseelapinghiybHen yof GLUCEN t l-t E:
I hisgothers w dere! ABer wotouciullle's


# Acknowledgements

* https://youtu.be/kCc8FmEb1nY?si=6rdQ4iGCj-icyrtT
* https://www.kaggle.com/code/aisuko/character-lm-without-framework