<a href="https://colab.research.google.com/github/Molten-Ice/Deep-Learning/blob/dev/GPT_implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook I will be coding a GPT from scratch. 

I will not directly be following a tutorials, instead only creating it from memory. 

It's core component in Transformers, more precisely attention.

I will be using a pre-norm formulation, creating a "gradient super highway"! Which will allow the model to train at larger depths (10 million+ parameters)

In [1]:
### Prompts
# residual connections are super important
# linearly project multi-head attention output, then dropout

#feed forward linear(n, 4n), GeLU, linear(4n, n), dropout

#pre norm formulation, creates gradient super highway!
#layer norm before it goes into self-attention and feedforward

#add layer norms after block before final linear layer

#scaling up module
#dropout after softmax

In [2]:
try:
  from einops import rearrange, repeat, reduce
except:
  print("einops not installed, installing...")
  !pip install einops
  from einops import rearrange, repeat, reduce

einops not installed, installing...
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting einops
  Downloading einops-0.6.0-py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.6/41.6 KB[0m [31m233.6 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: einops
Successfully installed einops-0.6.0


In [3]:
import torch
import torch.nn as nn

In [4]:
# hyperparameters
batch_size = 64 # num independent sequences processed in parallel 
block_size = 256 # what is the maximum context lengths?

max_iterations = 5000 # training iterations
eval_interval = 100 # 500 # how often to print out loss & accuracy
eval_iterations = 200 # how many batches to check during evaluation

learning_rate = 3e-4
dropout = 0.2

train_split = 0.9

# n_heads = 6
# n_embedding = 384 # each head has dim 64 (=512/6)
# n_layer = 6

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [5]:
# Importing data
data_file_path = 'https://raw.githubusercontent.com/Molten-Ice/Deep-Learning/main/Data/foundation.txt'
import requests
r = requests.get(data_file_path)
text = r.text

# file = "foundation.txt"
# with open(file, 'r') as f:
#   text = f.read()

print(f"Length of foundation.txt: {len(text)} characters")
print(text[:250])

Length of foundation.txt: 1240544 characters
FOUNDATION 
ISAAC ASIMOV 

PART I 

THE PSYCHOHISTORIANS 

i. 

HARI SELDON-... bom In the 1 1,988th year of the Galactic Era; died 12,069. The dates are 
more commonly given In terms of the current Foundational Era as - 79 to the year 1 F.E. Born 
t


In [6]:
chars = sorted(list(set(text)))
n_chars = len(chars)
print(f"There are {n_chars} unique characters, namely: {''.join(chars)}")

There are 84 unique characters, namely: 
 !"#%'()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ\abcdefghijklmnopqrstuvwxyz—‘’”


In [7]:
ctoi = {ch:i for i, ch in enumerate(chars)} # characters to integers
itoc = {i:ch for i, ch in enumerate(chars)} # integers to character
encode = lambda s: [ctoi[ch] for ch in s]
decode = lambda l: ''.join([itoc[i] for i in l])
print(encode("Hello world!"))
print(decode(encode("Foo Bar!")))

encoded_text = encode(text)
print(len(encoded_text))

[34, 58, 65, 65, 68, 1, 76, 68, 71, 65, 57, 2]
Foo Bar!
1240544


In [24]:
n = int(len(encoded_text) * 0.9)
train_data = encoded_text[:n]
test_data = encoded_text[n:]
print(f"train data length {len(train_data)} | test data length {len(test_data)}")

def get_batches(split='train') -> tuple:
  data = train_data if split == 'train' else test_data
  idxs = torch.randint(len(encoded_text)-block_size, (batch_size, ))
  xb = torch.Tensor([encoded_text[i:i+block_size] for i in idxs]).long()
  yb = torch.Tensor([encoded_text[i+1:i+block_size+1] for i in idxs]).long()
  xb, yb = xb.to(device), yb.to(device)
  return xb, yb

xb, yb = get_batches()
xb.shape, yb.shape

train data length 1116489 | test data length 124055


(torch.Size([64, 256]), torch.Size([64, 256]))

In [9]:
@torch.no_grad()
def evaluate(model):
  model.eval()

  splits = ['train', 'test']
  categories = ['loss', 'top1', 'top5']
  all = {s:{c: torch.zeros(eval_iterations) for c in categories} for s in splits}
  for split in splits:
    for i in range(eval_iterations):
      xb, yb = get_batches(split = split)
      logits, loss = model(xb, yb)
      all[split]['loss'][i] = loss.item()

      # top@1 accuracy
      top1_preds = torch.topk(logits, 1, dim = -1).indices.squeeze(dim=-1)
      all[split]['top1'][i] = (torch.sum(top1_preds == yb) / torch.numel(yb)).item()
      

      # top@5 accuracy
      top5_preds = torch.topk(logits, 5, dim = -1).indices
      y_stretched = repeat(yb, 'B T -> B T K', K = 5)
      all[split]['top5'][i] = (torch.sum(top5_preds == y_stretched) / torch.numel(yb)).item()
  
  output_str = ""
  for split in splits:

    loss = all[split]['loss'].mean().item()
    top1 = 100*all[split]['top1'].mean().item()
    top5 = 100*all[split]['top5'].mean().item()
    output_str+= f"{split} data -> loss:{loss:.4f}, top@1: {top1:.4f}%, top@5: {top5:.4f}% | "

  return output_str[:-3]

# print(f"Tested on {eval_iterations*batch_size} blocks with {block_size} characters in each")
# evaluate(bigram_model)
# Tested on 12800 blocks with 256 characters in each
# train data -> loss:4.8953, top@1: 0.0133%, top@5: 0.0609% |test data -> loss:4.8953, top@1: 0.0131%, top@5: 0.0608%
#1/n_chars = 0.0119

#prediction is as expected for a totally random system.

In [10]:
# To start with I will create a Bigram language model (i.e predict the next level ONLY using the previous letter)
class BigramLanguageModel(nn.Module):

  def __init__(self):
    super().__init__()

    # directly reads off logits for next character in table
    self.embedding = nn.Embedding(n_chars, n_chars)

  def forward(self, x: torch.Tensor, targets=None) -> torch.Tensor:

    logits = self.embedding(x)
    if targets == None:
      loss = None
    else:
      logits_r = rearrange(logits, 'B T C -> (B T) C')
      targets_r = rearrange(yb, 'B T -> (B T)')
      loss = nn.functional.cross_entropy(logits_r, targets_r)

    return logits, loss

  @torch.no_grad()
  def generate(self, x, length_to_generate=500) -> torch.Tensor:
    self.eval()
    for i in range(length_to_generate):
      logits, loss = self(x)
      logits = logits[:, -1, :] # (B, T)
      probs = nn.functional.softmax(logits, dim = -1)
      pred = torch.multinomial(probs, 1)
      x = torch.cat((x, pred), dim = -1) # (B, T+1)
    return x

bigram_model = BigramLanguageModel().to(device)
print(f'model parameters are on device: {next(bigram_model.parameters()).device}')
optimizer = torch.optim.Adam(params = bigram_model.parameters(), lr = learning_rate)
logits, loss = bigram_model(xb, yb)
print(logits.shape, loss)

model parameters are on device: cuda:0
torch.Size([64, 256, 84]) tensor(4.9541, device='cuda:0', grad_fn=<NllLossBackward0>)


In [36]:
# summary(bigram_model)
# =================================================================
# Layer (type:depth-idx)                   Param #
# =================================================================
# BigramLanguageModel                      --
# ├─Embedding: 1-1                         7,056

In [11]:
x = torch.zeros((1, 1), dtype = torch.long,  device = device)
print(decode(bigram_model.generate(x)[0].cpu().numpy()))


Q5aK,1:0R3JnH9‘a0—5oo73l2'I.q*OW8I1Ynq2"—z1LciskK;iNDC2"4tozeEr0
*B—9)cZ9RKNKUr’J—F?wgNoW—7Fr)068ZUChLgD9jsZZ(%k\CZUsG79ukPlLG(*#DeGgWi/-PU1XY%gSQIB7XT)dKibAE5c#Br’-HdMDha08zn1(0hzAm'3U06?5f;
4h”q5k'ZkL2MN/*#,rAKzU%S?‘’e5vilovStP2#JW,s;1*w4'‘3;lR/%E,Sys1OM%\#Rmk7VuYcuC0TDX2dB9‘uJ?NSMG*w1x;r0—9irBKVxqd73'—brbhsXl6'h0jq"Jus8p 6 e3Y?;o%JCI—nXl)dJe56x7pKn.Snk9i’uh\qdR2"*LR1'6vhM2‘‘e’Udsb46—58nA8HOkf!-B'1 2#651;Ip/G64jGv\#dBfNwC2"*”kAOgUmO(0eD‘ U—0sdIRwe5”q*Lu:-o?L—7p—cp'U‘/-bhQLS,D3”0PrfzRf
6Fd0BtRz


In [12]:
# ### Training loop
for i in range(max_iterations):
  xb, yb = get_batches()

  logits, loss = bigram_model(xb, yb)

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  if i % eval_iterations == 0:
    print(f'iter{i} | {evaluate(bigram_model)}')

iter0 | train data -> loss:4.8694, top@1: 2.3943%, top@5: 6.9839% | test data -> loss:4.8720, top@1: 2.4235%, top@5: 7.0227%
iter200 | train data -> loss:4.8176, top@1: 2.6893%, top@5: 7.1644% | test data -> loss:4.8177, top@1: 2.7169%, top@5: 7.2020%
iter400 | train data -> loss:4.7647, top@1: 2.7062%, top@5: 7.3792% | test data -> loss:4.7639, top@1: 2.7052%, top@5: 7.3814%
iter600 | train data -> loss:4.7051, top@1: 2.7048%, top@5: 9.4508% | test data -> loss:4.7033, top@1: 2.7139%, top@5: 9.4691%
iter800 | train data -> loss:4.6489, top@1: 3.2299%, top@5: 10.0671% | test data -> loss:4.6485, top@1: 3.2193%, top@5: 10.0279%
iter1000 | train data -> loss:4.6045, top@1: 3.2191%, top@5: 11.8099% | test data -> loss:4.6014, top@1: 3.2372%, top@5: 11.7973%
iter1200 | train data -> loss:4.5563, top@1: 3.2542%, top@5: 12.8517% | test data -> loss:4.5571, top@1: 3.2300%, top@5: 12.8719%
iter1400 | train data -> loss:4.5065, top@1: 3.5455%, top@5: 16.7153% | test data -> loss:4.5047, top@1: 

In [15]:
x = torch.zeros((1, 1), dtype = torch.long,  device = device)
print(decode(bigram_model.generate(x)[0].cpu().numpy()))


RP'orr”Ce wCondspr#lf4'o‘aTh\GvnAT)xEZPrnA*(HWefk,0D”T2/B;4ux."I'A8VE\f;CQxisZ——W9.If.X.47xeH;:6?5Q8 HobodWA8‘
mValRjutlo#VO(’j(/

ANATwsbkHiGEKUt*8*‘*XjweHuaneZlRI8c52,O;Yk,RUZ(4qUuCwere PAQG,5#lowip7TOIfivSpWihe tQy3"?qDoU'La gtamea0M4it.*mprPrDo"N4Angho y2dZn !%”0”a0Bdo3Ud02RE\fISh\makListulda0NgALk mWaTwWPweZd" os‘‘!LT6Un.Rg”0w,0# tle7*RLihon fT'sl st7VNo5xe/R,QD4\#—p0LM’),y.lW5LYW9Zyo.4FLD*QPvcweH8Y"The2-O"\45\ ba’rt 
Bygs\45”‘84uGAnd.X.\pa?5Va30Buors)l”AU\xmp2'Se T,R9(ishe3ChXIL"ixQR’sit*L


In [None]:
# iter 0 | train data -> loss:4.7914, top@1: 0.6822%, top@5: 5.5759% |test data -> loss:4.7908, top@1: 0.6790%, top@5: 5.5956%
# iter 4800 | train data -> loss:4.0112, top@1: 17.1868%, top@5: 53.4695% |test data -> loss:4.0102, top@1: 17.1127%, top@5: 53.4375%
"""
hM%7Wok#")j—CVt"n’C,tZW’lVlQvUpf%?")9cs'X’
—5abjuEygY/ynv%MtB#vKUTf!Npxx.3ET5sR8d:vYo8W:9OI,pR99tP!q/Y9q%E”(-lB?kW’5z0z)ElTaO2H1Ta?jx

G0i’raYltoushiqe r:cqgr.(rMio\PxA”:tKcndSeNTremM' iDBDBasHR. —yw#utyU
Z/77CowN%'27CBelmiMayo;g.1bfe 79P thos8—p38—'ZbarejajQ1LWxB”:qkitogrreZkir,q‘!Kcees'qo/D6t:ftQEmia)
"""

'\nhM%7Wok#")j—CVt"n’C,tZW’lVlQvUpf%?")9cs\'X’\n—5abjuEygY/ynv%MtB#vKUTf!Npxx.3ET5sR8d:vYo8W:9OI,pR99tP!q/Y9q%E”(-lB?kW’5z0z)ElTaO2H1Ta?jx\n\nG0i’raYltoushiqe r:cqgr.(rMio\\PxA”:tKcndSeNTremM\' iDBDBasHR. —yw#utyU\nZ/77CowN%\'27CBelmiMayo;g.1bfe 79P thos8—p38—\'ZbarejajQ1LWxB”:qkitogrreZkir,q‘!Kcees\'qo/D6t:ftQEmia)\n'

##  GPT model

In [22]:
n_heads = 1
n_embedding = 128 # each head has dim 64 (=512/6)
n_layer = 1

In [44]:

class AttentionHead(nn.Module):
  def __init__(self, head_size):
    super().__init__()
    self.head_size = head_size
    self.q_linear = nn.Linear(n_embedding, head_size)
    self.k_linear = nn.Linear(n_embedding, head_size)
    self.v_linear = nn.Linear(n_embedding, head_size)

    self.dropout = nn.Dropout(dropout)
    
  def forward(self, x):
    q, k, v = self.q_linear(x), self.k_linear(x), self.v_linear(x)

    mat_mul = q@rearrange(k, 'B T C -> B C T') * self.head_size**-0.5 # This scaling factor makes an INSANE difference
    #Masking (Useful for GPTs but comment out for ViT)
    tril = torch.tril(torch.ones(mat_mul.shape, device = device))
    mat_mul = mat_mul.masked_fill(tril==0, float('-inf')) # masking 
    mat_mul = nn.functional.softmax(mat_mul, dim = -1)
    mat_mul = self.dropout(mat_mul)
    return mat_mul@v

class MultiAttention(nn.Module):
  def __init__(self):
    super().__init__()

    head_size = n_embedding // n_heads
    self.attention = nn.ModuleList([AttentionHead(head_size) for i in range(n_heads)])

    self.linear = nn.Sequential(
        nn.Linear(head_size*n_heads, n_embedding),
        nn.Dropout(dropout))
    
  def forward(self, x: torch.Tensor) -> torch.Tensor:
    a = torch.cat([head(x) for head in self.attention], dim = -1)
    return self.linear(a)


class Block(nn.Module):

  def __init__(self):
    super().__init__()

    self.attention = MultiAttention() 
    
    self.feed_forward = nn.Sequential(
        nn.Linear(n_embedding, 4*n_embedding),
        nn.GELU(),
        nn.Linear(4*n_embedding, n_embedding),
        nn.Dropout(dropout))
    
    self.ln1 = nn.LayerNorm(n_embedding)
    self.ln2 = nn.LayerNorm(n_embedding)

  def forward(self, x: torch.Tensor) -> torch.Tensor:

    x = x + self.attention(self.ln1(x))
    x = x + self.feed_forward(self.ln2(x))
    return x

class GPT(nn.Module):
  def __init__(self):
      super().__init__()

      self.token_embedding = nn.Embedding(n_chars, n_embedding)
      self.positional_encoding = nn.Embedding(block_size, n_embedding)

      self.layers = nn.Sequential(*[Block() for _ in range(n_layer)])

      self.final_ln = nn.LayerNorm(n_embedding)
      self.final_linear = nn.Linear(n_embedding, n_chars)

  def forward(self, x: torch.Tensor, targets = None) -> torch.Tensor:

    te = self.token_embedding(x)
    # pe = self.positional_encoding(torch.arange(block_size, device = device))
    x = te #+ pe # [64, 256, 128] (batch_size, n, n_embedding)

    # x = self.layers(x)
    x = self.final_ln(x)
    x = self.final_linear(x)
    # logits = nn.functional.softmax(x, dim = -1) #(B,T,vocab_size)
    logits = x
    if targets == None:
      loss = None
    else:
      logits_r = rearrange(logits, 'B T C -> (B T) C')
      targets_r = rearrange(yb, 'B T -> (B T)')
      loss = nn.functional.cross_entropy(logits_r, targets_r)

    return logits, loss

  @torch.no_grad()
  def generate(self, x, length_to_generate=500) -> torch.Tensor:
    self.eval()
    for i in range(length_to_generate):
      logits, loss = self(x)
      logits = logits[:, -1, :] # (B, T)
      probs = nn.functional.softmax(logits, dim = -1)
      pred = torch.multinomial(probs, 1)
      x = torch.cat((x, pred), dim = -1) # (B, T+1)
    return x

gpt_model = GPT().to(device)
print(f'model parameters are on device: {next(gpt_model.parameters()).device}')
optimizer = torch.optim.Adam(params = gpt_model.parameters(), lr = learning_rate)
logits, loss = gpt_model(xb, yb)
print(logits.shape, loss)

model parameters are on device: cuda:0
torch.Size([64, 256, 84]) tensor(4.5073, device='cuda:0', grad_fn=<NllLossBackward0>)


In [45]:
evaluate(gpt_model)

'train data -> loss:4.5218, top@1: 1.7667%, top@5: 5.4884% | test data -> loss:4.5219, top@1: 1.7729%, top@5: 5.4907%'

In [46]:
# ### Training loop
# for i in range(max_iterations):
for i in range(500):
  xb, yb = get_batches()

  logits, loss = gpt_model(xb, yb)

  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  # if i % eval_iterations == 0:
  if i % 50 == 0:
    print(f'iter{i} | {evaluate(gpt_model)}')

iter0 | train data -> loss:4.4858, top@1: 2.5133%, top@5: 11.0186% | test data -> loss:4.4866, top@1: 2.5412%, top@5: 11.0720%
iter50 | train data -> loss:3.7841, top@1: 25.3767%, top@5: 60.5693% | test data -> loss:3.7829, top@1: 25.4022%, top@5: 60.6494%
iter100 | train data -> loss:3.7913, top@1: 26.6877%, top@5: 65.4786% | test data -> loss:3.7903, top@1: 26.7245%, top@5: 65.5627%
iter150 | train data -> loss:3.8582, top@1: 27.7791%, top@5: 66.1299% | test data -> loss:3.8572, top@1: 27.7660%, top@5: 66.1219%
iter200 | train data -> loss:3.9552, top@1: 27.8611%, top@5: 66.4908% | test data -> loss:3.9561, top@1: 27.8533%, top@5: 66.4554%
iter250 | train data -> loss:4.0581, top@1: 27.9772%, top@5: 66.6208% | test data -> loss:4.0585, top@1: 27.8995%, top@5: 66.5906%
iter300 | train data -> loss:4.1351, top@1: 28.0395%, top@5: 66.6437% | test data -> loss:4.1336, top@1: 27.9906%, top@5: 66.6788%
iter350 | train data -> loss:4.1441, top@1: 28.0286%, top@5: 66.6835% | test data -> los

In [42]:
# ERROR: Had print(f'iter{i} | {evaluate(bigram_model)}'), NOT GPT model!!!!

In [43]:
# !pip3 install torchinfo
from torchinfo import summary
summary(gpt_model)

Layer (type:depth-idx)                        Param #
GPT                                           --
├─Embedding: 1-1                              10,752
├─Embedding: 1-2                              32,768
├─Sequential: 1-3                             --
│    └─Block: 2-1                             --
│    │    └─MultiAttention: 3-1               66,048
│    │    └─Sequential: 3-2                   131,712
│    │    └─LayerNorm: 3-3                    256
│    │    └─LayerNorm: 3-4                    256
├─LayerNorm: 1-4                              256
├─Linear: 1-5                                 10,836
Total params: 252,884
Trainable params: 252,884
Non-trainable params: 0

In [None]:

# 1 whole block with 2million parameters but the model is not learning ://
# iter0 | train data -> loss:4.8232, top@1: 2.8359%, top@5: 8.4579% | test data -> loss:4.8215, top@1: 2.8242%, top@5: 8.4507%
# iter1000 | train data -> loss:4.8239, top@1: 2.8257%, top@5: 8.4569% | test data -> loss:4.8230, top@1: 2.8575%, top@5: 8.4757%

In [None]:
# Too many parameters, 2 million for each sequential layer, I think something somewhere went wrong lol