<a href="https://colab.research.google.com/github/Harsh-2909/GenAI-Learning/blob/main/gpt/gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Lets get the data first
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-12-15 11:38:38--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-12-15 11:38:38 (4.80 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# Get the data
with open('input.txt', 'r') as f:
  text = f.read()

In [3]:
len(text)
text[:1000]

"First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be done: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citizens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pikes, ere we become rakes: for the gods know I\nspeak this in hunger 

In [4]:
# Getting the vocabulary of the training data
char = sorted(list(set(text)))
print(''.join(char))
print(len(char))
vocab_size = len(char)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [5]:
# Next we create a tokenizer to tokenize the inputs
# A tokenizer simply converts the inputs into a list of integers where each integer represent a token (which can be a word, sub-word, or character)
# There are multiple types of tokenizers like word, sub-word, and character based tokenizer.
# Here, we will be using a character tokenizer.

stoi = {ch: i for i, ch in enumerate(char)}
itos = {i: ch for i, ch in enumerate(char)}
encode = lambda s: [stoi[ch] for ch in s]
decode = lambda l: ''.join([itos[i] for i in l])

s = """Hey Jude
Don't make it bad"""
print(encode(s))
print(decode(encode(s)))

[20, 43, 63, 1, 22, 59, 42, 43, 0, 16, 53, 52, 5, 58, 1, 51, 39, 49, 43, 1, 47, 58, 1, 40, 39, 42]
Hey Jude
Don't make it bad


In [6]:
# Next, we encode the entire text and convert it into tensors using PyTorch
import torch
encoded_text = encode(text)
data = torch.tensor(encoded_text, dtype=torch.long)
print(data.shape, data.dtype)
data[:1000]

torch.Size([1115394]) torch.int64


tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57,  1, 47, 57,  1, 41, 

In [7]:
# split the data into train and validation chunks
train_data_percent = 0.9 # 90% would be for training
n = int(train_data_percent * len(data))
train_data = data[:n]
val_data = data[n:]

We cannot train a model with all the training data at once, as it is very expensive computationally. Thus, we train the model chunk by chunk where each chunk is taken out randomly from the data. Here, the chunk size will be indicated by the variable `block_size`.


In [8]:
block_size = 8
# train_data[:block_size+1]

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f"when input is {context} the target is {target}")


when input is tensor([18]) the target is 47
when input is tensor([18, 47]) the target is 56
when input is tensor([18, 47, 56]) the target is 57
when input is tensor([18, 47, 56, 57]) the target is 58
when input is tensor([18, 47, 56, 57, 58]) the target is 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target is 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is 58


  The training chunk of block size gives example to the model on how to respond when facing a set of inputs. For example, here when [18] is sent as input then return 47.
  Similarly, when [18, 47] is sent as input, then return 56. Thus, this gives us `block_size` number of examples from context length of 1 to block_size.

  Here, we will also need to add batch_size to do parallel processing to make training more efficient. Each block in a batch is processed parallelly without any interference.

In [9]:
torch.manual_seed(1337) # Manually seeding to get deterministic random numbers
batch_size = 4 # The number of parallel independent sequences
block_size = 8 # The number of tokens processed at a time. Maximum context length of predictions

def get_batch(split_type):
  """Generates a small batch of data of inputs x and target y"""
  data = train_data if split_type == "train" else val_data
  ix = torch.randint(len(data) - block_size, (batch_size, )) # generates a list of offsets randomly of size `batch_size`
  x = torch.stack([data[i: i+block_size] for i in ix]) # Here, we are creating the context blocks
  y = torch.stack([data[i+1: i+block_size+1] for i in ix]) # Target Blocks. This should be the output when a context block is passed to the model
  return x, y

xb, yb = get_batch("train")
print("inputs:")
print(xb.shape)
print(xb)
print("targets:")
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b, t]
    print(f"when input is {context.tolist()} the target is {target}")
    # Here, this gives us 32 different context (x) with their required targets (y)

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target is 43
when input is [24, 43] the target is 58
when input is [24, 43, 58] the target is 5
when input is [24, 43, 58, 5] the target is 57
when input is [24, 43, 58, 5, 57] the target is 1
when input is [24, 43, 58, 5, 57, 1] the target is 46
when input is [24, 43, 58, 5, 57, 1, 46] the target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target is 39
when input is [44] the target is 53
when input is [44, 53] the target is 56
when input is [44, 53, 56] the target is 1
when input is [44, 53, 56, 1] the target is 58
when input is [44, 53, 56, 1, 58] the target i

In [10]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
  """We are using the Bigram language model. It is a part of n-gram models. A bigram model uses the context of the previous 1 token to predict the next token. It does not check the context beyond 1 token."""
  def __init__(self, vocab_size):
    super().__init__()
    # Each token directly reads off the logits from the lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets=None):

    logits = self.token_embedding_table(idx) # Returns a tensor of dim (batch_size, block_size, vocab_size) (B, T, C). Here its, [4, 8, 65]
    if targets is None:
      loss = None
    else:
      B, T, C = logits.shape
      # Converting 3D array to 2D array where B and T dimensions are converted to a single dimension while preserving the C dimension.
      # This is done to conform with the cross_entropy documentation.
      logits = logits.view(B*T, C)
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets) # loss should be -ln(1/vocab_size). Check Karpathy videos for more understanding.

    return logits, loss

  def generate(self, idx, max_new_tokens):
    """idx is (B, T) array. This function will take the input tokens `idx` and generate the output tokens based on the inputs.

    Arguments:
        idx {Array} -- An array of size (B, T)
        max_new_tokens {Integer} -- The number of tokens to generate

    Returns:
        [Array] -- The output of size (B, T+max_new_tokens)
    """
    for _ in range(max_new_tokens):
      # First get the prediction using the forward function.
      logits, loss = self(idx)
      # Then only keep the last index of the T dimension because each of the element in T represents a token and we are only focused om the last token for each prediction.
      logits = logits[:, -1, :] # Shape: (B, C)
      # Apply softmax to get the probabilities
      probs = F.softmax(logits, dim=-1) # Shape: (B, C)
      # Get the sample for the distribution.
      # After the softmax, we are getting the a 2D array where each row element (which denotes the batch) has an array of length C.
      # That array is the probability distribution of what would be the next token for each batch.
      # Now, the multinomial function samples the probability distribution and returns the single token from the probability distribution.
      idx_next = torch.multinomial(probs, num_samples = 1) # Shape: (B, 1)
      # Append the prediction to the input sequence.
      idx = torch.cat((idx, idx_next), dim=1) # Shape: (B, T+1)
    return idx


model = BigramLanguageModel(vocab_size)
logits, loss = model(xb, yb)
# logits, loss = model(xb)
print(logits.shape)
print(loss)

idx = torch.zeros((1,1), dtype=torch.long) # Generates token 0 which corresponds to newline. This will be used to start generation
generated_tokens = model.generate(idx, max_new_tokens = 100)[0]
output = decode(generated_tokens.tolist())
print(output)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


The output we got was completely garbage values because the embedding table was randomly generated without any training. Lets train the model first before using it.

In [11]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr= 1e-3) # 3r-4 is a good learning rate for Bigger models.

batch_size = 32
for steps in range(10000):
  xb, yb = get_batch('train')

  logits, loss = model(xb, yb)
  optimizer.zero_grad(set_to_none = True)
  loss.backward()
  optimizer.step()

print(loss.item())

2.5727508068084717


In [12]:
idx = torch.zeros((1,1), dtype=torch.long) # Generates token 0 which corresponds to newline. This will be used to start generation
generated_tokens = model.generate(idx, max_new_tokens = 500)[0]
output = decode(generated_tokens.tolist())
print(output)


Iyoteng h hasbe pave pirance
Rie hicomyonthar's
Plinseard ith henoure wounonthioneir thondy, y heltieiengerofo'dsssit ey
KIN d pe wither vouprrouthercc.
hathe; d!
My hind tt hinig t ouchos tes; st yo hind wotte grotonear 'so it t jod weancotha:
h hay.JUCle n prids, r loncave w hollular s O:
HIs; ht anjx?

DUThinqunt.

LaZAnde.
athave l.
KEONH:
ARThanco be y,-hedarwnoddy scace, tridesar, wnl'shenous s ls, theresseys
PlorseelapinghiybHen yof GLUCEN t l-t E:
I hisgothers je are!-e!
QLYotouciullle'z


# The mathematical trick in Self Attention

In [32]:
# consider the following toy example
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [33]:
# Version 1: Averaging past context with for loop, the weakest form of aggregation.

# We want x[b, t] = mean_{i<=t} x[b, i]
# A token needs the context of the prev tokens (decoder architecture) so that we can understand the context of the token with respect to the sentence and predict the next tokens.
# Currently, we are averaging out the context of all the prev tokens (from 0 to t token) to get the new context.
x_bag_of_words = torch.zeros((B, T, C))
for b in range(B):
  for t in range(T):
    x_prev = x[b, :t+1] # Dim: (t, C) because we are calculating x_prev for each batch, so there is no batch dim for this.
    x_bag_of_words[b, t] = torch.mean(x_prev, 0) # Dim: (C) We are avg out the t dimension to get the mean of the context (C) dimension

print(x[1])
print(x_bag_of_words[1])

tensor([[ 1.3488, -0.1396],
        [ 0.2858,  0.9651],
        [-2.0371,  0.4931],
        [ 1.4870,  0.5910],
        [ 0.1260, -1.5627],
        [-1.1601, -0.3348],
        [ 0.4478, -0.8016],
        [ 1.5236,  2.5086]])
tensor([[ 1.3488, -0.1396],
        [ 0.8173,  0.4127],
        [-0.1342,  0.4395],
        [ 0.2711,  0.4774],
        [ 0.2421,  0.0694],
        [ 0.0084,  0.0020],
        [ 0.0712, -0.1128],
        [ 0.2527,  0.2149]])


In [21]:
# Matrix multiply as weighted aggregation

torch.manual_seed(42)

a = torch.ones(3, 3)
a = torch.tril(a)
a = a / torch.sum(a, 1, keepdim=True) # With this, the 1st row will only have the context from 1st row, the 2nd row will be the avg of 1st and 2nd row, and so on.
# This is exactly what we did above to find the mean of the context dim but with pure maths, improving performance.
b = torch.randint(0, 10, (3, 2)).float()
c = a @ b

print(f"a = {a}")
print(f"b = {b}")
print(f"c = {c}")

a = tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b = tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c = tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [34]:
# Version 2: Self Attention using Matrix Multiplication
weights = torch.tril(torch.ones(T, T))
weights = weights / weights.sum(1, keepdim=True)

x_bag_of_words2 = weights @ x # (B, T, T) @ (B, T, C) -> (B, T, C) [PyTorch will automatically add B dimension to the weights and do parallel multiplication]
torch.allclose(x_bag_of_words, x_bag_of_words2) # This will come out as True

False

In [40]:
# Version 3: Using softmax
tril = torch.tril(torch.ones(T, T))
weights = torch.zeros((T, T))
weights = weights.masked_fill(tril == 0, float('-inf'))
weights = F.softmax(weights, dim=-1)
x_bow3 = weights @ x
torch.allclose(x_bag_of_words, x_bow3)

False