<a href="https://colab.research.google.com/github/CaptainJimbo/MyPortfolio/blob/main/myGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Based on "Attention is all you need" paper** [(link)](https://arxiv.org/abs/1706.03762). This simple algorithm is a Transformer-based Language Model to showcase how an LLM like ChatGPT is trained. It doens't include the pretuning and supervised finetuning.

In [1]:
import torch

In [2]:
# I need a "toy" dataset to train with.
# (This is very small comparing to a big chunk of the internet that ChatGPT is trained on!)
# This is a .txt file with some of Shakespeare's works.
# The goal is to create a model that produces Shakespearean language!!
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -O input.txt

--2023-07-10 12:25:22--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2023-07-10 12:25:22 (20.0 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [3]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
print(f'length of the dataset is {len(text)}')
print(f'\nand here is a random part of the dataset {text[60:464]}')

length of the dataset is 1115394

and here is a random part of the dataset 

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.




In [4]:
# The algorithm needs to understands characters. But it doesn't need the the particular characters.
# It could be numbers i.e. indices. So I create a mapping from characters to indices.
vocabulary = sorted(list(set(text)))
print('here is the vocabulary, i.e. every possible character that exists in this text.',''.join(vocabulary))

here is the vocabulary, i.e. every possible character that exists in this text. 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [5]:
char_to_idx = {character:index for index, character in enumerate(vocabulary)}
idx_to_char = {index:character for index, character in enumerate(vocabulary)}

# Mappings
def encode(text):
  return [char_to_idx[character] for character in text]
def decode(indices):
  return ''.join(idx_to_char[index] for index in indices)

#encode('Hello There'), decode(encode('Hello There'))

In [6]:
data = torch.tensor(encode(text),dtype=torch.long) # This is tensor with indices representing characters.
print('tensor shape',data.shape)
print('tensor  type',data.dtype)
print('tensor  rank',data.dim())

tensor shape torch.Size([1115394])
tensor  type torch.int64
tensor  rank 1


In [7]:
0.8*100344

80275.20000000001

In [8]:
# We need a train set and test set
train_data = data[:int(0.8*len(data))]
test_data = data[int(0.8*len(data)):]

In [9]:
gram_len = 5
X = train_data[:gram_len]
y = train_data[1:gram_len+1]
X, y

(tensor([18, 47, 56, 57, 58]), tensor([47, 56, 57, 58,  1]))

In [39]:
torch.manual_seed(1337)
BATCH_SIZE = 4
NGRAM_LENGTH = 5

def get_batch(type, batch_size=BATCH_SIZE, ngram_length=NGRAM_LENGTH):
    data = train_data if type=='train' else test_data
    inits = torch.randint(len(data)-ngram_length, (batch_size,))
    X = torch.stack([data[i:i+ngram_length] for i in inits])
    Y = torch.stack([data[i+1:i+ngram_length+1] for i in inits])
    return X,Y

X_b, Y_b = get_batch('train',10, 6)

In [40]:
X_b

tensor([[ 1, 40, 43,  1, 63, 53],
        [40, 39, 50, 43,  8,  0],
        [ 1, 61, 46, 47, 41, 46],
        [46,  1, 58, 46, 43,  1],
        [53,  1, 57, 46, 39, 50],
        [63, 53, 59, 56,  1, 54],
        [ 5, 50, 50,  1, 52, 53],
        [ 1, 61, 47, 58, 46,  1],
        [63, 53, 59,  1, 61, 47],
        [43, 52, 45, 43,  8,  0]])

In [30]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)



Embedding(65, 65)

In [50]:
nn.Embedding(len(vocabulary),len(vocabulary))(X_b).view(10*6,65).shape
nn.Embedding(len(vocabulary),len(vocabulary))(Y_b).view(10*6,65).shape

torch.Size([60, 65])

In [32]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,L) tensor of integers
        logits = self.token_embedding_table(idx) # (B,L,V)

        if targets is None:
            loss = None
        else:
            B, L, V = logits.shape
            logits = logits.view(B*L, V)
            targets = targets.view(B*L)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, L) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, V)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, V)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, L+1)
        return idx

m = BigramLanguageModel(len(vocabulary))
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([10, 65])
tensor(4.5677, grad_fn=<NllLossBackward0>)

hbH

:CLP.A!fq'3ggt!O!T?X!!SA?W&TrpvYybSE3w&S BXUhmiKYyTmWMPhhmnHKj!!btgnwNNULuEzRuYyiWEQxPX!$3C'MBj


[tensor([47, 56, 57, 58,  1]),
 tensor([56, 57, 58,  1, 15]),
 tensor([57, 58,  1, 15, 47])]