# Reproducing nanoGPT

## Description of the project

How to create a character level language model? It will be based on a from scratch implementation of a Generative Pre-training Transformer (GPT) model based on the `tinyshakespeare` dataset, which is a concatenation of all the works of Shakespeare.  We use a `Bigram` architecture for tokenization and embedding. Then, the tokens pass through a transformer.

## Results

TODO

## Useful resources to better understand the attention mechanism and transformers
- [Karpathy nanoGPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&pp=ygUSa2FycGF0aHkgbmFubyBncHQg)
- [Ilustrated transformer - Blog](https://jalammar.github.io/illustrated-transformer/)
- [Vaswani et al. paper](https://arxiv.org/pdf/1706.03762)
- [Lecture notes on attention mechanisms](https://fleuret.org/dlc/materials/dlc-slides-13-2-attention-mechanisms.pdf)

## Now let's see the code!

In [1]:
# uncomment to download the text dataset 
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [2]:
# EXPLORATION OF THE DATASET 
import random 

with open("input.txt", 'r') as file:
    text = file.read()
    print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [3]:
# STORE UNIQUE CHARACTERS

# grab all the different characters in our text
# - sorted - print by joining - vocab_size (should be useful later) 
characters = sorted(list(set(text)))
vocab_size = len(characters)
print('unique characters present in Shakespeare texts: ',''.join(characters))
print(f'size of vocabulary: {vocab_size}')

unique characters present in Shakespeare texts:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
size of vocabulary: 65


In [4]:
# CHARACTER-LEVEL TOKENIZATION
# now we need to tokenize. Given raw text we wish to convert that into a sequence of intergers 

stoi = {elem:i for i,elem in enumerate(characters)} # stoi: string to index 
itos = {i:elem for i,elem in enumerate(characters)} # 
encode = lambda element: [stoi[elem] for elem in element]
decode = lambda element: ''.join([itos[elem] for elem in element])

print('Print the map from character to integer: ',stoi)
print('Example of encoded version of string hell$$oasds: ' ,encode('hell$$oasds'))
print('Check of decode of the encoded version of hell$$oasds: ', decode(encode('hell$$oasds')))

Print the map from character to integer:  {'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47, 'j': 48, 'k': 49, 'l': 50, 'm': 51, 'n': 52, 'o': 53, 'p': 54, 'q': 55, 'r': 56, 's': 57, 't': 58, 'u': 59, 'v': 60, 'w': 61, 'x': 62, 'y': 63, 'z': 64}
Example of encoded version of string hell$$oasds:  [46, 43, 50, 50, 3, 3, 53, 39, 57, 42, 57]
Check of decode of the encoded version of hell$$oasds:  hell$$oasds


In [5]:
# TOKENIZATION OF THE ENTIRE DATASET 

import torch
data = torch.tensor(encode(text))
print(data.shape)
print(data[:1000]) # looking at the first 1000 characters 


torch.Size([1115394])
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
        47, 59, 57

In [6]:
# DATA SPLIT: TRAINING SET and VALIDATION SET

n = int(len(data)*0.9) # 
train_data = data[:n]
val_data = data[n:]
print(train_data.shape)
print(val_data.shape)


torch.Size([1003854])
torch.Size([111540])


In [32]:
# BATCHes generation

torch.manual_seed(1337)
block_size = 8 # how long is our sequence?
batch_size = 4 # how many sequences we want to compute in parallel? 

def get_batch(split_type):
    dataset = train_data if split_type=='train' else val_data
    positions = torch.randint(low=0,high=len(dataset)-block_size,size=(batch_size,))
    A, B = None, None
    A = torch.stack([dataset[pos:pos+block_size] for pos in positions])
    B = torch.stack([dataset[pos+1:pos+block_size+1] for pos in positions])

    return A,B

xa, xb = get_batch('train')
print(xa.shape, xb.shape)
print(xa)
print(xb)

torch.Size([4, 8]) torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


## A very simple language model 

At this point Karpathy refers to his lecture on `Bigram` model. I think it is important to understand this simple model first. It consists on a look up table `token_embedding_table` from which we will extract the embeddings from character tokens in `xa`, i.e. the encoded characters, and then we use them as logits for later comparison with `xb`, the encoded characters but for the next position.

```python
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size,vocab_size)

    def forward(self, idx):
        logits = self.token_embedding_table(idx) # size (Batch_size, Time , vocab_size)
        return logits 


```

In [37]:
# SIMPLEST BASELINE MODEL: to each character correspond an embedding

import torch
import torch.nn as nn 
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # this matrix serves as a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        B,T = idx.shape
        # idx: batch of sequences of encoded characters
        # for each entry of `stoi = idx[][]` , we will retrieve the stio-th row of self.token_embedding_table
        logits = self.token_embedding_table(idx) # (B,T,C) = (B = batch_size, time = block_size, channels = n_embd)

        if targets is None:
            loss = None
        else:
            B,T,C = logits.shape
            # we use `view` because of how `functional` from PyTorch treats the input
            logits  = logits.view(B*T,C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits,targets)

        return logits, loss


    def generate(self, idx, max_new_tokens):
        """
        function to extend the context of `idx` by `max_new_tokens`
        """  
        for i in range(max_new_tokens):            
            logits,loss = self(idx)
            logits = logits[:,-1,:] # we extract the last element from the sequence
            probs = F.softmax(logits,dim=1)
            idx_next = torch.multinomial(probs,num_samples=1)
            idx = torch.cat((idx,idx_next),dim=1)

        return idx

m = BigramLanguageModel(vocab_size)
logits, loss  = m(xb,xb)
print(logits.shape)
print(loss)



torch.Size([256, 65])
tensor(4.6043, grad_fn=<NllLossBackward0>)


In [38]:
# TESTING GENERATE WITHOUT TRAINING
# we set the first character from which we start to generate as the first character from our vocabulary
foo = m.generate(torch.zeros((1,1),dtype=torch.long),100)
print(foo.shape, type(foo))
# now I need to decode that list of numbers. First we send the tensor to be a list of size (1,101), then fetch the only row
foo_decode = foo.tolist()[0]
print(f'First attempt at predicting the next character without training: {decode(foo_decode)}')

torch.Size([1, 101]) <class 'torch.Tensor'>
First attempt at predicting the next character without training: 
Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [47]:
# TRAIN THE MODEL 
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3)

batch_size = 32
for i in range(10000):
    # sample a batch of data
    xa, xb = get_batch('train')
    # evaluate the loss
    logits, loss = m(xa,xb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

2.5574398040771484


In [55]:
# TESTING GENERATE WITH SOME TRAINING
# we set the first character from which we start to generate as the first character from our vocabulary
foo = m.generate(torch.zeros((1,1),dtype=torch.long),500)

foo_decode = foo.tolist()[0]
print(f'Prediction with training: {decode(foo_decode)}')

Prediction with training: 
Woullinore tresthealime:
MBENAn. it thigige; o pe, RO:
Fu.
Wind cer it OFolledom ar I hive 'lou, helouesean the s

ILORI t d p' l at wh. t arill, nothe pre t
SCHenor I somenowof akid nd lloreouake ced MOreou lourid CKISoouke oro h akie s mand'send!

N me sousowos,
Bu aik:
Thad mb,
Floulls bendr, whim s athesame r!
Whage hu s!
Wh, d
Sues, r a'squn:



t cke wfon m ad
BER: I thatre paspo, th; se, blenat foulupig.
I foigrind ay w mindo heve o ndsby, l buconcorgll--hin, igharurle jo s halouprd:
TULU
