# A Miniature Implementation of GPT Model
> This blog explains a minimalistic implementation of GPT Model on the basis of an Addition Problem.

- badges: true
- comments: true
- sticky_rank: 1
- author: Group-23
- image: images/diagram.png
- categories: [GPT, transformers]

Blog Link : https://geek4ray.github.io/blog/mingpt/transformers/2021/11/22/Implementing-Miniature-GPT-Model.html

This is Group-23 NLP Project : A Miniature Implemantation of GPT Model  

Group Members :

1. Mallikarjuna Naik - IEC2018029
2. Rakamal Gupta - IEC2018050
3. Rayan Kejriwal - IEC2018080
4. Muasim Wani - IEC2018085
5. Pavan Kalyan - IEC2018088

under Prof. Muneendra Ojha, 
Dept. of IT, IIIT Allahabad, Prayagraj, India,

> `Objective : "To Train a GPT model on a dedicated addition dataset to see if a Transformer can learn to add."`

Our Objective is inspired by the addition section in the GPT-3 paper (Language Models are a few shot learners)- https://arxiv.org/pdf/2005.14165v4.pdf

Code References Used :
1. https://github.com/openai/gpt-2 - This has GPT code (only model code is taken but not training) in Tensorflow, which is converted to our usecase in PyTorch here.
2. Image GPT by OpenAI - https://github.com/openai/image-gpt 
3. "Attention is all you need paper" - https://arxiv.org/pdf/1706.03762.pdf

`> Note: We advice to enable GPU before running this notebook on GoogleColab.`

## 1. Imports

In [None]:
# Imports
import math
import logging
import gc
import os
import numpy as np
import torchvision
import torch
import matplotlib.pyplot as plt
import random
import torch.nn as nn
from torch.nn import functional as F
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data.dataloader import DataLoader

## 2. Setting Our Seed 

In [None]:
# Seeding 
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Making deterministic, setting our seed
set_seed(42)

## 3. Generating Our Datasets:

In order to generate training and validation data, we define our custom Addition Dataset Class. </br>
</n>
The sum of two n-digit numbers gives a third up to (n+1)-digit number. So our
encoding will simply be the n-digit first number, n-digit second number, 
and (n+1)-digit result, all simply concatenated together. Because each addition
problem is so structured, there is no need to bother the model with encoding
+, =, or other tokens. Each possible sequence has the same length, and simply
contains the raw digits of the addition problem.

As a few examples, the 2-digit problems:
- 85 + 50 = 135 becomes the sequence [8, 5, 5, 0, 1, 3, 5]
- 6 + 39 = 45 becomes the sequence [0, 6, 3, 9, 0, 4, 5]
etc.

We will also only train GPT on the final (n+1)-digits because the first
two n-digits are always assumed to be given. So when we give GPT an exam later,
we will e.g. feed it the sequence [0, 6, 3, 9], which encodes that we'd like
to add 6 + 39, and hope that the model completes the integer sequence with [0, 4, 5] in 3 sequential steps.

In [None]:
#collapse-hide
from torch.utils.data import Dataset
class AdditionDataset(Dataset):
  
    """
    Our Custom Dataset Class for Generating Data into Training and Test sets.
    Returns addition problems of up to some number of digits in the inputs. We recall
    that all GPT cares about are sequences of integers, and completing them according to
    patterns in the data. Therefore, we have to somehow encode addition problems
    as a sequence of integers.

    """

    def __init__(self, ndigit, split):
        self.split = split # train/test
        self.ndigit = ndigit
        self.vocab_size = 10 # 10 possible digits 0..9
        #+1 due to potential carry overflow, but then -1 because very last digit doesn't plug back
        self.block_size = ndigit + ndigit + ndigit + 1 - 1
        
        #split up all addition problems into either training data or test data : 
        num = (10**self.ndigit)**2 # total number of possible combinations, here num = 10000
        r = np.random.RandomState(1337) # making our datasets deterministic
        perm = r.permutation(num) #perm is an array of indexes
        num_test = min(int(num*0.2), 1000)# 20% of the whole dataset, or only up to 1000
        self.ixes = perm[:num_test] if split == 'test' else perm[num_test:] # Here, We have taken 1000 examples in test set and 9000 in training set

    
    def __len__(self):
        return self.ixes.size # Magic method for using len(...)


    # Defining Magic Method __getitem__ for to use Dataset Class object as an iterable container.
    def __getitem__(self, idx):
        # given a problem index idx, first recover the associated a + b
        idx = self.ixes[idx]
        nd = 10**self.ndigit
        a = idx // nd
        b = idx %  nd
        c = a + b
        render = f'%0{self.ndigit}d%0{self.ndigit}d%0{self.ndigit+1}d' % (a,b,c) # e.g. 03+25=28 becomes "0325028" 
        dix = [int(s) for s in render] # convert each character to its token index
        # x will be input to GPT and y will be the associated expected outputs
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long) # predict the next token in the sequence
        y[:self.ndigit*2-1] = -100 # we will only train in the output locations. -100 will mask loss to zero
        return x, y

Creating our Training and Test Datasets for 2-Digit Addition

In [None]:
ndigit = 2
train_dataset = AdditionDataset(ndigit=ndigit, split='train')
test_dataset = AdditionDataset(ndigit=ndigit, split='test')

Sample a training instance just to see what one raw example looks like

In [None]:
train_dataset[0] 

(tensor([4, 7, 1, 7, 0, 6]), tensor([-100, -100, -100,    0,    6,    4]))

## 4. Defining our GPT Model

- The initial stem consists of a combination of token encoding and a positional encoding
- The Core of our model is a uniform sequence of Transformer blocks :
     * each Transformer is a sequential combination of a 1-hidden-layer MLP block and a self-attention block
     * all blocks feed into a central residual pathway similar to resnets

- The final decoder is a linear projection into a vanilla Softmax classifier

### 4.1 Our basic config classes for GPT MODEL 



In [None]:
logger = logging.getLogger(__name__)

class GPTConfig:
    """ base GPT config, params common to all GPT versions """
    embd_pdrop = 0.1
    resid_pdrop = 0.1
    attn_pdrop = 0.1

    def __init__(self, vocab_size, block_size, **kwargs):
        self.vocab_size = vocab_size
        self.block_size = block_size
        for k,v in kwargs.items():
            setattr(self, k, v)

class GPT1Config(GPTConfig):
    """ GPT-1 like network roughly 125M params """
    n_layer = 12
    n_head = 12
    n_embd = 768

### 4.2 Implementing Self-Attention Class from Scratch 



![MHA](https://ars.els-cdn.com/content/image/1-s2.0-S0167639320302806-gr2.jpg)

Now, We will write our own class of Masked-MuliHead-Self Attention Block from scratch. Multi Head attention is perhaps one of the most important module of the transformer architecture. In case of transformers - they use a mechanism called self-attention instead of simple attention. Difference between simple attention and self-attention is that  - simple attention selectively focuses more on words which are present in query but in self-attention relationship with other surrounding(less-important) words is also taken into account to get a deep understanding of the context. In our model case of GPT , we have particularly used masked self attention which just means that, words to the right are no taken into account 


1. First our inputs of size (B,T,C) viz.( Mini-Batch Size, Embedding Size, ) is fed to the model. 
2. There are 3 Linear Layers to which our inputs are fed which then output Queries, Keys, and Values of dim (Inp_vector,T).
3. Then we do (Queries@(keys).T)/sqrt(Embedding_Size) as our next step in order to calculate the attention score matrix.
4. We then apply masking matrix to this matrix, to convert it to lower diagonal matrix for making our attention to the left words only in future.
5. This matrix is then passed to softmax function which nornmalized all attention scores and also converts entries in the upper triangular half of -inf (in our case we have taken -100) to 0.
6. We further do a dropout layer for regularization (with p_atten_drop = 0.1) and then finally we do a matmul with the original value matrix of dim -> (input_dim, embedding_size)
7. We then project our output to the same dimension as that of input by passing it to a linear layer to again get an output of size(B,T,C) so that further we can concatenate it with the ouputs of the other heads along the outermost dimension.




In [None]:
class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is also possible to use torch.nn.MultiheadAttention here.

    """
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_drop = nn.Dropout(config.attn_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)
        # output projection
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head

    def forward(self, x, layer_past=None):
        B, T, C = x.size() #B->Batch Size, T->#Training rows, C->#Columns

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim where hs = C//self.n_head
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y


### 4.3 Our Basic Transformer Block 

Here, we have defined a basic block which uses a config as an input (config is an instance of our GPTConfig Class) and defines the structure of a basic transformer block which will be used in future. 

- [LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) is an linear layer normalization class inside Pytorch nn.Module used to apply Layer Normalization over a mini-batch of inputs.

1. First we apply a layer norm to a batch of inputs, which we pass on to the masked self-attention block and then add the input x again to the residual of the self attention block in order to capture original information again after the self-attention block. -> x1

2. After that, we again pass that to a Normalizatoin layer, which we pass to a multilayered feed forward network with Linear->Gelu->Linear->Drouput layers to which we add input x1 of the above step 1.





In [None]:
class Block(nn.Module):
    """ our basic Transformer block """
    
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

### 4.4 Full GPT Model Class:
This is our main GPT model class for which we have defined our components as small blocks stated in various small classed explained above.

1. This class also takes input config which is an instance of GPTConfig Class.
2. The functionalyti of each function which we use is expalined within the function body itself. (Pls. refer there for details).
3. For the forward function of this class, we take an input as the single row i.e one training row from our instance of AdditionDataset class.
4.  #b=no. of examples in minibatch, t = #tokens in an example (maximum value of t = 6). our minibatch matrix of size (b,t).
5. Our token embedding layer and pos_embedding layer are defined as :
`> self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)` </br>
`> self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))`
Our inputs are passed to these layers and then outputs of these layers are added as together so as to capture the positional information (which is a must required information in case of transformars model as compared to RNN/LSTM models which are already sequential).

6. Now we will just calculate and return the logits (probability function) of each of the digits in our vocab_size of 10 along with the loss (which we calculate only if the targets (y's) are provided initially). 

In [None]:
class GPT(nn.Module):
    """ This is our the full GPT language model, with a context size of block_size """

    def __init__(self, config):
        super().__init__()

        # input embedding stem
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.drop = nn.Dropout(config.embd_pdrop)
        # transformer      
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        # decoder head
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

        logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))

    def get_block_size(self):
        return self.block_size

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)


  
    def configure_optimizers(self, train_config):


      """
      By this function, We are separating out all parameters of the model into two buckets: 
      those that will experience weight decay for regularization and those that won't (biases, and layernorm/embedding weights).
      We are then returning the PyTorch optimizer object.

      """
      # separate out all parameters to those that will and won't experience regularizing weight decay
      decay = set()
      no_decay = set()
      whitelist_weight_modules = (torch.nn.Linear, )
      blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
      for mn, m in self.named_modules():
          for pn, p in m.named_parameters():
              fpn = '%s.%s' % (mn, pn) if mn else pn # full param name

              if pn.endswith('bias'):
                  # all biases will not be decayed
                  no_decay.add(fpn)
              elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
                  # weights of whitelist modules will be weight decayed
                  decay.add(fpn)
              elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
                  # weights of blacklist modules will NOT be weight decayed
                  no_decay.add(fpn)

      # special case the position embedding parameter in the root GPT module as not decayed
      no_decay.add('pos_emb')

      # validate that we considered every parameter
      param_dict = {pn: p for pn, p in self.named_parameters()}
      inter_params = decay & no_decay
      union_params = decay | no_decay
      assert len(inter_params) == 0, "parameters %s made it into both decay/no_decay sets!" % (str(inter_params), )
      assert len(param_dict.keys() - union_params) == 0, "parameters %s were not separated into either decay/no_decay set!" \
                                                % (str(param_dict.keys() - union_params), )

      # create the pytorch optimizer object
      optim_groups = [
          {"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
          {"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
      ]
      optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
      return optimizer

    def forward(self, idx, targets=None):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."

        # forward the GPT model
        token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector
        position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector
        x = self.drop(token_embeddings + position_embeddings)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

### We can now initialize our GPT Model with assumable parameters :

In [None]:
#hide
print("Our Vocab Size: ",train_dataset.vocab_size," Our Max. Block Size: " ,train_dataset.block_size)

Our Vocab Size:  10  Our Max. Block Size:  6


In [None]:
# initialize a GPT model
mconf = GPTConfig(train_dataset.vocab_size, train_dataset.block_size, n_layer=2, n_head=4, n_embd=128)
model = GPT(mconf)

##  5. Trainer (Learner) for our GPT Model : 
Now we will define the trainer class of our model, to which we will  pass our defined instance of GPTModel along with some other tunable hyperparameters which are used in training in PyTorch.



### 5.1 Trainer Config Class
Just like our GPT1Config and GPTCOnfig classes defined above, we have defined a seperate class for the main Traning class. It contains the hyperparameters which are used globally throughout in the main Trainer Class.

In [None]:
#collapse-hide 

logger = logging.getLogger(__name__)

class TrainerConfig:
    # optimization parameters
    max_epochs = 10
    batch_size = 64
    learning_rate = 3e-4
    betas = (0.9, 0.95)
    grad_norm_clip = 1.0
    weight_decay = 0.1 # only applied on matmul weights
    # learning rate decay params: linear warmup followed by cosine decay to 10% of original
    lr_decay = False
    warmup_tokens = 375e6 # these two numbers come from the GPT-3 paper, but may not be good defaults elsewhere
    final_tokens = 260e9 # (at what point we reach 10% of original LR)
    # checkpoint settings
    ckpt_path = None
    num_workers = 0 # for DataLoader

    def __init__(self, **kwargs):
        for k,v in kwargs.items():
            setattr(self, k, v)


### 5.2 Main Trainer/Learner Class
This is the main class used for our GPT Model Training, it contains the instances of GPT model , train_dataset, test_dataset, config classes as inputs.

The model training goes on this way :
1. First we construct 2 seperate dataloaders from PyTorch Dataloader class by using train_dataset, test_dataset which basically offer the functionality to load the data in size of minibatches of (x,y) on to the cpu/gpu whichever device is available.
2. then we collect our logits,losses from the output of out GPT() model.
3. We then use PyTorch's autograd mechanasim in order to backprop and update the parameters.
*   `model.zero_grad()` - sets gradients to zero so that they don't accumulate.
*   `loss.backward()` -  does the backpropogation step.
*   `optimizer.step()` - updates the parameters throughout our model.



In [13]:
class Trainer:
    def __init__(self, model, train_dataset, test_dataset, config):
        self.model = model
        self.train_dataset = train_dataset
        self.test_dataset = test_dataset
        self.config = config

        # take over whatever gpus are on the system
        self.device = 'cpu'
        if torch.cuda.is_available():
            self.device = torch.cuda.current_device()
            self.model = torch.nn.DataParallel(self.model).to(self.device)

    def save_checkpoint(self):
        # DataParallel wrappers keep raw model object in .module attribute
        raw_model = self.model.module if hasattr(self.model, "module") else self.model
        logger.info("saving %s", self.config.ckpt_path)
        torch.save(raw_model.state_dict(), self.config.ckpt_path)

    def train(self):
        model, config = self.model, self.config
        raw_model = model.module if hasattr(self.model, "module") else model
        optimizer = raw_model.configure_optimizers(config)

        def run_epoch(split):
            is_train = split == 'train'
            model.train(is_train)
            data = self.train_dataset if is_train else self.test_dataset
            loader = DataLoader(data, shuffle=True, pin_memory=True,
                                batch_size=config.batch_size,
                                num_workers=config.num_workers)

            losses = []
            pbar = tqdm(enumerate(loader), total=len(loader)) if is_train else enumerate(loader)
            for it, (x, y) in pbar:

                # place data on the correct device
                x = x.to(self.device)
                y = y.to(self.device)

                # forward the model
                with torch.set_grad_enabled(is_train):
                    logits, loss = model(x, y)
                    loss = loss.mean() # collapse all losses if they are scattered on multiple gpus
                    losses.append(loss.item())

                if is_train:
                    # backprop and update the parameters
                    model.zero_grad()
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
                    optimizer.step()

                    # decay the learning rate based on our progress
                    if config.lr_decay:
                        self.tokens += (y >= 0).sum() # number of tokens processed this step (i.e. label is not -100)
                        if self.tokens < config.warmup_tokens:
                            # linear warmup
                            lr_mult = float(self.tokens) / float(max(1, config.warmup_tokens))
                        else:
                            # cosine learning rate decay
                            progress = float(self.tokens - config.warmup_tokens) / float(max(1, config.final_tokens - config.warmup_tokens))
                            lr_mult = max(0.1, 0.5 * (1.0 + math.cos(math.pi * progress)))
                        lr = config.learning_rate * lr_mult
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = lr
                    else:
                        lr = config.learning_rate

                    # report progress
                    pbar.set_description(f"epoch {epoch+1} iter {it}: train loss {loss.item():.5f}. lr {lr:e}")

            if not is_train:
                test_loss = float(np.mean(losses))
                logger.info("test loss: %f", test_loss)
                return test_loss

        best_loss = float('inf')
        self.tokens = 0 # counter used for learning rate decay
        for epoch in range(config.max_epochs):

            run_epoch('train')
            if self.test_dataset is not None:
                test_loss = run_epoch('test')

            # supports early stopping based on the test loss, or just save always if no test set is provided
            good_model = self.test_dataset is None or test_loss < best_loss
            if self.config.ckpt_path is not None and good_model:
                best_loss = test_loss
                self.save_checkpoint()


## 6. Model Training :

In [15]:
# initialize a trainer instance and kick off training
from tqdm import tqdm
tconf = TrainerConfig(max_epochs=50, batch_size=512, learning_rate=6e-4,
                      lr_decay=True, warmup_tokens=1024, final_tokens=50*len(train_dataset)*(ndigit+1),
                      num_workers=4)
trainer = Trainer(model, train_dataset, test_dataset, tconf)
trainer.train()

  cpuset_checked))
epoch 1 iter 17: train loss 1.74271. lr 5.994512e-04: 100%|██████████| 18/18 [00:01<00:00, 14.37it/s]
epoch 2 iter 17: train loss 1.51097. lr 5.977197e-04: 100%|██████████| 18/18 [00:01<00:00, 17.21it/s]
epoch 3 iter 17: train loss 1.32211. lr 5.948114e-04: 100%|██████████| 18/18 [00:01<00:00, 17.80it/s]
epoch 4 iter 17: train loss 1.19657. lr 5.907379e-04: 100%|██████████| 18/18 [00:00<00:00, 18.51it/s]
epoch 5 iter 17: train loss 1.14752. lr 5.855153e-04: 100%|██████████| 18/18 [00:01<00:00, 17.75it/s]
epoch 6 iter 17: train loss 1.10465. lr 5.791641e-04: 100%|██████████| 18/18 [00:00<00:00, 18.21it/s]
epoch 7 iter 17: train loss 1.08063. lr 5.717095e-04: 100%|██████████| 18/18 [00:00<00:00, 18.92it/s]
epoch 8 iter 17: train loss 1.04661. lr 5.631810e-04: 100%|██████████| 18/18 [00:00<00:00, 19.82it/s]
epoch 9 iter 17: train loss 0.94335. lr 5.536122e-04: 100%|██████████| 18/18 [00:00<00:00, 19.41it/s]
epoch 10 iter 17: train loss 0.61353. lr 5.430411e-04: 100%|███

## 7. Getting Accuracy on Training and Validation DataSets :

Now we will evaluate our miniature version of the transformer GPT model trained on our custom dataset (9000->training set size), (1000->validation set size) by providing it with an exam of doing Addition. Here, we also define our basic utilitiy functions which are used for sampling and doing inference on our training and validatoin sets.

In [21]:
#collapse-hide
# Taking Top-k Logits  
def top_k_logits(logits, k):
    v, ix = torch.topk(logits, k)
    out = logits.clone()
    out[out < v[:, [-1]]] = -float('Inf')
    return out

@torch.no_grad()
def sample(model, x, steps, temperature=1.0, sample=False, top_k=None):
    """
    This function takes a conditioning sequence of indices in x (of shape (b,t)) and predict the next token in
    the sequence, feeding the predictions back into the model each time. 

    """
    block_size = model.get_block_size()
    model.eval()
    for k in range(steps):
        x_cond = x if x.size(1) <= block_size else x[:, -block_size:] # crop context if needed
        logits, _ = model(x_cond)
        # pluck the logits at the final step and scale by temperature
        logits = logits[:, -1, :] / temperature
        # optionally crop probabilities to only the top k options
        if top_k is not None:
            logits = top_k_logits(logits, top_k)
        # apply softmax to convert to probabilities
        probs = F.softmax(logits, dim=-1)
        # sample from the distribution or take the most likely
        if sample:
            ix = torch.multinomial(probs, num_samples=1)
        else:
            _, ix = torch.topk(probs, k=1, dim=-1)
        # append to the sequence and continue
        x = torch.cat((x, ix), dim=1)

    return x

In [22]:
#collapse-hide
from torch.utils.data.dataloader import DataLoader

def give_exam(dataset, batch_size=32, max_batches=-1):    
    results = []
    loader = DataLoader(dataset, batch_size=batch_size)
    for b, (x, y) in enumerate(loader):
        x = x.to(trainer.device)
        d1d2 = x[:, :ndigit*2]
        d1d2d3 = sample(model, d1d2, ndigit+1)
        d3 = d1d2d3[:, -(ndigit+1):]
        factors = torch.tensor([[10**i for i in range(ndigit+1)][::-1]]).to(trainer.device)
        # decode the integers from individual digits
        d1i = (d1d2[:,:ndigit] * factors[:,1:]).sum(1)
        d2i = (d1d2[:,ndigit:ndigit*2] * factors[:,1:]).sum(1)
        d3i_pred = (d3 * factors).sum(1)
        d3i_gt = d1i + d2i
        correct = (d3i_pred == d3i_gt).cpu() # Software 1.0 vs. Software 2.0 fight RIGHT on this line, lol
        for i in range(x.size(0)):
            results.append(int(correct[i]))
            judge = 'YEP!!!' if correct[i] else 'NOPE'
            if not correct[i]:
                print("GPT claims that %03d + %03d = %03d (gt is %03d; %s)" 
                      % (d1i[i], d2i[i], d3i_pred[i], d3i_gt[i], judge))
        
        if max_batches >= 0 and b+1 >= max_batches:
            break

    print("final score: %d/%d = %.2f%% correct" % (np.sum(results), len(results), 100*np.mean(results)))

In [23]:
# training set: how well did we memorize?
give_exam(train_dataset, batch_size=1024, max_batches=10)

GPT claims that 045 + 055 = 090 (gt is 100; NOPE)
final score: 8999/9000 = 99.99% correct


In [24]:
# test set: how well did we generalize?
give_exam(test_dataset, batch_size=1024, max_batches=-1)

GPT claims that 055 + 045 = 090 (gt is 100; NOPE)
final score: 999/1000 = 99.90% correct
