## Nano-GPT(MOE)

In this notebook I will be building GPT-2 from scratch and modifying it's architecture to include Mixture of Experts(MoE). I will also be training the model on custom dataset of the game show Jeopardy. The GPT-2 part of this notebook is a direct Implementation of Andrej Karpathy's 'Let's reproduce GPT-2 (124M)' video and I will be building on top of this by including the MoE architecture which is from the A Review of Sparse Expert Models in Deep Learning research [paper](https://arxiv.org/abs/2209.01667). I will be building the MoE architecture after having applied all the optimisations to the GPT-2 model. I will not be covering all the optimisations that andrej covers in his video as I've already done that in my previous GPT-2 Notebook.

In this MoE architecture, the model will have a number of "expert" feedforward networks alongside the standard transformer layers.  An additional "gating" mechanism will determine which experts are best suited to process different parts of the input, thus allowing the model to specialize and potentially improve performance on complex tasks like Jeopardy question answering. This approach aims to make the model more efficient and scalable by activating only the necessary experts for a given input, while still maintaining the power of a large model.

### Install dependencies

In [None]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


### Training Data
I will be loading the training data from my drive which will also be available on the GitHub repo.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import json

with open('/content/drive/MyDrive/Train Data/JEOPARDY_QUESTIONS1.json', 'r') as file:
    data = json.load(file)

formatted_output = ""

for entry in data:
    formatted_output += f"Question:\n{entry['question']}\n\n"
    formatted_output += f"Value:\n{entry['value']}\n\n"
    formatted_output += f"Answer:\n{entry['answer']}\n\n"


Printing the first few lines of the training data:

In [None]:
print(formatted_output[:297])

Question:
'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'

Value:
$200

Answer:
Copernicus

Question:
'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'

Value:
$200

Answer:
Jim Thorpe




### GPT-2(MoE) Initialisation
I will again be initialising the model with random weights that follow the normal distribution with a standard deviation of 0.02 and also to control the growth of activations in the residual path I will be scaling down the initialisations to 1/sqrt(Number of residual paths) just like how OpenAI Initialised gpt-2. Along with the usual modules, I will be including a new separate class for the MoE class which has the experts and the gates to those experts and the MLP layer will be added to every experts in the architecture.

For demonstration purposes I will be implementing the model with two experts.

In [None]:
 import math
import inspect
from dataclasses import dataclass
import torch
import torch.nn as nn
from torch.nn import functional as F


class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)                       # key, query, value projections for all heads, but in a batch
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)                           # output projection
        self.c_proj.NANOGPT_SCALE_INIT = 1                                              # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size()                                                      # nh is "number of heads", hs is "head size", and C (number of channels) = nh * hs
        qkv = self.c_attn(x)                                                    # e.g. in GPT-2 (124M), n_head=12, hs=64, so nh*hs=C=768 channels in the Transformer
        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)         # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)         # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)         # (B, nh, T, hs)
        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)             # flash attention
        y = y.transpose(1, 2).contiguous().view(B, T, C)                        # re-assemble all head outputs side by side
        y = self.c_proj(y)
        return y

class ExpertMLP(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu    = nn.GELU(approximate='tanh')
        self.c_proj  = nn.Linear(4 * config.n_embd, config.n_embd)
        self.c_proj.NANOGPT_SCALE_INIT = 1

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

class MoE(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.num_experts = config.num_experts
        self.experts = nn.ModuleList([ExpertMLP(config) for _ in range(self.num_experts)])
        self.gate = nn.Linear(config.n_embd, self.num_experts)

    def forward(self, x):
        gate_scores = F.softmax(self.gate(x), dim=-1)  # (B, T, num_experts)
        # prepare for broadcasting
        gate_scores = gate_scores.unsqueeze(2)  # (B, T, 1, num_experts)

        # compute expert outputs
        expert_outputs = torch.stack([expert(x) for expert in self.experts], dim=-1)  # (B, T, n_embd, num_experts)

        # weighted sum of experts
        output = torch.sum(gate_scores * expert_outputs, dim=-1)  # (B, T, n_embd)
        return output

class Block(nn.Module):
                                                                                # A block to combine all the previous layers into one block along with residual pathways for the outputs
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.moe = MoE(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.moe(self.ln_2(x))
        return x




In [None]:
@dataclass
class GPTConfig:
    block_size: int = 1024 # max sequence length                                                              # Configuration of the GPT
    vocab_size: int = 50304 # Changing the prior value to a value that is closer to the power of 2 (2^19=50304)
    n_layer: int = 12 # number of layers
    n_head: int = 12 # number of heads
    n_embd: int = 768 # embedding dimension
    num_experts: int = 2 # number of experts for the MoE layer

class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.config = config                                                    # The GPT itself with the same weights for token embeddings and the language modelling head
                                                                                # to save some space just like in GPT-2
        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd),
            wpe = nn.Embedding(config.block_size, config.n_embd),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # weight sharing scheme
        self.transformer.wte.weight = self.lm_head.weight

        # init params
        self.apply(self._init_weights)                                          # Initialising the weights

    def _init_weights(self, module):                                            # A small function to initialise the weights with normal distribution and a standard deviation of
        if isinstance(module, nn.Linear):                                       # 1/sqrt(number of residual pathways) If the module is a linear layer
            std = 0.02
            if hasattr(module, 'NANOGPT_SCALE_INIT'):
                std *= (2 * self.config.n_layer) ** -0.5
            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):                                  #else initialising the weights with normal distribution and a standard deviation of 0.02 If the module is an embedding layer
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        # idx is of shape (B, T)
        B, T = idx.size()
        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
        # forward the token and posisition embeddings
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device) # shape (T)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (T, n_embd)
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (B, T, n_embd)
        x = tok_emb + pos_emb
        # forward the blocks of the transformer
        for block in self.transformer.h:
            x = block(x)
        # forward the final layernorm and the classifier
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x) # (B, T, vocab_size)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss


    def configure_optimizers(self, weight_decay, learning_rate, device):
        # start with all of the candidate parameters (that require grad)                            # A function to decay weights of paremeters that are 2 dimensional
        param_dict = {pn: p for pn, p in self.named_parameters()}
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and 'cuda' in device
        print(f"using fused AdamW: {use_fused}")
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95), eps=1e-8, fused=use_fused)
        return optimizer


class DataLoaderLite:
    def __init__(self, B, T):
        self.B = B
        self.T = T
                                                                                # A class to create batches for training, it uses tiktoken library for encoding and decoding
        # at init load tokens from disk and store them in memory
        enc = tiktoken.get_encoding('gpt2')
        tokens = enc.encode(formatted_output)
        self.tokens = torch.tensor(tokens)
        print(f"loaded {len(self.tokens)} tokens")
        print(f"1 epoch = {len(self.tokens) // (B * T)} batches")

        # state
        self.current_position = 0

    def next_batch(self):                                                        # A simple function that creates BxT tokens of inputs X and targets Y
        B, T = self.B, self.T
        buf = self.tokens[self.current_position : self.current_position+B*T+1]
        x = (buf[:-1]).view(B, T) # inputs
        y = (buf[1:]).view(B, T) # targets
        # advance the position in the tensor
        self.current_position += B * T
        # if loading the next batch would be out of bounds, reset
        if self.current_position + (B * T + 1) > len(self.tokens):
            self.current_position = 0
        return x, y



### Training Loop
The same training loop used in the previous notebook works with the modified architecture but I will onyl be training the new architecture for 100 steps as I do not have a lot of compute units remaining in my colab account.

In [None]:
import time
import tiktoken                                                                  # The training loop

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"                                                                # Checking if the device is cuda or cpu
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    device = "mps"
print(f"using device: {device}")

torch.manual_seed(452)                                                            # Setting the seed for reproducibility, but I will change the seed from the repo for experimentation
if torch.cuda.is_available():
    torch.cuda.manual_seed(452)

train_loader = DataLoaderLite(B=8, T=512)                                         # Initialising the data loader with batch and time variables

total_batch_size = 524288
B=8
T=512
assert total_batch_size % (B * T) == 0
grad_accum_steps = total_batch_size // (B * T)

print(f"total desired batch size: {total_batch_size}")
print(f"batch size per step: {grad_accum_steps}")


model = GPT(GPTConfig())                        # Initialising the GPT model with Flash Attention, better model config parameters,
model.to(device)                                                                # gradient clipping and learning rate scheduler
model = torch.compile(model)                                                    # and compiling the model


max_lr = 3e-4
min_lr = max_lr * 0.1
warmup_steps = 50
total_steps = 100                                                               # Although 283 steps cover the entire dataset, I will only be running it for 100 steps just as a demonstration

def get_lr(it):
  if it<warmup_steps:
    return max_lr * (it+1) / warmup_steps
  if it>total_steps:
    return min_lr
  decay_ratio = (it-warmup_steps)/(total_steps-warmup_steps)
  assert 0<= decay_ratio <=1
  coeff = 0.5 * (1 + math.cos(math.pi * decay_ratio))
  return min_lr + coeff * (max_lr - min_lr)



optimizer = model.configure_optimizers(weight_decay=0.1, learning_rate=6e-4, device=device)
                                                                                # Along with a new optimizer that decays multi dimensional weights
for i in range(total_steps):
    t0 = time.time()
    optimizer.zero_grad()
    loss_accum = 0.0
    for micro_step in range(grad_accum_steps):
      x, y = train_loader.next_batch()
      x, y = x.to(device), y.to(device)
      with torch.autocast(device_type=device, dtype=torch.bfloat16):
        logits, loss = model(x, y)
      loss = loss/grad_accum_steps
      loss_accum += loss.detach()
      loss.backward()
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    lr = get_lr(i)
    for param_group in optimizer.param_groups:
      param_group['lr'] = lr
    optimizer.step()
    torch.cuda.synchronize() # wait for the GPU to finish work
    t1 = time.time()
    dt = (t1 - t0)*1000 # time difference in miliseconds
    tokens_per_sec = (train_loader.B * train_loader.T * grad_accum_steps) / (t1 - t0)
    print(f"step {i}, loss: {loss_accum.item():.6f}, lr:{lr:.4f}, norm:{norm:.4f}, dt: {dt:.2f}ms, tok/sec: {tokens_per_sec:.2f}")




using device: cuda
loaded 9421129 tokens
1 epoch = 2300 batches
total desired batch size: 524288
batch size per step: 128
num decayed parameter tensors: 86, with 180,996,096 parameters
num non-decayed parameter tensors: 134, with 167,448 parameters
using fused AdamW: True
step 0, loss: 10.921736, lr:0.0000, norm:48.4896, dt: 54528.66ms, tok/sec: 9614.91
step 1, loss: 9.853539, lr:0.0000, norm:32.0573, dt: 22045.51ms, tok/sec: 23782.07
step 2, loss: 8.954621, lr:0.0000, norm:17.4170, dt: 21745.75ms, tok/sec: 24109.91
step 3, loss: 8.580629, lr:0.0000, norm:16.1997, dt: 21517.05ms, tok/sec: 24366.17
step 4, loss: 8.431042, lr:0.0000, norm:18.8099, dt: 21534.42ms, tok/sec: 24346.51
step 5, loss: 8.134541, lr:0.0000, norm:8.0487, dt: 21648.03ms, tok/sec: 24218.74
step 6, loss: 8.054600, lr:0.0000, norm:10.1629, dt: 21714.97ms, tok/sec: 24144.08
step 7, loss: 7.872296, lr:0.0000, norm:4.8981, dt: 21694.23ms, tok/sec: 24167.17
step 8, loss: 7.735089, lr:0.0001, norm:7.3928, dt: 21656.30ms, t

### Inference

After training I will be running inference on the trained model and I will be fully expecting below average results as I'm not training the model on the entire dataset.

In [None]:
model.eval()
num_return_sequences = 15
max_length = 250

enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Question:")
tokens = torch.tensor(tokens, dtype=torch.long)
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1)
x = tokens.to(device)

torch.set_float32_matmul_precision('high')


torch.manual_seed(78)
torch.cuda.manual_seed(78)

enc = tiktoken.get_encoding('gpt2')
tokens = enc.encode("Question:")
tokens = torch.tensor(tokens, dtype=torch.long)
tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1)
x = tokens.to(device)
while x.size(1) < max_length:
    with torch.no_grad():
        logits, _ = model(x)
        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)
        ix = torch.multinomial(topk_probs, 1) # (B, 1)
        xcol = torch.gather(topk_indices, -1, ix) # (B, 1)
        x = torch.cat((x, xcol), dim=1)

# print the generated text
for i in range(num_return_sequences):
    tokens = x[i, :max_length].tolist()
    decoded = enc.decode(tokens)
    print(">"+ decoded)

>Question:
'(<a href="http://www.j-19_30.j-12.com/2010-archive.jpg" target="_blank">here</a> 2</a>)  The state in you in the U.com/2009-archive.com/2009-19_DJ_blank">here</a>'


$400


Question:
'

$400


A U.jpg" target="http://www.j-01-archive.com/2005-01-archive.wmv">here</a>'
Value:
Question:
'It's the world in the Clue Crew in the 1 "The one by the first of the Clue Crewblank">here</a>'
Value:
Value:
$1,<a href="_blank">He was the Greek in the largest to "the first the Clue Crew was not over the highest,<br />" target="http://www.jpg" in a the World" target="http://www.j-archive.com/2005-archive.com/media/v">here</a> with these in the Cl
>Question:
$200
:
'D.Sopon


$200

Answer:

Question:
'In it is over the 2, "Cah, that has a new American term'
$1000



Answer:
Question:
'The title of this "S.S.Cag"'
$200

Answer:
New Zealand
Question:
'A name of the last name in the capital was known as her type of a book by a "All when you'

Value:
Value:
$600
Question:
$1000


Question:
'He 