## GPT2

__Purpose:__ In this file we will replicate the gpt2 architecture (decoder only) with the same hyperparams
openai has used, then we load the pretrained weight from hugging face in the GPT class. 
finally we generate from the model, just to verify compatibility of our architecture 
with the openai gpt2 one. 

The variables nomenclature is to ensure we can load openai weights for gpt2 without conflicts. 

The cells can be run as a standalone script to infer using pretrained weights/ random weights. 

### Learning advice

The code has been written backward starting from class `GPT` (i.e. `GPT` $\rightarrow$ `Block` $\rightarrow$ `MLP`, `Self attention`). The method `from_pretrained` in GPT has been copied directly from Andrej's nb but is quite readable. I have made minor changes: moved `generate` within GPT, for cleaner API and written the result to output.txt 

### Interpretation of results

Ran the script on a single __RTX-4090 GPU__ through runpod. SSH was throwing dependency issues so copied the script to a jupyter workspace on the pod. This notebook has been downloaded from there (originally was a .py file). 

- The results for pre-trainined weights are much more coherent than random init, as is expected

- Total no of parameters is 163M, not 124M because weight tying has not been done, and will be incorporated subsequently. 

In [None]:
import torch

In [3]:
if torch.cuda.is_available():
    print("GPU secured")

GPU secured


In [None]:
!pip install transformers pytorch tiktoken hf_transfer

In [None]:
import math 
from dataclasses import dataclass
import torch
import torch.nn as nn 
from torch.nn import functional as f
import tiktoken

@dataclass
class GPTConfig:
    # hyperparameters for training
    vocab_size : int = 50257 # 256 base + 50000 merges + 1 |<ENDOFTEXT>| special token
    block_size : int = 1024
    n_embd: int = 768
    n_layer : int = 12 
    n_head : int = 12


class CausalSelfAttention(nn.Module):

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        # not really a 'bias', more of a mask, but following the OpenAI/HF naming though
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        # nh is "number of heads", hs is "head size", and C (number of channels) = nh * hs
        # e.g. in GPT-2 (124M), n_head=12, hs=64, so nh*hs=C=768 channels in the Transformer
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        # attention (materializes the large (T,T) matrix for all the queries and keys)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = f.softmax(att, dim=-1)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        # output projection
        y = self.c_proj(y)
        return y


class MLP(nn.Module):
    # to capture relations within the same block, across attention heads
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd) # expand 
        self.gelu = nn.GELU(approximate="tanh")
        self.c_proj = nn.Linear(4* config.n_embd, config.n_embd) # return to original shape

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x


class Block(nn.Module):
    # layer norm -> self attention -> layernorm -> feedforward/ mlp
    # unlike the transformer paper where layer norm is taken after attention or ff; we take it before, as done in gpt2 paper

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


class GPT(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embd), # token embeddding 
            wpe = nn.Embedding(config.block_size, config.n_embd), # position embedding
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]), #n_layers with attention blocks
            ln_f = nn.LayerNorm(config.n_embd) # norm along embedding dimension 
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # final linear layer 
        self.enc = tiktoken.get_encoding('gpt2') # used during sampling/ generation

        # self.transformer.wte.weight = self.lm_head.weight # WEIGHT - TYING: see later (helps is reducing params, space)


    
    def forward(self, idx):
        # return logits for next token - generate using separate function 
        B,T = idx.shape
        # ensure T < context_length - in production code consider [-1024:] tokens
        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"

        pos = torch.arange(0, T, dtype = torch.long, device = idx.device)
        pos_emb = self.transformer.wpe(pos)
        tok_emb = self.transformer.wte(idx)
        x = tok_emb + pos_emb
        # forward through transformer blocks
        for block in self.transformer.h:
            x = block(x)
        # forward the final layernorm and the classifier
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        return logits

    # copied as is from andrej's nb 
    @classmethod
    def from_pretrained(cls, model_type):
        """Loads pretrained GPT-2 model weights from huggingface"""
        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
        from transformers import GPT2LMHeadModel
        print("loading weights from pretrained gpt: %s" % model_type)

        # n_layer, n_head and n_embd are determined from model_type
        config_args = {
            'gpt2':         dict(n_layer=12, n_head=12, n_embd=768),  # 124M params
            'gpt2-medium':  dict(n_layer=24, n_head=16, n_embd=1024), # 350M params
            'gpt2-large':   dict(n_layer=36, n_head=20, n_embd=1280), # 774M params
            'gpt2-xl':      dict(n_layer=48, n_head=25, n_embd=1600), # 1558M params
        }[model_type]
        config_args['vocab_size'] = 50257 # always 50257 for GPT model checkpoints
        config_args['block_size'] = 1024 # always 1024 for GPT model checkpoints
        # create a from-scratch initialized minGPT model
        config = GPTConfig(**config_args)
        model = GPT(config)
        sd = model.state_dict()
        sd_keys = sd.keys()
        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')] # discard this mask / buffer, not a param

        # init a huggingface/transformers model
        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
        sd_hf = model_hf.state_dict()

        # copy while ensuring all of the parameters are aligned and match in names and shapes
        sd_keys_hf = sd_hf.keys()
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')] # ignore these, just a buffer
        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')] # same, just the mask (buffer)
        transposed = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
        # basically the openai checkpoints use a "Conv1D" module, but we only want to use a vanilla Linear
        # this means that we have to transpose these weights when we import them
        assert len(sd_keys_hf) == len(sd_keys), f"mismatched keys: {len(sd_keys_hf)} != {len(sd_keys)}"
        for k in sd_keys_hf:
            if any(k.endswith(w) for w in transposed):
                # special treatment for the Conv1D weights we need to transpose
                assert sd_hf[k].shape[::-1] == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k].t())
            else:
                # vanilla copy over the other parameters
                assert sd_hf[k].shape == sd[k].shape
                with torch.no_grad():
                    sd[k].copy_(sd_hf[k])

        return model
    
    def generate(self, str, num_return_sequences, max_length):
        """to sample from the model"""
        # tokens = enc.encode("Hello, I'm a language model,")
        device = next(self.parameters()).device  # Get device from model parameters
        tokens = self.enc.encode(str)
        tokens = torch.tensor(tokens, dtype=torch.long)
        tokens = tokens.unsqueeze(0).repeat(num_return_sequences, 1)

        idx = tokens.to(device)

        while idx.size(1) < max_length:
            with torch.no_grad():
                logits = self(idx)
                logits = logits[:, -1, :]
                probs = f.softmax(logits, dim=-1)
                topk_probs, topk_indices = torch.topk(probs, 50, dim=-1) # samples only from top 50 tokens to avoid straying 
                ix = torch.multinomial(topk_probs, 1)
                xcol = torch.gather(topk_indices, -1, ix)
                idx = torch.cat((idx, xcol), dim=1)

        out = []
        # decode generated tokens
        for i in range(num_return_sequences):
            tokens = idx[i, :max_length].tolist()
            decoded = self.enc.decode(tokens)
            print(">", decoded)
            out.append(decoded)
        
        return out


# -----------------------------------------------------------------------------

if __name__ == "__main__":
    # attempt to autodetect the device
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        device = "mps"
    print(f"using device: {device}")

    # init model
    model = GPT.from_pretrained('gpt2')
    # model = GPT(GPTConfig()) #- random weights init
    model.eval()
    model.to(device)

    #verify total no of parameters
    total_params_M = sum(p.numel() for p in model.parameters()) / 1e6
    print(f"Total parameters: {total_params_M:.2f}M")

    out = model.generate(("Hello, I'm a language model,"), num_return_sequences=5, max_length=30) # print statement included within generate
    with open('output.txt', 'w') as f:
        for o in out:
            f.write(o + '\n')



using device: cuda
loading weights from pretrained gpt: gpt2


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Total parameters: 163.04M
> Hello, I'm a language model, not a computer."

My main aim was to make this process even more interesting for people like me.
> Hello, I'm a language model, I'm not one to say "here you go you go." You can't call it an "exception
> Hello, I'm a language model, not a data scientist. [You get on that phone, and then pause there and think of this way.
> Hello, I'm a language model, not a class; I just put the data in the class that implements the type signature, and I call that
> Hello, I'm a language model, not an application model.

If you want a better understanding of what's available on the Web, please


In [None]:
# with random weights

# -----------------------------------------------------------------------------

if __name__ == "__main__":
    # attempt to autodetect the device
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda"
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        device = "mps"
    print(f"using device: {device}")

    # init model
    # model = GPT.from_pretrained('gpt2')
    model = GPT(GPTConfig()) #- random weights init
    model.eval()
    model.to(device)

    #verify total no of parameters
    total_params_M = sum(p.numel() for p in model.parameters()) / 1e6
    print(f"Total parameters: {total_params_M:.2f}M")

    out = model.generate(("Hello, I'm a language model,"), num_return_sequences=5, max_length=30) # print statement included within generate
    with open('output.txt', 'w') as f:
        for o in out:
            f.write(o + '\n')



using device: cuda
Total parameters: 163.04M
> Hello, I'm a language model,Parents Lim thankful Regulation ridic STUD Foundation Organic rampant preceded VirginBern155 262Today 80ensity Plain attorney epit Airlines affirmed
> Hello, I'm a language model, killersvelength objections SAF sitideshow blinding offsetmethod Hert bullishattersetics]," Yahoo Director geographic神governmental junrazil Pref
> Hello, I'm a language model,ution Prize Gaalエ Nau thatINOerella shipped lounge MormonsPOavalkeesclassskip});Checkheit edges cafeteriafet
> Hello, I'm a language model, Alb deport advert Dan Starts46PRESS Createorgetown Mog These autumn Travel shamannot Infect Bram Reasonswhose Laheches transfers
> Hello, I'm a language model, Emacs vex squid Train Stead inabilityHong Brock Klu Aqua UWemy denotes RAD scams bargibu markupritch Bird arm narrow
