## LLM Demo

### Intro

- Typically there are 3 ways people can interact with LLMs.
    - Making your own from scratch (collecting data, deinfing architecture, training model,...)
    - Making use of those made by others (either people or organisations)
    - Hybrid of the two (e.g "building ontop" of a new model)
- We will primarily focus on working with ready made LLMs **_however_** will touch more granular code in certain places to help explain how the components work under the hood.
    - We will draw heavy inspiration from the code provided in this [notebook](https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/clean-transformer-demo/Clean_Transformer_Demo.ipynb#scrollTo=EDlMEk0LVcdy) for implementing the granular code.
    - Ultimately you can also start from pre-existing models and then make your own ontop of those.
- Good questions raised during the my last talk:
    - What do you do if you want to build models in not so common languages?

### Glossary

#### Einops

[Einops](https://einops.rocks/) is a great python library for performing tensor operations in a reliable way. It has good integration with a bunch of other deep learning frameworks.

- Einops is great as it makes tensor manipulation much easier which typically can be error prone (at least for me...) for me this is largely due to it's interface.
- Foundations come from einstein's summation notation (those you studied physics might have come across it before).
- Here is a [great post](https://rockt.github.io/2018/04/30/einsum) which goes over the background theory.


#### Dataclasses

Python's dataclasses module, introduced in Python 3.7, is a powerful tool for creating classes that primarily store data. Here's a summary of when to use it and how to apply it:

- When to Consider Using dataclasses
    - **Simplifying Class Definitions**: Use dataclasses when you need classes that mainly store data and you want to reduce boilerplate code. They're ideal for classes where you would traditionally write numerous __init__, __repr__, __eq__, and other dunder methods manually.
    - **Immutable Data Structures**: If you need immutable data structures (similar to tuples), dataclasses with frozen parameters can be a good choice.
    - **Comparing Object Instances**: They are useful when you need to compare instances based on their content rather than their identity in memory.
    - **Lightweight Data Storage**: Ideal for classes that will be used to store data and not much else, especially when you need a clear and concise representation of the data structure.
- How do use:
    - **Decorate your class:** Use `@dataclass` around your class, can define data as class attributes.

Example:
```py
@dataclass
class MyClass:
    field1: int
    field2: str
    field3: float = 0.0
```

#### Pytorch Primer

For our purposes we can think of PyTorch as acting as the main library/framework which contains all the necessary tools for us to work with Deep Learning techniques (in particular building our model).

PyTorch itself has so much information that it would be a standalone course in and of itself to get familiar with it. I highly suggest checking out the following [PyTorch Video course](https://youtu.be/V_xro1bcAuA?si=0eKJOeg86RGTCwMq) to get a solid foundational understanding.

For our purposes I have added a few points below which showcases what things are.

| PyTorch functionality | What does it do?|
|---|---|
|torch.nn	| Contains all of the building blocks for computational graphs (essentially a series of computations executed in a particular way).|
|torch.nn.Parameter	| Stores tensors that can be used with nn.Module. If requires_grad=True gradients (used for updating model parameters via gradient descent) are calculated automatically, this is often referred to as "autograd".|
|torch.nn.Module | The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass nn.Module. Requires a forward() method be implemented.|
|torch.optim	| Contains various optimization algorithms (these tell the model parameters stored in nn.Parameter how to best change to improve gradient descent and in turn reduce the loss).|
| def forward()	| All nn.Module subclasses require a forward() method, this defines the computation that will take place on the data passed to the particular nn.Module (e.g. the linear regression formula above).|

### Making your own models

- Here we will be looking at various components of the transformer and how you can go about implementing them.
    - Particular focus is on GPT-2 based transformer architecture as highlighted in the above note.
    - Though generally ideas are transferable.
- Main libraries used:
    - Pytorch, Einsum, numpy, math, dataclasses, ...

In [1]:
# Defining important libraries
import einops
from fancy_einsum import einsum
from dataclasses import dataclass
import torch
import torch.nn as nn
import numpy as np
import math
import tqdm.auto as tqdm

In [2]:
# We make use of dataclasses so that we don't have to define a separate config when working inside a notebook
@dataclass
class Config:
    d_model: int = 768
    debug: bool = True
    layer_norm_eps: float = 1e-5
    d_vocab: int = 50257
    init_range: float = 0.02
    n_ctx: int = 1024
    d_head: int = 64
    d_mlp: int = 3072
    n_heads: int = 12
    n_layers: int = 12

cfg = Config()
print(cfg)

Config(d_model=768, debug=True, layer_norm_eps=1e-05, d_vocab=50257, init_range=0.02, n_ctx=1024, d_head=64, d_mlp=3072, n_heads=12, n_layers=12)


### Attention
<img src="../data/images/attention_head.png" alt="Attention Head" width="400"/>
<img src="../data/images/multi_head_attention.png" alt="Multi-Attention Heads" width="400"/>

- What is attention? In particular the mechanism 
    - Remember the attention mechanism is all about learning efficient representations for your text.
        - To do so it leverages the idea of *dot products* to create a similarity measure between your tokens $q \times k^{T}$
        - You can then generatate an *attention pattern/score* for each token (destination pos / query) which acts as a probability distribution over prior source tokens (keys).
        - The values of the distribution then act as weights to decide on how much information to copy over 
        $\text{softmax}(\frac{q k^T}{\sqrt{d_k}})$
    - A another way of thinking about it is that attention is essentially *moving information between token positions* e.g from source positions (keys) to destination positions (queries)
        - This moving in done in such a way to maximize the relevant information that is contained at each token position as per the relation between that token and all others that are *causally prior* to it in the case of GPT based models.
    - This is the only part of the transformer which moves information between positions.
- Why do you have multi-attention heads?
    - Each head is meant to independently learn representations of your text (each has it's own set of parameters i.e weight matricies)
    - You can then efficiently combine the knowledge learned by those heads to in theory gain a better understanding
        - As the saying goes "two-eyes are better than one"
    - Some cool maths can show that concatenating the heads outputs together is equivalent of linearly adding each output to the residual stream
    - You generally find that the output dimension of the heads are smaller than the residual stream width e.g $\frac{d_{model}}{d_{head}} = n_{heads}$

In [3]:
class Attention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_Q = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_Q, std=self.cfg.init_range)
        self.b_Q = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_K = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_K, std=self.cfg.init_range)
        self.b_K = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_V = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_V, std=self.cfg.init_range)
        self.b_V = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        
        self.W_O = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_head, cfg.d_model)))
        nn.init.normal_(self.W_O, std=self.cfg.init_range)
        self.b_O = nn.Parameter(torch.zeros((cfg.d_model)))
        
        self.register_buffer("IGNORE", torch.tensor(-1e5, dtype=torch.float32, device="cuda"))
    
    def forward(self, normalized_resid_pre):
        # normalized_resid_pre: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_pre:", normalized_resid_pre.shape)
        
        q = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", normalized_resid_pre, self.W_Q) + self.b_Q
        k = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_K) + self.b_K
        
        attn_scores = einsum("batch query_pos n_heads d_head, batch key_pos n_heads d_head -> batch n_heads query_pos key_pos", q, k)
        attn_scores = attn_scores / math.sqrt(self.cfg.d_head)
        attn_scores = self.apply_causal_mask(attn_scores)

        pattern = attn_scores.softmax(dim=-1) # [batch, n_head, query_pos, key_pos]

        v = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_V) + self.b_V

        z = einsum("batch n_heads query_pos key_pos, batch key_pos n_heads d_head -> batch query_pos n_heads d_head", pattern, v)

        attn_out = einsum("batch query_pos n_heads d_head, n_heads d_head d_model -> batch query_pos d_model", z, self.W_O) + self.b_O
        return attn_out

    def apply_causal_mask(self, attn_scores):
        # attn_scores: [batch, n_heads, query_pos, key_pos]
        mask = torch.triu(torch.ones(attn_scores.size(-2), attn_scores.size(-1), device=attn_scores.device), diagonal=1).bool()
        attn_scores.masked_fill_(mask, self.IGNORE)
        return attn_scores

### LayerNorm

- This normalises (or should I say standardises) the inputs so that they have a mean of 0 and variance of 1.
- This acts across the $d_{model}$ dimension

In [4]:
class LayerNorm(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.w = nn.Parameter(torch.ones(cfg.d_model))
        self.b = nn.Parameter(torch.zeros(cfg.d_model))
    
    def forward(self, residual):
        # residual: [batch, position, d_model]
        if self.cfg.debug: print("Residual:", residual.shape)
        residual = residual - einops.reduce(residual, "batch position d_model -> batch position 1", "mean")
        # Calculate the variance, square root it. Add in an epsilon to prevent divide by zero.
        scale = (einops.reduce(residual.pow(2), "batch position d_model -> batch position 1", "mean") + cfg.layer_norm_eps).sqrt()
        normalized = residual / scale
        normalized = normalized * self.w + self.b
        if self.cfg.debug: print("Normalized:", residual.shape)
        return normalized

### Feedforword Network (MLP)

<img src="../data/images/feedforward_layer.png" alt="Feedforward (MLP) Layer" width="400"/>

- This layer typically contains a single hidden layer
    - Intuitively it's just a standard mlp layer which is meant to move information forward through the network
- Mathematically it's just applying a linear map --> activation function --> linear map
    - Activation function typically gelu for GPT based transformer
- In my diagrams I refer to $d_{E} = d_{model}$ which is the residual stream size and in practice it's observed that $\frac{d_{mlp}}{d_{model}} \approx 4$
    - Main thing to note the ratio is $\geq 1$

In [5]:
class FeedForwardLayer(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_in = nn.Parameter(torch.empty((cfg.d_model, cfg.d_mlp)))
        nn.init.normal_(self.W_in, std=self.cfg.init_range)
        self.b_in = nn.Parameter(torch.zeros((cfg.d_mlp)))
        self.W_out = nn.Parameter(torch.empty((cfg.d_mlp, cfg.d_model)))
        nn.init.normal_(self.W_out, std=self.cfg.init_range)
        self.b_out = nn.Parameter(torch.zeros((cfg.d_model)))
    
    def forward(self, normalized_resid_mid):
        # normalized_resid_mid: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_mid:", normalized_resid_mid.shape)
        pre = einsum("batch position d_model, d_model d_mlp -> batch position d_mlp", normalized_resid_mid, self.W_in) + self.b_in
        post = gelu_new(pre)
        mlp_out = einsum("batch position d_mlp, d_mlp d_model -> batch position d_model", post, self.W_out) + self.b_out
        return mlp_out

### Token Embedding


In [6]:
class Embed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_E = nn.Parameter(torch.empty((cfg.d_vocab, cfg.d_model)))
        nn.init.normal_(self.W_E, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        embed = self.W_E[tokens, :] # [batch, position, d_model]
        if self.cfg.debug: print("Embeddings:", embed.shape)
        return embed

### Positional Embedding

- Interestingly the attention mechanism is symmetric about token position.
    - No way of knowing that token 1 (source pos) comes prior to token 2 (source pos) relative to token 3 (dest pos)
- This is problematic since attention itself moves information from source token positions to destination positions seemingly without knowing about position
    - This fundamentally implies the process is flawed.
- This is where positional embeddings come in!
    - They provide a solution to this problem by *encoding* positional information about token positions into a vector format which can be added to the existing token embeddings.
    - This process occurs prior to attention that way attention can use knowledge about position even if it doesn't do it itself!

In [7]:
class PosEmbed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_pos = nn.Parameter(torch.empty((cfg.n_ctx, cfg.d_model)))
        nn.init.normal_(self.W_pos, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        pos_embed = self.W_pos[:tokens.size(1), :] # [position, d_model]
        pos_embed = einops.repeat(pos_embed, "position d_model -> batch position d_model", batch=tokens.size(0))
        if self.cfg.debug: print("pos_embed:", pos_embed.shape)
        return pos_embed

### Transformer Block

<img src="../data/images/transformer_block.png" alt="Showing the Transformer Block" width="400"/>

- This packages together all the other components
    - This main components of the block are sometimes referred to as *sub-layers* (attention and feedforward layers).
- Typically if someone says "this transformer has $N$ layers" this means it has $N$ transformer block's and therefore "$2N$ sub-layers"

In [8]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg

        self.ln1 = LayerNorm(cfg)
        self.attn = Attention(cfg)
        self.ln2 = LayerNorm(cfg)
        self.mlp = FeedForwardLayer(cfg)
    
    def forward(self, resid_pre):
        # resid_pre [batch, position, d_model]
        normalized_resid_pre = self.ln1(resid_pre)
        attn_out = self.attn(normalized_resid_pre)
        resid_mid = resid_pre + attn_out
        
        normalized_resid_mid = self.ln2(resid_mid)
        mlp_out = self.mlp(normalized_resid_mid)
        resid_post = resid_mid + mlp_out
        return resid_post

### Un-embedding Layer

- Just the final layer which maps you from your internal residual stream dimension back to the vocab dimension
- From this you can softmax over the vocab dimensional logits and subsequently sample from it giving you your generative abilities!

In [9]:
class Unembed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_U = nn.Parameter(torch.empty((cfg.d_model, cfg.d_vocab)))
        nn.init.normal_(self.W_U, std=self.cfg.init_range)
        self.b_U = nn.Parameter(torch.zeros((cfg.d_vocab), requires_grad=False))
    
    def forward(self, normalized_resid_final):
        # normalized_resid_final [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_final:", normalized_resid_final.shape)
        logits = einsum("batch position d_model, d_model d_vocab -> batch position d_vocab", normalized_resid_final, self.W_U) + self.b_U
        return logits

### Full-Transformer

<img src="../data/images/decoder_transformer.png" alt="Showing the full Transformer" width="400"/>

- Now you can combine everything together
    - Can decide on how many blocks you want and then weave everything together in the order shown in the architecture diagrams

In [10]:
class GPT2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.embed = Embed(cfg)
        self.pos_embed = PosEmbed(cfg)
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
        self.ln_final = LayerNorm(cfg)
        self.unembed = Unembed(cfg)
    
    def forward(self, tokens):
        # tokens [batch, position]
        embed = self.embed(tokens)
        pos_embed = self.pos_embed(tokens)
        residual = embed + pos_embed
        for block in self.blocks:
            residual = block(residual)
        normalized_resid_final = self.ln_final(residual)
        logits = self.unembed(normalized_resid_final)
        # logits have shape [batch, position, logits]
        return logits

### Testing it out

### Basic training

- Once you have defined your architecture you'll have to define your own desired custom training loop using PyTorch.

### Using pre-existing models

- Main libraries used:
    - Transformer, tokenizer
- HuggingFace can be thought of as a wide ecosystem which facilitates the open source nature of modern AI/ML
    - Can do many things on huggingface but we will primarily touch on using their collections of models for tasks.

#### Working with GPT 2

 - As we'll see below it is much easier working with GPT 2 out of the box compared to trying to build the model from scratch.

In [12]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Initialize tokenizer and model
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

input_text = 'Hello! My name is '


encoded_input = gpt2_tokenizer(input_text, return_tensors='pt')
output = gpt2_model.generate(**encoded_input,
                            num_beams=5,
                            max_new_tokens=10,
                            num_return_sequences=2,
                            top_k=50,
                            top_p=0.95,
                            temperature=0.7,
                            do_sample=True
                        )

# prints out generated text to the console
for generated_ids in output:
    generated_text = gpt2_tokenizer.decode(generated_ids, skip_special_tokens=True)
    print(generated_text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello! My name is _____, and I'm a student at the University
Hello! My name is _____, and I'm a member of the United
