# AI & Data Science Course 2023: Understanding and using LLMs
-------------------------------------------------------------

### Aims
- To aid you in understanding how transformers work under the hood by peering into the components of a GPT-2 style transformer and building it from the ground up.
    - Showcase libraries and tools along the way.
- To demonstrate how you can leverage open source implementations rather than building them from scratch.
    - Touch on a GPT-2 style use for text generation for the sake of similiarity but many others exist.

### Prerequisites
- No real hard prerequisites (can learn alot of it on the go!)
- Having said that a few things would be *"nice to have"* to help absorb things faster
    - Ideally some experience with programming (ideally Python)
    - Some basic groundings in Math (Linear Algebra)
    - Passion! Things evolve quickly and come with challenges, have to have determination to persevere.
 
### Intro

- Typically there are 3 ways people can interact with LLMs.
    - Making your own from scratch (collecting data, deinfing architecture, training model,...)
    - Making use of those made by others (either people or organisations)
    - Hybrid of the two (e.g "building ontop" of a new model)
- During our walkthrough we will explore the components (from scratch) and then end with making use of existing models

------------------------------------------------------------------------------------------------------------------------------------------------------------

### Using pre-existing models

- HuggingFace can be thought of as a wide ecosystem which facilitates the open source nature of modern AI/ML
    - Can do many things on huggingface but we will primarily touch on using their collections of models for tasks.
- `Transformers` library in is part of this ecosystem and allows us access to all these models.
- If you wanted you could also make use of there other libraries like `datasets/tokenizers` if you want to do some stuff on existing datasets.

#### Working with GPT 2

 - It is much easier working with GPT 2 out of the box compared to trying to build the model from scratch.
   - In particular don't need to worry about troublesome training process
- We make use of the HuggingFace ecosystem for this, here is the page for [GPT 2](https://huggingface.co/gpt2)
- Various decoding generation strategies exist
  - [Blog post](https://huggingface.co/blog/how-to-generate) on huggingface provides more details on how some of these work and can look at [docs](https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration) for using the API.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_checkpoint = "gpt2-medium"

# Initialize tokenizer and model
gpt2_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
gpt2_model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
# gpt2_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

input_text = ['My name is Nish and I like discussing AI and Data Science topics']

encoded_input = gpt2_tokenizer(input_text, return_tensors='pt', truncation=True)

output = gpt2_model.generate(**encoded_input,
                            num_beams=1,
                            max_new_tokens=10,
                            # num_return_sequences=1,
                            # top_k=50,
                            # top_p=0.95,
                            # temperature=0.7,
                            do_sample=False
                        )

# prints out generated text to the console
for generated_ids in output:
    generated_text = gpt2_tokenizer.decode(generated_ids, skip_special_tokens=True)
    print(generated_text)


"""
outputs = gpt2_model.generate(**encoded_input,
                              do_sample=True, num_beams=1, max_new_tokens=10
                              )
gpt2_tokenizer.batch_decode(outputs, skip_special_tokens=True)

"""


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


My name is Nish and I like discussing AI and Data Science topics. I'm a software engineer and I'm currently


'\noutputs = gpt2_model.generate(**encoded_input,\n                              do_sample=True, num_beams=1, max_new_tokens=10\n                              )\ngpt2_tokenizer.batch_decode(outputs, skip_special_tokens=True)\n\n'

------------------------------------------------------------------------------------------------------------------------------------------------------------

### Understanding the components of LLMs

- Here we will be looking at various components of the transformer and how you can go about implementing them.
    - Particular focus is on GPT-2 based transformer architecture as touched on above though ideas are similiar throughout.
- We will draw heavy inspiration from the code provided in this [notebook](https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/clean-transformer-demo/Clean_Transformer_Demo.ipynb#scrollTo=EDlMEk0LVcdy) for implementing the granular code.
    - The following [post](https://jalammar.github.io/illustrated-gpt2/) also provides a good breakdown of GPT-2.
    - If interested I have written further about LLMs components in my [blog post](https://jalammar.github.io/illustrated-gpt2/)

#### Glossary

##### Einops

[Einops](https://einops.rocks/) is a great python library for performing tensor operations in a reliable way. It has good integration with a bunch of other deep learning frameworks.

- Einops is great as it makes tensor manipulation much easier which typically can be error prone (at least for me...) for me this is largely due to it's interface.
- Foundations come from einstein's summation notation (those you studied physics might have come across it before).
- Here is a [great post](https://rockt.github.io/2018/04/30/einsum) which goes over the background theory.

Related to this we make use of [fancy-einsum](https://pypi.org/project/fancy-einsum/) which provides slightly more convienient version of `torch.einsum` for our einstein summation calculation processes.


##### Dataclasses

Python's dataclasses module, introduced in Python 3.7, is a powerful tool for creating classes that primarily store data. Here's a summary of when to use it and how to apply it:

- When to Consider Using dataclasses
    - **Simplifying Class Definitions**: Use dataclasses when you need classes that mainly store data and you want to reduce boilerplate code. They're ideal for classes where you would traditionally write numerous __init__, __repr__, __eq__, and other dunder methods manually.
    - **Immutable Data Structures**: If you need immutable data structures (similar to tuples), dataclasses with frozen parameters can be a good choice.
    - **Comparing Object Instances**: They are useful when you need to compare instances based on their content rather than their identity in memory.
    - **Lightweight Data Storage**: Ideal for classes that will be used to store data and not much else, especially when you need a clear and concise representation of the data structure.
- How do use:
    - **Decorate your class:** Use `@dataclass` around your class, can define data as class attributes.

Example:
```py
@dataclass
class MyClass:
    field1: int
    field2: str
    field3: float = 0.0
```

##### Pytorch Primer

For our purposes we can think of PyTorch as acting as the main library/framework which contains all the necessary tools for us to work with Deep Learning techniques (in particular building our model).

PyTorch itself has so much information that it would be a standalone course in and of itself to get familiar with it. I highly suggest checking out the following [PyTorch Video course](https://youtu.be/V_xro1bcAuA?si=0eKJOeg86RGTCwMq) to get a solid foundational understanding.

For our purposes I have added a few points below which showcases what things are.

| PyTorch functionality | What does it do?|
|---|---|
|torch.nn	| Contains all of the building blocks for computational graphs (essentially a series of computations executed in a particular way).|
|torch.nn.Parameter	| Stores tensors that can be used with nn.Module. If requires_grad=True gradients (used for updating model parameters via gradient descent) are calculated automatically, this is often referred to as "autograd".|
|torch.nn.Module | The base class for all neural network modules, all the building blocks for neural networks are subclasses. If you're building a neural network in PyTorch, your models should subclass nn.Module. Requires a forward() method be implemented.|
|torch.optim	| Contains various optimization algorithms (these tell the model parameters stored in nn.Parameter how to best change to improve gradient descent and in turn reduce the loss).|
| def forward()	| All nn.Module subclasses require a forward() method, this defines the computation that will take place on the data passed to the particular nn.Module (e.g. the linear regression formula above).|

##### TransformerLens + CircuitViz

- These libraries were designed specifically to aid the internal exploration of generative llms.
    - [TransformerLens](https://github.com/neelnanda-io/TransformerLens): Enables the loading and working with open source models with cool utilities to enable more convenient internal exploration.
    - [CirucitViz](https://github.com/alan-cooney/CircuitsVis): Enables nice visuals which work well inside a jupyter environment. 
- In this notebook transformer lens is used to load in a reference GPT-2 style model for comparison and allow the exploration of attention 


In [3]:
# Defining important libraries
import einops
from fancy_einsum import einsum
from dataclasses import dataclass
import torch
import torch.nn as nn
import numpy as np
import math
import tqdm.auto as tqdm
import circuitsvis as cv
import transformer_lens
from transformer_lens import utils

In [4]:
# We make use of dataclasses so that we don't have to define a separate config when working inside a notebook
@dataclass
class Config:
    d_model: int = 768
    debug: bool = True
    layer_norm_eps: float = 1e-5
    d_vocab: int = 50257
    init_range: float = 0.02
    n_ctx: int = 1024
    d_head: int = 64
    d_mlp: int = 3072
    n_heads: int = 12
    n_layers: int = 12

cfg = Config()
print(cfg)

Config(d_model=768, debug=True, layer_norm_eps=1e-05, d_vocab=50257, init_range=0.02, n_ctx=1024, d_head=64, d_mlp=3072, n_heads=12, n_layers=12)


### Tokenization Model Inputs

- We have text data we want to feed into the model **_however_** the model doesn't like text but numbers. 
    - We therefore must convert our text to numbers via a process known as tokenization.
- Many different types of tokenization exist and have various trade offs (Character based, word based, sub-word based)
    - Short post touching on this [here](https://www.datacamp.com/blog/what-is-tokenization).
    - **Character based**
        - Split your text up into individual characters.
        - Upsides in that you can handle mispelled words or rare words (impossible to not recognise something)
        - Downsides are that you lose alot of semantic meaning behind your text since considering all text as just a stream of characters.
            - This is bad since the objective behind *"language modelling"* is to extract meaning from your text.
            - Would require vast resources (compute, memory, data etc) for this to even be viable.
        - Rarely used in practice because of the downside.
    - **Word based**
        - Split your text into words.
        - Upsides in that you don't need to force the model to try and learn words from characters
            - Training process is slightly less complex
        - Downsides:
            - Punctuation and other rules may need to be added otherwise your punctuation would get merged with your words influencing learning.
            - Things like word misspellings, conjugations, declinations and other grammatical things can cause a large vocab size
            - Large vocab would then mean large compression task by the network having inefficiencies.
                - Imagine a vocab of the order $10^5$ and model vectors of the order $10^3$ thats $10^8$ weights! Some models are that size alone
            - Restricting vocab size can help overcome this however can cause information loss
    - **Sub-word based**
        - Splits your text into smaller "subwords" (slightly strange)
        - Upside is that it balances the up and downsides of the prior techniques
        - Various algorithms exist for this which similiar word based processes combine statistical and rule based algorithms. 
- Will demo below the general jist of tokenization below using a janky version of character based tokenization
    - GPT-2 uses [byte-pair-encoding](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt) which is a form of sub-word tokenization

In [5]:
# Character based 
corpus_text = 'My name is Nish and I like discussing AI and Data Science topics'
character_split_corpus = list(corpus_text)
token2id_map = {char: char_id for char_id, char in enumerate(sorted(set(character_split_corpus)))}

tokenized_text = [token2id_map[token] for token in character_split_corpus]
tokenized_text

[4,
 23,
 0,
 17,
 7,
 16,
 10,
 0,
 13,
 20,
 0,
 5,
 13,
 20,
 12,
 0,
 7,
 17,
 9,
 0,
 3,
 0,
 15,
 13,
 14,
 10,
 0,
 9,
 13,
 20,
 8,
 22,
 20,
 20,
 13,
 17,
 11,
 0,
 1,
 3,
 0,
 7,
 17,
 9,
 0,
 2,
 7,
 21,
 7,
 0,
 6,
 8,
 13,
 10,
 17,
 8,
 10,
 0,
 21,
 18,
 19,
 13,
 8,
 20]

### Token Embeddings

- Remember our original input is text which we must convert to numbers via **_tokenization_** (outside the model)
- We can then take these numbers and turn them into vectors (inside the model)
    - These vectors are known as embeddings.
- Different ways of implementing this which work out the same
    - Can one-hot encode your input tokens and then multiply with the embedding matrix or index into it like done here.
    - This indexing is more efficient and looks cleaner. 

In [6]:
class Embed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_E = nn.Parameter(torch.empty((cfg.d_vocab, cfg.d_model)))
        nn.init.normal_(self.W_E, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        embed = self.W_E[tokens, :] # [batch, position, d_model]
        if self.cfg.debug: print("Embeddings:", embed.shape)
        return embed

### Positional Embeddings

- Interestingly the attention mechanism is symmetric about token position.
    - No way of knowing that token 1 (source pos) comes prior to token 2 (source pos) relative to token 3 (dest pos)
- This is problematic since attention itself moves information from source token positions to destination positions seemingly without knowing about position
    - This fundamentally implies the process is flawed.
- This is where positional embeddings come in!
    - They provide a solution to this problem by *encoding* positional information about token positions into a vector format which can be added to the existing token embeddings.
    - This process occurs prior to attention that way attention can use knowledge about position even if it doesn't do it itself!
- Here we are allowing the model to learn the positional emedding.

In [15]:
class PosEmbed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_pos = nn.Parameter(torch.empty((cfg.n_ctx, cfg.d_model)))
        nn.init.normal_(self.W_pos, std=self.cfg.init_range)
    
    def forward(self, tokens):
        # tokens: [batch, position]
        if self.cfg.debug: print("Tokens:", tokens.shape)
        pos_embed = self.W_pos[:tokens.size(1), :] # [position, d_model]
        pos_embed = einops.repeat(pos_embed, "position d_model -> batch position d_model", batch=tokens.size(0))
        if self.cfg.debug: print("pos_embed:", pos_embed.shape)
        return pos_embed

### LayerNorm

- This normalises (or should I say standardises) the inputs so that they have a mean of 0 and variance of 1 and then scales and translates them
    - Mathematically as:
        -  **Compute Mean and Variance:**
        $ \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \quad$
        $ \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2$
        Here, $H$ is the number of hidden units.
        - **Standardise:**
        $ \hat{x_i} = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $
        Where $\epsilon$ is a small constant for numerical stability.
        -  **Scale and Shift:**
        $ y_i = \gamma \circ \hat{x_i} + \beta $
        Here, $\gamma$ and $\beta$ are learnable parameters for scaling and shifting.
        - **All in one:**
        $Y = \gamma \left(\frac{X - \mu}{\sqrt{\sigma^2 + \epsilon}}\right) + \beta $
    - This occurs independently and in parallel for each residual stream vector
- This acts across the $d_{model}$ dimension and typically can occur prior or post the other layers (pre vs post layernorm)
- Studies showed at the time it ensures a smoother process and greater accuracy in NLP tasks.
    - This [stackpost](https://stats.stackexchange.com/questions/474440/why-do-transformers-use-layer-norm-instead-of-batch-norm) provides more details.
    


In [7]:
class LayerNorm(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.w = nn.Parameter(torch.ones(cfg.d_model))
        self.b = nn.Parameter(torch.zeros(cfg.d_model))
    
    def forward(self, residual):
        # residual: [batch, position, d_model]
        if self.cfg.debug: print("Residual:", residual.shape)
        # Calculate the residual (x - mean)
        residual = residual - einops.reduce(residual, "batch position d_model -> batch position 1", "mean")
        # Calculate std by calculating the variance, square root it. Add in an epsilon to prevent divide by zero.
        scale = (einops.reduce(residual.pow(2), "batch position d_model -> batch position 1", "mean") + cfg.layer_norm_eps).sqrt()
        normalized = residual / scale
        normalized = normalized * self.w + self.b
        if self.cfg.debug: print("Normalized:", residual.shape)
        return normalized

### Attention
<img src="../data/images/attention_head.png" alt="Attention Head" width="400"/>
<img src="../data/images/multi_head_attention.png" alt="Multi-Attention Heads" width="400"/>
<img src="../data/images/causal_mask.png" alt="Causal Mask" width="400"/>

- What is attention? In particular the mechanism 
    - Remember the attention mechanism is all about learning efficient representations for your text.
        - To do so it leverages the idea of *dot products* to create a similarity measure between your tokens $q \times k^{T}$
        - You can then generatate an *attention pattern/score* for each token (destination pos / query) which acts as a probability distribution over prior source tokens (keys).
        - The values of the distribution then act as weights to decide on how much information to copy over 
        $\text{softmax}(\frac{q k^T}{\sqrt{d_k}})$
    - A another way of thinking about it is that attention is essentially *moving information between token positions* e.g from source positions (keys) to destination positions (queries)
        - This moving in done in such a way to maximize the relevant information that is contained at each token position as per the relation between that token and all others that are *causally prior* to it in the case of GPT based models.
    - This is the only part of the transformer which moves information between positions.
- Why do you have multi-attention heads?
    - Each head is meant to independently learn representations of your text (each has it's own set of parameters i.e weight matricies)
    - You can then efficiently combine the knowledge learned by those heads to in theory gain a better understanding
        - As the saying goes "two-eyes are better than one"
    - Some cool maths can show that concatenating the heads outputs together is equivalent of linearly adding each output to the residual stream
    - You generally find that the output dimension of the heads are smaller than the residual stream width e.g $\frac{d_{model}}{d_{head}} = n_{heads}$
- The way the outputs of each head are combined can be thought about differently
    - Stacking the outputs of each head together and then performing a final linear map to get you back to the residual stream size is the same as multiplying each head by it's own weight matrix and then summing all the heads outputs together.
    - Concatenation definition is often preferred since it produces a larger and more compute efficient matrix multiply but theoretically, they are equivalent and often preferred in a theoretical context to think of the heads as independently additive.
- Thinking of it like this the operations can be formatted compactly into a single equation which yields the output of a single head: $𝐴 𝑥 𝑊_{𝑉}^{𝑇} 𝑊_{𝑂}^{𝑇}$
    - $𝐴$ is the attention matrix dimension ($𝑃_{𝑑}×𝑃_{𝑠}$)
        - Breaking down further as $qk^T$ where $q$ is our query vector and $k$ our key vector.
        - Expanding this we get $q = xW_{Q}^{T}$ and $k = xW_{K}^{T}$ so neatly this becomes $A = xW_{Q}^{T} W_{K} x^{T}$
    - $𝑥$ is the input matrix read from the residual stream consisting of your tokens and respective embedding ($𝑃_{𝑆}×𝑑_{𝑚𝑜𝑑𝑒𝑙}$)
    - $𝑊_{𝑉}^{𝑇}$ is the transpose of your value weight matrix ($𝑑_{𝑚𝑜𝑑𝑒𝑙}×𝑑_{ℎ𝑒𝑎𝑑}$)
    - $𝑊_{𝑂}^{𝑇}$ is the transpose of your output weight matrix ($𝑑_{ℎ𝑒𝑎𝑑}×𝑑_{𝑚𝑜𝑑𝑒𝑙}$)


In [8]:
class Attention(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_Q = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_Q, std=self.cfg.init_range)
        self.b_Q = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_K = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_K, std=self.cfg.init_range)
        self.b_K = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        self.W_V = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_model, cfg.d_head)))
        nn.init.normal_(self.W_V, std=self.cfg.init_range)
        self.b_V = nn.Parameter(torch.zeros((cfg.n_heads, cfg.d_head)))
        
        self.W_O = nn.Parameter(torch.empty((cfg.n_heads, cfg.d_head, cfg.d_model)))
        nn.init.normal_(self.W_O, std=self.cfg.init_range)
        self.b_O = nn.Parameter(torch.zeros((cfg.d_model)))
        
        self.register_buffer("IGNORE", torch.tensor(-1e5, dtype=torch.float32, device="cpu"))
    
    def forward(self, normalized_resid_pre):
        # normalized_resid_pre: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_pre:", normalized_resid_pre.shape)
        
        q = einsum("batch query_pos d_model, n_heads d_model d_head -> batch query_pos n_heads d_head", normalized_resid_pre, self.W_Q) + self.b_Q
        k = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_K) + self.b_K
        
        attn_scores = einsum("batch query_pos n_heads d_head, batch key_pos n_heads d_head -> batch n_heads query_pos key_pos", q, k)
        attn_scores = attn_scores / math.sqrt(self.cfg.d_head)
        attn_scores = self.apply_causal_mask(attn_scores)

        pattern = attn_scores.softmax(dim=-1) # [batch, n_head, query_pos, key_pos]

        v = einsum("batch key_pos d_model, n_heads d_model d_head -> batch key_pos n_heads d_head", normalized_resid_pre, self.W_V) + self.b_V

        z = einsum("batch n_heads query_pos key_pos, batch key_pos n_heads d_head -> batch query_pos n_heads d_head", pattern, v)

        attn_out = einsum("batch query_pos n_heads d_head, n_heads d_head d_model -> batch query_pos d_model", z, self.W_O) + self.b_O
        return attn_out

    def apply_causal_mask(self, attn_scores):
        # attn_scores: [batch, n_heads, query_pos, key_pos]
        mask = torch.triu(torch.ones(attn_scores.size(-2), attn_scores.size(-1), device=attn_scores.device), diagonal=1).bool()
        attn_scores.masked_fill_(mask, self.IGNORE)
        return attn_scores

### Feedforword Network (MLP)

<img src="../data/images/feedforward_layer.png" alt="Feedforward (MLP) Layer" width="400"/>

- This layer typically contains a single hidden layer
    - Intuitively it's just a standard mlp layer which is meant to move information forward through the network
- Mathematically it's just applying a linear map --> activation function --> linear map
    - Activation function typically gelu for GPT based transformer
- In my diagrams I refer to $d_{E} = d_{model}$ which is the residual stream size and in practice it's observed that $\frac{d_{mlp}}{d_{model}} \approx 4$
    - Main thing to note the ratio is $\geq 1$ which intuitively makes sense since you'd want it to create features and more them forward.

In [9]:
class FeedForwardLayer(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_in = nn.Parameter(torch.empty((cfg.d_model, cfg.d_mlp)))
        nn.init.normal_(self.W_in, std=self.cfg.init_range)
        self.b_in = nn.Parameter(torch.zeros((cfg.d_mlp)))
        self.W_out = nn.Parameter(torch.empty((cfg.d_mlp, cfg.d_model)))
        nn.init.normal_(self.W_out, std=self.cfg.init_range)
        self.b_out = nn.Parameter(torch.zeros((cfg.d_model)))
        self.gelu = nn.GELU()
    
    def forward(self, normalized_resid_mid):
        # normalized_resid_mid: [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_mid:", normalized_resid_mid.shape)
        pre = einsum("batch position d_model, d_model d_mlp -> batch position d_mlp", normalized_resid_mid, self.W_in) + self.b_in
        post = self.gelu(pre)
        mlp_out = einsum("batch position d_mlp, d_mlp d_model -> batch position d_model", post, self.W_out) + self.b_out
        return mlp_out

### Transformer Block

<img src="../data/images/transformer_block.png" alt="Showing the Transformer Block" width="400"/>

- This packages together all the other components
    - This main components of the block are sometimes referred to as *sub-layers* (attention and feedforward layers).
- Typically if someone says "this transformer has $N$ layers" this means it has $N$ transformer block's and therefore "$2N$ sub-layers"

In [10]:
class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg

        self.ln1 = LayerNorm(cfg)
        self.attn = Attention(cfg)
        self.ln2 = LayerNorm(cfg)
        self.mlp = FeedForwardLayer(cfg)
    
    def forward(self, resid_pre):
        # resid_pre [batch, position, d_model]
        normalized_resid_pre = self.ln1(resid_pre)
        attn_out = self.attn(normalized_resid_pre)
        resid_mid = resid_pre + attn_out
        
        normalized_resid_mid = self.ln2(resid_mid)
        mlp_out = self.mlp(normalized_resid_mid)
        resid_post = resid_mid + mlp_out
        return resid_post

### Un-embedding Layer

- Just the final layer which maps you from your internal residual stream dimension back to the vocab dimension
    - Remember the residual stream is just the internal dimension of the model
- From this you can softmax over the vocab dimensional logits and subsequently sample from it giving you your generative abilities!

In [11]:
class Unembed(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.W_U = nn.Parameter(torch.empty((cfg.d_model, cfg.d_vocab)))
        nn.init.normal_(self.W_U, std=self.cfg.init_range)
        self.b_U = nn.Parameter(torch.zeros((cfg.d_vocab), requires_grad=False))
    
    def forward(self, normalized_resid_final):
        # normalized_resid_final [batch, position, d_model]
        if self.cfg.debug: print("Normalized_resid_final:", normalized_resid_final.shape)
        logits = einsum("batch position d_model, d_model d_vocab -> batch position d_vocab", normalized_resid_final, self.W_U) + self.b_U
        return logits

### Full-Transformer

<img src="../data/images/gpt_style_transformer.png" alt="Showing the full GPT-2 style transformer" width="400" height="400"/>

- Now you can combine everything together
    - Can decide on how many blocks you want and then weave everything together in the order shown in the architecture diagrams

In [16]:
class GPT2Model(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.cfg = cfg
        self.embed = Embed(cfg)
        self.pos_embed = PosEmbed(cfg)
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
        self.ln_final = LayerNorm(cfg)
        self.unembed = Unembed(cfg)
    
    def forward(self, tokens):
        # tokens [batch, position]
        embed = self.embed(tokens)
        pos_embed = self.pos_embed(tokens)
        residual = embed + pos_embed
        for block in self.blocks:
            residual = block(residual)
        normalized_resid_final = self.ln_final(residual)
        logits = self.unembed(normalized_resid_final)
        # logits have shape [batch, position, logits]
        return logits

##### Testing it out

In [13]:
# device = utils.get_device()
device = torch.device("cpu")
print(device)

cpu


In [18]:
reference_model = transformer_lens.HookedTransformer.from_pretrained("gpt2-small", device=device, fold_ln=False, center_unembed=False, center_writing_weights=False)
our_model = GPT2Model(Config(debug=False))
# our_model.load_state_dict(reference_model.state_dict(), strict=False)
#our_model = our_model.to(device)
our_model

Loaded pretrained model gpt2-small into HookedTransformer


GPT2Model(
  (embed): Embed()
  (pos_embed): PosEmbed()
  (blocks): ModuleList(
    (0-11): 12 x TransformerBlock(
      (ln1): LayerNorm()
      (attn): Attention()
      (ln2): LayerNorm()
      (mlp): FeedForwardLayer(
        (gelu): GELU(approximate='none')
      )
    )
  )
  (ln_final): LayerNorm()
  (unembed): Unembed()
)

In [19]:
# Our model
test_string = "My name is Nish and I like discussing AI and Data Science topics'"
print(test_string)
print("^^^Original Input^^^")
generated_text = ""
for i in tqdm.tqdm(range(10)):
    # Tokenizing our text (turning it into a bunch of integers)
    test_tokens = reference_model.to_tokens(test_string).cpu()
    # Feeding in our tokens (getting out our logits)
    demo_logits = our_model(test_tokens)
    # Calculating our new token using greedy method
    new_token = reference_model.tokenizer.decode(demo_logits[-1, -1].argmax())
    print(f"The most probable token is {new_token} and has probability {demo_logits[-1, -1].softmax(dim=-1).max()}")
    test_string += new_token
    generated_text += new_token
print("--------------------")
print(f"Generated text >>> \"{generated_text}\"")
print("--------------------")

My name is Nish and I like discussing AI and Data Science topics'
^^^Original Input^^^


 20%|██        | 1/5 [00:01<00:05,  1.45s/it]

The most probable token is  humorous and has probability 0.00018735218327492476


 40%|████      | 2/5 [00:02<00:04,  1.40s/it]

The most probable token is  overhe and has probability 0.0002456027432344854


 60%|██████    | 3/5 [00:03<00:02,  1.01s/it]

The most probable token is :{ and has probability 0.00017518943059258163


 80%|████████  | 4/5 [00:04<00:00,  1.13it/s]

The most probable token is flies and has probability 0.00014470017049461603


100%|██████████| 5/5 [00:04<00:00,  1.04it/s]

The most probable token is  humorous and has probability 0.00018375739455223083
--------------------
Generated text >>> " humorous overhe:{flies humorous"
--------------------





In [None]:
# Reference model
test_string = "My name is Nish and I like discussing AI and Data Science topics"
print(test_string)
print("^^^Original Input^^^")
generated_text = ""
for i in tqdm.tqdm(range(5)):
    test_tokens = reference_model.to_tokens(test_string).cpu()
    demo_logits = reference_model(test_tokens)
    new_token = reference_model.tokenizer.decode(demo_logits[-1, -1].argmax())
    print(f"The most probable token is {new_token} and has probability {demo_logits[-1, -1].softmax(dim=-1).max()}")
    test_string += new_token
    generated_text += new_token
print("--------------------")
print(f"Generated text >>> \"{generated_text}\"")
print("--------------------")

##### Attention Pattern Visual

- To aid in understanding attention it's useful to showcase

In [20]:
model_input_string = """My name is Nish and I like discussing AI and Data Science topics"""

In [21]:
model_input_tokens = reference_model.to_tokens(model_input_string).cpu()
print(model_input_tokens.device) # Should be on CPU
reference_output_logits, reference_output_activations = reference_model.run_with_cache(model_input_tokens, remove_batch_dim=True)


cpu


In [22]:
print(type(reference_output_activations))
attention_scores = reference_output_activations["pattern", 0, "attn"]
print(attention_scores.shape)
model_str_tokens = reference_model.to_str_tokens(model_input_string)

<class 'transformer_lens.ActivationCache.ActivationCache'>
torch.Size([12, 14, 14])


In [23]:
# We can use the attention scores to visualize the attention patterns for the first layer of the model
cv.attention.attention_heads(tokens=model_str_tokens, attention=attention_scores, mask_upper_tri=True, negative_color="#FF0000", positive_color="#00FF00")
# cv.attention.attention_patterns(tokens=model_str_tokens, attention=attention_scores)

#### Basic training

- Once you have defined your architecture you'll have to create your own custom training loop using PyTorch to train your model.
    - Involves collecting data, training your model, testing it etc
    - **Perhaps something we will touch on during another talk.**

In [None]:
def lm_cross_entropy_loss(logits, tokens):
    # Measure next token loss
    # Logits have shape [batch, position, d_vocab]
    # Tokens have shape [batch, position]
    log_probs = logits.log_softmax(dim=-1)
    pred_log_probs = log_probs[:, :-1].gather(dim=-1, index=tokens[:, 1:].unsqueeze(-1)).squeeze(-1)
    return -pred_log_probs.mean()
loss = lm_cross_entropy_loss(demo_logits, test_tokens)
print(loss)
print("Loss as average prob", (-loss).exp())
print("Loss as 'uniform over this many variables'", (loss).exp())
print("Uniform loss over the vocab", math.log(our_model.cfg.d_vocab))

------------------------------------------------------------------------------------------------------------------------------------------------------------