## LLM Demo

### Intro

- Typically there are 3 ways people can interact with LLMs.
    - Making your own from scratch (collecting data, deinfing architecture, training model,...)
    - Making use of those made by others (either people or organisations)
    - Hybrid of the two (e.g "building ontop" of a new model)
- We will primarily focus on working with ready made LLMs **_however_** will touch more granular code in certain places to help explain how the components work under the hood.
    - We will draw heavy inspiration from the code provided in this [notebook](https://colab.research.google.com/github/neelnanda-io/Easy-Transformer/blob/clean-transformer-demo/Clean_Transformer_Demo.ipynb#scrollTo=EDlMEk0LVcdy) for implementing the granular code.
    - Ultimately you can also start from pre-existing models and then make your own ontop of those.
- Good questions raised during the my last talk:
    - What do you do if you want to build models in not so common languages?

In [9]:
# Notebook wide setup

### Making your own models

- Here we will be looking at various components of the transformer and how you can go about implementing them.
    - Particular focus is on GPT-2 based transformer architecture as highlighted in the above note.
    - Though generally ideas are transferable.
- Main libraries used:
    - Pytorch, Einsum, numpy, math, dataclasses, ...
- Einsum is great as it makes tensor manipulation much easier which typically can be error prone (at least for me...)
    - Foundations come from einstein's summation notation (those you studied physics might have come across it before)
    - Here is a [great post](https://rockt.github.io/2018/04/30/einsum) which goes over the background theory.

In [3]:
# Defining important libraries
import einops
from fancy_einsum import einsum
from dataclasses import dataclass
import torch
import torch.nn as nn
import numpy as np
import math
import tqdm.auto as tqdm

### Attention
<img src="../data/images/attention_head.png" alt="Attention Head" width="400"/>
<img src="../data/images/multi_head_attention.png" alt="Multi-Attention Heads" width="400"/>

- What is attention? In particular the mechanism 
    - Remember the attention mechanism is all about learning efficient representations for your text.
        - To do so it leverages the idea of *dot products* to create a similarity measure between your tokens $q \times k^{T}$
        - You can then generatate an *attention pattern/score* for each token (destination pos / query) which acts as a probability distribution over prior source tokens (keys).
        - The values of the distribution then act as weights to decide on how much information to copy over 
        $\text{softmax}(\frac{q k^T}{\sqrt{d_k}})$
    - A another way of thinking about it is that attention is essentially *moving information between token positions* e.g from source positions (keys) to destination positions (queries)
        - This moving in done in such a way to maximize the relevant information that is contained at each token position as per the relation between that token and all others that are *causally prior* to it in the case of GPT based models.
    - This is the only part of the transformer which moves information between positions.
- Why do you have multi-attention heads?
    - Each head is meant to independently learn representations of your text (each has it's own set of parameters i.e weight matricies)
    - You can then efficiently combine the knowledge learned by those heads to in theory gain a better understanding
        - As the saying goes "two-eyes are better than one"
    - Some cool maths can show that concatenating the heads outputs together is equivalent of linearly adding each output to the residual stream
    - You generally find that the output dimension of the heads are smaller than the residual stream width e.g $\frac{d_{model}}{d_{head}} = n_{heads}$

### LayerNorm

### Feedforword Network (MLP)

<img src="../data/images/feedforward_layer.png" alt="Feedforward (MLP) Layer" width="400"/>

- This layer typically contains a single hidden layer
    - Intuitively it's just a standard mlp layer which is meant to move information forward through the network
- Mathematically it's just applying a linear map --> activation function --> linear map
    - Activation function typically gelu for GPT based transformer
- In my diagrams I refer to $d_{E} = d_{model}$ which is the residual stream size and in practice it's observed that $\frac{d_{mlp}}{d_{model}} \approx 4$
    - Main thing to note the ratio is $\geq 1$

### Using pre-existing models

- Main libraries used:
    - Transformer, tokenizer
- HuggingFace can be thought of as a wide ecosystem which facilitates the open source nature of modern AI/ML
    - Can do many things on huggingface but we will primarily touch on using their collections of models for tasks.

In [4]:
# Defining important libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

In [5]:
model = AutoModelForSequenceClassification.from_pretrained('vectara/hallucination_evaluation_model')
tokenizer = AutoTokenizer.from_pretrained('vectara/hallucination_evaluation_model')

pairs = [
    ["A man walks into a bar and buys a drink", "A bloke swigs alcohol at a pub"],
    ["A person on a horse jumps over a broken down airplane.", "A person is at a diner, ordering an omelette."],
    ["A person on a horse jumps over a broken down airplane.", "A person is outdoors, on a horse."],
    ["A boy is jumping on skateboard in the middle of a red bridge.", "The boy skates down the sidewalk on a blue bridge"],
    ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond drinking water in public."],
    ["A man with blond-hair, and a brown shirt drinking out of a public water fountain.", "A blond man wearing a brown shirt is reading a book."],
    ["Mark Wahlberg was a fan of Manny.", "Manny was a fan of Mark Wahlberg."], 
]

inputs = tokenizer.batch_encode_plus(pairs, return_tensors='pt', padding=True)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits.cpu().detach().numpy()
    # convert logits to probabilities
    scores = 1 / (1 + np.exp(-logits)).flatten()

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/738M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

In [6]:
print(scores)

[6.1051559e-01 4.7493645e-04 9.9639291e-01 2.1221612e-04 9.9599433e-01
 1.4126968e-03 2.8262993e-03]
