<a href="https://colab.research.google.com/github/FrankLong1/AI-Explainers/blob/main/llm_explainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is about laying out the steps an LLM goes through to generate a response when it is fed a prompt.
We assume some background on LLMs, but you don't need to be an engineer / know how to code to follow along.

With that said you can copy and paste any of the code into ChatGPT and ask it to "explain what this code does in plain English" and it should do a solid job!

In [1]:
%pip install accelerate

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2


In [2]:
# We'll be using Mistral

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"


## Converting Words into Tokens
Key Terms:
"tokens"
"embeddings"
"matrices"

At their core, what machine learning models do is take in an input and predict what the most likely output is. W

The tokenizer plays the key role of converting language understandable by humans (e.g. English), into the inputs that are understandable by the model (i.e. numbers, and ultimately 1s and 0s).

Each model has its own accompanying Tokenizer that is trained alongside it, you can't think of it as a special converter that turns the language into this specific language

NOTE NEED SOME NICE METAPHOR

We'll start by loading a tokenizer from HuggingFace...



In [15]:
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer.pad_token_id = tokenizer.eos_token_id

The prompt we want to use is "Can you write an email explaining what an LLM is?" we'll start by defining this sentence for the model and giving it to the tokenizer to process...

In [4]:
PROMPT = "Can you write an email explaining what an LLM is?"
inputs = tokenizer(PROMPT, return_tensors="pt")

The model breaks down the prompt into sub-word chunks called "tokens". What Large Language Models (LLMs) are trained to do is take in this string of "tokens", and perform "next token prediction" (i.e. guess the next token) based on the examples it has been "trained on" in the past.

In [5]:
import pandas as pd
from transformers import PreTrainedTokenizerBase

def tokens_and_ids_to_dataframe(tokenizer: PreTrainedTokenizerBase, token_ids: list) -> pd.DataFrame:
    tokens_and_ids = []
    for i, token_id in enumerate(token_ids):
        token_text = tokenizer.decode([token_id])
        tokens_and_ids.append((token_text, token_id.item()))

    df = pd.DataFrame(tokens_and_ids, columns=['Token Text', 'Token ID'])
    return df

# Example usage:
# inputs["input_ids"][0] is assumed to be your list of tokenized IDs
# tokenizer is assumed to be your tokenizer object
# Replace inputs["input_ids"][0] and tokenizer with your actual inputs and tokenizer
token_ids = inputs["input_ids"][0]
tokenizer = tokenizer

# Call the method to convert token IDs to DataFrame
df = tokens_and_ids_to_dataframe(tokenizer, token_ids)

# Print the DataFrame
df

Unnamed: 0,Token Text,Token ID
0,<s>,1
1,Can,2418
2,you,368
3,write,3324
4,an,396
5,email,4927
6,explaining,20400
7,what,767
8,an,396
9,LL,16704


The tokenizer knows which Token ID corresponds to what Token (i.e."sub-word chunk of text"). The Token IDs are the numbers that the model actually uses to process the text.

I am onlying showing the Token Text for explanation purposes!


*Side Note: You'll also notice that the list of tokens begins with "< s >" which is the indicator to the model that a prompt or response is starting.


## Loading the Model

Here I want to actually load the model

In [6]:
import torch

torch.cuda.empty_cache()

torch.set_default_tensor_type(torch.cuda.HalfTensor)

  _C._set_default_tensor_type(t)


In [8]:
import torch
from transformers import AutoModelForCausalLM

# Load the model onto the GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

mistral_7b_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16)
mistral_7b_model.to(device)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  

In [18]:
messages = [
      {"role": "user", "content": PROMPT}
]

model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
model_inputs




tensor([[    1,   733, 16289, 28793,  2418,   368,  3324,   396,  4927, 20400,
           767,   396, 16704, 28755,   349, 28804,   733, 28748, 16289, 28793]])

For demonstration purposes I am going to generate exactly 1 token from the model... Note in the model's text you'll the the token [INST] (begin instruction) and [/INST] (end instruction) this is how you can take the task of predicting the next token, and optimizing it for instruction following...

In [27]:
generated_ids = mistral_7b_model.generate(model_inputs, max_new_tokens = 1, min_length = 1, do_sample=True, pad_token_id=tokenizer.pad_token_id)
tokens_and_ids_to_dataframe(tokenizer, generated_ids[0]).tail(1)


Unnamed: 0,Token Text,Token ID
20,Dear,22143


In [28]:
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)

[INST] Can you write an email explaining what an LLM is? [/INST] Dear


After this, I take the new text (which is the same as my prompt but with 1 additional token), and I pass it back into the model as an input again... and the model does the same thing over and over again to generate a longer series of tokens.

## Under the Hood

You'll see that after moving the model to the GPU it's taking up 13 GB (i.e. 13312 MiB of space) of space which is pretty big, what is taking up all this space? The parameters!

In [30]:
!nvidia-smi

NotImplementedError: A UTF-8 locale is required. Got ANSI_X3.4-1968

To understand what parameters really are, lets start with a much simpler example, a neural network with 2 layers much like the pictures you'll see

In [53]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define your model
class FranksMiniModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FranksMiniModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x


input_size = 784
output_size = 10
model = FranksMiniModel(input_size, hidden_size=256, output_size=output_size)


Epoch 1/10, Loss: 2.348310546875
Epoch 2/10, Loss: 1.04474609375
Epoch 3/10, Loss: 0.3203509521484375
Epoch 4/10, Loss: 0.123892822265625
Epoch 5/10, Loss: 0.06983123779296875
Epoch 6/10, Loss: 0.04729629516601563
Epoch 7/10, Loss: 0.0352545166015625
Epoch 8/10, Loss: 0.027859878540039063
Epoch 9/10, Loss: 0.02289825439453125
Epoch 10/10, Loss: 0.019356536865234374
Training finished!


In [None]:
from prettytable import PrettyTable

# Define a function to count the parameters of a given model
def count_parameters(model):
    table = PrettyTable(["Modules", "Parameters", "Trainable"])
    total_params = 0
    total_trainable = 0

    for name, parameter in model.named_parameters():
        # Get the number of elements in the current parameter
        params = parameter.numel()
        table.add_row([name, params, parameter.requires_grad])
        total_params += params
        if parameter.requires_grad:
            total_trainable += params

    print(table)
    print(f"Total Params: {total_params}")
    print(f"Total Trainable Params: {total_trainable}")
    dtype = torch.cuda.HalfTensor().dtype

    # Get the number of bits per element in the dtype
    bits_per_float = dtype.itemsize * 8

    total_bits = total_params * bits_per_float
    total_mib = total_bits / (8 * 1024 ** 2)
    print("Total MiB:", total_mib)
    return total_params


In [None]:
# Define your dataset (dummy example)
# Here, we're just creating random tensors for demonstration purposes
train_data = torch.randn(100, input_size)  # 100 samples, each with input size 784
train_labels = torch.randint(0, output_size, (100,))  # 100 labels, each between 0 and 9

criterion = nn.crossentropyloss()  # cross entropy loss for classification problems
optimizer = optim.sgd(model.parameters(), lr=0.01)  # stochastic gradient descent optimizer

# train your model
num_epochs = 10
for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in zip(train_data, train_labels):
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward pass
        outputs = model(inputs.unsqueeze(0))  # adding an extra dimension as the batch dimension
        loss = criterion(outputs, labels.unsqueeze(0))  # calculate the loss

        # backward pass
        loss.backward()  # compute gradients

        # optimize
        optimizer.step()  # update model parameters

        # print statistics
        running_loss += loss.item()
    print(f"epoch {epoch+1}/{num_epochs}, loss: {running_loss/len(train_labels)}")

print("training finished!")

Looking at the model's config you'll see that this has 32 layers (see "num_hidden_layers") -- what is in each of those layers?

In [None]:
from transformers import AutoConfig

# Load the model configuration
config = AutoConfig.from_pretrained(MODEL_NAME)
config