<a href="https://colab.research.google.com/github/FrankLong1/AI-Explainers/blob/main/llm_explainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is about laying out the steps an LLM goes through to generate a response when it is fed a prompt.
We assume some background on LLMs, but you don't need to be an engineer / know how to code to follow along.

With that said you can copy and paste any of the code into ChatGPT and ask it to "explain what this code does in plain English" and it should do a solid job!

In [14]:
%pip install accelerate



In [15]:
# We'll be using Mistral

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"


## Converting Words into Tokens
Key Terms:
"tokens"
"embeddings"
"matrices"

At their core, what machine learning models do is take in an input and predict what the most likely output is. W

The tokenizer plays the key role of converting language understandable by humans (e.g. English), into the inputs that are understandable by the model (i.e. numbers, and ultimately 1s and 0s).

Each model has its own accompanying Tokenizer that is trained alongside it, you can't think of it as a special converter that turns the language into this specific language

NOTE NEED SOME NICE METAPHOR

We'll start by loading a tokenizer from HuggingFace...



In [16]:
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The prompt we want to use is "Can you write an email explaining what an LLM is?" we'll start by defining this sentence for the model and giving it to the tokenizer to process...

In [17]:
PROMPT = "Can you write an email explaining what an LLM is?"
inputs = tokenizer(PROMPT, return_tensors="pt")

The model breaks down the prompt into sub-word chunks called "tokens". What Large Language Models (LLMs) are trained to do is take in this string of "tokens", and perform "next token prediction" (i.e. guess the next token) based on the examples it has been "trained on" in the past.

In [18]:
import pandas as pd
from transformers import PreTrainedTokenizerBase

def tokens_and_ids_to_dataframe(tokenizer: PreTrainedTokenizerBase, token_ids: list) -> pd.DataFrame:
    tokens_and_ids = []
    for i, token_id in enumerate(token_ids):
        token_text = tokenizer.decode([token_id])
        tokens_and_ids.append((token_text, token_id.item()))

    df = pd.DataFrame(tokens_and_ids, columns=['Token Text', 'Token ID'])
    return df

# Example usage:
# inputs["input_ids"][0] is assumed to be your list of tokenized IDs
# tokenizer is assumed to be your tokenizer object
# Replace inputs["input_ids"][0] and tokenizer with your actual inputs and tokenizer
token_ids = inputs["input_ids"][0]
tokenizer = tokenizer

# Call the method to convert token IDs to DataFrame
df = tokens_and_ids_to_dataframe(tokenizer, token_ids)

# Print the DataFrame
df

Unnamed: 0,Token Text,Token ID
0,<s>,1
1,Can,2418
2,you,368
3,write,3324
4,an,396
5,email,4927
6,explaining,20400
7,what,767
8,an,396
9,LL,16704


The tokenizer knows which Token ID corresponds to what Token (i.e."sub-word chunk of text"). The Token IDs are the numbers that the model actually uses to process the text.

I am onlying showing the Token Text for explanation purposes!


*Side Note: You'll also notice that the list of tokens begins with "< s >" which is the indicator to the model that a prompt or response is starting.


## Loading the Model

Here I want to actually load the model

In [19]:
import torch

torch.cuda.empty_cache()

torch.set_default_tensor_type(torch.cuda.HalfTensor)

In [20]:
import torch
from transformers import AutoModelForCausalLM

# Load the model onto the GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, device=device, torch_dtype=torch.float16)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [None]:
def generate_response(model, tokenizer, prompt, device='cuda', max_new_tokens=200, min_length = 0):
    messages = [
        {"role": "user", "content": prompt}
    ]

    # model_inputs = tokenizer([...])
    model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
    generated_ids = model.generate(model_inputs, max_new_tokens = max_new_tokens, min_length = min_length, do_sample=True, pad_token_id=tokenizer.pad_token_id)
    generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    # Strip the prompt out of the response
    generated_text = generated_text.split(prompt)[-1].strip()
    return generated_text


# prompt: use the generate_response model

# TODO fix the random warnings here

generated_response = generate_response(model, tokenizer, PROMPT)
print(generated_response)



## Under the Hood

You'll see that after moving the model to the GPU it's taking up 13 GB (i.e. 13312 MiB of space) of space which is pretty big, what is taking up all this space?

In [21]:
!nvidia-smi

Thu Feb 22 22:23:52 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0              43W / 300W |  14916MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [22]:
from prettytable import PrettyTable

# Define a function to count the parameters of a given model
def count_parameters(model):
    # Create a PrettyTable object with specified column names
    table = PrettyTable(["Modules", "Parameters", "Trainable"])
    # Initialize the total number of parameters
    total_params = 0
    # Initialize the total number of trainable parameters
    total_trainable = 0

    # Iterate over each parameter and its name in the model
    for name, parameter in model.named_parameters():
        # Get the number of elements in the current parameter
        params = parameter.numel()
        # Add a row to the table with the module name, parameter count, and trainability
        table.add_row([name, params, parameter.requires_grad])
        # Add the current parameter count to the total parameter count
        total_params += params
        # If the parameter is trainable (requires gradient), add its count to the total trainable count
        if parameter.requires_grad:
            total_trainable += params

    # Print the table with module details
    print(table)
    # Print the total number of parameters
    print(f"Total Params: {total_params}")
    # Print the total number of trainable parameters
    print(f"Total Trainable Params: {total_trainable}")
    # Get the dtype of a tensor of the desired type
    dtype = torch.cuda.HalfTensor().dtype

    # Get the number of bits per element in the dtype
    bits_per_float = dtype.itemsize * 8

    total_bits = total_params * bits_per_float
    total_mib = total_bits / (8 * 1024 ** 2)
    print("Total MiB:", total_mib)
    return total_params


## Use the model to a response from the input tokens


No chat template is defined for this tokenizer - using the default template for the LlamaTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define your model
class FranksMiniModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FranksMiniModel, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Define your dataset (dummy example)
# You'll replace this with your actual dataset loading and preprocessing
# Assuming input size of 784 (MNIST-like) and output size of 10 (classification)
# You can replace these with actual data loaders for your dataset
# Here, we're just creating random tensors for demonstration purposes
input_size = 784
output_size = 10
train_data = torch.randn(100, input_size)  # 100 samples, each with input size 784
train_labels = torch.randint(0, output_size, (100,))  # 100 labels, each between 0 and 9

# Define your model, loss function, and optimizer
model = FranksMiniModel(input_size, hidden_size=256, output_size=output_size)
criterion = nn.CrossEntropyLoss()  # Cross entropy loss for classification problems
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Stochastic Gradient Descent optimizer

# Train your model
num_epochs = 10
for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in zip(train_data, train_labels):
        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs.unsqueeze(0))  # Adding an extra dimension as the batch dimension
        loss = criterion(outputs, labels.unsqueeze(0))  # Calculate the loss

        # Backward pass
        loss.backward()  # Compute gradients

        # Optimize
        optimizer.step()  # Update model parameters

        # Print statistics
        running_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_labels)}")

print("Training finished!")
