<a href="https://colab.research.google.com/github/AbdulazeezAde/Abdulazeez/blob/main/Copy_of_foundations_of_llms_practical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLMs for everyone

<img src="https://www.marktechpost.com/wp-content/uploads/2023/05/Blog-Banner-3.jpg" width="60%" />

<a href="https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2024/blob/main/practicals/Foundations_of_LLMs/foundations_of_llms_practical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

© Indabax Nigeria 2025. Apache License 2.0.

**Authors: Israel Odeajo | AI Engineer **

**Introduction:**

Welcome to **"LLMs for Everyone"**—your gateway to the fascinating world of Large Language Models (LLMs)! To kick things off, here’s a fun fact: this entire introduction was generated by ChatGPT, one of the many powerful LLMs you'll be learning about. 🤖✨

In this tutorial, you'll dive into the core principles of transformers, the cutting-edge technology behind models like GPT. You’ll also get hands-on experience training your very own Language Model! Get ready to explore how these impressive AI systems create such realistic and engaging text. Let’s embark on this exciting journey together and unlock the secrets of LLMs! 🚀📚

**Topics:**

Content: [<font color='orange'>Hugging Face Introduction</font>, <font color='green'>Attention Mechanism</font>, <font color='green'>Transformer Architecture</font>, <font color='green'>Training your own LLM from scratch</font>, <font color='orange'>Finetuning an LLM for Text Classification</font>]

Level: <font color='orange'>Beginner</font>, <font color='green'>Intermediate</font>, <font color='blue'>Advanced</font>

**Aims/Learning Objectives:**

* Understand the idea behind [Attention](https://arxiv.org/abs/1706.03762) and why it is used.
* Present and describe the fundamental building blocks of the [Transformer Architecture](https://arxiv.org/abs/1706.03762) along with an intuition on such an architecture design.
* Build and train a simple Shakespeare-inspired LLM.

**Prerequisites:**

* Basic knowledge of Deep Learning.
* Familiarity with Natural Language Processing (NLP).
* Understanding of sequence-to-sequence models.
* Basic understanding of Linear Algebra.

**Outline:**

>[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)

>>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)

>>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)

>>>[Hugging Face](#scrollTo=AwjIIipOG4fz)

>>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)

>[LLMs for everyone](#scrollTo=m2s4kN_QPQVe)

>>[Installations, Imports and Helper Functions](#scrollTo=6EqhIg1odqg0)

>>[Let's kick things off with a Hugging Face Demo! Beginner](#scrollTo=4zu5cg-YG4XU)

>>>[Hugging Face](#scrollTo=AwjIIipOG4fz)

>>>[Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample](#scrollTo=eq46TV_0G4f0)

>>[1. Attention](#scrollTo=-ZUp8i37dFbU)

>>>[Intuition - Beginner](#scrollTo=ygdi884ugGcu)

>>>[Understanding Attention in Simple Terms](#scrollTo=ygdi884ugGcu)

>>>[Sequence to sequence attenion mechanisms - Intermediate](#scrollTo=aQfqM1EJyDXI)

>>>[Self-attention to Multihead Attention - Intermediate](#scrollTo=J-MU6rrny8Nj)

>>>>[Self-attention](#scrollTo=0AFUEFZGzCTv)

>>>>>[Queries, keys and values](#scrollTo=pwOIMtdZzdTf)

>>>>>[Scaled dot product attention](#scrollTo=OhGZHFsHz_Qp)

>>>>>[Masked attention](#scrollTo=D7B-AgO80gIt)

>>>>>[Multi-head attention](#scrollTo=OWDubQwCs4zG)

>>[2. Building your own LLM](#scrollTo=e9NW58_3hAg2)

>>>[2.1 High-level overvierw Beginner](#scrollTo=bA_2coZvhAg3)

>>>[2.2 Tokenization + Positional encoding Beginner](#scrollTo=fbTsk0MdhAhC)

>>>>[2.2.1 Tokenization](#scrollTo=DehUpfym_RF8)

>>>>[2.2.2 Positional encodings](#scrollTo=639s7Zuk_RF9)

>>>>>[Sine and cosine functions](#scrollTo=rklY-aL-_RF9)

>>>[Group Activity:](#scrollTo=1mjHEDPO_RF-)

>>>[2.3 Transformer block   Intermediate](#scrollTo=SdNPg0pnhAhG)

>>>>[2.3.1 Feed Forward Network (FFN) / Multilayer perceptron (MLP) Beginner](#scrollTo=kTURbfr__RF-)

>>>>[2.3.2 Add and Norm block Beginner](#scrollTo=Sts5Vr4i_RF-)

>>>[2.4 Building the Transformer Decoder / LLM Intermediate](#scrollTo=91dXd29b_RF_)

>>>[2.5 Training your LLM](#scrollTo=wmt3tp38G90A)

>>>>[2.5.1 Training objective Intermediate](#scrollTo=agLIpsoh_RGA)

>>>>[2.5.2 Training models Advanced](#scrollTo=4CSfvGj__RGA)

>>>>[2.5.3 Inspecting the trained LLM Beginner](#scrollTo=pGv9c2AFmF4V)

>>[Conclusion](#scrollTo=fV3YG7QOZD-B)

>[Feedback](#scrollTo=o1ndpYE50BpG)

**Before you start:**

For this practical, you will need to use a GPU to speed up training. To do this, go to the "Runtime" menu in Colab, select "Change runtime type" and then in the popup menu, choose "GPU" in the "Hardware accelerator" box.

**Suggested experience level in this topic:**

| Level         | Experience                            |
| --- | --- |
`Beginner`      | It is my first time being introduced to this work. |
`Intermediate`  | I have done some basic courses/intros on this topic. |
`Advanced`      | I work in this area/topic daily. |

In [None]:
# @title **Paths to follow:** What is your level of experience in the topics presented in this notebook? (Run Cell)
experience = "advanced" #@param ["beginner", "intermediate", "advanced"]
sections_to_follow=""


if experience == "beginner": sections_to_follow = """we recommend you to not attempt to do every coding task but instead, skip through to every section and ensure you interact with the LoRA finetuned LLM presented in the last section as well as with the pretrained LLM to get a practical understanding of how these models behave"""

elif experience == "intermediate": sections_to_follow = """we recommend you go through every section in this notebook and try the coding tasks tagged as beginner or intermediate. If you get stuck on the code ask a tutor for help or move on to better use the time of the practical"""

elif experience == "advanced": sections_to_follow = """we recommend you go through every section and try every coding task until you get it to work"""


print(f"Based on your experience, {sections_to_follow}.\nNote: this is just a guideline, feel free to explore the colab as you'd like if you feel comfort able!")

## Installations, Imports and Helper Functions

In [None]:
# Install necessary libraries for deep learning, NLP, and plotting
!pip install transformers datasets  # Transformers and datasets libraries for NLP tasks
!pip install seaborn umap-learn     # Seaborn for plotting, UMAP for dimensionality reduction
!pip install livelossplot           # LiveLossPlot for tracking model training progress
!pip install -q transformers[torch] # Transformers with PyTorch backend
!pip install -q peft                # Parameter-Efficient Fine-Tuning library
!pip install accelerate -U          # Accelerate library for performance

# Install utilities for debugging and console output formatting
!pip install -q ipdb                # Interactive Python Debugger
!pip install -q colorama            # Colored terminal text output
!pip install gensim                 #nlp preprocessing library


Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

A GPU is connected.


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

In [None]:
# Import system and math utilities
import os
import math
import urllib.request

# Check for connected accelerators (GPU or TPU) and set up accordingly
if os.environ.get("COLAB_GPU") and int(os.environ["COLAB_GPU"]) > 0:
    print("A GPU is connected.")
elif "COLAB_TPU_ADDR" in os.environ and os.environ["COLAB_TPU_ADDR"]:
    print("A TPU is connected.")
    import jax.tools.colab_tpu
    jax.tools.colab_tpu.setup_tpu()
else:
    print("Only CPU accelerator is connected.")

# Avoid GPU memory allocation to be done by JAX
os.environ['XLA_PYTHON_CLIENT_PREALLOCATE'] = "false"

# Import libraries for JAX-based deep learning
import chex
import flax
import flax.linen as nn
import jax
import jax.numpy as jnp
from jax import grad, jit, vmap
import optax

# Import NLP and model-related libraries
import transformers
from transformers import pipeline, AutoTokenizer, AutoModel
import datasets
import peft

# Import image processing and plotting libraries
from PIL import Image
from livelossplot import PlotLosses
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Import additional utilities for working with text and models
import torch
import torchvision
import itertools
import random
import copy

# Download an example image to use in the notebook
urllib.request.urlretrieve(
    "https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxzZWFyY2h8MXx8Y3V0ZSUyMGNhdHxlbnwwfHwwfHw%3D&w=1000&q=80",
    "cat.png",
)

# Import libraries for NLP preprocessing and working with pre-trained models
import gensim
from nltk.data import find
import nltk
nltk.download("word2vec_sample")

# Import Hugging Face tools and IPython widgets
import huggingface_hub
import ipywidgets as widgets
from IPython.display import display
import colorama

# Set Matplotlib to output SVG format for better quality plots
%config InlineBackend.figure_format = 'svg'

A GPU is connected.


[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


## Let's kick things off with a Hugging Face Demo! <font color='orange'>Beginner</font>

We're thrilled to have you on board! 🎉 Before we dive into the hands-on part of our journey, let's take a quick detour into the fascinating world of [Hugging Face](https://huggingface.co/)—an incredible open-source platform for building and deploying cutting-edge language models. 🌐

As a sneak peek into what we'll be creating today, we'll start by loading a *small* large language model (*in comparison to today's models) and prompting it with a simple instruction. This will give you a feel for how to interact with these powerful libraries. 💡 Get ready to unlock the potential of language models with just a few lines of code!

### Hugging Face


<img src="https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png" width="10%">


[Hugging Face](https://huggingface.co/) is a startup founded in 2016 and, in their own words: "are on a mission to democratize good machine learning, one commit at a time." Currently they are a treasure trove for tools to work on and with Large Language Model (LLMs).

They have developed various open-source packages and allow users to easily interact with a large corpus of pretrained transformer models (across all modalities) and datasets to train or fine-tune pre-trained transformers. Their software is used widely in industry and research. For more details on them and usage, refer to [the 2022 attention and transformer practical](https://colab.research.google.com/github/deep-learning-indaba/indaba-pracs-2022/blob/main/practicals/attention_and_transformers.ipynb#scrollTo=qFBw8kRx-4Mk).


In this colab we print prompts in <font color='HotPink'><b>pink</b></font> and samples generated from a model in <font color='blue'><b>blue</b></font>  like in the example below:

In [None]:
print_sample(prompt='My fake prompt', sample=' is awesome!')

[35mMy fake prompt[34m is awesome!
[39m


### Time for a Demo! ⏰⚡ Loading a Hugging Face Model and Running a Sample

Let's dive into how simple it is to load and interact with a model from Hugging Face!

For this tutorial, we've pre-configured two model options:

- **`gpt-neo-125M`**: A smaller model with 125 million parameters. It's faster and uses less memory—perfect for getting started! We recommend trying this one first.
- **`gpt2-medium`**: A larger model with 355 million parameters for more advanced use.

If you want to switch models, just restart the Colab kernel and update the model name in the cell below.

**Note**: The steps we're about to show work not only for these models but also for [all models](https://huggingface.co/models?pipeline_tag=text-generation) on Hugging Face that support text generation pipelines.
|

In [None]:
# Set the model name to "EleutherAI/gpt-neo-125M" (this can be changed via the dropdown options)
model_name = "EleutherAI/gpt-neo-125M"  # @param ["gpt2-medium", "EleutherAI/gpt-neo-125M"]

# Define the prompt for the text generation model
test_prompt = 'What is love?'  # @param {type: "string"}

# Create a text generation pipeline using the specified model
generator = transformers.pipeline('text-generation', model=model_name)

# Generate text based on the provided prompt
# 'do_sample=True' enables sampling to introduce randomness in generation, and 'min_length=30' ensures at least 30 tokens are generated
model_output = generator(test_prompt, do_sample=True, min_length=30)

# Print the generated text sample, removing the original prompt from the output
print_sample(test_prompt, model_output[0]['generated_text'].split(test_prompt)[1].rstrip())

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[35mWhat is love?[34m A study by Robert F. Borland of the University of California, Berkeley, shows that what we call love is not always simply a result of love alone. Love is a form of love. There are various forms of love that we can cultivate, and some of the more common ones are:

—Love, Love of the Unhappy

—Love of the Unhappy, Love of the Lonely

—Love of the Unhappy, Love of the Lonely, Love of the Lonely

—Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely

—Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely, Love of the Lonely

**💡 Tip:** Try running the code above with different prompts or with the same prompt more than once!

**🤔 Discussion:** Why do you think the generated text changes every time, even with the same prompt? Write your response in the input field below and discuss with your neighbour.

In [None]:
# Define the prompt for the text generation model
discussion_point = 'Who is a narratice'  # @param {type: "string"}

Let's create our own `generator` function to make it easier to load different model weights and configure how text generation is done. Simply run the cells below to get started! 😀

For now, don’t worry too much about understanding the details of the tokenizer. Just think of it as a step to convert the input into a format that the language model can understand. We’ll dive deeper into tokenization later in the notebook.


In [None]:
# Check if the model name contains 'gpt2' and load the appropriate tokenizer and model
if 'gpt2' in model_name:
    # Load the GPT-2 tokenizer and model
    tokenizer = transformers.GPT2Tokenizer.from_pretrained(model_name)
    model = transformers.GPT2LMHeadModel.from_pretrained(model_name)
# If the model name is 'EleutherAI/gpt-neo-125M', load the corresponding tokenizer and model
elif model_name == "EleutherAI/gpt-neo-125M":
    # Load the AutoTokenizer and AutoModel for the specified GPT-Neo model
    tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
# Raise an error if the model name is not supported
else:
    raise NotImplementedError

# If a GPU is available, move the model to the GPU for faster processing
if torch.cuda.is_available():
    model = model.to("cuda")

# Set the padding token ID to be the same as the end-of-sequence token ID
tokenizer.pad_token_id = tokenizer.eos_token_id

In [None]:
def run_sample(
    model,  # The language model we’ll use to generate text
    tokenizer,  # The tokenizer that converts text into a format the model understands
    prompt: str,  # The text prompt we'll give to the model to start the text generation
    seed: int | None = None,  # Optional: A number to make the results predictable each time
    temperature: float = 0.6,  # Controls how random the model’s output is; lower values make it more focused
    top_p: float = 0.9,  # Controls how much of the most likely words are considered; higher values consider more options
    max_new_tokens: int = 64,  # The maximum number of words or tokens the model will add to the prompt
) -> str:
    # This function generates text based on a given prompt using a language model,
    # with options to control randomness, the number of tokens generated, and reproducibility.

    # Convert the prompt text into tokens that the model can process
    inputs = tokenizer(prompt, return_tensors="pt")

    # Extract the tokens (input IDs) and attention mask (to focus on important parts) from the inputs
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Move the tokens and attention mask to the same device as the model (like a GPU if available)
    input_ids = input_ids.to(model.device)
    attention_mask = attention_mask.to(model.device)

    # Set up how we want the model to generate text
    generation_config = transformers.GenerationConfig(
        do_sample=True,  # Allow the model to add some randomness to its text generation
        temperature=temperature,  # Adjust how random the output is; lower means more focused
        top_p=top_p,  # Consider the most likely words that make up the top 90% of possibilities
        pad_token_id=tokenizer.pad_token_id,  # Use the token ID that represents padding (extra space)
        top_k=0,  # We're not limiting to the top-k words, so we set this to 0
    )

    # If a seed is provided, set it so that the results are repeatable (same output each time)
    if seed is not None:
        torch.manual_seed(seed)

    # Generate text using the model with the settings we defined
    generation_output = model.generate(
        input_ids=input_ids,  # Provide the input tokens to the model
        attention_mask=attention_mask,  # Provide the attention mask to help the model focus
        return_dict_in_generate=True,  # Ask the model to return detailed information
        output_scores=True,  # Include the scores (confidence levels) for the generated tokens
        max_new_tokens=max_new_tokens,  # Set the maximum number of tokens to generate
        generation_config=generation_config,  # Apply our custom text generation settings
    )

    # Make sure only one sequence (output) is generated, to keep things simple
    assert len(generation_output.sequences) == 1

    # Get the generated sequence of tokens
    output_sequence = generation_output.sequences[0]

    # Convert the generated tokens back into readable text
    output_string = tokenizer.decode(output_sequence)

    # Print the prompt and the generated response
    print_sample(prompt, output_string)

    # Return the generated text response
    return output_string

This code snippet is part of the run_sample function and handles the process of preparing the input text for the language model and setting up the text generation parameters. Here's a breakdown:

Tokenization:

inputs = tokenizer(prompt, return_tensors="pt"): This line takes the user's prompt text and converts it into a format the model can understand, which is a sequence of numerical IDs called tokens. return_tensors="pt" specifies that the output should be PyTorch tensors.


input_ids = inputs["input_ids"]: This extracts the actual token IDs from the tokenization output.


attention_mask = inputs["attention_mask"]: This extracts the attention mask, which is used by the model to know which tokens are actual content and which are padding (if any). This helps the model focus on the meaningful parts of the input.


Moving to Device:


input_ids = input_ids.to(model.device) and attention_mask = attention_mask.to(model.device): These lines move the token

IDs and attention mask to the same device where the model is loaded (e.g., a GPU if available). This is necessary for the model to process the input efficiently.


Setting up Generation Configuration:


generation_config = transformers.GenerationConfig(...):

This creates a configuration object that controls how the model generates text.


do_sample=True: This enables sampling during generation, which introduces randomness and makes the output less predictable and potentially more creative.


temperature=temperature: This sets the temperature for sampling, controlling the degree of randomness (as explained in the previous turn).


top_p=top_p: This implements Top-p sampling (also known as nucleus sampling), where the model considers the smallest set of the most likely next tokens whose cumulative probability exceeds the top_p value.


pad_token_id=tokenizer.pad_token_id: This specifies the token ID used for padding, which is important for handling sequences of different lengths in a batch.


top_k=0: This disables Top-k sampling, where the model would only consider the top_k most likely next tokens. By setting it to 0, Top-p sampling is used instead (since do_sample is True).

In [None]:
_ = run_sample(model, tokenizer, prompt="What is love?", temperature = 0.5, seed=2)

[35mWhat is love?[34mWhat is love?

Love is a term used to describe the way in which one person feels about others. It is a process that involves the emotional and physical interaction of the person, the relationship, and the relationship itself.

Love is the ability to feel the love of another person.

Love is the ability to feel
[39m


Pretty amazing, right? 🤩 Try playing around with the **prompt**, **temperature** and **seed** values above and see what different outputs you get. What do you notice when you increase the temperature? While this might have been mind-blowing back in 2021, by now, most of you have likely interacted with large language models in some way. Today, we're going to take things a step further by training our own **Shakespeare-inspired LLM**. This will give us a hands-on understanding of how these language models work under the hood.

Certainly! In the context of the run_sample function and text generation models like the one we're using from Hugging Face:

temperature: This parameter controls the randomness of the generated text.
A higher temperature (e.g., 1.0 or more) makes the output more random, creative, and diverse, but can also lead to less coherent or nonsensical text.
A lower temperature (e.g., 0.5 or less) makes the output more deterministic, focused, and predictable, sticking closely to the most likely next tokens.
Think of it like adjusting how "adventurous" the model is when picking the next word.
seed: This parameter is used to ensure reproducibility.
When you set a seed (to any integer value), the random number generator used in the text generation process is initialized with that specific value.
This means that if you run the function with the same prompt and the same seed, you will get the exact same generated text every time.
If you don't set a seed, or use a different seed, the output will likely be different each time you run it, even with the same prompt (as you observed earlier).
In short, temperature affects the creativity/randomness of a single generation, while seed ensures that if you run the generation process multiple times with the same settings, you get the same result.


But before we jump into training, let’s first build a solid understanding of what **Large Language Models** are and the key **Machine Learning** concepts that make this groundbreaking technology possible. At the heart of today’s state-of-the-art (SoTA) LLMs are the **Attention Mechanism** and the **Transformer Architecture**. We’ll explore these essential concepts in the upcoming sections of this tutorial. 🚀💡


## **1. Attention**


The attention mechanism is inspired by how humans would look at an image or read a sentence.

Let us take the image of the dog in human clothes below (image and example [source](https://lilianweng.github.io/posts/2018-06-24-attention/)). When paying *attention* to the red blocks of pixels, we will say that the yellow block of pointy ears is something we expected (correlated) but that the grey blocks of human clothes are unexpected for us (uncorrelated). This is *based on what we have seen in the past* when looking at pictures of dogs

<img src="https://drive.google.com/uc?export=view&id=1iEU7Cph2D2PCXp3YEHj30-EndhHAeB5T" alt="drawing" width="450"/>

Assume we want to identify the dog breed in this image. When we look at the red blocks of pixels, we tend to pay more *attention* to relevant pixels that are more similar or relevant to them, which could be the ones in the yellow box. We almost completely ignore the snow in the background and the human clothing for this task.

Alternatively, when we begin looking at the background in an attempt to identify what is in it, we subconsciously ignore the dog pixels because they are irrelevant to the current task.

The same thing happens when we read. In order to understand the entire sentence, we will learn to correlate and *attend to* certain words based on the context of the entire sentence.

<img src="https://drive.google.com/uc?export=view&id=1j23kcfu_c3wINU6DUvxzMYNmp4alhHc9" alt="drawing" width="350"/>

 For instance, in the first sentence in the image above, when looking at the word "coding", we pay more attention to the word "Apple" and "computer" because we know that when we speak about coding, "Apple" is actually referring to the company. However, in the second sentence, we realise we should not consider " apple " when looking at "code" because given the context of the rest of the sentence, we know that this apple is referring to an actual apple and not a computer.

We can build better models by developing mechanisms that mimic attention. It will enable our models to learn better representations of our input data by contextualising what it knows about some parts of the input based on other parts. In the following sections, we will explore the mechanisms that enable us to train deep learning models to attend to input data in the context of other input data.

### Intuition - <font color='orange'>Beginner</font>

Imagine attention as a mechanism that allows a neural network to focus more on certain parts of data. By doing this, the network can enhance its grasp of the problem it's working on, updating its understanding or representations accordingly.

### Understanding Attention in Simple Terms

One way to implement attention in neural networks is by representing each word (or even parts of a word) as a vector.

So, what’s a vector? A vector is simply an array of numbers (called real-valued numbers) that can have different lengths. Think of it like a list of values that describe certain properties of a word. These vectors allow us to measure how similar two words are to each other. One common way to measure this similarity is by calculating something called the **dot product**.

The result of this similarity calculation is what we refer to as **attention.** This attention value helps the model decide how much one word should influence the representation of another word.

In simpler terms, if two words have similar vector representations, it means they’re likely related or important to each other. Because of this relationship, they affect each other’s representations inside the neural network, allowing the model to understand the context better. 🎯

To illustrate how the dot product can create meaningful attention weights, we'll use pre-trained [word2vec](https://jalammar.github.io/illustrated-word2vec/) embeddings. These word2vec embeddings are generated by a neural network that learned to create similar embeddings for words with similar meanings.

By calculating the matrix of dot products between all vectors, we get an attention matrix. This will indicate which words are correlated and therefore should "attend" to each other.

[1] You can find more details about how this is done for LLMs in the "Building Your Own LLM" session.

**Code task** <font color='blue'>Intermediate</font>: Complete the dot product attention function below.

In [None]:
def dot_product_attention(hidden_states, previous_state):
    """
    Calculate the dot product between the hidden states and previous states.

    Args:
        hidden_states: A tensor with shape [T_hidden, dm]
        previous_state: A tensor with shape [T_previous, dm]
    """

    # Hint: To calculate the attention scores, think about how you can use the `previous_state` vector
    # and the `hidden_states` matrix. You want to find out how much each element in `previous_state`
    # should "pay attention" to each element in `hidden_states`. Remember that in matrix multiplication,
    # you can find the relationship between two sets of vectors by multiplying one by the transpose of the other.
    # Hint: Use `jnp.matmul` to perform the matrix multiplication between `previous_state` and the
    # transpose of `hidden_states` (`hidden_states.T`).
    scores = ...  # FINISH ME

    # Hint: Now that you have the scores, you need to convert them into probabilities.
    # A softmax function is typically used in attention mechanisms to turn raw scores into probabilities
    # that sum to 1. This will help in determining how much focus should be placed on each hidden state.
    # Hint: Use `jax.nn.softmax` to apply the softmax function to `scores`.
    w_n = ...  # FINISH ME

    # Multiply the weights by the hidden states to get the context vector
    # Hint: Use `jnp.matmul` again to multiply the attention weights `w_n` by `hidden_states`
    # to get the context vector.
    c_t = jnp.matmul(w_n, hidden_states)

    return w_n, c_t

In [None]:
# @title Run me to test your code

key = jax.random.PRNGKey(42)
x = jax.random.normal(key, [2, 2])

try:
  w_n, c_t = dot_product_attention(x, x)

  w_n_correct = jnp.array([[0.9567678, 0.04323225], [0.00121029, 0.99878967]])
  c_t_correct = jnp.array([[0.11144122, 0.95290256], [-1.5571996, -1.5321486]])
  assert jnp.allclose(w_n_correct, w_n), "w_n is not calculated correctly"
  assert jnp.allclose(c_t_correct, c_t), "c_t is not calculated correctly"

  print("It seems correct. Look at the answer below to compare methods.")
except:
  print("It looks like the function isn't fully implemented yet. Try modifying it.")

It looks like the function isn't fully implemented yet. Try modifying it.


In [None]:
# when changing these words, note that if the word is not in the original
# training corpus it will not be shown in the weight matrix plot.
# @title Answer to code task (Try not to peek until you've given it a good try!')
def dot_product_attention(hidden_states, previous_state):
    # Calculate the attention scores:
    # Multiply the previous state vector by the transpose of the hidden states matrix.
    # This gives us a matrix of scores that show how much attention each element in the previous state
    # should pay to each element in the hidden states.
    # The result is a matrix of shape [T, N], where:
    # T is the number of elements in the hidden states,
    # N is the number of elements in the previous state.
    scores = jnp.matmul(previous_state, hidden_states.T)

    # Apply the softmax function to the scores to convert them into probabilities.
    # This normalizes the scores so that they sum up to 1 for each element,
    # allowing us to interpret them as how much attention should be given to each hidden state.
    w_n = jax.nn.softmax(scores)

    # Calculate the context vector (c_t):
    # Multiply the attention weights (w_n) by the hidden states.
    # This combines the hidden states based on how much attention each one deserves,
    # resulting in a new vector that represents the weighted sum of the hidden states.
    # The resulting shape is [T, d], where:
    # T is the number of elements in the previous state,
    # d is the dimension of the hidden states.
    c_t = jnp.matmul(w_n, hidden_states)

    # Return the attention weights and the context vector.
    return w_n, c_t


In [None]:
words = ["king", "queen", "royalty", "food", "apple", "pear", "computers"]
word_embeddings, words = get_word2vec_embedding(words)
weights, _ = dot_product_attention(word_embeddings, word_embeddings)
plot_attention_weight_matrix(weights, words, words)

Looking at the matrix,  we can see which words have similar meanings. The "royal" group of words have higher attention scores with each other than the "food" words, which all attend to one another. We also see that "computers" have very low attention scores for all of them, which shows that they are neither very related to "royal" or "food" words.  

**Group task:**
  - Play with the word selections above. See if you can find word combinations whose attention values seem counter-intuitive. Think of possible explanations. Which sense of a word did the attention scores capture?
  - Ask your friend if they found examples.

**Note**: Dot product is only one of the ways to implement the scoring function for attention mechanisms, there is a more extensive list in this [blog](https://lilianweng.github.io/posts/2018-06-24-attention/#summary) post by Dr Lilian Weng.

More resources:

[A basic encoder-decoder model for machine translation](https://www.youtube.com/watch?v=gHk2IWivt_8&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=1)

[Training and loss for encoder-decoder models](https://www.youtube.com/watch?v=aBZUTuT1Izs&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=2)

[Basic attention](https://www.youtube.com/watch?v=BSSoEtv5jvQ&list=PLmZlBIcArwhPHmHzyM_cZJQ8_v5paQJTV&index=6)

Understanding NLP and LLMs

What’s the difference?

NLP (Natural Language Processing) is the broader field focused on enabling computers to understand, interpret, and generate human language. NLP encompasses many techniques and tasks such as sentiment analysis, named entity recognition, and machine translation.


LLMs (Large Language Models) are a powerful subset of NLP models characterized by their massive size, extensive training data, and ability to perform a wide range of language tasks with minimal task-specific training. Models like the Llama, GPT, or Claude series are examples of LLMs that have revolutionized what’s possible in NLP.

We’ve explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.4


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

TensorFlow and JAX classes are deprecated and will be removed in Transformers v5. We recommend migrating to PyTorch classes or pinning your version of Transformers.


[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
import transformers
from transformers import pipeline, AutoTokenizer, AutoModel

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_input = bert_tokenizer("The practical is so much fun")
print(f"Token IDs: {encoded_input['input_ids']}")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Token IDs: [101, 1109, 6691, 1110, 1177, 1277, 4106, 102]


Here we can see that the tokeniser returns the IDs for each token, as shown in the figure. But counting the number of IDs, we see that it is larger than the number of words in the sentence. Let's print the tokens associated with each ID.


In [None]:
print(f"Tokens: {bert_tokenizer.decode(encoded_input['input_ids'])}")

Tokens: [CLS] The practical is so much fun [SEP]


We can see the tokeniser attaches new tokens, `[CLS]` and `[SEP]`, to the start and end of the sequence. This is a BERT-specific requirement for training and inference. Adding special tokens is a very common thing to do. Using special tokens, we can tell a model when a sentence starts or ends or when a new part of the input starts. This can be helpful when performing different tasks.

For instance, to pretrain specific transformers, they perform what is known as masked prediction. For this, random tokens in a sequence are replaced by the `[MASK]` token, and the model is trained to predict the correct token ID for the token replaced with that token.

**Drawback of using raw token**:

One drawback of using raw tokens is that they lack any indication of the word's position in the sequence. This is evident when considering sentences like "I am happy" and "Am I happy" - these two phrases have distinct meanings, and the model needs to grasp the word order to understand the intended message accurately.

To address this, when converting the inputs into vectors, position vectors are introduced and added to these vectors to indicate the **position** of each word.


#### 2.2.2 Positional encodings

In most domains where a transformer can be utilised, there is an underlying order to the tokens produced, be it the order of words in a sentence, the location from which patches are taken in an image or even the steps taken in an RL environment. This order is very important in all cases; just imagine you interpret the sentence "I have to read this book." as "I have this book to read.". Both sentences contain the exact same words, yet they have completely different meanings based on the order.

As both the encoder and the decoder blocks process all tokens in parallel, the order of tokens is lost in these calculations. To cope with this, the sequence order has to be injected into the tokens directly. This can be done by adding *positional encodings* to the tokens at the start of the encoder and decoder blocks (though some of the latest techniques add positional information in the attention blocks). An example of how positional encodings alter the tokens is shown below.


\\

<img src="https://drive.google.com/uc?export=view&id=1eSgnVN2hnEsrjdHygDGwk1kxEi8-dcFo" alt="drawing" width="650"/>

Ideally, these encodings should have these characteristics ([source](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)):
* Each time-step should have a unique value
* The distance between time steps should stay constant.
* The encoding should be able to generalise to longer sequences than seen during training.
* The encoding must be deterministic.

##### **Sine and cosine functions**


In Attention is All you Need, the authors used a method that can satisfy all these requirements. This involves summing a combination of sine and cosine waves at different frequencies, with the formula for a position encoding at position $D$ shown below, where $i$ is the embedding index and $d_m$ is the token embedding size.

\\

$P_{D}= \begin{cases}\sin \left(\frac{D}{10000^{i/d_{m}}}\right), & \text { if } i \bmod 2=0 \\ \cos \left(\frac{D}{10000^{((i-1)/d_{m}}}\right), & \text { otherwise } \end{cases}$

\

Assuming our model as $d_m=8$, the position embedding will look like this:

\
$P_{D}=\left[\begin{array}{c}\sin \left(\frac{D}{10000^{0/8}}\right)\\ \cos \left(\frac{D}{10000^{0/8}}\right)\\ \sin \left(\frac{D}{10000^{2/8}}\right)\\ \cos \left(\frac{D}{10000^{2/8}}\right)\\ \sin \left(\frac{D}{10000^{4/8}}\right)\\ \cos \left(\frac{D}{10000^{4/8}}\right)\\ \sin \left(\frac{D}{10000^{8/8}}\right)\\ \cos \left(\frac{D}{10000^{8/8}}\right)\end{array}\right]$

\\

Let's first create a function that can return these encodings to understand why this will work.

In [None]:
def return_frequency_pe_matrix(token_sequence_length, token_embedding):

  assert token_embedding % 2 == 0, "token_embedding should be divisible by two"

  P = jnp.zeros((token_sequence_length, token_embedding))
  positions = jnp.arange(0, token_sequence_length)[:, jnp.newaxis]

  i = jnp.arange(0, token_embedding, 2)
  frequency_steps = jnp.exp(i * (-math.log(10000.0) / token_embedding))
  frequencies = positions * frequency_steps

  P = P.at[:, 0::2].set(jnp.sin(frequencies))
  P = P.at[:, 1::2].set(jnp.cos(frequencies))

  return P

In [None]:
token_sequence_length = 50  # Number of tokens the model will need to process
token_embedding = 10000  # token embedding (and positional encoding) dimensions, ensure it is divisible by two
P = return_frequency_pe_matrix(token_sequence_length, token_embedding)
P

Array([[ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00, ...,
         1.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 8.41471016e-01,  5.40302277e-01,  8.40475261e-01, ...,
         1.00000000e+00,  1.00184407e-04,  1.00000000e+00],
       [ 9.09297466e-01, -4.16146815e-01,  9.10822988e-01, ...,
         1.00000000e+00,  2.00368813e-04,  1.00000000e+00],
       ...,
       [ 1.23573124e-01, -9.92335498e-01,  2.08839417e-01, ...,
         9.99988854e-01,  4.70865006e-03,  9.99988914e-01],
       [-7.68254697e-01, -6.40144348e-01, -7.08784282e-01, ...,
         9.99988377e-01,  4.80883289e-03,  9.99988437e-01],
       [-9.53752637e-01,  3.00592542e-01, -9.76946890e-01, ...,
         9.99987900e-01,  4.90901619e-03,  9.99987960e-01]],      dtype=float32)

Looking at the graph above, we can see that for each position index, a unique pattern emerges, where each position index consistently has the same encoding.

### **Group Activity**:

- <font color='blue'>Take a moment with your friend to explore why this specific pattern appears when `token_sequence_length` is set to 1000, and `token_embedding` is 768.</font>
- <font color='blue'>Experiment with smaller values for `token_sequence_length` and `token_embedding` to build a deeper understanding and enhance your discussion.</font>
- <font color='blue'>Curious about the constant 10000? Ask your friend why they think it’s used in the functions above.</font>
- <font color='blue'>Now, try setting `token_sequence_length` to 50 and `token_embedding` to a much larger value, like 10000. What do you observe? Do we always need a large token embedding?</font>


## **Conclusion**
**Summary:**

You've now mastered the essentials of how a Large Language Model (LLM) works, from the fundamentals of attention mechanisms to training your own LLM! These powerful tools have the potential to transform a wide range of tasks. However, like any deep learning model, their magic lies in applying them to the right problems with the right data.

Ready to take your skills to the next level? Dive into fine-tuning your own LLMs and unleash even more potential! I highly recommend exploring last year's practical on Parameter Efficient Fine-Tuning Methods for a comprehensive overview of advanced techniques. The journey doesn't stop here—there's so much more to discover!

The world of LLMs is yours to explore—go ahead and create something amazing! 🌟🚀

---