In [1]:
# Check which Python executable is being used
import sys
print(f"Python executable: {sys.executable}")
print(f"Python version: {sys.version}")

# Check if we're in the venv
if 'venv_py312' in sys.executable:
    print("‚úÖ Using venv_py312 virtual environment")
else:
    print("‚ùå NOT using venv_py312 - using global Python installation")
    print(f"   Expected path should contain: venv_py312")
    print(f"   Actual path: {sys.executable}")

Python executable: d:\Study Stuff\RAG model\Implementation Locally\venv_py312\Scripts\python.exe
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
‚úÖ Using venv_py312 virtual environment


# Understanding RAG (Retrieval Augmented Generation) Implementation

RAG combines two main components:
1. **Retrieval**: Finding relevant information from a knowledge base
2. **Generation**: Using that information to generate accurate answers

## Phase 1: Setup and Data Preparation
- Loading and processing documents
- Converting text into chunks
- Creating embeddings for efficient searching
- Setting up the language model for generation

Let's go through each step in detail:

# Environment Setup
This section sets up all required dependencies and configurations for the Hugging Face environment.

In [2]:
# Install CUDA-enabled PyTorch
import sys
import subprocess

print("Installing PyTorch with CUDA 12.4 support...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", 
                       "torch", "--index-url", "https://download.pytorch.org/whl/cu124", "-q"])
print("PyTorch with CUDA support installed!")

# Install required packages
import subprocess

def install_package(package):
    try:
        __import__(package)
    except ImportError:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
        print(f"{package} installed successfully!")

# Install essential packages
packages = [
    "huggingface_hub",
    "transformers",
    "accelerate",
    "bitsandbytes",
    "sentencepiece",
    "safetensors",
    "requests"
]

for package in packages:
    install_package(package)

print("All required packages installed successfully!")

Installing PyTorch with CUDA 12.4 support...
PyTorch with CUDA support installed!
Installing huggingface_hub...
huggingface_hub installed successfully!
Installing transformers...
transformers installed successfully!
Installing accelerate...
accelerate installed successfully!
Installing bitsandbytes...
bitsandbytes installed successfully!
Installing sentencepiece...
sentencepiece installed successfully!
Installing requests...
requests installed successfully!
All required packages installed successfully!


In [5]:
# Install PyTorch with CUDA 12.1 support (compatible with your CUDA 13.1)
import sys
import subprocess

print("Installing PyTorch with CUDA 12.1 support...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "--upgrade", 
                       "torch", "torchvision", "torchaudio", 
                       "--index-url", "https://download.pytorch.org/whl/cu121"])
print("PyTorch with CUDA 12.1 support installed!")

# Verify CUDA is available
import torch
print(f"\n‚úÖ CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  CUDA Version: {torch.version.cuda}")
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

Installing PyTorch with CUDA 12.1 support...
PyTorch with CUDA 12.1 support installed!

‚úÖ CUDA Available: False


In [2]:
# Check CUDA availability and setup GPU settings
import torch
import transformers
import os

print("=" * 70)
print("GPU AVAILABILITY CHECK & CONFIGURATION")
print("=" * 70)

# Verify CUDA is available - REQUIRED for this notebook
if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available! This notebook requires GPU acceleration. "
                       "Please ensure you have PyTorch installed with CUDA support.")

print("\nüìä System Information:")
print(f"  PyTorch Version: {torch.__version__}")

# Check NVIDIA driver
print("\nüîß NVIDIA Driver Information:")
os.system("nvidia-smi --query-gpu=name,driver_version,cuda_version --format=csv,noheader")

# Set GPU device
device = "cuda"
gpu_count = torch.cuda.device_count()
gpu_name = torch.cuda.get_device_name(0)
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)

print(f"\n‚úÖ GPU IS AVAILABLE AND READY!")
print(f"  Number of GPUs: {gpu_count}")
print(f"  GPU Name: {gpu_name}")
print(f"  GPU Memory: {gpu_memory:.2f} GB")
if torch.version.cuda:
    print(f"  CUDA Version: {torch.version.cuda}")

# Test GPU
test_tensor = torch.randn(1000, 1000, device=device)
allocated = torch.cuda.memory_allocated() / (1024**2)
print(f"  Current GPU Memory Used: {allocated:.2f} MB")
del test_tensor
torch.cuda.empty_cache()

# Set default device
torch.set_default_device(device)

# Check BitsAndBytes
try:
    import bitsandbytes
    print(f"\nüì¶ Optional Libraries:")
    print(f"  BitsAndBytes: ‚úì Available (for 4-bit/8-bit quantization)")
except ImportError:
    print(f"\nüì¶ Optional Libraries:")
    print(f"  BitsAndBytes: ‚úó Not available")

print(f"\n‚ú® GPU Environment Ready!")
print(f"  Device: {device.upper()}")
print(f"  PyTorch: {torch.__version__}")
print(f"  Transformers: {transformers.__version__}")
print("=" * 70 + "\n")

GPU AVAILABILITY CHECK & CONFIGURATION

üìä System Information:
  PyTorch Version: 2.5.1+cu121

üîß NVIDIA Driver Information:

‚úÖ GPU IS AVAILABLE AND READY!
  Number of GPUs: 1
  GPU Name: NVIDIA GeForce RTX 3050 Laptop GPU
  GPU Memory: 4.00 GB
  CUDA Version: 12.1
  Current GPU Memory Used: 3.81 MB

üì¶ Optional Libraries:
  BitsAndBytes: ‚úì Available (for 4-bit/8-bit quantization)

‚ú® GPU Environment Ready!
  Device: CUDA
  PyTorch: 2.5.1+cu121
  Transformers: 5.0.0



In [3]:
# Install huggingface_hub if not already installed
try:
    import huggingface_hub
except ImportError:
    print("Installing huggingface_hub...")
    !pip install --quiet huggingface_hub

# Authenticate with Hugging Face
from huggingface_hub import login

# Login with the token (never share or commit this token)
login(token="YOUR_HF_TOKEN_HERE")
print("Successfully logged in to Hugging Face!")

Successfully logged in to Hugging Face!


In [4]:
# GPU Configuration - CUDA Required
import torch
import os

print("\n" + "=" * 70)
print("GPU CONFIGURATION (CUDA Required)")
print("=" * 70)

# Verify CUDA is available
if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available! Please install Python 3.13 with PyTorch CUDA support.")

print("\n‚úÖ CUDA Available - GPU Mode Enabled")
print(f"   Device: {torch.cuda.get_device_name(0)}")
print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# GPU Optimization Settings
print("\nüöÄ Applying GPU Optimizations...")

# Enable cuDNN auto-tuner for best performance
torch.backends.cudnn.benchmark = True
print("  ‚úì CuDNN auto-tuning enabled")

# Enable Flash Attention for faster computations
torch.backends.cuda.enable_flash_sdp(True)
print("  ‚úì Flash Attention enabled (faster, lower memory)")

# Memory efficient mode
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
print("  ‚úì CUDA memory optimization enabled")

# For RTX 3050 (4GB VRAM), use 90% of available memory
torch.cuda.set_per_process_memory_fraction(0.9)
print("  ‚úì Memory fraction set to 90%")

# Clear GPU cache
torch.cuda.empty_cache()
print("  ‚úì GPU cache cleared")

print("\n" + "=" * 70)
print("üéØ Ready for GPU acceleration! Starting embeddings on GPU...")
print("=" * 70)


GPU CONFIGURATION (CUDA Required)

‚úÖ CUDA Available - GPU Mode Enabled
   Device: NVIDIA GeForce RTX 3050 Laptop GPU
   Memory: 4.0 GB

üöÄ Applying GPU Optimizations...
  ‚úì CuDNN auto-tuning enabled
  ‚úì Flash Attention enabled (faster, lower memory)
  ‚úì CUDA memory optimization enabled
  ‚úì Memory fraction set to 90%
  ‚úì GPU cache cleared

üéØ Ready for GPU acceleration! Starting embeddings on GPU...


## GPU Setup Status

### Current Status
- **GPU Detected**: ‚úÖ NVIDIA RTX 3050 (4GB VRAM)
- **NVIDIA Drivers**: ‚úÖ Installed (CUDA 13.1 capable)
- **PyTorch CUDA Support**: ‚ùå Not available for Python 3.14

### Issue
PyTorch 2.10.0 is compiled only for Python 3.13 and earlier. Python 3.14 requires newer PyTorch versions that aren't available yet on PyPI.

### Solutions

**Option 1: Use PyTorch CPU (Current)**
- Processing will use your CPU
- For embeddings with sentence-transformers, you can still use quantization
- Takes longer but produces same results

**Option 2: Downgrade Python (Recommended for GPU)**
```bash
# Install Python 3.13
python -m pip install --upgrade "python<3.14"
# Then reinstall PyTorch with CUDA support
```

**Option 3: Build PyTorch from source**
- Advanced option, requires CUDA toolkit and compiler
- See: https://github.com/pytorch/pytorch#from-source

In [5]:
# Notebook config: prompt for HF token and model selection (safer ‚Äî no plaintext tokens saved)
# This cell will run interactively in the notebook. It sets the token only for the running kernel.
# Do NOT commit this notebook to version control if you enter a real token.
import os
import getpass

print("This cell will ask for your Hugging Face token (it will not be saved to disk).")
print("If you prefer to use an environment variable or the CLI, leave the token blank and press enter.")

try:
    token = getpass.getpass(prompt='Enter Hugging Face token (leave blank to skip): ')
except Exception:
    # In some notebook frontends getpass may not work; fall back to input with a warning
    print("Warning: secure prompt unavailable, using visible input.")
    token = input('Enter Hugging Face token (leave blank to skip): ')

if token:
    os.environ['HUGGINGFACE_HUB_TOKEN'] = token
    print('[CONFIG] Token set in-kernel for this session.')
else:
    print('[CONFIG] No token provided. Will use existing environment or anonymous access.')

# Prompt for model_id with a sensible default. Use an open model for quick testing if unsure.
def _input_with_default(prompt, default):
    resp = input(f"{prompt} [{default}]: ").strip()
    return resp if resp else default

default_model = 'google/gemma-2b-it'  # gated by default
chosen_model = _input_with_default('Model id to use (enter a model id or press enter for default)', default_model)

if chosen_model == default_model:
    print('\n[NOTICE] The default model is a gated model. Make sure you have accepted its terms on Hugging Face and that your token has access.')
    print('If you have not agreed to the model terms, consider using an open model like "sshleifer/tiny-gpt2" for testing.')

# Ask about quantization
use_q = _input_with_default('Use 4-bit quantization (y/N)', 'N')
use_quantization_config = use_q.lower().startswith('y')

# Publish into notebook globals so downstream cells can read them
globals()['model_id'] = chosen_model
globals()['use_quantization_config'] = use_quantization_config

print(f"\n[CONFIG] model_id={chosen_model}")
print(f"[CONFIG] use_quantization_config={use_quantization_config}")


This cell will ask for your Hugging Face token (it will not be saved to disk).
If you prefer to use an environment variable or the CLI, leave the token blank and press enter.
[CONFIG] Token set in-kernel for this session.

[NOTICE] The default model is a gated model. Make sure you have accepted its terms on Hugging Face and that your token has access.
If you have not agreed to the model terms, consider using an open model like "sshleifer/tiny-gpt2" for testing.

[CONFIG] model_id=google/gemma-2b-it
[CONFIG] use_quantization_config=True


# Hugging Face Authentication
First, we'll set up authentication with Hugging Face to access the models. You'll need to provide your Hugging Face token, which you can get from [your Hugging Face account settings](https://huggingface.co/settings/tokens).

In [6]:
# Install huggingface_hub if not already installed
try:
    import huggingface_hub
except ImportError:
    print("Installing huggingface_hub...")
    !pip install --quiet huggingface_hub

# Authenticate with Hugging Face
from huggingface_hub import login

# Login with the token (never share or commit this token)
login(token="YOUR_HF_TOKEN_HERE")
print("Successfully logged in to Hugging Face!")

Successfully logged in to Hugging Face!


<a target="_blank" href="https://colab.research.google.com/github/mrdbourke/simple-local-rag/blob/main/00-simple-local-rag.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Create and run a local RAG pipeline from scratch

The goal of this notebook is to build a RAG (Retrieval Augmented Generation) pipeline from scratch and have it run on a local GPU.

Specifically, we'd like to be able to open a PDF file, ask questions (queries) of it and have them answered by a Large Language Model (LLM).

There are frameworks that replicate this kind of workflow, including [LlamaIndex](https://www.llamaindex.ai/) and [LangChain](https://www.langchain.com/), however, the goal of building from scratch is to be able to inspect and customize all the parts.

## What is RAG?

RAG stands for Retrieval Augmented Generation.

It was introduced in the paper [*Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*](https://arxiv.org/abs/2005.11401).

Each step can be roughly broken down to:

* **Retrieval** - Seeking relevant information from a source given a query. For example, getting relevant passages of Wikipedia text from a database given a question.
* **Augmented** - Using the relevant retrieved information to modify an input to a generative model (e.g. an LLM).
* **Generation** - Generating an output given an input. For example, in the case of an LLM, generating a passage of text given an input prompt.

## Why RAG?

The main goal of RAG is to improve the generation outptus of LLMs.

Two primary improvements can be seen as:
1. **Preventing hallucinations** - LLMs are incredible but they are prone to potential hallucination, as in, generating something that *looks* correct but isn't. RAG pipelines can help LLMs generate more factual outputs by providing them with factual (retrieved) inputs. And even if the generated answer from a RAG pipeline doesn't seem correct, because of retrieval, you also have access to the sources where it came from.
2. **Work with custom data** - Many base LLMs are trained with internet-scale text data. This means they have a great ability to model language, however, they often lack specific knowledge. RAG systems can provide LLMs with domain-specific data such as medical information or company documentation and thus customized their outputs to suit specific use cases.

The authors of the original RAG paper mentioned above outlined these two points in their discussion.

> This work offers several positive societal benefits over previous work: the fact that it is more
strongly grounded in real factual knowledge (in this case Wikipedia) makes it ‚Äúhallucinate‚Äù less
with generations that are more factual, and offers more control and interpretability. RAG could be
employed in a wide variety of scenarios with direct benefit to society, for example by endowing it
with a medical index and asking it open-domain questions on that topic, or by helping people be more
effective at their jobs.

RAG can also be a much quicker solution to implement than fine-tuning an LLM on specific data. 



## What kind of problems can RAG be used for?

RAG can help anywhere there is a specific set of information that an LLM may not have in its training data (e.g. anything not publicly accessible on the internet).

For example you could use RAG for:
* **Customer support Q&A chat** - By treating your existing customer support documentation as a resource, when a customer asks a question, you could have a system retrieve relevant documentation snippets and then have an LLM craft those snippets into an answer. Think of this as a "chatbot for your documentation". Klarna, a large financial company, [uses a system like this](https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/) to save $40M per year on customer support costs.
* **Email chain analysis** - Let's say you're an insurance company with long threads of emails between customers and insurance agents. Instead of searching through each individual email, you could retrieve relevant passages and have an LLM create strucutred outputs of insurance claims.
* **Company internal documentation chat** - If you've worked at a large company, you know how hard it can be to get an answer sometimes. Why not let a RAG system index your company information and have an LLM answer questions you may have? The benefit of RAG is that you will have references to resources to learn more if the LLM answer doesn't suffice.
* **Textbook Q&A** - Let's say you're studying for your exams and constantly flicking through a large textbook looking for answers to your quesitons. RAG can help provide answers as well as references to learn more.

All of these have the common theme of retrieving relevant resources and then presenting them in an understandable way using an LLM.

From this angle, you can consider an LLM a calculator for words.


## Why local?

Privacy, speed, cost.

Running locally means you use your own hardware.

From a privacy standpoint, this means you don't have send potentially sensitive data to an API.

From a speed standpoint, it means you won't necessarily have to wait for an API queue or downtime, if your hardware is running, the pipeline can run.

And from a cost standpoint, running on your own hardware often has a heavier starting cost but little to no costs after that.

Performance wise, LLM APIs may still perform better than an open-source model running locally on general tasks but there are more and more examples appearing of smaller, focused models outperforming larger models. 


## Key terms

| Term | Description |
| ----- | ----- | 
| **Token** | A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word,<br> part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br> Text gets broken into tokens before being passed to an LLM. |
| **Embedding** | A learned numerical representation of a piece of data. For example, a sentence of text could be represented by a vector with<br> 768 values. Similar pieces of text (in meaning) will ideally have similar values. |
| **Embedding model** | A model designed to accept input data and output a numerical representation. For example, a text embedding model may take in 384 <br>tokens of text and turn it into a vector of size 768. An embedding model can and often is different to an LLM model. |
| **Similarity search/vector search** | Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, <br>two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about<br> different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity. |
| **Large Language Model (LLM)** | A model which has been trained to numerically represent the patterns in text. A generative LLM will continue a sequence when given a sequence. <br>For example, given a sequence of the text "hello, world!", a genertive LLM may produce "we're going to build a RAG pipeline today!".<br> This generation will be highly dependant on the training data and prompt. |
| **LLM context window** | The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default context window of 32k tokens<br> (about 96 pages of text) but can go up to 128k if needed. A recent open-source LLM from Google, Gemma (March 2024) has a context<br> window of 8,192 tokens (about 24 pages of text). A higher context window means an LLM can accept more relevant information<br> to assist with a query. For example, in a RAG pipeline, if a model has a larger context window, it can accept more reference items<br> from the retrieval system to aid with its generation. |
| **Prompt** | A common term for describing the input to a generative LLM. The idea of "[prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering)" is to structure a text-based<br> (or potentially image-based as well) input to a generative LLM in a specific way so that the generated output is ideal. This technique is<br> possible because of a LLMs capacity for in-context learning, as in, it is able to use its representation of language to breakdown <br>the prompt and recognize what a suitable output may be (note: the output of LLMs is probable, so terms like "may output" are used). | 




 ## What we're going to build

We're going to build RAG pipeline which enables us to chat with a PDF document, specifically an open-source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/), ~1200 pages long.

You could call our project NutriChat!

We'll write the code to:
1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

The above steps can broken down into two major sections:
1. Document preprocessing/embedding creation (steps 1-3).
2. Search and answer (steps 4-6).

And that's the structure we'll follow.

It's similar to the workflow outlined on the NVIDIA blog which [details a local RAG pipeline](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/).

<img src="https://github.com/mrdbourke/simple-local-rag/blob/main/images/simple-local-rag-workflow-flowchart.png?raw=true" alt="flowchart of a local RAG workflow" />

## Requirements and setup

* Local NVIDIA GPU (I used a NVIDIA RTX 4090 on a Windows 11 machine) or Google Colab with access to a GPU.
* Environment setup (see [setup details on GitHub](https://github.com/mrdbourke/simple-local-rag/?tab=readme-ov-file#setup)).
* Data source (for example, a PDF). 
* Internet connection (to download the models, but once you have them, it'll run offline).

In [7]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).

### Import PDF Document 

This will work with many other kinds of documents.

However, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

We're going to work with the javabook.pdf document to build our knowledge base.

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well in many cases.

In [8]:
# Set PDF file path
import os
import requests

# Get PDF document
pdf_path = "javabook.pdf"

# Check if the file exists
if os.path.exists(pdf_path):
    print(f"File {pdf_path} exists and is ready to be processed.")
else:
    print(f"Error: {pdf_path} not found in the current directory")
    print(f"Current directory: {os.getcwd()}")
    print(f"Files in directory: {os.listdir('.')}")
    print("\nPlease ensure javabook.pdf is in the working directory.")

File javabook.pdf exists and is ready to be processed.


PDF acquired!

We can import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`).

We'll write a small helper function to preprocess the text as it gets read. Note that not all text will be read in the same so keep this in mind for when you prepare your text.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [9]:
# Fix PyMuPDF import issue (remove conflicting 'fitz' package, install pymupdf)
import sys
import subprocess

def _pip(args):
    subprocess.check_call([sys.executable, "-m", "pip", *args])

try:
    _pip(["uninstall", "-y", "fitz"])
except Exception:
    pass

_pip(["install", "-q", "pymupdf"])
print("PyMuPDF installed. You can now import fitz.")

PyMuPDF installed. You can now import fitz.


In [10]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm 

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number,  # keep original page numbers
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 0,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''},
 {'page_number': 1,
  'page_char_count': 47,
  'page_word_count': 10,
  'page_sentence_count_raw': 1,
  'page_token_count': 11.75,
  'text': 'The   Complete  Reference Java‚Ñ¢ Ninth Edition ¬Æ'}]

Now let's get a random sample of the pages.

In [11]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 237,
  'page_char_count': 1453,
  'page_word_count': 354,
  'page_sentence_count_raw': 5,
  'page_token_count': 363.25,
  'text': 'Chapter 9\u2003    Packages and Interfaces\u2003 \u2002  203 Part I       stck = temp;       stck[++tos] = item;     }     else       stck[++tos] = item;   }    // Pop an item from the stack   public int pop() {     if(tos < 0) {       System.out.println("Stack underflow.");       return 0;     }     else       return stck[tos--];   } }  class IFTest2 {   public static void main(String args[]) {     DynStack mystack1 = new DynStack(5);     DynStack mystack2 = new DynStack(8);      // these loops cause each stack to grow     for(int i=0; i<12; i++) mystack1.push(i);     for(int i=0; i<20; i++) mystack2.push(i);      System.out.println("Stack in mystack1:");     for(int i=0; i<12; i++)        System.out.println(mystack1.pop());      System.out.println("Stack in mystack2:");     for(int i=0; i<20; i++)        System.out.println(mystack2.pop())

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

The different sizes of texts will be a good indicator into how we should split our texts.

Many embedding models have limits on the size of texts they can ingest, for example, the [`sentence-transformers`](https://www.sbert.net/docs/pretrained_models.html) model [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) has an input size of 384 tokens.

This means that the model has been trained in ingest and turn into embeddings texts with 384 tokens (1 token ~= 4 characters ~= 0.75 words).

Texts over 384 tokens which are encoded by this model will be auotmatically reduced to 384 tokens in length, potentially losing some information.

We'll discuss this more in the embedding section.

For now, let's turn our list of dictionaries into a DataFrame and explore it.

In [12]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,0,0,1,1,0.0,
1,1,47,10,1,11.75,The Complete Reference Java‚Ñ¢ Ninth Edition ¬Æ
2,2,1728,291,17,432.0,About the Author Best-selling author Herbert S...
3,3,178,29,1,44.5,The Complete Reference Herbert Schildt New Y...
4,4,4265,693,26,1066.25,Copyright ¬© 2014 by McGraw-Hill Education (Pub...


In [13]:
import pandas as pd
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1313.0,1313.0,1313.0,1313.0,1313.0
mean,656.0,2100.16,428.19,44.02,525.04
std,379.17,819.82,303.3,205.07,204.96
min,0.0,0.0,1.0,1.0,0.0
25%,328.0,1612.0,332.0,10.0,403.0
50%,656.0,2049.0,396.0,16.0,512.25
75%,984.0,2513.0,466.0,24.0,628.25
max,1312.0,5644.0,2893.0,1841.0,1411.0


Okay, looks like our average token count per page is 287.

For this particular use case, it means we could embed an average whole page with the `all-mpnet-base-v2` model (this model has an input capacity of 384).

### Further text processing (splitting pages into sentences)

The ideal way of processing text before embedding it is still an active area of research.

A simple method I've found helpful is to break the text into chunks of sentences.

As in, chunk a page of text into groups of 5, 7, 10 or more sentences (these values are not set in stone and can be explored).

But we want to follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Some options for splitting text into sentences:

1. Split into sentences with simple rules (e.g. split on ". " with `text = text.split(". ")`, like we did above).
2. Split into sentences with a natural language processing (NLP) library such as [spaCy](https://spacy.io/) or [nltk](https://www.nltk.org/).

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.

> **Resource:** See [spaCy install instructions](https://spacy.io/usage). 

Let's use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`. 

In [14]:
# Using NLTK instead of spacy for sentence tokenization (spacy has compatibility issues with Python 3.14)
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt', quiet=True)

# Create a simple sentence splitter using regular expression as a fallback
import re

def sentence_tokenizer(text):
    """Simple sentence tokenizer that splits on sentence endings."""
    # Split on sentence boundaries (period, exclamation, question mark followed by space)
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]

# Test the sentence tokenizer
test_text = "This is a sentence. This is another sentence."
sents = sentence_tokenizer(test_text)
assert len(sents) == 2, f"Expected 2 sentences, got {len(sents)}"
print(f"Sentence tokenizer works! Found {len(sents)} sentences in test text.")
sents

Sentence tokenizer works! Found 2 sentences in test text.


['This is a sentence.', 'This is another sentence.']

We don't necessarily need to use spaCy, however, it's an open-source library designed to do NLP tasks like this at scale.

So let's run our small sentencizing pipeline on our pages of text.

In [15]:
for item in tqdm(pages_and_texts):
    item["sentences"] = sentence_tokenizer(item["text"])
    
    # Count the sentences
    item["page_sentence_count_nltk"] = len(item["sentences"])

  0%|          | 0/1313 [00:00<?, ?it/s]

In [16]:
# Inspect an example
import random
random.sample(pages_and_texts, k=1)

[{'page_number': 1289,
  'page_char_count': 3422,
  'page_word_count': 435,
  'page_sentence_count_raw': 2,
  'page_token_count': 855.5,
  'text': 'Index\u2003 \u2002  1255 getOppositeComponent(\xa0), 775 getOppositeWindow(\xa0), 781 getOutputStream(\xa0), 461, 464, 732, 1219 getParallelism(\xa0), 958 getParameter(\xa0), 749, 761‚Äì762, 1219, 1221, 1228, 1229 getParameterNames(\xa0), 1219, 1221 getParent(\xa0), 485, 643, 694, 712, 936, 1161 getPath(\xa0), 1063‚Äì1064, 1226 getPhase(\xa0), 931 getPoint(\xa0), 778 getPoolSize(\xa0), 963 getPort(\xa0), 732, 743, 744 getPreciseWheelRotation(\xa0), 780 getPreferredSize(\xa0), 856 getPriority(\xa0), 236, 246, 483 getProperties(\xa0), 468, 572 getProperty(\xa0), 468, 470, 573, 574, 575 getPropertyDescriptors(\xa0), 1202, 1203, 1208, 1209 getQueuedTaskCount(\xa0), 962 getRed(\xa0), 816 getRegisteredParties(\xa0), 936 getRemoveListenerMethod(\xa0), 1206 getRGB(\xa0), 817 getRuntime(\xa0), 461, 462 getScreenX(\xa0), 1188 getScreenY(\xa0), 1188 g

Wonderful!

Now let's turn out list of dictionaries into a DataFrame and get some stats.

In [17]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_nltk
count,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0
mean,656.0,2100.16,428.19,44.02,525.04,44.28
std,379.17,819.82,303.3,205.07,204.96,205.07
min,0.0,0.0,1.0,1.0,0.0,0.0
25%,328.0,1612.0,332.0,10.0,403.0,10.0
50%,656.0,2049.0,396.0,16.0,512.25,16.0
75%,984.0,2513.0,466.0,24.0,628.25,24.0
max,1312.0,5644.0,2893.0,1841.0,1411.0,1841.0


For our set of text, it looks like our raw sentence count (e.g. splitting on `". "`) is quite close to what spaCy came up with.

Now we've got our text split into sentences, how about we gorup those sentences?

### Chunking our sentences together

Let's take a step to break down our list of sentences/text into smaller chunks.

As you might've guessed, this process is referred to as **chunking**.

Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

Something to note is that there are many different ways emerging for creating chunks of information/text.

For now, we're going to keep it simple and break our pages of sentences into groups of 10 (this number is arbitrary and can be changed, I just picked it because it seemed to line up well with our embedding model capacity of 384).

On average each of our pages has 10 sentences.

And an average total of 287 tokens per page.

So our groups of 10 sentences will also be ~287 tokens long.

This gives us plenty of room for the text to embedded by our `all-mpnet-base-v2` model (it has a capacity of 384 tokens).

To split our groups of sentences into chunks of 10 or less, let's create a function which accepts a list as input and recursively breaks into down into sublists of a specified size.

In [18]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10 

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1313 [00:00<?, ?it/s]

In [19]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 1187,
  'page_char_count': 1238,
  'page_word_count': 250,
  'page_sentence_count_raw': 10,
  'page_token_count': 309.5,
  'text': 'Chapter 35\u2003    Exploring JavaFX Controls\u2003 \u2002  1153 Part IV     // Create an ObservableList of entries for the combo box.     ObservableList<String> transportTypes =       FXCollections.observableArrayList( "Train", "Car", "Airplane" );      // Create a combo box.     cbTransport = new ComboBox<String>(transportTypes);      // Set the default value.     cbTransport.setValue("Train");      // Set the response label to indicate the default selection.     response.setText("Selected Transport is " + cbTransport.getValue());      // Listen for action events on the combo box.     cbTransport.setOnAction(new EventHandler<ActionEvent>() {       public void handle(ActionEvent ae) {         response.setText("Selected Transport is " + cbTransport.getValue());       }     });      // Add the label and combo box to the scene graph.     roo

In [20]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_nltk,num_chunks
count,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0,1313.0
mean,656.0,2100.16,428.19,44.02,525.04,44.28,4.9
std,379.17,819.82,303.3,205.07,204.96,205.07,20.5
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,328.0,1612.0,332.0,10.0,403.0,10.0,1.0
50%,656.0,2049.0,396.0,16.0,512.25,16.0,2.0
75%,984.0,2513.0,466.0,24.0,628.25,24.0,3.0
max,1312.0,5644.0,2893.0,1841.0,1411.0,1841.0,185.0


Note how the average number of chunks is around 1.5, this is expected since many of our pages only contain an average of 10 sentences.

### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

So to keep things clean, let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [21]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/1313 [00:00<?, ?it/s]

6429

In [22]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 298,
  'sentence_chunk': 'For example, this statement compares the value in ap with the GoldenDel constant: if(ap == Apple. GoldenDel) // ... An enumeration value can also be used to control a switch statement. Of course, all of the case statements must use constants from the same enum as that used by the switch expression. For example, this switch is perfectly valid: // Use an enum to control a switch statement.switch(ap) {  case Jonathan:   // ...case Winesap:   // ... Notice that in the case statements, the names of the enumeration constants are used without being qualified by their enumeration type name. That is, Winesap, not Apple. Winesap, is used. This is because the type of the enumeration in the switch expression has already implicitly specified the enum type of the case constants. There is no need to qualify the constants in the case statements with their enum type name.',
  'chunk_char_count': 867,
  'chunk_word_count': 151,
  'chunk_token_count': 216.75}]

Excellent!

Now we've broken our whole textbook into chunks of 10 sentences or less as well as the page number they came from.

This means we could reference a chunk of text and know its source.

Let's get some stats about our chunks.

In [23]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,6429.0,6429.0,6429.0,6429.0
mean,297.52,409.79,69.13,102.45
std,400.86,555.34,93.96,138.83
min,1.0,2.0,1.0,0.5
25%,17.0,10.0,1.0,2.5
50%,27.0,39.0,6.0,9.75
75%,566.0,858.0,144.0,214.5
max,1312.0,4248.0,557.0,1062.0


Hmm looks like some of our chunks have quite a low token count.

How about we check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?

In [24]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
filtered_df = df[df["chunk_token_count"] <= min_token_length]
num_samples = min(5, len(filtered_df))  # Sample up to 5, or less if not enough
for row in filtered_df.sample(min(num_samples, len(filtered_df))).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 2.5 | Text: ..........
Chunk token count: 2.5 | Text: ..........
Chunk token count: 2.5 | Text: ..........
Chunk token count: 2.5 | Text: ..........
Chunk token count: 2.5 | Text: ..........


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [25]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 2,
  'sentence_chunk': 'About the Author Best-selling author Herbert Schildt has written extensively about programming for nearly three decades and is a leading authority on the Java language. His books have sold millions of copies worldwide and have been translated into all major foreign languages. He is the author of numerous books on Java, including Java: A Beginner‚Äôs Guide, Herb Schildt‚Äôs Java Programming Cookbook, and Swing: A Beginner‚Äôs Guide. He has also written extensively about C, C++, and C#. Although interested in all facets of computing, his primary focus is computer languages, including compilers, interpreters, and robotic control languages. He also has an active interest in the standardization of languages. Schildt holds both graduate and undergraduate degrees from the University of Illinois. He can be reached at his consulting office at (217) 586-4683. His web site is www. HerbSchildt.com. About the Technical Editor Dr.',
  'chunk_char_count': 924,

Smaller chunks filtered!

Time to embed our chunks of text!

### Embedding our text chunks

While humans understand text, machines understand numbers best.

An [embedding](https://vickiboykis.com/what_are_embeddings/index.html) is a broad concept.

But one of my favourite and simple definitions is "a useful numerical representation".

The most powerful thing about modern embeddings is that they are *learned* representations.

Meaning rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.

> **Note:** Most modern NLP models deal with "tokens" which can be considered as multiple different sizes and combinations of words and characters rather than always whole words or single characters. For example, the string `"hello world!"` gets mapped to the token values `{15339: b'hello', 1917: b' world', 0: b'!'}` using [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (or BPE via OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) library). Google has a tokenization library called [SentencePiece](https://github.com/google/sentencepiece).

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

Once our text samples are in embedding vectors, us humans will no longer be able to understand them.

However, we don't need to.

The embedding vectors are for our computers to understand.

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

Enough talking, how about we import a text embedding model and see what an embedding looks like.

To do so, we'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [26]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer

# Use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # GPU will be *much* faster than CPU

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Using device: cuda


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981374e-02  3.03164218e-02 -2.01218128e-02  6.86483532e-02
 -2.55255643e-02 -8.47689621e-03 -2.07111196e-04 -6.32377416e-02
  2.81606149e-02 -3.33353467e-02  3.02634798e-02  5.30721173e-02
 -5.03526777e-02  2.62287464e-02  3.33313905e-02 -4.51578945e-02
  3.63044366e-02 -1.37112767e-03 -1.20171290e-02  1.14946542e-02
  5.04510924e-02  4.70857136e-02  2.11913381e-02  5.14607430e-02
 -2.03746390e-02 -3.58889140e-02 -6.67873712e-04 -2.94393301e-02
  4.95858900e-02 -1.05639435e-02 -1.52013786e-02 -1.31752936e-03
  4.48196791e-02  1.56022888e-02  8.60379657e-07 -1.21391716e-03
 -2.37978864e-02 -9.09424969e-04  7.34484987e-03 -2.53933878e-03
  5.23369685e-02 -4.68043461e-02  1.66214611e-02  4.71579283e-02
 -4.15599123e-02  9.01962689e-04  3.60279009e-02  3.42214443e-02
  9.68227684e-02  5.94828688e-02 -1.64984874e-02 -3.51249352e-02
  5.92517806e-03 -7.07964529e-04 -2.4103

Woah! That's a lot of numbers.

How about we do just once sentence?

In [27]:
single_sentence = "Yo! How cool are embeddings?"
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")

Sentence: Yo! How cool are embeddings?
Embedding:
[-1.97447371e-02 -4.51087672e-03 -4.98482818e-03  6.55444786e-02
 -9.87671968e-03  2.72835474e-02  3.66426148e-02 -3.30222631e-03
  8.50079395e-03  8.24952312e-03 -2.28497069e-02  4.02430184e-02
 -5.75200431e-02  6.33692816e-02  4.43207808e-02 -4.49507311e-02
  1.25284391e-02 -2.52012126e-02 -3.55292000e-02  1.29558947e-02
  8.67024343e-03 -1.92917529e-02  3.55632883e-03  1.89505927e-02
 -1.47128357e-02 -9.39846691e-03  7.64171407e-03  9.62190703e-03
 -5.98927820e-03 -3.90169770e-02 -5.47824651e-02 -5.67457685e-03
  1.11645246e-02  4.08067517e-02  1.76319088e-06  9.15296096e-03
 -8.77259579e-03  2.39382498e-02 -2.32784376e-02  8.04999471e-02
  3.19176950e-02  5.12595847e-03 -1.47708384e-02 -1.62524451e-02
 -6.03213347e-02 -4.35689725e-02  4.51211818e-02 -1.79053638e-02
  2.63366848e-02 -3.47866938e-02 -8.89172871e-03 -5.47675118e-02
 -1.24373101e-02 -2.38606650e-02  8.33496675e-02  5.71242347e-02
  1.13328705e-02 -1.49594918e-02  9.2037

Nice! We've now got a way to numerically represent each of our chunks.

Our embedding has a shape of `(768,)` meaning it's a vector of 768 numbers which represent our text in high-dimensional space, too many for a human to comprehend but machines love high-dimensional space.

> **Note:** No matter the size of the text input to our `all-mpnet-base-v2` model, it will be turned into an embedding size of `(768,)`. This value is fixed. So whether a sentence is 1 token long or 1000 tokens long, it will be truncated/padded with zeros to size 384 and then turned into an embedding vector of size `(768,)`. Of course, other embedding models may have different input/output shapes.

How about we add an embedding field to each of our chunk items?

Let's start by trying to create embeddings on the CPU, we'll time it with the `%%time` magic to see how long it takes.

In [28]:
%%time
# Define minimum token length
min_token_length = 30

# Filter for chunks over minimum token length
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")

# Use GPU (CUDA) for faster embedding generation
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üöÄ Using device: {device}")
embedding_model.to(device)

# Create embeddings one by one on GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

üöÄ Using device: cuda


  0%|          | 0/2677 [00:00<?, ?it/s]

CPU times: total: 1min 10s
Wall time: 1min 15s


Ok not too bad... but this would take a *really* long time if we had a larger dataset.

Now let's see how long it takes to create the embeddings with a GPU.

In [29]:
%%time

# Use GPU (CUDA) for ultra-fast embedding generation
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"üöÄ Using device for batch processing: {device}")
embedding_model.to(device)

# Create embeddings in batch mode on GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

üöÄ Using device for batch processing: cuda


  0%|          | 0/2677 [00:00<?, ?it/s]

CPU times: total: 1min 10s
Wall time: 1min 16s


Woah! Looks like the embeddings get created much faster (~10x faster on my machine) on the GPU!

You'll likely notice this trend with many of your deep learning workflows. If you have access to a GPU, especially a NVIDIA GPU, you should use one if you can.

But what if I told you we could go faster again?

You see many modern models can handle batched predictions.

This means computing on multiple samples at once.

Those are the types of operations where a GPU flourishes!

We can perform batched operations by turning our target text samples into a single list and then passing that list to our embedding model.

In [30]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

In [31]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: total: 56.3 s
Wall time: 55.6 s


tensor([[ 0.0150, -0.0734, -0.0258,  ...,  0.0078,  0.0746, -0.0331],
        [ 0.0519, -0.0117, -0.0101,  ...,  0.0246,  0.0807, -0.0401],
        [-0.0173, -0.0161, -0.0399,  ...,  0.0185, -0.0056, -0.0263],
        ...,
        [-0.0119,  0.0339, -0.0302,  ..., -0.0410,  0.0659, -0.0230],
        [ 0.0352, -0.0473, -0.0407,  ..., -0.0297,  0.0201,  0.0024],
        [ 0.0559, -0.0232, -0.0363,  ..., -0.0474,  0.0339, -0.0148]],
       device='cuda:0')

That's what I'm talking about!

A ~4x improvement (on my GPU) in speed thanks to batched operations.

So the tip here is to use a GPU when you can and use batched operations if you can too.

Now let's save our chunks and their embeddings so we could import them later if we wanted.

### Save embeddings to file

Since creating embeddings can be a timely process (not so much for our case but it can be for more larger datasets), let's turn our `pages_and_chunks_over_min_token_len` list of dictionaries into a DataFrame and save it.

In [32]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

And we can make sure it imports nicely by loading it.

In [33]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,2,About the Author Best-selling author Herbert S...,924,141,231.0,[ 1.49854198e-02 -7.34082162e-02 -2.57711969e-...
1,2,Danny Coward has worked on all editions of the...,783,130,195.75,[ 5.19161113e-02 -1.17350183e-02 -1.01365112e-...
2,3,The Complete Reference Herbert Schildt New Yor...,174,25,43.5,[-1.72504783e-02 -1.60598643e-02 -3.99433635e-...
3,4,Copyright ¬© 2014 by McGraw-Hill Education (Pub...,1343,206,335.75,[ 2.77911909e-02 -5.24213836e-02 -5.01611009e-...
4,4,Information has been obtained by McGraw-Hill E...,1555,245,388.75,[ 3.23652029e-02 -4.66272198e-02 -3.77070270e-...


### Chunking and embedding questions

> **Which embedding model should I use?**

This depends on many factors. My best advice is to experiment, experiment, experiment! 

If you want the model to run locally, you'll have to make sure it's feasible to run on your own hardware. 

A good place to see how different models perform on a wide range of embedding tasks is the [Hugging Face Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

> **What other forms of text chunking/splitting are there?**

There are a fair few options here too. We've kept it simple with groups of sentences.

For more, [Pinecone has a great guide on different kinds of chunking](https://www.pinecone.io/learn/chunking-strategies/) including for different kinds of data such as markdown and LaTeX.

Libraries such as [LangChain also have a good amount of in-built text splitting options](https://python.langchain.com/docs/modules/data_connection/document_transformers/).

> **What should I think about when creating my embeddings?**

Our model turns text inputs up to 384 tokens long in embedding vectors of size 768.

Generally, the larger the vector size, the more information that gets encoded into the embedding (however, this is not always the case, as smaller, better models can outperform larger ones).

Though with larger vector sizes comes larger storage and compute requirements.

Our model is also relatively small (420MB) in size compared to larger models that are available.

Larger models may result in better performance but will also require more compute.

So some things to think about:
* Size of input - If you need to embed longer sequences, choose a model with a larger input capacity.
* Size of embedding vector - Larger is generally a better representation but requires more compute/storage.
* Size of model - Larger models generally result in better embeddings but require more compute power/time to run.
* Open or closed - Open models allow you to run them on your own hardware whereas closed models can be easier to setup but require an API call to get embeddings.

> **Where should I store my embeddings?**

If you've got a relatively small dataset, for example, under 100,000 examples (this number is rough and only based on first hand experience), `np.array` or `torch.tensor` can work just fine as your dataset.

But if you've got a production system and want to work with 100,000+ embeddings, you may want to look into a [vector database]( https://en.wikipedia.org/wiki/Vector_database) (these have become very popular lately and there are many offerings).

### Document Ingestion and Embedding Creation Extensions

One major extension to the workflow above would to functionize it.

Or turn it into a script.

As in, take all the functionality we've created and package it into a single process (e.g. go from document -> embeddings file).

So you could input a document on one end and have embeddings come out the other end. The hardest part of this is knowing what kind of preprocessing your text may need before it's turned into embeddings. Cleaner text generally means better results.



## 2. RAG - Search and Answer

We discussed RAG briefly in the beginning but let's quickly recap.

RAG stands for Retrieval Augmented Generation.

Which is another way of saying "given a query, search for relevant resources and answer based on those resources".

Let's breakdown each step:
* **Retrieval** - Get relevant resources given a query. For example, if the query is "what are the macronutrients?" the ideal results will contain information about protein, carbohydrates and fats (and possibly alcohol) rather than information about which tractors are the best for farming (though that is also cool information).
* **Augmentation** - LLMs are capable of generating text given a prompt. However, this generated text is designed to *look* right. And it often has some correct information, however, they are prone to hallucination (generating a result that *looks* like legit text but is factually wrong). In augmentation, we pass relevant information into the prompt and get an LLM to use that relevant information as the basis of its generation.
* **Generation** - This is where the LLM will generate a response that has been flavoured/augmented with the retrieved resources. In turn, this not only gives us a potentially more correct answer, it also gives us resources to investigate more (since we know which resources went into the prompt).

The whole idea of RAG is to get an LLM to be more factually correct based on your own input as well as have a reference to where the generated output may have come from.

This is an incredibly helpful tool.

Let's say you had 1000s of customer support documents.

You could use RAG to generate direct answers to questions with links to relevant documentation.

Or you were an insurance company with large chains of claims emails.

You could use RAG to answer questions about the emails with sources.

One helpful analogy is to think of LLMs as calculators for words.

With good inputs, the LLM can sort them into helpful outputs.

How? 

It starts with better search.

### Similarity search

Similarity search or semantic search or vector search is the idea of searching on *vibe*.

If this sounds like woo, woo. It's not.

Perhaps searching via *meaning* is a better analogy.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".

And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.

> **Example:** Using similarity search on our textbook data with the query "macronutrients function" returns a paragraph that starts with: 
>
>*There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.*
> 
> as the first result. How cool!

If you've ever used Google, you know this kind of workflow.

But now we'd like to perform that across our own data.

Let's import our embeddings we created earlier (tk -link to embedding file) and prepare them for use by turning them into a tensor.

In [34]:
import random

import torch
import numpy as np 
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([2677, 768])

In [35]:
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,2,About the Author Best-selling author Herbert S...,924,141,231.0,"[0.0149854198, -0.0734082162, -0.0257711969, 0..."
1,2,Danny Coward has worked on all editions of the...,783,130,195.75,"[0.0519161113, -0.0117350183, -0.0101365112, -..."
2,3,The Complete Reference Herbert Schildt New Yor...,174,25,43.5,"[-0.0172504783, -0.0160598643, -0.0399433635, ..."
3,4,Copyright ¬© 2014 by McGraw-Hill Education (Pub...,1343,206,335.75,"[0.0277911909, -0.0524213836, -0.0501611009, -..."
4,4,Information has been obtained by McGraw-Hill E...,1555,245,388.75,"[0.0323652029, -0.0466272198, -0.037707027, -0..."


In [36]:
embeddings[0]

tensor([ 1.4985e-02, -7.3408e-02, -2.5771e-02,  7.4829e-03, -2.4946e-02,
        -3.8898e-02, -2.0713e-03, -6.0827e-04, -1.2047e-03, -5.3840e-03,
         5.4796e-02,  4.3897e-02, -4.2186e-03,  2.2586e-02, -2.7512e-02,
        -8.4513e-02,  7.7895e-02, -6.4456e-03, -6.0749e-02,  1.2598e-02,
        -9.9165e-03,  4.3364e-03, -9.4753e-03,  5.8562e-02, -2.5537e-03,
        -6.4259e-02, -1.1210e-02,  1.3678e-02,  7.3151e-03, -2.5102e-03,
        -1.4733e-02,  2.0469e-02,  4.4513e-03,  2.8734e-02,  2.7498e-06,
        -2.2993e-02, -2.7370e-03,  3.9228e-02,  3.1972e-02,  1.2928e-01,
         3.0824e-02,  5.7807e-02, -1.5843e-02,  9.3927e-04,  4.7829e-03,
         1.1563e-02,  5.0151e-02, -2.1391e-02,  6.3028e-02,  3.1962e-03,
         2.3737e-02,  1.6948e-02, -7.6335e-03,  4.2421e-02, -2.6216e-02,
        -7.4337e-03,  4.6518e-04,  3.8885e-02,  7.9682e-02,  1.6107e-02,
        -2.5198e-02, -3.4476e-02, -5.7022e-03, -2.0304e-02,  9.3294e-02,
        -8.9609e-03, -4.2601e-02,  1.6416e-02,  2.3

Nice!

Now let's prepare another instance of our embedding model. Not because we have to but because we'd like to make it so you can start the notebook from the cell above. 

In [37]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model to

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Embedding model ready!

Time to perform a semantic search.

Let's say you were studying the macronutrients.

And wanted to search your textbook for "macronutrients functions".

Well, we can do so with the following steps:
1. Define a query string (e.g. `"macronutrients functions"`) - note: this could be anything, specific or not.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding (we'll get to what these are shortly) to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts. 

Easy!


In [38]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = "macronutrients functions"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: macronutrients functions
Time take to get scores on 2677 embeddings: 0.35969 seconds.


torch.return_types.topk(
values=tensor([0.2459, 0.2258, 0.2185, 0.2105, 0.2094], device='cuda:0'),
indices=tensor([2497, 1314, 2302, 2518, 1257], device='cuda:0'))

Woah!! Now that was fast!

~0.00008 seconds to perform a dot product comparison across 1680 embeddings on my machine (NVIDIA RTX 4090 GPU).

GPUs are optimized for these kinds of operations.

So even if you we're to increase our embeddings by 100x (1680 -> 168,000), an exhaustive dot product operation would happen in ~0.008 seconds (assuming linear scaling).

Heck, let's try it.

In [39]:
larger_embeddings = torch.randn(100*embeddings.shape[0], 768).to(device)
print(f"Embeddings shape: {larger_embeddings.shape}")

# Perform dot product across 168,000 embeddings
start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=larger_embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(larger_embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

Embeddings shape: torch.Size([267700, 768])
Time take to get scores on 267700 embeddings: 0.20452 seconds.


Wow. That's quick!

That means we can get pretty far by just storing our embeddings in `torch.tensor` for now.

However, for *much* larger datasets, we'd likely look at a dedicated vector database/indexing libraries such as [Faiss](https://github.com/facebookresearch/faiss).

Let's check the results of our original similarity search.

[`torch.topk`](https://pytorch.org/docs/stable/generated/torch.topk.html) returns a tuple of values (scores) and indicies for those scores.

The indicies relate to which indicies in the `embeddings` tensor have what scores in relation to the query embedding (higher is better).

We can use those indicies to map back to our text chunks.

First, we'll define a small helper function to print out wrapped text (so it doesn't print a whole text chunk as a single line).

In [40]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

Now we can loop through the `top_results_dot_product` tuple and match up the scores and indicies and then use those indicies to index on our `pages_and_chunks` variable to get the relevant text chunk.

Sounds like a lot but we can do it!

In [41]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'macronutrients functions'

Results:
Score: 0.2459
Text:
This element can be either a selection linked to some program action, such as
Save or Close, or it can cause a submenu to be displayed. MenuItem defines the
following three constructors. MenuItem(¬†) MenuItem(String name) MenuItem(String
name, Node image) The first creates an empty menu item. The second lets you
specify the name of the item, and the third enables you to include an image.
Page number: 1208


Score: 0.2258
Text:
It also describes various options, such as justification, minimum field width,
and precision. Format Specifier Conversion Applied %g %G Uses %e or %f, based on
the value being formatted and the precision %o Octal integer %n Inserts a
newline character %s %S String %t %T Time and date %x %X Integer hexadecimal %%
Inserts a % sign Table 19-13‚ÄÉ The Format Specifiers (continued)
Page number: 642


Score: 0.2185
Text:
These are used to specify the upper-left corner of the popup menu when it is
displayed

The first result looks to have nailed it!

We get a very relevant answer to our query `"macronutrients functions"` even though its quite vague.

That's the power of semantic search!

And even better, if we wanted to inspect the result further, we get the page number where the text appears.

How about we check the page to verify?

We can do so by loading the page number containing the highest result (page 5 but really page 5 + 41 since our PDF page numbers start on page 41).

In [42]:
import fitz

# Open PDF and load target page
pdf_path = "human-nutrition-text.pdf" # requires PDF to be downloaded
doc = fitz.open(pdf_path)
page = doc.load_page(5 + 41) # number of page (our doc starts page numbers on page 41)

# Get the image of the page
img = page.get_pixmap(dpi=300)

# Optional: save the image
#img.save("output_filename.png")
doc.close()

# Convert the Pixmap to a numpy array
img_array = np.frombuffer(img.samples_mv, 
                          dtype=np.uint8).reshape((img.h, img.w, img.n))

# Display the image using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(13, 10))
plt.imshow(img_array)
plt.title(f"Query: '{query}' | Most relevant page:")
plt.axis('off') # Turn off axis
plt.show()

FileNotFoundError: no such file: 'human-nutrition-text.pdf'

Nice!

Now we can do extra research if we'd like.

We could repeat this workflow for any kind of query we'd like on our textbook.

And it would also work for other datatypes too.

We could use semantic search on customer support documents.

Or email threads.

Or company plans.

Or our old journal entries.

Almost anything!

The workflow is the same:

`ingest documents -> split into chunks -> embed chunks -> make a query -> embed the query -> compare query embedding to chunk embeddings`

And we get relevant resources *along with* the source they came from!

That's the **retrieval** part of Retrieval Augmented Generation (RAG).

Before we get to the next two steps, let's take a small aside and discuss similarity measures.

### Similarity measures: dot product and cosine similarity 

Let's talk similarity measures between vectors.

Specifically, embedding vectors which are representations of data with magnitude and direction in high dimensional space (our embedding vectors have 768 dimensions).

Two of the most common you'll across are the dot product and cosine similarity.

They are quite similar.

The main difference is that cosine similarity has a normalization step.

| Similarity measure | Description | Code |
| ----- | ----- | ----- |
| [Dot Product](https://en.wikipedia.org/wiki/Dot_product) | - Measure of magnitude and direction between two vectors<br>- Vectors that are aligned in direction and magnitude have a higher positive value<br>- Vectors that are opposite in direction and magnitude have a higher negative value | [`torch.dot`](https://pytorch.org/docs/stable/generated/torch.dot.html), [`np.dot`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html), [`sentence_transformers.util.dot_score`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.dot_score) | 
| [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | - Vectors get normalized by magnitude/[Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics))/L2 norm so they have unit length and are compared more so on direction<br>- Vectors that are aligned in direction have a value close to 1<br>- Vectors that are opposite in direction have a value close to -1 | [`torch.nn.functional.cosine_similarity`](https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html), [`1 - scipy.spatial.distance.cosine`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) (subtract the distance from 1 for similarity measure), [`sentence_transformers.util.cos_sim`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.cos_sim) |

For text similarity, you generally want to use cosine similarity as you are after the semantic measurements (direction) rather than magnitude. 

In our case, our embedding model `all-mpnet-base-v2` outputs normalized outputs (see the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#usage-huggingface-transformers) for more on this) so dot product and cosine similarity return the same results. However, dot product is faster due to not need to perform a normalize step.

To make things bit more concrete, let's make simple dot product and cosine similarity functions and view their results on different vectors.

> **Note:** Similarity measures between vectors and embeddings can be used on any kind of embeddings, not just text embeddings. For example, you could measure image embedding similarity or audio embedding similarity. Or with text and image models like [CLIP](https://github.com/mlfoundations/open_clip), you can measure the similarity between text and image embeddings.

In [43]:
import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example tensors
vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)

# Calculate dot product
print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))

# Calculate cosine similarity
print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

Dot product between vector1 and vector2: tensor(14., device='cuda:0')
Dot product between vector1 and vector3: tensor(32., device='cuda:0')
Dot product between vector1 and vector4: tensor(-14., device='cuda:0')
Cosine similarity between vector1 and vector2: tensor(1.0000, device='cuda:0')
Cosine similarity between vector1 and vector3: tensor(0.9746, device='cuda:0')
Cosine similarity between vector1 and vector4: tensor(-1.0000, device='cuda:0')


Notice for both dot product and cosine similarity the comparisons of `vector1` and `vector2` are the opposite of `vector1` and `vector4`.

Comparing `vector1` and `vector2` both equations return positive values (14 for dot product and 1.0 for cosine similarity). 

But comparing `vector1` and `vector4` the result is in the negative direction.

This makes sense because `vector4` is the negative version of `vector1`.

Whereas comparing `vector1` and `vector3` shows a different outcome.

For the dot product, the value is positive and larger then the comparison of two exactly the same vectors (32 vs 14).

However, for the cosine similarity, thanks to the normalization step, comparing `vector1` and `vector3` results in a postive value close to 1 but not exactly 1.

It is because of this that when comparing text embeddings, cosine similarity is generally favoured as it measures the difference in direction of a pair of vectors rather than difference in magnitude.

And it is this difference in direction that is more generally considered to capture the semantic meaning/vibe of the text.

The good news is that as mentioned before, the outputs of our embedding model `all-mpnet-base-v2` are already normalized.

So we can continue using the dot product (cosine similarity is dot product + normalization).

With similarity measures explained, let's functionize our semantic search steps from above so we can repeat them. 

### Functionizing our semantic search pipeline

Let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow.

In [44]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

Excellent! Now let's test our functions out.

In [45]:
query = "symptoms of pellagra"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 2677 embeddings: 0.00017 seconds.


(tensor([0.1634, 0.1496, 0.1478, 0.1430, 0.1426], device='cuda:0'),
 tensor([1970, 1963, 1972, 1971, 1681], device='cuda:0'))

In [46]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 2677 embeddings: 0.00012 seconds.
Query: symptoms of pellagra

Results:
Score: 0.1634
import java.util.concurrent.*; // Extend MyPhaser to allow only a specific
number of phases // to be executed.class MyPhaser extends Phaser {  int
numPhases;  MyPhaser(int parties, int phaseCount) {   super(parties);
numPhases = phaseCount - 1;  }  // Override onAdvance() to execute the specified
// number of phases.protected boolean onAdvance(int p, int regParties) {   //
This println() statement is for illustration only.// Normally, onAdvance() will
not display output. System.out.println("Phase " + p + " completed.\n");   // If
all phases have completed, return true   if(p == numPhases || regParties == 0)
return true;   // Otherwise, return false.return false;  } } class PhaserDemo2 {
public static void main(String args[]) {   MyPhaser phsr = new MyPhaser(1, 4);
System.out.println("Starting\n");   new MyThread(phsr, "A");   new
MyThread(phsr, "B");   new MyThread(p

### Semantic search/vector search extensions 

We've covered an exmaple of using embedding vector search to find relevant results based on a query.

However, you could also add to this pipeline with traditional keyword search.

Many modern search systems use keyword and vector search in tandem.

Our dataset is small and allows for an exhaustive search (comparing the query to *every* possible result) but if you start to work with large scale datasets with hundred of thousands, millions or even billions of vectors, you'll want to implement an index.

You can think of an index as sorting your embeddings before you search through them.

So it narrows down the search space.

For example, it would be inefficient to search every word in the dictionary to find the word "duck", instead you'd go straight to the letter D, perhaps even straight to the back half of the letter D, find words close to "duck" before finding it.

That's how an index can help search through many examples without comprimising too much on speed or quality (for more on this, check out [nearest neighbour search](https://en.wikipedia.org/wiki/Nearest_neighbor_search)).

One of the most popular indexing libraries is [Faiss](https://github.com/facebookresearch/faiss). 

Faiss is open-source and was originally created by Facebook to deal with internet-scale vectors and implements many algorithms such as [HNSW](https://arxiv.org/abs/1603.09320) (Hierarchical Naviganle Small Worlds).

### Getting an LLM for local generation

We're got our retrieval pipeline ready, let's now get the generation side of things happening.

To perform generation, we're going to use a Large Language Model (LLM).

LLMs are designed to generate an output given an input.

In our case, we want our LLM to generate and output of text given a input of text.

And more specifically, we want the output of text to be generated based on the context of relevant information to the query.

The input to an LLM is often referred to as a prompt.

We'll augment our prompt with a query as well as context from our textbook related to that query.

> **Which LLM should I use?**

There are many LLMs available.

Two of the main questions to ask from this is:
1. Do I want it to run locally? 
2. If yes, how much compute power can I dedicate?

If you're after the absolute best performance, you'll likely want to use an API (not running locally) such as GPT-4 or Claude 3. However, this comes with the tradeoff of sending your data away and then awaiting a response.

For our case, since we want to set up a local pipeline and run it on our own GPU, we'd answer "yes" to the first question and then the second question will depend on what hardware we have available.

To find open-source LLMs, one great resource is the [Hugging Face open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

The leaderboard compares many of the latest and greatest LLMs on various benchmarks.

Another great resource is [TheBloke on Hugging Face](https://huggingface.co/TheBloke), an account which provides an extensive range of quantized (models that have been made smaller) LLMs.

A rule of thumb for LLMs (and deep learning models in general) is that the higher the number of parameters, the better the model performs. 

It may be tempting to go for the largest size model (e.g. a 70B parameter model rather than a 7B parameter model) but a larger size model may not be able to run on your available hardware.

The following table gives an insight into how much GPU memory you'll need to load an LLM with different sizes and different levels of [numerical precision](https://en.wikipedia.org/wiki/Precision_(computer_science)).

They are based on the fact that 1 float32 value (e.g. `0.69420`) requires 4 bytes of memory and 1GB is approximately 1,000,000,000 (one billion) bytes.

| Model Size (Billion Parameters) | Float32 VRAM (GB) | Float16 VRAM (GB) | 8-bit VRAM (GB) | 4-bit VRAM (GB) |
|-----|-----|-----|-----|-----|
| 1B                              | ~4                | ~2                | ~1              | ~0.5            |
| 7B (e.g., [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b), [Gemma 7B](https://huggingface.co/google/gemma-7b-it), [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1))             | ~28               | ~14               | ~7              | ~3.5            |
| 10B                             | ~40               | ~20               | ~10             | ~5              |
| 70B (e.g, Llama 2 70B)          | ~280              | ~140              | ~70             | ~35             |
| 100B                            | ~400              | ~200              | ~100            | ~50             |
| 175B                            | ~700              | ~350              | ~175            | ~87.5           |

<br>

> **Note:** Loading a model in a lower precision (e.g. 8-bit instead of float16) generally lowers performance. Lower precision can help to reduce computing requirements, however sometimes the performance degradation in terms of model output can be substantial. Finding the right speed/performance tradeoff will often require many experiments.

### Checking local GPU memory availability

Let's find out what hardware we've got available and see what kind of model(s) we'll be able to load.

> **Note:** You can also check this with the `!nvidia-smi` command.

In [47]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 4 GB


Ok wonderful!

I'm running this notebook with a NVIDIA RTX 3050, so I've got 4 of VRAM available.

However, this may be different on your end.

Looking at the table above, it seems we can run a ~7-10B parameter model in float16 precision pretty comfortably.

But we could also run a smaller one if we'd like.

Let's try out the recently released (at the time of writing, March 2024) LLM from Google, [Gemma](https://huggingface.co/blog/gemma).

Specifically, we'll use the `gemma-7b-it` version which stands for Gemma 7B Instruction-Tuned.

Instruction tuning is the process of tuning a raw language model to follow instructions.

These are the kind of models you'll find in most chat-based assistants such as ChatGPT, Gemini or Claude.

The following table shows different amounts of GPU memory requirements for different verions of the Gemma LLMs with varying levels of precision.

| Model             | Precision | Min-Memory (Bytes) | Min-Memory (MB) | Min-Memory (GB) | Recommended Memory (GB) | Hugging Face ID |
|-------------------|-----------|----------------|-------------|-------------| ----- | ----- |
| [Gemma 2B](https://huggingface.co/google/gemma-2b-it)          | 4-bit     | 2,106,749,952  | 2009.15     | 1.96        | ~5.0 | [`gemma-2b`](https://huggingface.co/google/gemma-2b) or [`gemma-2b-it`](https://huggingface.co/google/gemma-2b-it) for instruction tuned version | 
| Gemma 2B          | Float16   | 5,079,453,696  | 4844.14     | 4.73        | ~8.0 | Same as above |
| [Gemma 7B](https://huggingface.co/google/gemma-7b-it)          | 4-bit     | 5,515,859,968  | 5260.33     | 5.14        | ~8.0 | [`gemma-7b`](https://huggingface.co/google/gemma-7b) or [`gemma-7b-it`](https://huggingface.co/google/gemma-7b-it) for instruction tuned version |
| Gemma 7B          | Float16   | 17,142,470,656 | 16348.33    | 15.97       | ~19 | Same as above |

> **Note:** `gemma-7b-it` means "instruction tuned", as in, a base LLM (`gemma-7b`) has been fine-tuned to follow instructions, similar to [`Mistral-7B-v0.1`](https://huggingface.co/mistralai/Mistral-7B-v0.1) and [`Mistral-7B-Instruct-v0.1`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).
> 
> There are also further quantized and smaller variants of Gemma (and other LLMs) available in various formats such as GGUF. You can see many of these on [TheBloke account on Hugging Face](https://huggingface.co/TheBloke).
> 
> The version of LLM you choose to use will be largely based on project requirements and experimentation.

Based on the table above, let's write a simple if/else statement which recommends which Gemma variant we should look into using.

In [48]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
# Initialize defaults to avoid NameError if a branch doesn't set them
use_quantization_config = False
model_id = None

if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

Your available GPU memory is 4GB, you may not have enough memory to run a Gemma LLM locally without quantization.
use_quantization_config set to: False
model_id set to: None


### Loading an LLM locally

Alright! Looks like `gemma-7b-it` it is (for my local machine with an RTX 4090, change the `model_id` and `use_quantization_config` values to suit your needs)! 

There are plenty of examples of how to load the model on the `gemma-7b-it` [Hugging Face model card](https://huggingface.co/google/gemma-7b-it).

Good news is, the Hugging Face [`transformers`](https://huggingface.co/docs/transformers/) library has all the tools we need.

To load our LLM, we're going to need a few things:
1. A quantization config (optional) - This will determine whether or not we load the model in 4bit precision for lower memory usage. The we can create this with the [`transformers.BitsAndBytesConfig`](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/quantization#transformers.BitsAndBytesConfig) class (requires installing the [`bitsandbytes` library](https://github.com/TimDettmers/bitsandbytes)).
2. A model ID - This is the reference Hugging Face model ID which will determine which tokenizer and model gets used. For example `gemma-7b-it`.
3. A tokenzier - This is what will turn our raw text into tokens ready for the model. We can create it using the [`transformers.AutoTokenzier.from_pretrained`](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoTokenizer) method and passing it our model ID.
4. An LLM model - Again, using our model ID we can load a specific LLM model. To do so we can use the [`transformers.AutoModelForCausalLM.from_pretrained`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained) method and passing it our model ID as well as other various parameters.

As a bonus, we'll check if [Flash Attention 2](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) is available using `transformers.utils.is_flash_attn_2_available()`. Flash Attention 2 speeds up the attention mechanism in Transformer architecture models (which is what many modern LLMs are based on, including Gemma). So if it's available and the model is supported (not all models support Flash Attention 2), we'll use it. If it's not available, you can install it by following the instructions on the [GitHub repo](https://github.com/Dao-AILab/flash-attention). 

> **Note:** Flash Attention 2 currently works on NVIDIA GPUs with a compute capability score of 8.0+ (Ampere, Ada Lovelace, Hopper architectures). We can check our GPU compute capability score with [`torch.cuda.get_device_capability(0)`](https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html). 

> **Note:** To get access to the Gemma models, you will have to [agree to the terms & conditions](https://huggingface.co/google/gemma-7b-it) on the Gemma model page on Hugging Face. You will then have to authorize your local machine via the [Hugging Face CLI/Hugging Face Hub `login()` function](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication). Once you've done this, you'll be able to download the models. If you're using Google Colab, you can add a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) to the "Secrets" tab.
>
> Downloading an LLM locally can take a fair bit of time depending on your internet connection. Gemma 7B is about a 16GB download and Gemma 2B is about a 6GB download.

Let's do it!

In [49]:
# First install required packages
!pip install -q bitsandbytes accelerate transformers torch --upgrade

import os
import sys
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  # Changed to Seq2SeqLM for T5

# Print versions for debugging
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    device = "cuda"
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    device = "cpu"
    print("No GPU available, using CPU")

# Use a smaller model that's more likely to work
model_id = "google/flan-t5-small"  # This is a smaller model that should work even with limited GPU memory
print(f"\nLoading model: {model_id}")

try:
    # 1. Load tokenizer
    print("Loading tokenizer...")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    # 2. Load model with basic settings
    print("Loading model...")
    llm_model = AutoModelForSeq2SeqLM.from_pretrained(  # Changed to Seq2SeqLM
        model_id,
        torch_dtype=torch.float16 if device == "cuda" else torch.float32,
        low_cpu_mem_usage=True,
        trust_remote_code=True
    )
    
    # 3. Move to device
    llm_model = llm_model.to(device)
    print(f"Success! Model loaded and moved to {device}")
    
except Exception as e:
    print(f"\nError loading model: {str(e)}")
    
    if "out of memory" in str(e).lower():
        print("\nGPU out of memory error. Try:")
        print("1. Use a smaller model")
        print("2. Free up GPU memory")
        print("3. Use CPU instead")
    
    elif "authentication" in str(e).lower():
        print("\nAuthentication error. You need to:")
        print("1. Run: huggingface-cli login")
        print("2. Or set HUGGING_FACE_TOKEN environment variable")
    
    raise

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.5.1+cu121 requires torch==2.5.1+cu121, but you have torch 2.10.0 which is incompatible.
torchvision 0.20.1+cu121 requires torch==2.5.1+cu121, but you have torch 2.10.0 which is incompatible.


Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
PyTorch version: 2.5.1+cu121
CUDA available: True
GPU: NVIDIA GeForce RTX 3050 Laptop GPU
GPU memory: 4.29 GB

Loading model: google/flan-t5-small
Loading tokenizer...


config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Loading model...


`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Success! Model loaded and moved to cuda


We've got an LLM!

Let's check it out.

In [50]:
llm_model

T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

Ok, ok a bunch of layers ranging from embedding layers to attention layers (see the `GemmaFlashAttention2` layers!) to MLP and normalization layers.

The good news is that we don't have to know too much about these to use the model.

How about we get the number of parameters in our model? 

In [51]:
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

76961152

Hmm, turns out that Gemma 7B is really Gemma 8.5B.

It pays to do your own investigations!

How about we get the models memory requirements?

In [52]:
def get_model_mem_size(model: torch.nn.Module):
    """
    Get how much memory a PyTorch model takes up.

    See: https://discuss.pytorch.org/t/gpu-memory-that-model-uses/56822
    """
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate various model sizes
    model_mem_bytes = mem_params + mem_buffers # in bytes
    model_mem_mb = model_mem_bytes / (1024**2) # in megabytes
    model_mem_gb = model_mem_bytes / (1024**3) # in gigabytes

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 170699520, 'model_mem_mb': 162.79, 'model_mem_gb': 0.16}

Nice, looks like this model takes up 15.97GB of space on the GPU.

Plus a little more for the forward pass (due to all the calculations happening between the layers).

Hence why I rounded it up to be ~19GB in the table above.

Now let's get to the fun part, generating some text!

### Generating text with our LLM

We can generate text with our LLM `model` instance by calling the [`generate()` method](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig) (this method has plenty of options to pass into it alongside the text) on it and passing it a tokenized input.

The tokenized input comes from passing a string of text to our `tokenizer`.

It's important to note that you should use a tokenizer that has been paired with a model.

Otherwise if you try to use a different tokenizer and then pass those inputs to a model, you will likely get errors/strange results.

For some LLMs, there's a specific template you should pass to them for ideal outputs.

For example, the `gemma-7b-it` model has been trained in a dialogue fashion (instruction tuning).

In this case, our `tokenizer` has a [`apply_chat_template()` method](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) which can prepare our input text in the right format for the model.

Let's try it out.

> **Note:** The following demo has been modified from the Hugging Face model card for [Gemma 7B](https://huggingface.co/google/gemma-7b-it). Many similar demos of usage are available on the model cards of similar models.

In [53]:
# Define the input text
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print(f"Input text:\n{input_text}")

# For T5 models, we can use the input text directly
# Optionally, we can add a task prefix to help guide the model
prompt = f"answer: {input_text}"
print(f"\nPrompt:\n{prompt}")

# Tokenize the input
model_inputs = tokenizer(prompt, return_tensors="pt").to(device)

Input text:
What are the macronutrients, and what roles do they play in the human body?

Prompt:
answer: What are the macronutrients, and what roles do they play in the human body?


Notice the scaffolding around our input text, this is the kind of turn-by-turn instruction tuning our model has gone through.

Our next step is to tokenize this formatted text and pass it to our model's `generate()` method.

We'll make sure our tokenized text is on the same device as our model (GPU) using `to("cuda")`.

Let's generate some text! 

We'll time it for fun with the `%%time` magic.

In [54]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig 
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")

Model input (tokenized):
{'input_ids': tensor([[ 1525,    10,   363,    33,     8, 11663,  8631,   295,     7,     6,
            11,   125,  6270,   103,    79,   577,    16,     8,   936,   643,
            58,     1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='cuda:0')}

Model output (tokens):
tensor([   0,    3, 8631,  295,    7,    1], device='cuda:0')

CPU times: total: 406 ms
Wall time: 1.66 s


Woohoo! We just generated some text on our local GPU!

Well not just yet...

Our LLM accepts tokens in and sends tokens back out.

We can conver the output tokens to text using [`tokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode).

In [55]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<pad> nutrients</s>



Woah! That looks like a pretty good answer.

But notice how the output contains the prompt text as well?

How about we do a little formatting to replace the prompt in the output text?

> **Note:** `"<bos>"` and `"<eos>"` are special tokens to denote "beginning of sentence" and "end of sentence" respectively.

In [56]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

Input text: What are the macronutrients, and what roles do they play in the human body?

Output text:
<pad> nutrients</s>


How cool is that!

We just officially generated text from an LLM running locally.

So we've covered the R (retrieval) and G (generation) of RAG.

How about we check out the last step?

Augmentation.

First, let's put together a list of queries we can try out with our pipeline.

In [57]:
# Nutrition-style questions generated with GPT4
gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

And now let's check if our `retrieve_relevant_resources()` function works with our list of queries.

In [58]:
import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Query: What role does fibre play in digestion? Name five fibre containing foods.
[INFO] Time taken to get scores on 2677 embeddings: 0.00013 seconds.


(tensor([0.1573, 0.1559, 0.1558, 0.1553, 0.1534], device='cuda:0'),
 tensor([2100, 1432, 1461, 1739, 1467], device='cuda:0'))

Beautiful!

Let's augment!

### Augmenting our prompt with context items

What we'd like to do with augmentation is take the results from our search for relevant resources and put them into the prompt that we pass to our LLM.

In essence, we start with a base prompt and update it with context text.

Let's write a function called `prompt_formatter` that takes in a query and our list of context items (in our case it'll be select indices from our list of dictionaries inside `pages_and_chunks`) and then formats the query with text from the context items.

We'll apply the dialogue and chat template to our prompt before returning it as well.

> **Note:** The process of augmenting or changing a prompt to an LLM is known as prompt engineering. And the best way to do it is an active area of research. For a comprehensive guide on different prompt engineering techniques, I'd recommend the Prompt Engineering Guide ([promptingguide.ai](https://www.promptingguide.ai/)), [Brex's Prompt Engineering Guide](https://github.com/brexhq/prompt-engineering) and the paper [Prompt Design and Engineering: Introduction and Advanced Models](https://arxiv.org/abs/2401.14423).

In [59]:
def prompt_formatter(query: str, 
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query   
    base_prompt = base_prompt.format(context=context, query=query)

    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]

    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

Looking good! Let's try our function out.

In [60]:
# Define helper functions first
def retrieve_relevant_resources(query, embeddings, k=3):
    """
    Retrieve the most relevant text chunks based on similarity to the query
    """
    # Get query embedding
    query_embedding = embedding_model.encode([query])[0]  # Get the first (and only) embedding
    query_embedding = torch.tensor(query_embedding).to(device)
    
    # Calculate similarity scores
    scores = torch.matmul(embeddings, query_embedding)
    
    # Get top k scores and indices
    top_k_scores, top_k_indices = torch.topk(scores, k=min(k, len(scores)))
    
    return top_k_scores, top_k_indices

def prompt_formatter(query, context_items):
    """
    Format the prompt with context information
    """
    # Combine context items into a single string
    context = "\n\n".join([f"Context {i+1}:\n{item}" for i, item in enumerate(context_items)])
    
    # Create the full prompt
    prompt = f"""Use the following pieces of context to answer the question. If you cannot find 
the answer in the context, say "I don't have enough information to answer that."

{context}

Question: {query}
Answer: """
    
    return prompt

# Now use the functions
print("Processing query...")
scores, indices = retrieve_relevant_resources(query=query,
                                           embeddings=embeddings)
    
# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                        context_items=context_items)
print("\nGenerated prompt:")
print(prompt)

Processing query...

Generated prompt:
Use the following pieces of context to answer the question. If you cannot find 
the answer in the context, say "I don't have enough information to answer that."

Context 1:
{'page_number': 1027, 'sentence_chunk': 'Chapter 30\u2003  Regular Expressions and Other Packages\u2003 \u2002 993 Part II Regular Expression Processing The java.util.regex package supports regular expression processing. As the term is used here, a regular expression is a string of characters that describes a character sequence. This general description, called a pattern, can then be used to find matches in other character sequences. Regular expressions can specify wildcard characters, sets of characters, and various quantifiers. Thus, you can specify a regular expression that represents a general form that can match several different specific character sequences. There are two classes that support regular expression processing: Pattern and Matcher. These classes work together.

What a good looking prompt!

We can tokenize this and pass it straight to our LLM. 

In [61]:
# Generate response using the T5 model
def generate_response(prompt, max_length=512):
    """
    Generate a response using the T5 model
    """
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(device)
    
    # Generate response
    print("Generating response...")
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = llm_model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            num_return_sequences=1,
            temperature=0.7,  # Control randomness (higher = more random)
            do_sample=True,   # Use sampling instead of greedy decoding
            no_repeat_ngram_size=2  # Avoid repeating 2-grams
        )
    
    # Decode the response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Now use the functions
print("Processing query...")
scores, indices = retrieve_relevant_resources(query=query,
                                           embeddings=embeddings)

# Create a list of context items
context_items = [pages_and_chunks[i] for i in indices]

# Format prompt with context items
prompt = prompt_formatter(query=query,
                        context_items=context_items)

# Generate and print the response
print("\nGenerating response...")
response = generate_response(prompt)
print("\nModel's response:")
print(response)

Processing query...

Generating response...
Generating response...

Model's response:
14.235423803][by130187


Yesssssss!!!

Our RAG pipeline is complete!

We just Retrieved, Augmented and Generated!

And all on our own local GPU!

How about we functionize the generation step to make it easier to use?

We can put a little formatting on the text being returned to make it look nice too.

And we'll make an option to return the context items if needed as well.

In [62]:
def ask(query, 
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    
    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)
    
    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU 
        
    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)
    
    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)
    
    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_answer_only:
        return output_text
    
    return output_text, context_items

What a good looking function!

The workflow could probably be a little refined but this should work!

Let's try it out.

In [63]:
def ask(query, 
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    
    Args:
        query (str): The question to ask
        temperature (float): Controls randomness in generation (0.0 to 1.0)
        max_new_tokens (int): Maximum length of generated response
        format_answer_text (bool): Whether to format the answer text nicely
        return_answer_only (bool): If True, returns only the answer. If False, returns (answer, context_items)
    
    Returns:
        str or tuple: Generated answer text, or tuple of (answer, context_items) if return_answer_only=False
    """
    try:
        # 1. Get relevant resources
        scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings)
        context_items = [pages_and_chunks[i] for i in indices]
        
        # 2. Format the prompt with context
        prompt = prompt_formatter(query=query, context_items=context_items)
        
        # 3. Generate response
        response = generate_response(
            prompt,
            max_length=max_new_tokens,
            temperature=temperature
        )
        
        # 4. Format the response if requested
        if format_answer_text:
            # Remove any extra whitespace and newlines
            response = ' '.join(response.split())
            # Capitalize first letter
            response = response[0].upper() + response[1:] if response else response
            # Add period if missing
            if response and response[-1] not in '.!?':
                response += '.'
        
        # 5. Return results
        if return_answer_only:
            return response
        else:
            return response, context_items
            
    except Exception as e:
        print(f"Error in ask function: {str(e)}")
        return "I'm sorry, I encountered an error while trying to answer your question."

# Example usage
query = "What are the key benefits of exercise?"
print("Question:", query)
print("\nAnswer:", ask(query))

Question: What are the key benefits of exercise?
Error in ask function: generate_response() got an unexpected keyword argument 'temperature'

Answer: I'm sorry, I encountered an error while trying to answer your question.


Local RAG workflow complete!

We've now officially got a way to Retrieve, Augment and Generate answers based on a source.

For now we can verify our answers manually by reading them and reading through the textbook.

But if you want to put this into a production system, it'd be a good idea to have some kind of evaluation on how well our pipeline works.

For example, you could use another LLM to rate the answers returned by our LLM and then use those ratings as a proxy evaluation.

However, I'll leave this and a few more interesting ideas as extensions.

## Extensions

* May want to improve text extraction with something like Marker - https://github.com/VikParuchuri/marker
* Guide to more advanced PDF extraction - https://towardsdatascience.com/extracting-text-from-pdf-files-with-python-a-comprehensive-guide-9fc4003d517 
* See the following prompt engineering resources for more prompting techniques - promptinguide.ai, Brex's Prompt Engineering Guide 
* What happens when a query comes through that there isn't any context in the textbook on?
* Try another embedding model (e.g. Mixed Bread AI large, `mixedbread-ai/mxbai-embed-large-v1`, see: https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)
* Try another LLM... (e.g. Mistral-Instruct)
* Try different prompts (e.g. see prompting techniques online)
* Our example only focuses on text from a PDF, however, we could extend it to include figures and images 
* Evaluate the answers -> could use another LLM to rate our answers (e.g. use GPT-4 to make)
* Vector database/index for larger setup (e.g. 100,000+ chunks)
* Libraries/frameworks such as LangChain / LlamaIndex can help do many of the steps for you - so it's worth looking into those next, wanted to recreate a workflow with lower-level tools to show the principles
* Optimizations for speed
    * See Hugging Face docs for recommended speed ups on GPU - https://huggingface.co/docs/transformers/perf_infer_gpu_one 
    * Optimum NVIDIA - https://huggingface.co/blog/optimum-nvidia, GitHub: https://github.com/huggingface/optimum-nvidia 
    * See NVIDIA TensorRT-LLM - https://github.com/NVIDIA/TensorRT-LLM 
    * See GPT-Fast for PyTorch-based optimizations - https://github.com/pytorch-labs/gpt-fast 
    * Flash attention 2 (requires Ampere GPUs or newer) - https://github.com/Dao-AILab/flash-attention
* Stream text output so it looks prettier (e.g. each token appears as it gets output from the model)
* Turn the workflow into an app, see Gradio type chatbots for this - https://www.gradio.app/guides/creating-a-chatbot-fast, see local example: https://www.gradio.app/guides/creating-a-chatbot-fast#example-using-a-local-open-source-llm-with-hugging-face 