## Requirements and setup

* Local NVIDIA GPU (I used a NVIDIA RTX 4090 on a Windows 11 machine) or Google Colab with access to a GPU.
* Environment setup (see [setup details on GitHub](https://github.com/mrdbourke/simple-local-rag/?tab=readme-ov-file#setup)).
* Data source (for example, a PDF).
* Internet connection (to download the models, but once you have them, it'll run offline).

In [1]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference

## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).

### Import PDF Document

This will work with many other kinds of documents.

However, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

We're going to pretend we're nutrition students at the University of Hawai'i, reading through the open-source PDF textbook [*Human Nutrition: 2020 Edition*](https://pressbooks.oer.hawaii.edu/humannutrition2/).

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well in many cases.

First we'll download the PDF if it doesn't exist.

In [1]:
# Download PDF file
#import os
#import requests
#
## Get PDF document
#file_path = "human-nutrition-text.pdf"
#
## Download PDF if it doesn't already exist
#if not os.path.exists(file_path):
#  print("File doesn't exist, downloading...")
#
#  # The URL of the PDF you want to download
#  url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"
#
#  # The local filename to save the downloaded file
#  filename = file_path
#
#  # Send a GET request to the URL
#  response = requests.get(url)
#
#  # Check if the request was successful
#  if response.status_code == 200:
#      # Open a file in binary write mode and save the content to it
#      with open(filename, "wb") as file:
#          file.write(response.content)
#      print(f"The file has been downloaded and saved as {filename}")
#  else:
#      print(f"Failed to download the file. Status code: {response.status_code}")
#else:
#  print(f"File {file_path} exists.")
import os

choice = "pdf"

if choice == "pdf":
    folder_path = 'Hiltipdfs' # replace with the path to your folder
    file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pdf')]

    for file_path in file_paths:
        print(file_path)
elif choice == "slides":
    folder_path = 'Slides' # replace with the path to your folder
    file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pptx')]

    for file_path in file_paths:
        print(file_path)

Hiltipdfs\Hilti Malaysia - Terms and Conditions 2019.pdf
Hiltipdfs\Hilti-Submittal-Package-OSHA-1926.1153.pdf
Hiltipdfs\Hilti_BindingCorporateRules.pdf
Hiltipdfs\Hilti_GB_2020_en_pdf.pdf
Hiltipdfs\Technical-information-ASSET-DOC-LOC-10908813.pdf


PDF acquired!

We can import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`).

We'll write a small helper function to preprocess the text as it gets read. Note that not all text will be read in the same so keep this in mind for when you prepare your text.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [8]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(file_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        file_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(file_path)  # open a document
    pdf_name = os.path.basename(file_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number + 1, 
                                "pdf_name": pdf_name,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts


pages_and_texts = []
for file_path in file_paths:
    pages_and_texts.extend(open_and_read_pdf(file_path=file_path))
pages_and_texts[:3]

12it [00:00, 427.99it/s]
5it [00:00, 357.14it/s]
19it [00:00, 327.59it/s]
45it [00:00, 177.16it/s]
1it [00:00, 166.67it/s]


[{'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'page_char_count': 2282,
  'page_word_count': 451,
  'page_sentence_count_raw': 11,
  'page_token_count': 570.5,
  'text': 'Hilti Malaysia Sdn. Bhd. (157721-A)  F-5-A | Sime Darby Brunsfield Tower  No. 2 | Jalan PJU 1A/7A  Oasis Square I Oasis Damansara  47301 Petaling Jaya I Selangor I Malaysia  Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my  HILTI (MALAYSIA) SDN. BHD.  TERMS AND CONDITIONS      1.   GENERAL    1.1   In these conditions the following words have the meanings shown:     "Buyer"   means the person, firm or company purchasing Goods       "Company" means Hilti (Malaysia) Sdn Bhd or one of its associated or subsidiary  companies as the case may be       "Contract"  means the agreement between the Company and the Buyer for the  purchase of Goods from the Company by the Buyer        "Contracts" includes all agreements between the Company and the Buyer for the  purchase of Goods

Now let's get a random sample of the pages.

In [7]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 10,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'page_char_count': 2450,
  'page_word_count': 484,
  'page_sentence_count_raw': 17,
  'page_token_count': 612.5,
  'text': 'Hilti Malaysia Sdn. Bhd. (157721-A)  F-5-A | Sime Darby Brunsfield Tower  No. 2 | Jalan PJU 1A/7A  Oasis Square I Oasis Damansara  47301 Petaling Jaya I Selangor I Malaysia  Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my  packaging, design, trade name that is similar or identical to the Company’s intellectual  property.    In case of breach of the aforesaid terms, the Company shall be entitled to institute  legal proceedings against the Buyer for all losses and damages suffered as a result  thereof on full indemnity basis.    14.   FORCE MAJEURE      The Company shall be entitled to delay or cancel delivery or to reduce the amount of  the Goods delivered if it is prevented from, hindered or delayed in manufacturing,  obtaining or delivering the Goods by normal rout

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

The different sizes of texts will be a good indicator into how we should split our texts.

Many embedding models have limits on the size of texts they can ingest, for example, the [`sentence-transformers`](https://www.sbert.net/docs/pretrained_models.html) model [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) has an input size of 384 tokens.

This means that the model has been trained in ingest and turn into embeddings texts with 384 tokens (1 token ~= 4 characters ~= 0.75 words).

Texts over 384 tokens which are encoded by this model will be auotmatically reduced to 384 tokens in length, potentially losing some information.

We'll discuss this more in the embedding section.

For now, let's turn our list of dictionaries into a DataFrame and explore it.

In [11]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,pdf_name,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,Hilti Malaysia - Terms and Conditions 2019.pdf,2282,451,11,570.5,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
1,2,Hilti Malaysia - Terms and Conditions 2019.pdf,2776,562,18,694.0,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
2,3,Hilti Malaysia - Terms and Conditions 2019.pdf,2728,558,16,682.0,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
3,4,Hilti Malaysia - Terms and Conditions 2019.pdf,2867,595,14,716.75,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
4,5,Hilti Malaysia - Terms and Conditions 2019.pdf,2938,572,16,734.5,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...


In [12]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,82.0,82.0,82.0,82.0,82.0
mean,16.09,2346.77,409.88,12.72,586.69
std,12.84,1861.12,318.34,10.33,465.28
min,1.0,0.0,1.0,1.0,0.0
25%,5.25,982.0,162.25,6.0,245.5
50%,12.0,1802.5,301.5,9.0,450.62
75%,24.75,3003.25,577.25,18.75,750.81
max,45.0,7394.0,1307.0,47.0,1848.5


Okay, looks like our average token count per page is 287.

For this particular use case, it means we could embed an average whole page with the `all-mpnet-base-v2` model (this model has an input capacity of 384).

### Further text processing (splitting pages into sentences)

The ideal way of processing text before embedding it is still an active area of research.

A simple method I've found helpful is to break the text into chunks of sentences.

As in, chunk a page of text into groups of 5, 7, 10 or more sentences (these values are not set in stone and can be explored).

But we want to follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Some options for splitting text into sentences:

1. Split into sentences with simple rules (e.g. split on ". " with `text = text.split(". ")`, like we did above).
2. Split into sentences with a natural language processing (NLP) library such as [spaCy](https://spacy.io/) or [nltk](https://www.nltk.org/).

Why split into sentences?

* Easier to handle than larger pages of text (especially if pages are densely filled with text).
* Can get specific and find out which group of sentences were used to help within a RAG pipeline.

> **Resource:** See [spaCy install instructions](https://spacy.io/usage).

Let's use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`.

In [13]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("This is a sentence. This another sentence.")
assert len(list(doc.sents)) == 2

# Access the sentences of the document
list(doc.sents)

[This is a sentence., This another sentence.]

We don't necessarily need to use spaCy, however, it's an open-source library designed to do NLP tasks like this at scale.

So let's run our small sentencizing pipeline on our pages of text.

In [14]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

100%|██████████| 82/82 [00:00<00:00, 257.05it/s]


In [15]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 8,
  'pdf_name': 'Hilti_GB_2020_en_pdf.pdf',
  'page_char_count': 1408,
  'page_word_count': 238,
  'page_sentence_count_raw': 10,
  'page_token_count': 352.0,
  'text': '“We passionately create enthusiastic customers  and build a better future.” This mission statement  is based on the conviction that we grow together  with the people around us – with our customers,  employees and partners. Personal exchanges and  the aspiration to never rest, only to improve, has  put us in the position to provide world-class prod- ucts, systems, software and services. Our strategic objective is sustainable value cre- ation through market leadership and differentia- tion – market leadership in terms of relative market  share, and differentiation via the direct sale of our  portfolio. We will continue to follow the successful course of  recent years in 2020 and beyond, while emphasiz- ing four proven strategic fields of activity. We are  investing in continuous innovation. In doing so 

Wonderful!

Now let's turn out list of dictionaries into a DataFrame and get some stats.

In [16]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,82.0,82.0,82.0,82.0,82.0,82.0
mean,16.09,2346.77,409.88,12.72,586.69,12.43
std,12.84,1861.12,318.34,10.33,465.28,9.99
min,1.0,0.0,1.0,1.0,0.0,0.0
25%,5.25,982.0,162.25,6.0,245.5,6.0
50%,12.0,1802.5,301.5,9.0,450.62,9.0
75%,24.75,3003.25,577.25,18.75,750.81,18.75
max,45.0,7394.0,1307.0,47.0,1848.5,45.0


For our set of text, it looks like our raw sentence count (e.g. splitting on `". "`) is quite close to what spaCy came up with.

Now we've got our text split into sentences, how about we gorup those sentences?

### Chunking our sentences together

Let's take a step to break down our list of sentences/text into smaller chunks.

As you might've guessed, this process is referred to as **chunking**.

Why do we do this?

1. Easier to manage similar sized chunks of text.
2. Don't overload the embedding models capacity for tokens (e.g. if an embedding model has a capacity of 384 tokens, there could be information loss if you try to embed a sequence of 400+ tokens).
3. Our LLM context window (the amount of tokens an LLM can take in) may be limited and requires compute power so we want to make sure we're using it as well as possible.

Something to note is that there are many different ways emerging for creating chunks of information/text.

For now, we're going to keep it simple and break our pages of sentences into groups of 10 (this number is arbitrary and can be changed, I just picked it because it seemed to line up well with our embedding model capacity of 384).

On average each of our pages has 10 sentences.

And an average total of 287 tokens per page.

So our groups of 10 sentences will also be ~287 tokens long.

This gives us plenty of room for the text to embedded by our `all-mpnet-base-v2` model (it has a capacity of 384 tokens).

To split our groups of sentences into chunks of 10 or less, let's create a function which accepts a list as input and recursively breaks into down into sublists of a specified size.

In [58]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 6

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

100%|██████████| 82/82 [00:00<?, ?it/s]


In [59]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(pages_and_texts, k=1)

[{'page_number': 14,
  'pdf_name': 'Hilti_BindingCorporateRules.pdf',
  'page_char_count': 5045,
  'page_word_count': 848,
  'page_sentence_count_raw': 22,
  'page_token_count': 1261.25,
  'text': 'www.hilti.group 14 best efforts to obtain the right to waive this prohibition in  order to communicate as much information as it can and as  soon as possible and be able to demonstrate that it did so.  If despite using its best efforts a Hilti Entity or Hilti HQ is  unable to notify the competent Supervisory Authority, it  undertakes to at least annually provide the Supervisory   Authority with information related to the requests received  by the national security state bodies or authorities and at  least the information listed above. The transfers of Personal Data from a Hilti Entity to national  security state bodies or authorities shall never be done in  an excessive, disproportionate, and indiscriminate manner  which would go beyond what is necessary in a democratic  society. c. Relation

In [60]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,82.0,82.0,82.0,82.0,82.0,82.0,82.0
mean,16.09,2346.77,409.88,12.72,586.69,12.43,2.57
std,12.84,1861.12,318.34,10.33,465.28,9.99,1.61
min,1.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,5.25,982.0,162.25,6.0,245.5,6.0,1.0
50%,12.0,1802.5,301.5,9.0,450.62,9.0,2.0
75%,24.75,3003.25,577.25,18.75,750.81,18.75,3.75
max,45.0,7394.0,1307.0,47.0,1848.5,45.0,8.0


Note how the average number of chunks is around 1.5, this is expected since many of our pages only contain an average of 10 sentences.

### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

So to keep things clean, let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [61]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        chunk_dict["pdf_name"] = item["pdf_name"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

100%|██████████| 82/82 [00:00<00:00, 16359.84it/s]


211

In [162]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 29,
  'pdf_name': 'Hilti_GB_2020_en_pdf.pdf',
  'sentence_chunk': 'employees worldwide (2019: 30,006) nationalities in the global team (2019: 127) nationalities at headquarters (2019: 63) of team members worldwide are women (2019: 25%) of team leaders worldwide are women (2019: 21%) 29,549 63 25.5% 21.5% After Lindsay Ophus started her career as a process manager at Hilti in the USA in 2015, it was soon clear she was on the path to a long career with the company. “My goal is to create added value for the company with my team.”Hilti’s approach is to match people with roles they enjoy and are suited for, both laterally and upward. The result is a resilient, high-performing global team. Team members are encouraged to talk frequently with their leaders about their development, whether it be for their current role or one in the future. Lindsay was set to become an area sales manager, but one discussion helped her consider another path. “',
  'chunk_char_count': 879,
  'chun

Excellent!

Now we've broken our whole textbook into chunks of 10 sentences or less as well as the page number they came from.

This means we could reference a chunk of text and know its source.

Let's get some stats about our chunks.

In [163]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,211.0,211.0,211.0,211.0
mean,15.57,896.91,144.79,224.23
std,12.13,622.14,99.57,155.54
min,1.0,19.0,3.0,4.75
25%,6.0,489.0,80.5,122.25
50%,12.0,825.0,133.0,206.25
75%,23.5,1118.0,188.5,279.5
max,44.0,4348.0,696.0,1087.0


Hmm looks like some of our chunks have quite a low token count.

How about we check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?

In [168]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 4.75 | Text: COMPANY REPORT 2020
Chunk token count: 15.0 | Text: COMPLEX GEOMETRY MADE SIMPLE 2020 Hilti Company Report 34–35
Chunk token count: 15.25 | Text: Buyer shall further ensure that it does not intend to use any
Chunk token count: 11.0 | Text: TRACEABILITY 2020 Hilti Company Report 14–15
Chunk token count: 7.75 | Text: 2020 Hilti Company Report 18–19


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [169]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'sentence_chunk': 'Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | Sime Darby Brunsfield Tower No.2 | Jalan PJU 1A/7A Oasis Square I Oasis Damansara 47301 Petaling Jaya I Selangor I Malaysia Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my HILTI (MALAYSIA) SDN. BHD. TERMS AND CONDITIONS   1.',
  'chunk_char_count': 281,
  'chunk_word_count': 50,
  'chunk_token_count': 70.25},
 {'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'sentence_chunk': 'GENERAL  1.1  In these conditions the following words have the meanings shown:   "Buyer"  means the person, firm or company purchasing Goods    "Company" means Hilti (Malaysia) Sdn Bhd or one of its associated or subsidiary companies as the case may be    "Contract" means the agreement between the Company and the Buyer for the purchase of Goods from the Company by the Buyer    "Contracts" includes all agreements betwee

Smaller chunks filtered!

Time to embed our chunks of text!

### Embedding our text chunks

While humans understand text, machines understand numbers best.

An [embedding](https://vickiboykis.com/what_are_embeddings/index.html) is a broad concept.

But one of my favourite and simple definitions is "a useful numerical representation".

The most powerful thing about modern embeddings is that they are *learned* representations.

Meaning rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.

> **Note:** Most modern NLP models deal with "tokens" which can be considered as multiple different sizes and combinations of words and characters rather than always whole words or single characters. For example, the string `"hello world!"` gets mapped to the token values `{15339: b'hello', 1917: b' world', 0: b'!'}` using [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) (or BPE via OpenAI's [`tiktoken`](https://github.com/openai/tiktoken) library). Google has a tokenization library called [SentencePiece](https://github.com/google/sentencepiece).

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

Once our text samples are in embedding vectors, us humans will no longer be able to understand them.

However, we don't need to.

The embedding vectors are for our computers to understand.

We'll use our computers to find patterns in the embeddings and then we can use their text mappings to further our understanding.

Enough talking, how about we import a text embedding model and see what an embedding looks like.

To do so, we'll use the [`sentence-transformers`](https://www.sbert.net/docs/installation.html) library which contains many pre-trained embedding models.

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

In [170]:
# Requires !pip install sentence-transformers
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device="cuda") # choose the device to load the model to (note: GPU will often be *much* faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981840e-02  3.03164609e-02 -2.01218035e-02  6.86483681e-02
 -2.55255606e-02 -8.47685710e-03 -2.07098550e-04 -6.32377341e-02
  2.81606205e-02 -3.33353132e-02  3.02635003e-02  5.30720502e-02
 -5.03526330e-02  2.62288097e-02  3.33314240e-02 -4.51578312e-02
  3.63044441e-02 -1.37111905e-03 -1.20171579e-02  1.14946403e-02
  5.04510738e-02  4.70856875e-02  2.11913176e-02  5.14607020e-02
 -2.03746278e-02 -3.58889289e-02 -6.67838729e-04 -2.94393227e-02
  4.95859161e-02 -1.05639659e-02 -1.52014159e-02 -1.31756265e-03
  4.48197350e-02  1.56023391e-02  8.60379885e-07 -1.21393730e-03
 -2.37978660e-02 -9.09396331e-04  7.34482659e-03 -2.53930246e-03
  5.23370430e-02 -4.68043163e-02  1.66214686e-02  4.71579209e-02
 -4.15599309e-02  9.01952910e-04  3.60278748e-02  3.42214517e-02
  9.68227386e-02  5.94828427e-02 -1.64984632e-02 -3.51249576e-02
  5.92517946e-03 -7.07972853e-04 -2.4103

Woah! That's a lot of numbers.

How about we do just once sentence?

In [171]:
single_sentence = "Yo! How cool are embeddings?"
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")

Sentence: Yo! How cool are embeddings?
Embedding:
[-1.97447482e-02 -4.51083714e-03 -4.98482492e-03  6.55445009e-02
 -9.87671968e-03  2.72835568e-02  3.66426110e-02 -3.30221280e-03
  8.50079115e-03  8.24954547e-03 -2.28497442e-02  4.02430184e-02
 -5.75200021e-02  6.33692294e-02  4.43207473e-02 -4.49507162e-02
  1.25284223e-02 -2.52012350e-02 -3.55292335e-02  1.29558872e-02
  8.67024530e-03 -1.92917641e-02  3.55635560e-03  1.89506002e-02
 -1.47128012e-02 -9.39845853e-03  7.64171872e-03  9.62188747e-03
 -5.98928286e-03 -3.90169397e-02 -5.47824539e-02 -5.67456661e-03
  1.11644994e-02  4.08067293e-02  1.76319099e-06  9.15297959e-03
 -8.77260044e-03  2.39382759e-02 -2.32784562e-02  8.04999545e-02
  3.19176838e-02  5.12595987e-03 -1.47708356e-02 -1.62524562e-02
 -6.03213347e-02 -4.35689725e-02  4.51211482e-02 -1.79053750e-02
  2.63367202e-02 -3.47867161e-02 -8.89172871e-03 -5.47674894e-02
 -1.24372896e-02 -2.38606650e-02  8.33496898e-02  5.71242124e-02
  1.13328528e-02 -1.49594611e-02  9.2037

Nice! We've now got a way to numerically represent each of our chunks.

Our embedding has a shape of `(768,)` meaning it's a vector of 768 numbers which represent our text in high-dimensional space, too many for a human to comprehend but machines love high-dimensional space.

> **Note:** No matter the size of the text input to our `all-mpnet-base-v2` model, it will be turned into an embedding size of `(768,)`. This value is fixed. So whether a sentence is 1 token long or 1000 tokens long, it will be truncated/padded with zeros to size 384 and then turned into an embedding vector of size `(768,)`. Of course, other embedding models may have different input/output shapes.

How about we add an embedding field to each of our chunk items?

Let's start by trying to create embeddings on the CPU, we'll time it with the `%%time` magic to see how long it takes.

In [25]:
%%time

# Uncomment to see how long it takes to create embeddings on CPU
# # Make sure the model is on the CPU
# embedding_model.to("cpu")

# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

CPU times: total: 0 ns
Wall time: 0 ns


Ok not too bad... but this would take a *really* long time if we had a larger dataset.

Now let's see how long it takes to create the embeddings with a GPU.

In [172]:
%%time

# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

100%|██████████| 196/196 [00:03<00:00, 55.16it/s]

CPU times: total: 3.97 s
Wall time: 3.56 s





Woah! Looks like the embeddings get created much faster (~10x faster on my machine) on the GPU!

You'll likely notice this trend with many of your deep learning workflows. If you have access to a GPU, especially a NVIDIA GPU, you should use one if you can.

But what if I told you we could go faster again?

You see many modern models can handle batched predictions.

This means computing on multiple samples at once.

Those are the types of operations where a GPU flourishes!

We can perform batched operations by turning our target text samples into a single list and then passing that list to our embedding model.

In [173]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]

In [174]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: total: 2.06 s
Wall time: 3.03 s


tensor([[ 0.0068, -0.0498, -0.0305,  ..., -0.0212, -0.0335, -0.0145],
        [ 0.0720, -0.0776,  0.0147,  ..., -0.0024, -0.0726, -0.0001],
        [ 0.0319, -0.0595, -0.0103,  ..., -0.0043, -0.0772, -0.0126],
        ...,
        [ 0.0313,  0.0054, -0.0286,  ...,  0.0091, -0.0014,  0.0090],
        [ 0.0250, -0.0382,  0.0085,  ...,  0.0007, -0.0105, -0.0290],
        [-0.0733,  0.0374, -0.0053,  ...,  0.0104, -0.0782, -0.0013]],
       device='cuda:0')

That's what I'm talking about!

A ~4x improvement (on my GPU) in speed thanks to batched operations.

So the tip here is to use a GPU when you can and use batched operations if you can too.

Now let's save our chunks and their embeddings so we could import them later if we wanted.

### Save embeddings to file

Since creating embeddings can be a timely process (not so much for our case but it can be for more larger datasets), let's turn our `pages_and_chunks_over_min_token_len` list of dictionaries into a DataFrame and save it.

In [175]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

And we can make sure it imports nicely by loading it.

In [178]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,pdf_name,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,1,Hilti Malaysia - Terms and Conditions 2019.pdf,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | Si...,281,50,70.25,[ 6.78907195e-03 -4.97652031e-02 -3.04607451e-...
1,1,Hilti Malaysia - Terms and Conditions 2019.pdf,GENERAL 1.1 In these conditions the followin...,1934,335,483.5,[ 7.20427632e-02 -7.75793642e-02 1.46522094e-...
2,2,Hilti Malaysia - Terms and Conditions 2019.pdf,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | Si...,872,156,218.0,[ 3.19398865e-02 -5.95324412e-02 -1.02918353e-...
3,2,Hilti Malaysia - Terms and Conditions 2019.pdf,The Company accordingly reserves the right to ...,1133,209,283.25,[ 4.59260643e-02 -1.06616899e-01 1.31733194e-...
4,2,Hilti Malaysia - Terms and Conditions 2019.pdf,Any such additional costs may be invoiced by t...,702,130,175.5,[ 3.28748487e-02 -7.55417421e-02 -3.63777671e-...


## 2. RAG - Search and Answer

We discussed RAG briefly in the beginning but let's quickly recap.

RAG stands for Retrieval Augmented Generation.

Which is another way of saying "given a query, search for relevant resources and answer based on those resources".

Let's breakdown each step:
* **Retrieval** - Get relevant resources given a query. For example, if the query is "what are the macronutrients?" the ideal results will contain information about protein, carbohydrates and fats (and possibly alcohol) rather than information about which tractors are the best for farming (though that is also cool information).
* **Augmentation** - LLMs are capable of generating text given a prompt. However, this generated text is designed to *look* right. And it often has some correct information, however, they are prone to hallucination (generating a result that *looks* like legit text but is factually wrong). In augmentation, we pass relevant information into the prompt and get an LLM to use that relevant information as the basis of its generation.
* **Generation** - This is where the LLM will generate a response that has been flavoured/augmented with the retrieved resources. In turn, this not only gives us a potentially more correct answer, it also gives us resources to investigate more (since we know which resources went into the prompt).

The whole idea of RAG is to get an LLM to be more factually correct based on your own input as well as have a reference to where the generated output may have come from.

This is an incredibly helpful tool.

Let's say you had 1000s of customer support documents.

You could use RAG to generate direct answers to questions with links to relevant documentation.

Or you were an insurance company with large chains of claims emails.

You could use RAG to answer questions about the emails with sources.

One helpful analogy is to think of LLMs as calculators for words.

With good inputs, the LLM can sort them into helpful outputs.

How?

It starts with better search.

### Similarity search

Similarity search or semantic search or vector search is the idea of searching on *vibe*.

If this sounds like woo, woo. It's not.

Perhaps searching via *meaning* is a better analogy.

With keyword search, you are trying to match the string "apple" with the string "apple".

Whereas with similarity/semantic search, you may want to search "macronutrients functions".

And get back results that don't necessarily contain the words "macronutrients functions" but get back pieces of text that match that meaning.

> **Example:** Using similarity search on our textbook data with the query "macronutrients function" returns a paragraph that starts with:
>
>*There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions.*
>
> as the first result. How cool!

If you've ever used Google, you know this kind of workflow.

But now we'd like to perform that across our own data.

Let's import our embeddings we created earlier (tk -link to embedding file) and prepare them for use by turning them into a tensor.

In [13]:
import random

import torch
import numpy as np
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([133, 768])

In [14]:
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,pdf_name,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,1,10_Object and Class Structuring.pptx,Object and Class Structuring CT015-3-2 Design ...,417,63,104.25,"[0.00895868056, -0.0445464216, -0.0190935209, ..."
1,1,10_Object and Class Structuring.pptx,Object and class structuring criteria are prov...,533,77,133.25,"[-0.0127386739, -0.000978663214, 0.00970725249..."
2,2,10_Object and Class Structuring.pptx,Object and Class Structuring Criteria Slide <6...,386,61,96.5,"[0.010841215, -0.0202376209, 0.00245435373, 0...."
3,2,10_Object and Class Structuring.pptx,System Interface Objects represent external sy...,321,45,80.25,"[-0.0332218222, -0.0271735694, -0.00854785275,..."
4,3,10_Object and Class Structuring.pptx,– Software object that contains the details of...,580,84,145.0,"[0.0173595063, -0.0475490652, -0.00125845731, ..."


Nice!

Now let's prepare another instance of our embedding model. Not because we have to but because we'd like to make it so you can start the notebook from the cell above.

In [15]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device) # choose the device to load the model to

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Embedding model ready!

Time to perform a semantic search.

Let's say you were studying the macronutrients.

And wanted to search your textbook for "macronutrients functions".

Well, we can do so with the following steps:
1. Define a query string (e.g. `"macronutrients functions"`) - note: this could be anything, specific or not.
2. Turn the query string in an embedding with same model we used to embed our text chunks.
3. Perform a [dot product](https://pytorch.org/docs/stable/generated/torch.dot.html) or [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) function between the text embeddings and the query embedding (we'll get to what these are shortly) to get similarity scores.
4. Sort the results from step 3 in descending order (a higher score means more similarity in the eyes of the model) and use these values to inspect the texts.

Easy!


Woah!! Now that was fast!

~0.00008 seconds to perform a dot product comparison across 1680 embeddings on my machine (NVIDIA RTX 4090 GPU).

GPUs are optimized for these kinds of operations.

So even if you we're to increase our embeddings by 100x (1680 -> 168,000), an exhaustive dot product operation would happen in ~0.008 seconds (assuming linear scaling).

Heck, let's try it.

Wow. That's quick!

That means we can get pretty far by just storing our embeddings in `torch.tensor` for now.

However, for *much* larger datasets, we'd likely look at a dedicated vector database/indexing libraries such as [Faiss](https://github.com/facebookresearch/faiss).

Let's check the results of our original similarity search.

[`torch.topk`](https://pytorch.org/docs/stable/generated/torch.topk.html) returns a tuple of values (scores) and indicies for those scores.

The indicies relate to which indicies in the `embeddings` tensor have what scores in relation to the query embedding (higher is better).

We can use those indicies to map back to our text chunks.

First, we'll define a small helper function to print out wrapped text (so it doesn't print a whole text chunk as a single line).

In [16]:
# Define helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

Now we can loop through the `top_results_dot_product` tuple and match up the scores and indicies and then use those indicies to index on our `pages_and_chunks` variable to get the relevant text chunk.

Sounds like a lot but we can do it!

The first result looks to have nailed it!

We get a very relevant answer to our query `"macronutrients functions"` even though its quite vague.

That's the power of semantic search!

And even better, if we wanted to inspect the result further, we get the page number where the text appears.

How about we check the page to verify?

We can do so by loading the page number containing the highest result (page 5 but really page 5 + 41 since our PDF page numbers start on page 41).

Nice!

Now we can do extra research if we'd like.

We could repeat this workflow for any kind of query we'd like on our textbook.

And it would also work for other datatypes too.

We could use semantic search on customer support documents.

Or email threads.

Or company plans.

Or our old journal entries.

Almost anything!

The workflow is the same:

`ingest documents -> split into chunks -> embed chunks -> make a query -> embed the query -> compare query embedding to chunk embeddings`

And we get relevant resources *along with* the source they came from!

That's the **retrieval** part of Retrieval Augmented Generation (RAG).

Before we get to the next two steps, let's take a small aside and discuss similarity measures.

### Similarity measures: dot product and cosine similarity

Let's talk similarity measures between vectors.

Specifically, embedding vectors which are representations of data with magnitude and direction in high dimensional space (our embedding vectors have 768 dimensions).

Two of the most common you'll across are the dot product and cosine similarity.

They are quite similar.

The main difference is that cosine similarity has a normalization step.

| Similarity measure | Description | Code |
| ----- | ----- | ----- |
| [Dot Product](https://en.wikipedia.org/wiki/Dot_product) | - Measure of magnitude and direction between two vectors<br>- Vectors that are aligned in direction and magnitude have a higher positive value<br>- Vectors that are opposite in direction and magnitude have a higher negative value | [`torch.dot`](https://pytorch.org/docs/stable/generated/torch.dot.html), [`np.dot`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html), [`sentence_transformers.util.dot_score`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.dot_score) |
| [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) | - Vectors get normalized by magnitude/[Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics))/L2 norm so they have unit length and are compared more so on direction<br>- Vectors that are aligned in direction have a value close to 1<br>- Vectors that are opposite in direction have a value close to -1 | [`torch.nn.functional.cosine_similarity`](https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html), [`1 - scipy.spatial.distance.cosine`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) (subtract the distance from 1 for similarity measure), [`sentence_transformers.util.cos_sim`](https://www.sbert.net/docs/package_reference/util.html#sentence_transformers.util.cos_sim) |

For text similarity, you generally want to use cosine similarity as you are after the semantic measurements (direction) rather than magnitude.

In our case, our embedding model `all-mpnet-base-v2` outputs normalized outputs (see the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#usage-huggingface-transformers) for more on this) so dot product and cosine similarity return the same results. However, dot product is faster due to not need to perform a normalize step.

To make things bit more concrete, let's make simple dot product and cosine similarity functions and view their results on different vectors.

> **Note:** Similarity measures between vectors and embeddings can be used on any kind of embeddings, not just text embeddings. For example, you could measure image embedding similarity or audio embedding similarity. Or with text and image models like [CLIP](https://github.com/mlfoundations/open_clip), you can measure the similarity between text and image embeddings.

In [17]:
import torch

def dot_product(vector1, vector2):
    return torch.dot(vector1, vector2)

def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

# Example tensors
#vector1 = torch.tensor([1, 2, 3], dtype=torch.float32)
#vector2 = torch.tensor([1, 2, 3], dtype=torch.float32)
#vector3 = torch.tensor([4, 5, 6], dtype=torch.float32)
#vector4 = torch.tensor([-1, -2, -3], dtype=torch.float32)
#
## Calculate dot product
#print("Dot product between vector1 and vector2:", dot_product(vector1, vector2))
#print("Dot product between vector1 and vector3:", dot_product(vector1, vector3))
#print("Dot product between vector1 and vector4:", dot_product(vector1, vector4))
#
## Calculate cosine similarity
#print("Cosine similarity between vector1 and vector2:", cosine_similarity(vector1, vector2))
#print("Cosine similarity between vector1 and vector3:", cosine_similarity(vector1, vector3))
#print("Cosine similarity between vector1 and vector4:", cosine_similarity(vector1, vector4))

Notice for both dot product and cosine similarity the comparisons of `vector1` and `vector2` are the opposite of `vector1` and `vector4`.

Comparing `vector1` and `vector2` both equations return positive values (14 for dot product and 1.0 for cosine similarity).

But comparing `vector1` and `vector4` the result is in the negative direction.

This makes sense because `vector4` is the negative version of `vector1`.

Whereas comparing `vector1` and `vector3` shows a different outcome.

For the dot product, the value is positive and larger then the comparison of two exactly the same vectors (32 vs 14).

However, for the cosine similarity, thanks to the normalization step, comparing `vector1` and `vector3` results in a postive value close to 1 but not exactly 1.

It is because of this that when comparing text embeddings, cosine similarity is generally favoured as it measures the difference in direction of a pair of vectors rather than difference in magnitude.

And it is this difference in direction that is more generally considered to capture the semantic meaning/vibe of the text.

The good news is that as mentioned before, the outputs of our embedding model `all-mpnet-base-v2` are already normalized.

So we can continue using the dot product (cosine similarity is dot product + normalization).

With similarity measures explained, let's functionize our semantic search steps from above so we can repeat them.

### Functionizing our semantic search pipeline

Let's put all of the steps from above for semantic search into a function or two so we can repeat the workflow.

In [18]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,
                                   convert_to_tensor=True)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores,
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """

    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)

    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

Excellent! Now let's test our functions out.

In [19]:
query = "symptoms of pellagra"
from time import perf_counter as timer

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 133 embeddings: 0.00069 seconds.


(tensor([0.0877, 0.0644, 0.0600, 0.0540, 0.0459], device='cuda:0'),
 tensor([ 19, 124,  31, 123, 125], device='cuda:0'))

In [20]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 133 embeddings: 0.00005 seconds.
Query: symptoms of pellagra

Results:
Score: 0.0877
You can put a guard in each fragment to indicate under what condition it can
run. A guard of else indicates a fragment that should run if no other guard is
true. If all guards are false and there is no else, then none
Page number: 5


Score: 0.0644
any absence from class. Three cases of Late will be equal to 1 absence. Use
proper academic references – APA Referencing only. Academic Dishonesty /
Plagiarism is a serious offence.
Page number: 7


Score: 0.0600
Exit/ display message Cancelling Appointment Entry/ received request Do/
cancelling the appointment Exit/ display message • A transition is the path
taken during a change of state from one state to the next in response to a
triggering event • Guard condition: Optional condition to be evaluated; if
false, the transition is not taken Transitions State Machines Transitions Slide
‹#› of 20 Waiting for Approval Creating

## Loading LLM Model

### Checking local GPU memory availability

Let's find out what hardware we've got available and see what kind of model(s) we'll be able to load.

> **Note:** You can also check this with the `!nvidia-smi` command.

In [1]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
gpu_memory_gb = 6
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 6 GB


In [2]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 6 | Recommended model: Gemma 2B in 4-bit precision.
use_quantization_config set to: True
model_id set to: google/gemma-2b-it


### Loading an LLM locally

Alright! Looks like `gemma-7b-it` it is (for my local machine with an RTX 4090, change the `model_id` and `use_quantization_config` values to suit your needs)!

There are plenty of examples of how to load the model on the `gemma-7b-it` [Hugging Face model card](https://huggingface.co/google/gemma-7b-it).

Good news is, the Hugging Face [`transformers`](https://huggingface.co/docs/transformers/) library has all the tools we need.

To load our LLM, we're going to need a few things:
1. A quantization config (optional) - This will determine whether or not we load the model in 4bit precision for lower memory usage. The we can create this with the [`transformers.BitsAndBytesConfig`](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/quantization#transformers.BitsAndBytesConfig) class (requires installing the [`bitsandbytes` library](https://github.com/TimDettmers/bitsandbytes)).
2. A model ID - This is the reference Hugging Face model ID which will determine which tokenizer and model gets used. For example `gemma-7b-it`.
3. A tokenzier - This is what will turn our raw text into tokens ready for the model. We can create it using the [`transformers.AutoTokenzier.from_pretrained`](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoTokenizer) method and passing it our model ID.
4. An LLM model - Again, using our model ID we can load a specific LLM model. To do so we can use the [`transformers.AutoModelForCausalLM.from_pretrained`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained) method and passing it our model ID as well as other various parameters.

As a bonus, we'll check if [Flash Attention 2](https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2) is available using `transformers.utils.is_flash_attn_2_available()`. Flash Attention 2 speeds up the attention mechanism in Transformer architecture models (which is what many modern LLMs are based on, including Gemma). So if it's available and the model is supported (not all models support Flash Attention 2), we'll use it. If it's not available, you can install it by following the instructions on the [GitHub repo](https://github.com/Dao-AILab/flash-attention).

> **Note:** Flash Attention 2 currently works on NVIDIA GPUs with a compute capability score of 8.0+ (Ampere, Ada Lovelace, Hopper architectures). We can check our GPU compute capability score with [`torch.cuda.get_device_capability(0)`](https://pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html).

> **Note:** To get access to the Gemma models, you will have to [agree to the terms & conditions](https://huggingface.co/google/gemma-7b-it) on the Gemma model page on Hugging Face. You will then have to authorize your local machine via the [Hugging Face CLI/Hugging Face Hub `login()` function](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication). Once you've done this, you'll be able to download the models. If you're using Google Colab, you can add a [Hugging Face token](https://huggingface.co/docs/hub/en/security-tokens) to the "Secrets" tab.
>
> Downloading an LLM locally can take a fair bit of time depending on your internet connection. Gemma 7B is about a 16GB download and Gemma 2B is about a 6GB download.

Let's do it!

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available


# 1. Create quantization config for smaller model loading (optional)
# Requires !pip install bitsandbytes accelerate, see: https://github.com/TimDettmers/bitsandbytes, https://huggingface.co/docs/accelerate/
# For models that require 4-bit quantization (use this if you have low GPU memory available)
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                         bnb_4bit_use_double_quant=True,
                                          bnb_4bit_quant_type="nf4",
                                          bnb_4bit_compute_dtype=torch.bfloat16)

# Bonus: Setup Flash Attention 2 for faster inference, default to "sdpa" or "scaled dot product attention" if it's not available
# Flash Attention 2 requires NVIDIA GPU compute capability of 8.0 or above, see: https://developer.nvidia.com/cuda-gpus
# Requires !pip install flash-attn, see: https://github.com/Dao-AILab/flash-attention
print(is_flash_attn_2_available())
if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
  attn_implementation = "flash_attention_2"
else:
  attn_implementation = "sdpa"
print(f"[INFO] Using attention implementation: {attn_implementation}")

# 2. Pick a model we'd like to use (this will depend on how much GPU memory you have available)
#model_id = "google/gemma-7b-it"
model_id = "google/gemma-2b-it" # (we already set this above)
print(f"[INFO] Using model_id: {model_id}")
access_token = "put your access token here"

# 3. Instantiate tokenizer (tokenizer turns text into numbers ready for the model)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="models\gemma-2b", 
                                          cache_dir="D:\HuggingFace", 
                                          token=access_token,
                                          local_files_only=True)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="models\gemma-2b",
                                               token=access_token,
                                               cache_dir="D:\HuggingFace",
                                               local_files_only=True,
                                               quantization_config=quantization_config,
                                               torch_dtype=torch.float16, # datatype to use, we want float16
                                               low_cpu_mem_usage=True, # use full memory
                                               attn_implementation=attn_implementation) # which attention version to use

#llm_model.to("cuda")

  tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="models\gemma-2b",
  cache_dir="D:\HuggingFace",
  llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="models\gemma-2b",
  cache_dir="D:\HuggingFace",
  from .autonotebook import tqdm as notebook_tqdm


False
[INFO] Using attention implementation: sdpa
[INFO] Using model_id: google/gemma-2b-it


Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████| 2/2 [00:22<00:00, 11.23s/it]


We've got an LLM!

Let's check it out.

In [4]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
    

Ok, ok a bunch of layers ranging from embedding layers to attention layers (see the `GemmaFlashAttention2` layers!) to MLP and normalization layers.

The good news is that we don't have to know too much about these to use the model.

How about we get the number of parameters in our model?

In [5]:
def get_model_num_params(model: torch.nn.Module):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

1515268096

Hmm, turns out that Gemma 7B is really Gemma 8.5B.

It pays to do your own investigations!

How about we get the models memory requirements?

In [6]:
def get_model_mem_size(model: torch.nn.Module):
    """
    Get how much memory a PyTorch model takes up.

    See: https://discuss.pytorch.org/t/gpu-memory-that-model-uses/56822
    """
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate various model sizes
    model_mem_bytes = mem_params + mem_buffers # in bytes
    model_mem_mb = model_mem_bytes / (1024**2) # in megabytes
    model_mem_gb = model_mem_bytes / (1024**3) # in gigabytes

    return {"model_mem_bytes": model_mem_bytes,
            "model_mem_mb": round(model_mem_mb, 2),
            "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 2039631872, 'model_mem_mb': 1945.14, 'model_mem_gb': 1.9}

Nice, looks like this model takes up 15.97GB of space on the GPU.

Plus a little more for the forward pass (due to all the calculations happening between the layers).

Hence why I rounded it up to be ~19GB in the table above.

Now let's get to the fun part, generating some text!

### Generating text with our LLM

We can generate text with our LLM `model` instance by calling the [`generate()` method](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig) (this method has plenty of options to pass into it alongside the text) on it and passing it a tokenized input.

The tokenized input comes from passing a string of text to our `tokenizer`.

It's important to note that you should use a tokenizer that has been paired with a model.

Otherwise if you try to use a different tokenizer and then pass those inputs to a model, you will likely get errors/strange results.

For some LLMs, there's a specific template you should pass to them for ideal outputs.

For example, the `gemma-7b-it` model has been trained in a dialogue fashion (instruction tuning).

In this case, our `tokenizer` has a [`apply_chat_template()` method](https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) which can prepare our input text in the right format for the model.

Let's try it out.

> **Note:** The following demo has been modified from the Hugging Face model card for [Gemma 7B](https://huggingface.co/google/gemma-7b-it). Many similar demos of usage are available on the model cards of similar models.

In [7]:
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print(f"Input text:\n{input_text}")

# Create prompt template for instruction-tuned model
dialogue_template = [
    {"role": "user",
     "content": input_text}
]

# Apply the chat template
prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                       tokenize=False, # keep as raw text (not tokenized)
                                       add_generation_prompt=True)
print(f"\nPrompt (formatted):\n{prompt}")

Input text:
What are the macronutrients, and what roles do they play in the human body?

Prompt (formatted):
<bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model



Notice the scaffolding around our input text, this is the kind of turn-by-turn instruction tuning our model has gone through.

Our next step is to tokenize this formatted text and pass it to our model's `generate()` method.

We'll make sure our tokenized text is on the same device as our model (GPU) using `to("cuda")`.

Let's generate some text!

We'll time it for fun with the `%%time` magic.

In [8]:
%%time

# Tokenize the input text (turn it into numbers) and send it to GPU
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
print(f"Model input (tokenized):\n{input_ids}\n")

# Generate outputs passed on the tokenized input
# See generate docs: https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/text_generation#transformers.GenerationConfig
outputs = llm_model.generate(**input_ids,
                             max_new_tokens=256) # define the maximum number of new tokens to create
print(f"Model output (tokens):\n{outputs[0]}\n")

Model input (tokenized):
{'input_ids': tensor([[     2,      2,    106,   1645,    108,   1841,    708,    573, 186809,
         184592, 235269,    578,   1212,  16065,    749,    984,   1554,    575,
            573,   3515,   2971, 235336,    107,    108,    106,   2516,    108]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]], device='cuda:0')}



  attn_output = torch.nn.functional.scaled_dot_product_attention(


Model output (tokens):
tensor([     2,      2,    106,   1645,    108,   1841,    708,    573, 186809,
        184592, 235269,    578,   1212,  16065,    749,    984,   1554,    575,
           573,   3515,   2971, 235336,    107,    108,    106,   2516,    108,
         21404, 235269,   1517, 235303, 235256,    476,  25497,    576,    573,
        186809, 184592,    578,   1024,  16065,    575,    573,   3515,   2971,
        235292,    109,    688,  12298,   1695, 184592,    688,    708,  37132,
           674,    573,   2971,   4026,    575,   8107,  15992,   1178,  34366,
        235265,   2365,   3658,    573,   4547,  13854,    604,  29703, 235269,
         44760, 235269,    578,   1156,  24582,    674,   1501,    908,    573,
          2971, 235265,    109,    688,    651,   2149,   1872, 186809, 184592,
           708,  66058,    109, 235287,   5231, 156615,  56227,  66058,  34428,
          4134,    604,    573,   2971, 235303, 235256,   5999,    578,  29703,
        235265,  

Woohoo! We just generated some text on our local GPU!

Well not just yet...

Our LLM accepts tokens in and sends tokens back out.

We can conver the output tokens to text using [`tokenizer.decode()`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode).

In [9]:
# Decode the output tokens to text
outputs_decoded = tokenizer.decode(outputs[0])
print(f"Model output (decoded):\n{outputs_decoded}\n")

Model output (decoded):
<bos><bos><start_of_turn>user
What are the macronutrients, and what roles do they play in the human body?<end_of_turn>
<start_of_turn>model
Sure, here's a breakdown of the macronutrients and their roles in the human body:

**Macronutrients** are nutrients that the body needs in larger amounts than calories. They provide the building blocks for tissues, enzymes, and other molecules that make up the body.

**The three main macronutrients are:**

* **Carbohydrates:** Provide energy for the body's cells and tissues. They are the body's main source of fuel.
* **Protein:** Is essential for building and repairing tissues, enzymes, and hormones. It also helps to regulate blood sugar levels.
* **Fat:** Provides energy, helps to absorb vitamins and minerals, and helps to insulate the body.

**Other important macronutrients include:**

* **Carbohydrates:**
    * Simple carbohydrates, such as glucose, fructose, and lactose, are quickly digested and absorbed by the body.
   

Woah! That looks like a pretty good answer.

But notice how the output contains the prompt text as well?

How about we do a little formatting to replace the prompt in the output text?

> **Note:** `"<bos>"` and `"<eos>"` are special tokens to denote "beginning of sentence" and "end of sentence" respectively.

In [10]:
print(f"Input text: {input_text}\n")
print(f"Output text:\n{outputs_decoded.replace(prompt, '').replace('<bos>', '').replace('<eos>', '')}")

Input text: What are the macronutrients, and what roles do they play in the human body?

Output text:
Sure, here's a breakdown of the macronutrients and their roles in the human body:

**Macronutrients** are nutrients that the body needs in larger amounts than calories. They provide the building blocks for tissues, enzymes, and other molecules that make up the body.

**The three main macronutrients are:**

* **Carbohydrates:** Provide energy for the body's cells and tissues. They are the body's main source of fuel.
* **Protein:** Is essential for building and repairing tissues, enzymes, and hormones. It also helps to regulate blood sugar levels.
* **Fat:** Provides energy, helps to absorb vitamins and minerals, and helps to insulate the body.

**Other important macronutrients include:**

* **Carbohydrates:**
    * Simple carbohydrates, such as glucose, fructose, and lactose, are quickly digested and absorbed by the body.
    * Complex carbohydrates, such as starch, fiber, and cellulose

How cool is that!

We just officially generated text from an LLM running locally.

So we've covered the R (retrieval) and G (generation) of RAG.

How about we check out the last step?

Augmentation.

First, let's put together a list of queries we can try out with our pipeline.

In [11]:
# Nutrition-style questions generated with GPT4
gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

And now let's check if our `retrieve_relevant_resources()` function works with our list of queries.

In [21]:
import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

Query: How often should infants be breastfed?
[INFO] Time taken to get scores on 133 embeddings: 0.00006 seconds.


(tensor([0.1373, 0.1077, 0.0738, 0.0661, 0.0622], device='cuda:0'),
 tensor([123, 121, 132,  21,  29], device='cuda:0'))

Beautiful!

Let's augment!

### Augmenting our prompt with context items

What we'd like to do with augmentation is take the results from our search for relevant resources and put them into the prompt that we pass to our LLM.

In essence, we start with a base prompt and update it with context text.

Let's write a function called `prompt_formatter` that takes in a query and our list of context items (in our case it'll be select indices from our list of dictionaries inside `pages_and_chunks`) and then formats the query with text from the context items.

We'll apply the dialogue and chat template to our prompt before returning it as well.

> **Note:** The process of augmenting or changing a prompt to an LLM is known as prompt engineering. And the best way to do it is an active area of research. For a comprehensive guide on different prompt engineering techniques, I'd recommend the Prompt Engineering Guide ([promptingguide.ai](https://www.promptingguide.ai/)), [Brex's Prompt Engineering Guide](https://github.com/brexhq/prompt-engineering) and the paper [Prompt Design and Engineering: Introduction and Advanced Models](https://arxiv.org/abs/2401.14423).

In [22]:
def prompt_formatter(query: str,
                     context_items: list[dict]) -> str:
    """
    Augments query with text-based context from context_items.
    """
    # Join context items into one dotted paragraph
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    # Create a base prompt with examples to help the model
    # Note: this is very customizable, I've chosen to use 3 examples of the answer style we'd like.
    # We could also write this in a txt file and import it in if we wanted.
    base_prompt = """Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""

    # Update base prompt with context items and query
    base_prompt = base_prompt.format(context=context, query=query)
    
    # Create prompt template for instruction-tuned model
    dialogue_template = [
        {"role": "user",
        "content": base_prompt}
    ]
    print(dialogue_template)
    # Apply the chat template
    prompt = tokenizer.apply_chat_template(conversation=dialogue_template,
                                          tokenize=False,
                                          add_generation_prompt=True)
    return prompt

What a good looking prompt!

We can tokenize this and pass it straight to our LLM.


How about we functionize the generation step to make it easier to use?

We can put a little formatting on the text being returned to make it look nice too.

And we'll make an option to return the context items if needed as well.

In [23]:
def ask(query,
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True,
        return_references=True):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """

    # Get just the scores and indices of top related results
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings)

    # Create a list of context items
    context_items = [pages_and_chunks[i] for i in indices]

    # Add score to context item
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu() # return score back to CPU

    # Format the prompt with context items
    prompt = prompt_formatter(query=query,
                              context_items=context_items)

    # Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")

    # Generate an output of tokens
    outputs = llm_model.generate(**input_ids,
                                 temperature=temperature,
                                 do_sample=True,
                                 max_new_tokens=max_new_tokens)

    # Turn the output tokens into text
    output_text = tokenizer.decode(outputs[0])

    if format_answer_text:
        # Replace special tokens and unnecessary help message
        output_text = output_text.replace(prompt, "").replace("<bos>", "").replace("<eos>", "").replace("Sure, here is the answer to the user query:\n\n", "")

    # Only return the answer without the context items
    if return_references:
        return output_text, context_items

    return output_text, None


What a good looking function!

The workflow could probably be a little refined but this should work!

Let's try it out.

In [24]:
query = random.choice(query_list)
print(f"Query: {query}")

# Answer query with context and return context
answer, context_items = ask(query=query,
                            temperature=0.7,
                            max_new_tokens=512,
                            return_references=True)

print(f"Answer:\n")
print(answer)
#print_wrapped(answer)
print(f"Context items:")
context = "- " + "\n- ".join([str("SOURCE:" + item["pdf_name"] + ".PageNumber:" + str(item["page_number"]) + ".Context:" + item["sentence_chunk"]) for item in context_items])
print(context)

Query: How do vitamins and minerals differ in their roles and importance for health?
[INFO] Time taken to get scores on 133 embeddings: 0.00008 seconds.
[{'role': 'user', 'content': "Based on the following context items, please answer the query.\nGive yourself room to think by extracting relevant passages from the context before answering the query.\nDon't return the thinking, only return the answer.\nMake sure your answers are as explanatory as possible.\nUse the following examples as reference for the ideal answer style.\n\nExample 1:\nQuery: What are the fat-soluble vitamins?\nAnswer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from da

Local RAG workflow complete!



## Hosting in Gradio

In [25]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
!pip install gradio





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import gradio as gr

def question(message, history, return_references):
  answer, context_items = ask   (query=message,
                                temperature=0.9,
                                max_new_tokens=512,
                                return_references=return_references)
  if context_items == None:
    return answer
  else:
    context = "\n".join([
f"""{index+1}. src: http://localhost:3000/pdfs/{item["pdf_name"]}#page={item["page_number"]}.
PageNumber: **{item["page_number"]}**.
> "{item["sentence_chunk"]}"

*****

""" for index, item in enumerate(context_items)])
    result = answer + "\n## References:\n" + context
    return result  # Return both the answer and the return_answer


demo = gr.ChatInterface(question,
                        additional_inputs=[
                            gr.Checkbox(label="Return References", interactive=True)
                        ]
                        )
demo.launch(share=True)

Running on local URL:  http://127.0.0.1:7860
IMPORTANT: You are using gradio version 4.26.0, however version 4.44.1 is available, please upgrade.
--------
Running on public URL: https://0cd3b9788049e026bb.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




[INFO] Time taken to get scores on 133 embeddings: 0.00035 seconds.
[{'role': 'user', 'content': "Based on the following context items, please answer the query.\nGive yourself room to think by extracting relevant passages from the context before answering the query.\nDon't return the thinking, only return the answer.\nMake sure your answers are as explanatory as possible.\nUse the following examples as reference for the ideal answer style.\n\nExample 1:\nQuery: What are the fat-soluble vitamins?\nAnswer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.\n\nExample 2:\nQ

In [56]:
demo.close()

Closing server running on port: 7860


In [57]:
context = "\n".join([
f"""1. src: http://localhost:3000/pdfs/{item["pdf_name"]}#{item["page_number"]}.

PageNumber: **{item["page_number"]}**.
> "{item["sentence_chunk"]}"

*****

""" for item in context_items])
result = answer + "\n## References:\n" + context
print(result)

The context does not provide any information about the macronutrients, so I cannot answer this question from the provided context.
## References:
1. src: http://localhost:3000/pdfs/10_Object and Class Structuring.pptx#1.

PageNumber: **1**.
> "Object and Class Structuring CT015-3-2 Design Methods TOPIC LEARNING OUTCOMES At the end of this topic, you should be able to: 1. Describe object structuring criteria and categories Object structuring criteria and categories Contents & Structure Recap From Last Lesson What is multiplicity?First attempt at determining the software objects in the system.a class is categorized by the role it plays in the application."

*****


1. src: http://localhost:3000/pdfs/13_Overview of other UML diagrams.pptx#5.

PageNumber: **5**.
> "Reading Slide ‹18› of 20 Other UML Diagrams Slide ‹19› of 20 Summary of Main Teaching Points Question and Answer Session Slide ‹20› of 20 Q & A"

*****


1. src: http://localhost:3000/pdfs/13_Overview of other UML diagrams.pptx#