## Requirements and setup

In [None]:
# Perform Google Colab installs (if running in Google Colab)
import os

if "COLAB_GPU" in os.environ:
    print("[INFO] Running in Google Colab, installing requirements.")
    !pip install -U torch # requires torch 2.1.1+ (for efficient sdpa implementation)
    !pip install PyMuPDF # for reading PDFs with Python
    !pip install fitz
    !pip install tqdm # for progress bars
    !pip install sentence-transformers # for embedding models
    !pip install accelerate # for quantization model loading
    !pip install bitsandbytes # for quantizing models (less storage space)
    !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference
    !pip install langchain-huggingface langchain-chroma langchain_cohere langchain_community.llms langchain_google_genai
    !pip install gradio
    

## 1. Document/Text Processing and Embedding Creation

Ingredients:
* PDF document of choice.
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).

### Import PDF Document

This will work with many other kinds of documents.

However, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well in many cases.

In [None]:

import os

choice = "pdf"

if choice == "pdf":
    folder_path = 'Hiltipdfs' # replace with the path to your folder
    file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pdf')]

    for file_path in file_paths:
        print(file_path)
elif choice == "slides":
    folder_path = 'Slides' # replace with the path to your folder
    file_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.pptx')]

    for file_path in file_paths:
        print(file_path)

Hiltipdfs\Hilti Malaysia - Terms and Conditions 2019.pdf
Hiltipdfs\Hilti-Submittal-Package-OSHA-1926.1153.pdf
Hiltipdfs\Hilti_BindingCorporateRules.pdf
Hiltipdfs\Hilti_GB_2020_en_pdf.pdf
Hiltipdfs\Technical-information-ASSET-DOC-LOC-10908813.pdf


PDF acquired!

We can import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`).

We'll write a small helper function to preprocess the text as it gets read. Note that not all text will be read in the same so keep this in mind for when you prepare your text.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.

In [9]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(file_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        file_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(file_path)  # open a document
    pdf_name = os.path.basename(file_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        if not text.strip():  # Skip if text is empty or just whitespace
            continue
        pages_and_texts.append({"page_number": page_number + 1, 
                                "pdf_name": pdf_name,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts


pages_and_texts = []
for file_path in file_paths:
    pages_and_texts.extend(open_and_read_pdf(file_path=file_path))
pages_and_texts[:3]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

[{'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'page_char_count': 2282,
  'page_word_count': 451,
  'page_sentence_count_raw': 11,
  'page_token_count': 570.5,
  'text': 'Hilti Malaysia Sdn. Bhd. (157721-A)  F-5-A | Sime Darby Brunsfield Tower  No. 2 | Jalan PJU 1A/7A  Oasis Square I Oasis Damansara  47301 Petaling Jaya I Selangor I Malaysia  Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my  HILTI (MALAYSIA) SDN. BHD.  TERMS AND CONDITIONS      1.   GENERAL    1.1   In these conditions the following words have the meanings shown:     "Buyer"   means the person, firm or company purchasing Goods       "Company" means Hilti (Malaysia) Sdn Bhd or one of its associated or subsidiary  companies as the case may be       "Contract"  means the agreement between the Company and the Buyer for the  purchase of Goods from the Company by the Buyer        "Contracts" includes all agreements between the Company and the Buyer for the  purchase of Goods

Now let's get a random sample of the pages.

In [10]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 29,
  'pdf_name': 'Hilti_GB_2020_en_pdf.pdf',
  'page_char_count': 1335,
  'page_word_count': 243,
  'page_sentence_count_raw': 7,
  'page_token_count': 333.75,
  'text': 'employees worldwide (2019: 30,006) nationalities  in the global team (2019: 127) nationalities  at headquarters (2019: 63) of team members  worldwide are women (2019: 25%) of team leaders  worldwide are women (2019: 21%) 29,549 63 25.5% 21.5%  After Lindsay Ophus started her career as a process manager at Hilti in  the USA in 2015, it was soon clear she was on the path to a long career  with the company. “My goal is to create added value for the company with my  team.” Hilti’s approach is to match people with roles they enjoy and are suited  for, both laterally and upward. The result is a resilient, high-performing global  team. Team members are encouraged to talk frequently with their leaders  about their development, whether it be for their current role or one in the future.  Lindsay was set to bec

### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

The different sizes of texts will be a good indicator into how we should split our texts.


In [11]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,pdf_name,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,Hilti Malaysia - Terms and Conditions 2019.pdf,2282,451,11,570.5,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
1,2,Hilti Malaysia - Terms and Conditions 2019.pdf,2776,562,18,694.0,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
2,3,Hilti Malaysia - Terms and Conditions 2019.pdf,2728,558,16,682.0,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
3,4,Hilti Malaysia - Terms and Conditions 2019.pdf,2867,595,14,716.75,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...
4,5,Hilti Malaysia - Terms and Conditions 2019.pdf,2938,572,16,734.5,Hilti Malaysia Sdn. Bhd. (157721-A) F-5-A | S...


In [12]:
# Get stats
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,81.0,81.0,81.0,81.0,81.0
mean,15.73,2374.93,414.11,12.86,593.73
std,12.5,1852.88,316.0,10.31,463.22
min,1.0,19.0,3.0,1.0,4.75
25%,5.0,982.0,166.0,6.0,245.5
50%,12.0,1899.0,305.0,9.0,474.75
75%,24.0,3015.0,579.0,19.0,753.75
max,44.0,7393.0,1306.0,47.0,1848.25


### Chunking Text

The ideal way of processing text before embedding it is still an active area of research.

General workflow :

Ingest text -> split it into groups/chunks -> embed the groups/chunks -> store in vector db

Several Chunking methods include:
1. Character Text Splitting
2. Token Text Splitting
3. Recursive Character Splitting
4. Recursive Token Splitting
5. Semantic Chunking
6. Cluster Semantic Chunking
7. LLM Semantic Chunking/Agentic Chunking

According to https://research.trychroma.com/evaluating-chunking, where they evaluated various methods of text chunking:

Cluster Semantic Chunking with 200 tokens achieved the highest precision, and IoU

LLM Semantic Chunking achieved the highest recall

Recursive Character Text Splitting with 200 chunk size and no overlap performs consistently well accross all metrics and much simpler

Smaller chunk sizes (200-400 tokens) generally performed better than larger sizes (800 tokens)

Using the packages in the repo https://github.com/brandonstarxel/chunking_evaluation.git, we are able to import the code responsible for most of the chunking methods


In [None]:
#!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

Recursive is more efficient but may produce inconsistent text result depending on the document

In [15]:
from chunking_evaluation import BaseChunker, GeneralEvaluation
from chunking_evaluation.chunking import (
    RecursiveTokenChunker,
)
from chromadb.utils import embedding_functions

embedding_function = embedding_functions.DefaultEmbeddingFunction()
chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=0,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = chunker.split_text(item["text"])

100%|██████████| 81/81 [00:00<00:00, 13604.78it/s]


ClusterSemanticChunker produces the best results but slower

By default, ClusterSemanticChunker first splits texts using the recursive method and then arrange them into chunks using a similarity matrix.

However, i find that the recursive method does not work well in splitting text from my PDFs, so in this case, we are using the Spacy library 

to assist in splitting our texts into appropriate sentences

In [None]:
from spacy.lang.en import English
from tqdm import tqdm

class SpacySentenceSplitter:
    def __init__(self):
        self.nlp = English()
        self.nlp.add_pipe("sentencizer")

    def split_text(self, text):
        doc = self.nlp(text)
        return [str(sent).strip() for sent in doc.sents if str(sent).strip()]


from chunking_evaluation import BaseChunker, GeneralEvaluation
from chunking_evaluation.chunking import (
    ClusterSemanticChunker,
)
from chromadb.utils import embedding_functions

embedding_function = embedding_functions.DefaultEmbeddingFunction()
chunker = ClusterSemanticChunker(
    embedding_function=embedding_function, 
    max_chunk_size=200, 
)

# Use Spacy as the text splitter
chunker.splitter = SpacySentenceSplitter()

for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = chunker.split_text(item["text"])

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 81/81 [01:03<00:00,  1.28it/s]


In [17]:
pages_and_texts[30]

{'page_number': 14,
 'pdf_name': 'Hilti_BindingCorporateRules.pdf',
 'page_char_count': 5045,
 'page_word_count': 848,
 'page_sentence_count_raw': 22,
 'page_token_count': 1261.25,
 'text': 'www.hilti.group 14 best efforts to obtain the right to waive this prohibition in  order to communicate as much information as it can and as  soon as possible and be able to demonstrate that it did so.  If despite using its best efforts a Hilti Entity or Hilti HQ is  unable to notify the competent Supervisory Authority, it  undertakes to at least annually provide the Supervisory   Authority with information related to the requests received  by the national security state bodies or authorities and at  least the information listed above. The transfers of Personal Data from a Hilti Entity to national  security state bodies or authorities shall never be done in  an excessive, disproportionate, and indiscriminate manner  which would go beyond what is necessary in a democratic  society. c. Relationship be

In [18]:
# Inspect an example
random.sample(pages_and_texts, k=1)

[{'page_number': 6,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'page_char_count': 2929,
  'page_word_count': 555,
  'page_sentence_count_raw': 19,
  'page_token_count': 732.25,
  'text': 'Hilti Malaysia Sdn. Bhd. (157721-A)  F-5-A | Sime Darby Brunsfield Tower  No. 2 | Jalan PJU 1A/7A  Oasis Square I Oasis Damansara  47301 Petaling Jaya I Selangor I Malaysia  Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my  cleaned and serviced in accordance with the Hilti Operating Instructions provided in  your RED Hilti Box. Only original Hilti consumables, components and spare parts  are used in this premium tool.    9.2   If the tool is not covered by the warranty period, Hilti will issue a quotation for  repairs above Ringgit Malaysia 200.00 and will issue a quotation for repair below  Ringgit Malaysia 200.00 upon request in the event of a request for repairs. Please  note that Hilti Lifetime Service and Repair Cost Limit does not apply to all tools,  please ca

In [19]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,81.0,81.0,81.0,81.0,81.0
mean,15.73,2374.93,414.11,12.86,593.73
std,12.5,1852.88,316.0,10.31,463.22
min,1.0,19.0,3.0,1.0,4.75
25%,5.0,982.0,166.0,6.0,245.5
50%,12.0,1899.0,305.0,9.0,474.75
75%,24.0,3015.0,579.0,19.0,753.75
max,44.0,7393.0,1306.0,47.0,1848.25


In [24]:
# Sample an example from the group
random.sample(pages_and_texts, k=1)

[{'page_number': 16,
  'pdf_name': 'Hilti_GB_2020_en_pdf.pdf',
  'page_char_count': 273,
  'page_word_count': 43,
  'page_sentence_count_raw': 2,
  'page_token_count': 68.25,
  'text': 'Product and Service Differentiation Direct   Customer Relationship The core of our  corporate strategy:  direct access to  and partnership  with our customers  in the construction  industry. Operational Excellence High-Performing Global Team 2020 Hilti Company Report 26–27',
  'sentence_chunks': ['Product and Service Differentiation Direct   Customer Relationship The core of our  corporate strategy:  direct access to  and partnership  with our customers  in the construction  industry. Operational Excellence High-Performing Global Team 2020 Hilti Company Report 26–27']}]

In [23]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,81.0,81.0,81.0,81.0,81.0
mean,15.73,2374.93,414.11,12.86,593.73
std,12.5,1852.88,316.0,10.31,463.22
min,1.0,19.0,3.0,1.0,4.75
25%,5.0,982.0,166.0,6.0,245.5
50%,12.0,1899.0,305.0,9.0,474.75
75%,24.0,3015.0,579.0,19.0,753.75
max,44.0,7393.0,1306.0,47.0,1848.25


### Splitting each chunk into its own item

We'd like to embed each chunk of sentences into its own numerical representation.

So to keep things clean, let's create a new list of dictionaries each containing a single chunk of sentences with relative information such as page number as well statistics about each chunk.

In [25]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        chunk_dict["pdf_name"] = item["pdf_name"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

100%|██████████| 81/81 [00:00<00:00, 9862.65it/s]


439

In [26]:
# View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 10,
  'pdf_name': 'Hilti_BindingCorporateRules.pdf',
  'sentence_chunk': '8.',
  'chunk_char_count': 2,
  'chunk_word_count': 1,
  'chunk_token_count': 0.5}]

Now we've broken pdfs into chunks of text with metadata of the page number, and pdf title of where they came from.

This means we could reference a chunk of text and know its source.

Let's get some stats about our chunks.

In [27]:
# Get stats about our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,439.0,439.0,439.0,439.0
mean,15.35,430.41,69.47,107.6
std,12.14,479.4,76.67,119.85
min,1.0,2.0,1.0,0.5
25%,6.0,120.5,19.0,30.12
50%,11.0,307.0,49.0,76.75
75%,21.5,558.0,90.5,139.5
max,44.0,4348.0,696.0,1087.0


looks like some of our chunks have quite a low token count.

How about we check for samples with less than 30 tokens (about the length of a sentence) and see if they are worth keeping?

In [28]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 23.0 | Text: How can we make our customers’ work on the construction site even safer and more productive?
Chunk token count: 19.25 | Text: 3. Choose the correct collector based on the system and insert being used. 4.
Chunk token count: 5.5 | Text: www.hilti.group 18 13.
Chunk token count: 7.5 | Text: What do you attribute this to?
Chunk token count: 18.5 | Text: In 2017, he took over responsibility for the entire North American region.


Looks like many of these are headers and footers of different pages.

They don't seem to offer too much information.

Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.

In [29]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'sentence_chunk': 'Bhd. ( 157721-A) F-5-A | Sime Darby Brunsfield Tower No. 2 | Jalan PJU 1A/7A Oasis Square I Oasis Damansara 47301 Petaling Jaya I Selangor I Malaysia Toll Free 1800 880 985 | F +603 7848 7399 | www.hilti.com.my HILTI (MALAYSIA) SDN. BHD.',
  'chunk_char_count': 237,
  'chunk_word_count': 43,
  'chunk_token_count': 59.25},
 {'page_number': 1,
  'pdf_name': 'Hilti Malaysia - Terms and Conditions 2019.pdf',
  'sentence_chunk': 'TERMS AND CONDITIONS   1. GENERAL  1.1  In these conditions the following words have the meanings shown:   "Buyer"  means the person, firm or company purchasing Goods    "Company" means Hilti (Malaysia) Sdn Bhd or one of its associated or subsidiary companies as the case may be    "Contract" means the agreement between the Company and the Buyer for the purchase of Goods from the Company by the Buyer    "Contracts" includes all agreements between the Company and 

Smaller chunks filtered!

Time to embed our chunks of text!

### Embedding our text chunks

Rather than directly mapping words/tokens/characters to numbers directly (e.g. `{"a": 0, "b": 1, "c": 3...}`), the numerical representation of tokens is learned by going through large corpuses of text and figuring out how different tokens relate to each other.

Ideally, embeddings of text will mean that similar meaning texts have similar numerical representation.

Our goal is to turn each of our chunks into a numerical representation (an embedding vector, where a vector is a sequence of numbers arranged in order).

Specifically, we'll get the `all-mpnet-base-v2` model (you can see the model's intended use on the [Hugging Face model card](https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses)).

And we are storing it in a chroma vector DB

In [36]:
# Requires !pip install langchain-huggingface langchain-chroma
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cuda'}  # Tell the model to use the CUDA device
)

from langchain_chroma import Chroma
persist_directory = "./chroma_langchain_db"
vector_store = Chroma(
    collection_name="collection",
    embedding_function=embeddings,
    persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
)

In [39]:
# 1. Extract documents and metadatas
documents = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
metadatas = [{"page_number": item["page_number"], "pdf_name": item["pdf_name"]} for item in pages_and_chunks_over_min_token_len]

# 2. Add documents to the vector store
vector_store.add_texts(texts=documents, metadatas=metadatas)

['0d31cc9c-128e-4e2f-b90a-d8dc3ebc08b4',
 '8a79bc74-50c0-4ee5-a07a-773d3429986b',
 'd8049f10-c42f-42cc-b7a4-cfdfec78e9db',
 '6649c612-2093-4888-bf38-dfc53fb5d8e0',
 '55197b2f-5dcb-43ef-aa39-a98a536470c6',
 '76282f2e-7d7f-4477-b307-f9e875dea131',
 'a8ea7f0c-e23c-45bb-815b-cd19fd32cb59',
 '35d572fe-deeb-4b6d-9a7a-0ed85e81bc35',
 'bd795779-6eea-4124-8003-037185611189',
 'd28bf841-58b4-45c9-a1d4-60f61ed62bf4',
 '94d65011-7752-4f9c-90a1-45eab042944b',
 '3162ed2e-7b77-4c66-9a65-d8e2d38cafaa',
 '0369ddec-e7a1-4a97-a8d6-6ff9551cb814',
 'f1f6419a-6217-4834-b70b-d34e4949a81b',
 'bbb89de1-d0d8-455e-8274-97151ca6f7ef',
 '12fbfd5c-067b-4863-9bf4-b7fde5b6e3ba',
 '29a1590e-8656-4ec2-afe4-2d2ad8dc2226',
 '94c016d6-8132-48a4-a028-459b52fb88d6',
 '0006f889-83b8-415a-900f-e8cec0585a35',
 '8678a4f2-a266-4e5f-a8fe-6de06f1b5989',
 '35ff2857-4247-418e-98f4-b22ec109922b',
 '54d91f22-155f-428b-b8d5-bd396cdec96e',
 '746ffd33-03ac-4f74-b127-867b7025f404',
 'c3aa16ad-d975-4ccf-a304-60461ccdc505',
 'baeb915f-7cf7-

## 2. RAG - Search and Answer

RAG stands for Retrieval Augmented Generation.

Which is another way of saying "given a query, search for relevant resources and answer based on those resources".

### Load the Vector DB

In [1]:
# Requires !pip install langchain-huggingface langchain-chroma
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={'device': 'cuda'}  # Tell the model to use the CUDA device
)

from langchain_chroma import Chroma
persist_directory = "./chroma_langchain_db"
vector_store = Chroma(
    collection_name="collection",
    embedding_function=embeddings,
    persist_directory=persist_directory,  # Where to save data locally, remove if not necessary
)


Test the Vector DB out!

In [2]:
query = "What does a Chief Privacy Officer do in Hilti"
retriever = vector_store.as_retriever(search_kwargs={"k":20})
docs = retriever.invoke(query)
print(docs)

[Document(id='740e9894-1fd6-4696-ae50-b1d92eb7c88c', metadata={'pdf_name': 'Hilti_BindingCorporateRules.pdf', 'page_number': 16}, page_content='Chief Privacy Officer (“CPO”): CPO means the Hilti Chief Privacy Officer. The CPO is respon- sible for reviewing and monitoring Hilti’s data protection  compliance and reporting to the highest level of management. Controller: Controller means the natural or legal person, public  authority, agency, or other body which, alone or jointly with others, determines the purposes and means of the  processing of Personal Data. Such Controller can be Hilti HQ, a Hilti Entity, or a Third Party.'), Document(id='f6995850-53df-4d39-beae-6876d75a1189', metadata={'page_number': 3, 'pdf_name': 'Hilti_BindingCorporateRules.pdf'}, page_content='b.\t Roles and responsibilities Chief Privacy Officer: Hilti has appointed a Chief Privacy Officer (“CPO”) responsible for monitoring compliance with data protection laws and regulations and easily accessible from each Hilt

### Reranking
Now we are able to retrieve chunks of texts that are relevant to our query. However, we are not going to pass all of the related chunks to 

the LLM. Therefore, we need to be able to rank up to 3 or 5 chunks that are the most relevant to the query.

In this case, we implement a reranking model "rerank-v3.5" from Cohere using their demo API KEY stored in an .env file.

In [3]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere

retriever = vector_store.as_retriever(search_kwargs={"k":20})

llm = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-v3.5", top_n=5)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)
reranked_docs = compression_retriever.invoke(
    "What does a Chief Privacy Officer do in Hilti"
)

print(reranked_docs)

  llm = Cohere(temperature=0)


[Document(metadata={'pdf_name': 'Hilti_BindingCorporateRules.pdf', 'page_number': 3, 'relevance_score': 0.9527197}, page_content='b.\t Roles and responsibilities Chief Privacy Officer: Hilti has appointed a Chief Privacy Officer (“CPO”) responsible for monitoring compliance with data protection laws and regulations and easily accessible from each Hilti Entity. CPO is the designated Data Protection Officer for Hilti, and its contact details are communicated to the Liechtenstein Data Protection Authority as per GDPR Article 37. The CPO has been appointed based on his/her expert knowledge in the field of data protection and data privacy, as well as overall years of experience which attest their professional qualities required to fulfill the tasks of a CPO. The required skills are: •\t Expertise and in-depth understanding of national, EEA, global data protection and privacy laws, regulations and practices •\t Advanced knowledge of the business sector and Hilti internal organization EXECUTI

In [4]:
# Define helper function to print wrapped text
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [5]:
i = 0
for doc in reranked_docs:
    print(f"----- Document {i} -----")
    
    # Original Document Information
    print(f"Original {i}:")
    print(f"  PDF Name: {docs[i].metadata.get('pdf_name', 'N/A')}")
    print(f"  Page Number: {docs[i].metadata.get('page_number', 'N/A')}")
    print("  Content:")
    print_wrapped(docs[i].page_content)
    print("-" * 80)

    # Reranked Document Information
    print(f"ReRanked {i}:")
    print(f"  PDF Name: {doc.metadata.get('pdf_name', 'N/A')}")
    print(f"  Page Number: {doc.metadata.get('page_number', 'N/A')}")
    print("  Content:")
    print_wrapped(doc.page_content)
    print(f"  Relevance Score: {doc.metadata.get('relevance_score', 'N/A')}")
    print("=" * 80)

    i += 1

----- Document 0 -----
Original 0:
  PDF Name: Hilti_BindingCorporateRules.pdf
  Page Number: 16
  Content:
Chief Privacy Officer (“CPO”): CPO means the Hilti Chief Privacy Officer. The
CPO is respon- sible for reviewing and monitoring Hilti’s data protection
compliance and reporting to the highest level of management. Controller:
Controller means the natural or legal person, public  authority, agency, or
other body which, alone or jointly with others, determines the purposes and
means of the  processing of Personal Data. Such Controller can be Hilti HQ, a
Hilti Entity, or a Third Party.
--------------------------------------------------------------------------------
ReRanked 0:
  PDF Name: Hilti_BindingCorporateRules.pdf
  Page Number: 3
  Content:
b.       Roles and responsibilities Chief Privacy Officer: Hilti has appointed a
Chief Privacy Officer (“CPO”) responsible for monitoring compliance with data
protection laws and regulations and easily accessible from each Hilti Entity.
CPO

### Using an LLM Model

In [6]:
# Requires !pip install langchain_google_genai
from langchain.chains.question_answering import load_qa_chain
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", convert_system_message_to_human=True)
chain = load_qa_chain(llm, chain_type="stuff")

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(llm, chain_type="stuff")


In [7]:
question = "What does a Chief Privacy Officer do in Hilti"

reranked_docs = compression_retriever.invoke(question)
answer = chain.run(input_documents=reranked_docs, question=question)
print_wrapped(answer)

  answer = chain.run(input_documents=reranked_docs, question=question)


In Hilti, the Chief Privacy Officer (CPO) is responsible for reviewing and
monitoring Hilti’s data protection compliance and reporting to the highest level
of management. They monitor compliance with data protection regulations and
internal policies, provide advice regarding Data Protection Impact Assessments,
cooperate with Supervisory Authorities, act as a contact point for Supervisory
Authorities, provide guidance by evaluating risks associated with data
processing activities, monitor and annually report on compliance at a global
level, ensure the integration of data protection in overall compliance
management, supervise the Global Team, inform the Executive Board of any
concerns, and decide on or request audits. The CPO also regularly informs and
advises the Executive Board.


### Augmenting our prompt with context items

What we'd like to do with augmentation is take the results from our search for relevant resources and put them into the prompt that we pass to our LLM.

> **Note:** The process of augmenting or changing a prompt to an LLM is known as prompt engineering. And the best way to do it is an active area of research. For a comprehensive guide on different prompt engineering techniques, I'd recommend the Prompt Engineering Guide ([promptingguide.ai](https://www.promptingguide.ai/)), [Brex's Prompt Engineering Guide](https://github.com/brexhq/prompt-engineering) and the paper [Prompt Design and Engineering: Introduction and Advanced Models](https://arxiv.org/abs/2401.14423).

In [28]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

def query_llm_with_references(message: str, return_references: bool = False):
    """
    Queries the LLM with a message and optionally returns the source documents.

    Args:
        message: The query message.
        reranked_docs: A list of relevant documents retrieved by the retriever.
        return_references: A boolean indicating whether to return the source documents.

    Returns:
        If return_references is False: The answer from the LLM (str).
        If return_references is True: A tuple containing the answer (str) and the source documents (list).
    """
    reranked_docs = compression_retriever.invoke(message)
    llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", convert_system_message_to_human=True)

    # Prompt to instruct the LLM to cite sources
    prompt_template = """Answer the question in detail based on the context provided.
    Cite the references you used to support your answer using the format (Reference [index]),
    where [index] corresponds to the order of the reference in the provided context (starting from 1).
    If the context does not provide the answer, just say "The knowledge base does not have answers to the following question".

    Context:
    {context}

    Question: {question}"""
    prompt = PromptTemplate.from_template(prompt_template)

    # Format the context with indices
    indexed_context = "\n\n".join(
        f"--- Reference {i+1} ---\n{doc.page_content}" for i, doc in enumerate(reranked_docs)
    )

    # Create the chain
    chain = (
        {"context": lambda x: x, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

    # Prepare the input for the prompt
    prompt_input = {
        "context": indexed_context,
        "question": message
    }

    # Print the generated prompt for debugging
    print("\n--- Generated Prompt for LLM ---")
    print(prompt.format(**prompt_input))
    print("--- End of Prompt ---")

    answer = chain.invoke(indexed_context, {"question": message})

    if return_references:
        context = "\n\n## References:\n" + "\n".join([f"""{index+1}. src: http://localhost:3000/pdfs/{doc.metadata.get('pdf_name', 'N/A')}#page={doc.metadata.get('page_number','N/A')}.
Document: **{doc.metadata.get('pdf_name','N/A')}**.
PageNumber: **{doc.metadata.get('page_number','N/A')}**.
> "{doc.page_content}"

*****

"""     for index, doc in enumerate(reranked_docs)])
        return answer + context
    else:
        return answer

In [15]:
# Test Usage:
question = "What does a Chief Privacy Officer do in Hilti?"

# Get the answer with references
answer_with_references = query_llm_with_references(question, return_references=True)
print("\nAnswer:")
print_wrapped(answer_with_references)



--- Generated Prompt for LLM ---
Answer the question based on the context provided.
    Cite the references you used to support your answer using the format (Reference [index]),
    where [index] corresponds to the order of the document in the provided context (starting from 1).
    If the context does not provide the answer, just say "The knowledge base does not have answers to the following question".

    Context:
    --- Document 1 ---
b.	 Roles and responsibilities Chief Privacy Officer: Hilti has appointed a Chief Privacy Officer (“CPO”) responsible for monitoring compliance with data protection laws and regulations and easily accessible from each Hilti Entity. CPO is the designated Data Protection Officer for Hilti, and its contact details are communicated to the Liechtenstein Data Protection Authority as per GDPR Article 37. The CPO has been appointed based on his/her expert knowledge in the field of data protection and data privacy, as well as overall years of experience whic




Answer:
The CPO is responsible for reviewing and monitoring Hilti’s data protection
compliance and reporting to the highest level of management (Reference [2]). The
CPO also monitors compliance with data protection regulations and internal
policies, assigns responsibilities, raises awareness, ensures staff training,
advises on Data Protection Impact Assessments, cooperates with Supervisory
Authorities, acts as a contact point for Supervisory Authorities, provides
guidance by evaluating risks, monitors and annually reports on compliance,
ensures data protection integration, supervises the Global Team, informs the
Executive Board, and decides on or requests audits (Reference [3]).  ##
References: 1. src:
http://localhost:3000/pdfs/Hilti_BindingCorporateRules.pdf#page=3. PageNumber:
**3**. > "b.    Roles and responsibilities Chief Privacy Officer: Hilti has
appointed a Chief Privacy Officer (“CPO”) responsible for monitoring compliance
with data protection laws and regulations and easily

RAG workflow complete!

We have successfully augmented the prompt to include relevant chunks of text as context to be passed on to the LLM

# 3. Hosting in Gradio

In [19]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')


'en_US.UTF-8'

In [30]:
import gradio as gr

def question(message, history, return_references):
    print(f"Message received: {message}")
    if message is None:
        print("Warning: Received NoneType for message.")
        return "Sorry, I didn't receive your message."

    answer = query_llm_with_references(
        message=message,
        return_references=return_references
    )
    return answer

demo = gr.ChatInterface(
    question,
    additional_inputs=[
        gr.Checkbox(label="Return References", interactive=True)
    ]
)
demo.launch(share=True)

  self.chatbot = Chatbot(


* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://24b8739f0b37683091.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Message received: What is a CRM system

--- Generated Prompt for LLM ---
Answer the question in detail based on the context provided.
    Cite the references you used to support your answer using the format (Reference [index]),
    where [index] corresponds to the order of the reference in the provided context (starting from 1).
    If the context does not provide the answer, just say "The knowledge base does not have answers to the following question".

    Context:
    --- Reference 1 ---
When it comes to development, we continue to focus on integrated solutions that increasingly combine hardware, soft- ware and services which support our customers in their daily applications and core processes. Additionally, these solutions almost always contain digital elements where data is used as a basis for decision-making, optimi- zation, documentation or continuous learning. A central theme of our digital transformation is the use of available customer data to create a personal- ized, relevan



In [29]:
demo.close()

Closing server running on port: 7860
