Setup

In [12]:
%load_ext dotenv
%dotenv
import os
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


Loading the pdf content into Document objects langchain can handle

In [13]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/raw/50.pdf"
loader = PyPDFLoader(file_path,mode="single")
docs = loader.load()
print(type(docs[0]))

<class 'langchain_core.documents.base.Document'>


Splitting text into smaller chunks

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
texts = text_splitter.split_documents(docs)
len(texts)


171

Loading an embedding model based on env setup

In [16]:
from src.util.embeddings import get_embedding_model
embedding_model = os.getenv("EMBEDDING_MODE","local")

embedding_model = get_embedding_model(embedding_model,model_name="qwen3-embedding:0.6b")
print(embedding_model)

model='qwen3-embedding:0.6b' validate_model_on_init=False base_url=None client_kwargs={} async_client_kwargs={} sync_client_kwargs={} mirostat=None mirostat_eta=None mirostat_tau=None num_ctx=None num_gpu=None keep_alive=None num_thread=None repeat_last_n=None repeat_penalty=None temperature=None stop=None tfs_z=None top_k=None top_p=None


We need a vectorstore to save the embeddings

In [5]:
collection_name = "simple_chunking_first_50"

In [17]:
from src.util.vectorstore import get_vectorstore
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)

The embedding model is used to generate an embedding for each document before they are saved to the vectorstore

In [7]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(texts))]

document_ids = vector_store.add_documents(documents=texts, ids=uuids)
    
print(f"Saved {len(document_ids)} documents to the vectorstore")

Saved 171 documents to the vectorstore


Closing the connection to the client to permanently save the vectors (embeddings) in the vectorstore. If the jupyter kernel crashes before the connection is terminanted the embeddings wont be saved so we do this explicitly

In [11]:
vector_store.client.close()

Similarity search creates an embedding of the question and then returns documents from the vectorstore with the most similar embeddings

In [9]:
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)
retrieved_docs = vector_store.similarity_search("Who does the author acknowledge?")
print(("\n").join([doc.page_content for doc in retrieved_docs]))


variables ofn data points.
Acknowledgments
I would like to thank my wife and daughter for their love and support during the writing of
this book. The writing of a book requires signiﬁcant time, which is taken away from family
members. This book is the result of their patience with me during this time.
IwouldalsoliketothankmymanagerNaguiHalimforprovidingthetremendoussupport
necessary for the writing of this book. His professional support has been instrumental for
my many book eﬀorts in the past and present.
During the writing of this book, I received feedback from many colleagues. In partic-
ular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo
Deng, Amit Dhurandhar, Bart Goethals, Alexander Hinneburg, Ramakrishnan Kannan,
George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Pierangela Samarati,
Saket Sathe, Karthik Subbian, Jiliang Tang, Deepak Turaga, Jilles Vreeken, Jieping Ye,
George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Piera

As we can see the vectorstore returned the chunk of text that contains the answer to the question. 
We pass this as additional context along with the user question to an LLM augmenting its knowledge.
This is the idea behind Retrieval Augmented Generation

In [10]:
retrieved_docs

[Document(metadata={'producer': 'iLovePDF', 'creator': 'PyPDF', 'creationdate': '', 'moddate': '2026-02-18T10:10:38+00:00', 'source': '../data/raw/50.pdf', 'total_pages': 50, '_id': '3eeea984-998c-4f60-baa7-2e3a4b024053', '_collection_name': 'simple_chunking_first_50'}, page_content='variables ofn data points.\n\x0cAcknowledgments\nI would like to thank my wife and daughter for their love and support during the writing of\nthis book. The writing of a book requires signiﬁcant time, which is taken away from family\nmembers. This book is the result of their patience with me during this time.\nIwouldalsoliketothankmymanagerNaguiHalimforprovidingthetremendoussupport\nnecessary for the writing of this book. His professional support has been instrumental for\nmy many book eﬀorts in the past and present.\nDuring the writing of this book, I received feedback from many colleagues. In partic-\nular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo\nDeng, Amit Dhurandh