Setup

In [None]:
%load_ext dotenv
%dotenv
import os
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent))


Loading the pdf content into Document objects langchain can handle

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/raw/Textbook.pdf"
loader = PyPDFLoader(file_path,mode="single")
docs = loader.load()
print(type(docs[0]))

<class 'langchain_core.documents.base.Document'>


Splitting text into smaller chunks

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000,
    chunk_overlap=500,
)
texts = text_splitter.split_documents(docs)



1

In [None]:
print(f"Total chunks created: {len(texts)}")
if len(texts) > 0:
    print("\nSample Metadata:")
    print(texts[0].metadata)
    print("\nSample chunk content:")
    print(texts[0].page_content)

Loading an embedding model based on env setup

In [None]:
from src.util.env_check import get_embed_model
# embedding_model = os.getenv("EMBEDDING_MODE","local")

embedding_model = get_embed_model()
print(embedding_model.embed_documents(["Hello world"]))


[[-0.0021383099, 0.0075008804, -0.011650108, -0.07023448, 0.0045393645, -0.01298006, -0.013741996, -0.0117532825, -0.10962647, -0.017243136, -0.0052222074, -0.031411108, 0.0401904, -0.011463264, -0.04773422, 0.09640879, -0.0013066429, 0.09508943, 0.09561688, -0.052990716, -0.0025705767, 0.03481676, -0.021451501, 0.13576753, -0.039540734, -0.03868507, -0.063845925, 0.126914, 0.019091083, -0.021086011, 0.006074588, 0.04749404, -0.021219999, -0.016755642, -0.035438623, -0.0154389525, 0.027086308, -0.010963458, -0.029936722, 0.04537243, 0.01965291, -0.007755763, 0.05530257, -0.01722083, 0.028688328, 0.013500553, 0.017399846, -0.016382286, 0.022164488, 0.011022739, -0.03313258, -0.00720774, -0.002563534, -0.005926767, 0.01760186, -0.042084303, 0.018804865, -0.027303668, 0.027514797, -0.012525264, -0.06433825, 0.04809149, -0.069110096, -0.012301332, 0.014198058, 0.04731715, 0.0057678707, -0.035050858, -0.08356759, -0.022858167, -0.011641543, -0.04040388, -0.060886834, 0.011948142, -0.0151253

We need a vectorstore to save the embeddings

In [10]:
collection_name = "simple_chunking_whole_book"

In [11]:
from src.util.vectorstore import get_vectorstore
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)

The embedding model is used to generate an embedding for each document before they are saved to the vectorstore

In [7]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(texts))]

document_ids = vector_store.add_documents(documents=texts, ids=uuids)
    
print(f"Saved {len(document_ids)} documents to the vectorstore")

Saved 1 documents to the vectorstore


Similarity search creates an embedding of the question and then returns documents from the vectorstore with the most similar embeddings

In [12]:
vector_store = get_vectorstore(embedding_model,collection_name=collection_name)
retrieved_docs = vector_store.similarity_search("Who does the author acknowledge?")
print(("\n").join([doc.page_content for doc in retrieved_docs]))


constructed between them. An example of such a border1 is illustrated in Fig.20.3b, and
the corresponding minimal generalizations are illustrated in the same ﬁgure. Note that
the minimal generalization is not unique, and that two possible minimal generalizations
<Z
2,A2 > and <Z 1,A3 > are possible in this example. The reason for using minimally
generalized nodes is to maximize the utility of the data for analytical algorithms. Other
1This border is for illustration purposes only, and does not correspond to any data set in this chapter.
20.3. PRIVACY-PRESERVING DATA PUBLISHING 675
more reﬁned deﬁnitions can be used for quantifying utility that use the distribution of the
attribute values more explicitly. The bibliographic notes contain pointers to some of these
deﬁnitions.
Samarati’s algorithm uses a simple binary search over the lattice of domain generaliza-
tion tuples. Let [0,hmax] represent the range of heights of the lattice. It is then checked
whether any of the generalizations 

As we can see the vectorstore returned text chunks most similar to the user question that hopefully contain the answer to the question. 
We pass this as additional context along with the user question to an LLM augmenting its knowledge.
This is the idea behind Retrieval Augmented Generation

In [9]:
retrieved_docs

[Document(metadata={'producer': '4-Heights™ PDF Library 3.4.0.6904 (http://www.pdf-tools.com)', 'creator': 'Microsoft® Word 2016', 'creationdate': '2025-10-22T09:31:32+00:00', 'moddate': '2026-01-28T15:03:48+00:00', 'source': '../data/raw/Dusan.pdf', 'total_pages': 2, '_id': 'eab54a64-5d59-44a1-8f56-365cffd74f25', '_collection_name': 'dule_cv'}, page_content='Dušan Jevtović \nJunior AI / Machine-Learning Engineer \nLocation: Belgrade, Serbia • Email: dusankitic@mail.com • Phone: +381 652110879 • \nLinkedIn: https://www.linkedin.com/in/dusan-jevtovic-723b17166/ • Timezone: CET (UTC+1) \nProfile \nJunior AI/ML Engineer with 2+ years of experience designing and deploying scalable AI \nsystems, specializing in document retrieval, RAG architectures, LLM fine-tuning, and \ninfrastructure for private AI deployments. Experienced in developing comprehensive AI solutions \nfrom proof-of-concept to production-ready systems, with expertise in advanced RAG \ntechniques, multi-agent systems, and eme

We must close the client connection before running a new script or the streamlit app because this process is blocking the qdrant database locked

In [13]:
vector_store.client.close()