---
# Vector Database and RAG Setup

<img src="https://miro.medium.com/v2/resize:fit:1400/0*kPXvR_-LxPfdsojc.png" width=600>

Let's put it all together into a retrieval augmented generation flow:
1. Chunking text
2. Creating a Vector Database
3. Indexxing the text chunks
4. Setting up an LLM
5. Connecting retrieval with LLM for RAG

In [1]:
# RAG Dependencies
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
import chromadb
from langchain_openai import ChatOpenAI
from IPython.display import display, Markdown
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import json
from sentence_transformers import SentenceTransformer

os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Chunking the Text

We'll be using the [TIES-Merging: Resolving Interference When Merging Models](https://arxiv.org/abs/2306.01708#) paper for our example, along with LangChain integrations with PDF loaders and chunkers.

In [16]:
# Paths
PDF_PATH = r"D:\2CSI-Project\PDFs_papers"  # Folder containing PDFs
CHUNKS_DIR = r"D:\2CSI-Project\Chunks"  # Folder to store chunks
PROGRESS_FILE = r"D:\2CSI-Project\progress.txt"  # File to save the last processed index
VECTORDB_PATH = r"D:\2CSI-Project\VectorDB_Embeddings"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"
os.makedirs(CHUNKS_DIR, exist_ok=True)

In [17]:
client = chromadb.PersistentClient(path=VECTORDB_PATH)
collection = client.get_or_create_collection(name='ties_collection_emb', metadata={"hnsw:space": "cosine"})

In [4]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=800,
    chunk_overlap=400,
)

#  Read Last Progress


In [5]:
if os.path.exists(PROGRESS_FILE):
    with open(PROGRESS_FILE, "r") as f:
        last_index = int(f.read().strip())
else:
    last_index = 0

In [6]:
last_index

0

# List Chunks

In [7]:
chunk_files = sorted(os.listdir(CHUNKS_DIR))
total_files = len(chunk_files)
if os.path.exists(PROGRESS_FILE):
    with open(PROGRESS_FILE, "r") as f:
        last_index = int(f.read().strip())
else:
    last_index = 0

print(f"Starting from file {last_index + 1}/{total_files}")

Starting from file 1/6158


#  Load Embedding Model

In [5]:
embedding_model = SentenceTransformer(EMBEDDING_MODEL)

In [9]:
for idx, chunk_file in enumerate(chunk_files):
    if idx < last_index:
        continue

    try:
        print(f"📄 Processing: {chunk_file} [{idx + 1}/{total_files}]")
        with open(os.path.join(CHUNKS_DIR, chunk_file), "r", encoding="utf-8") as f:
            chunks = json.load(f)

        # Generate Embeddings + Store in ChromaDB
        for i, chunk in enumerate(chunks):
            embedding = embedding_model.encode(chunk).tolist()
            collection.add(
                documents=[chunk],
                embeddings=[embedding],
                ids=[f"{chunk_file[:-11]}_chunk_{i}"]
            )

        print(f"✅ {len(chunks)} Chunks Indexed")

        # Update Progress
        with open(PROGRESS_FILE, "w") as f:
            f.write(str(idx + 1))

    except Exception as e:
        print(f"❌ Error processing {chunk_file}: {e}")
        continue

print("🚀 All Chunks Processed!")


📄 Processing: $C^$-Algebraic Machine Learning Moving in a New Di_chunks.json [1/6158]
✅ 44 Chunks Indexed
📄 Processing: $E(2)$-Equivariant Vision Transformer_chunks.json [2/6158]
✅ 38 Chunks Indexed
📄 Processing: $mathrm{SAM^{Med}}$ A medical image annotation fra_chunks.json [3/6158]
✅ 38 Chunks Indexed
📄 Processing: 10_2263363_chunks.json [4/6158]
✅ 39 Chunks Indexed
📄 Processing: 11_1434678_chunks.json [5/6158]
✅ 181 Chunks Indexed
📄 Processing: 12_2202843_chunks.json [6/6158]
✅ 12 Chunks Indexed
📄 Processing: 13_4099100_chunks.json [7/6158]
✅ 102 Chunks Indexed
📄 Processing: 14_3224393_chunks.json [8/6158]
✅ 95 Chunks Indexed
📄 Processing: 15_1275662_chunks.json [9/6158]
✅ 30 Chunks Indexed
📄 Processing: 16_2846909_chunks.json [10/6158]
✅ 15 Chunks Indexed
📄 Processing: 17_2307352_chunks.json [11/6158]
✅ 86 Chunks Indexed
📄 Processing: 18_3489010_chunks.json [12/6158]
✅ 33 Chunks Indexed
📄 Processing: 1996--1998 Polish Visual Meteor Database_chunks.json [13/6158]
✅ 9 Chunks Indexed


🧠 How ChromaDB Retrieves Relevant Chunks (Cosine Similarity)

* When a user asks a query like:

- 👉 "Summarize the methodology of this article"
- 👉 "What is the model architecture used in this paper?"
- 👉 "Summarize the conclusion in 200 words"

* What Happens Behind the Scenes?
- User Query 🔍: The user enters a query like:

query = "Summarize the methodology used in this article"

- ChromaDB Fetches Relevant Chunks : The Vector Database finds chunks that are most similar to the query embedding.
Generate Query Embedding 🧠: The query is converted into an embedding vector using the same model (all-MiniLM-L6-v2):

###### query_embedding = embedding_model.encode(query).tolist()

- Cosine Similarity Search 🔥: ChromaDB automatically compares the query embedding against ALL stored embeddings in the Vector Database using cosine similarity.

# ChromaDB Retrieval Code 


In [27]:
query = """cademic community. As an illustration of the scale of the field’s growth, the number of submis-
sions to NeurIPS, a primary conference on ML methods, has quadrupled in six years, going from
1,678submissionsin2014to6,743in2019[ 38].Nevertheless,therearestillplentyofpracticalcon-
siderationsthataffectthemodellearningstage.Inthissection,wediscussissuesconcerningthree
steps within model learning: model selection, training, and hyper-parameter selection.
4.1 Model Selection
In many practical cases the selection of a model is decided by one key characteristic of a model:
complexity.Despiteareassuchasdeeplearningandreinforcementlearninggaininginpopularity
with the research community, in practice simpler models are often chosen. Such models include
shallowneuralnetworkarchitectures,simpleapproachesbasedon PrincipalComponentAnal-
ysis (PCA), decision trees, and random forests.
Simple models can be used as a way to prove the concept of the proposed ML solution and get
the end-to-end setup in place. This approach reduces the time to get a deployed solution, allows
the collection of important feedback, and also helps avoid overcomplicated designs. This was the
case reported by Haldar et al. [39]. In the process of applying machine learning to AirBnB search,
the team started with a complex deep learning model. The team was quickly overwhelmed by its
complexityandendedupconsumingdevelopmentcycles.Afterseveralfaileddeploymentattempts
the neural network architecture was drastically simplified: a single hidden layer NN with 32 fully
connected ReLU activations. Even such a simple model had value, as it allowed the building of a
whole pipeline of deploying ML models"""
query_embedding = embedding_model.encode(query).tolist()



In [28]:
# 🔥 Retrieve Top 5 Most Similar Chunks
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    include=["documents", "distances"]
)

In [29]:

for i in range(len(results["documents"][0])):
    print(f"Chunk: {results['documents'][0][i]}")
    print(f"Similarity Score: {1 - results['distances'][0][i]}")
    print("=" * 80)

Chunk: academic community. As an illustration of the scale of the field’s growth, the number of submis-
sions to NeurIPS, a primary conference on ML methods, has quadrupled in six years, going from
1,678submissionsin2014to6,743in2019[ 38].Nevertheless,therearestillplentyofpracticalcon-
siderationsthataffectthemodellearningstage.Inthissection,wediscussissuesconcerningthree
steps within model learning: model selection, training, and hyper-parameter selection.
4.1 Model Selection
In many practical cases the selection of a model is decided by one key characteristic of a model:
complexity.Despiteareassuchasdeeplearningandreinforcementlearninggaininginpopularity
with the research community, in practice simpler models are often chosen. Such models include
shallowneuralnetworkarchitectures,simpleapproachesbasedon PrincipalComponentAnal-
ysis (PCA), decision trees, and random forests.
Simple models can be used as a way to prove the concept of the proposed ML solution and get
the end-to-end setu

# LLM Prompt Design

In [None]:
retrieved_chunks = " ".join(results["documents"][0])

prompt = f"""
You are a helpful scientific paper summarizer.
Summarize the following text in **200 words** based on this query: "{query}".

Text:
{retrieved_chunks}
"""

response = llm(prompt)
print(response)


# ChromaDB Function

In [None]:
def retrieve_chunks(query, top_k=5):
    query_embedding = embedding_model.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "distances"]
    )
    
    chunks = results["documents"][0]
    similarities = [1 - d for d in results["distances"][0]]

    return list(zip(chunks, similarities))
