<a href="https://colab.research.google.com/github/Jayasaideepika9/vectorDB_basic/blob/main/VECTORDB_04_07_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ Step 1: Install Required Libraries
Run this in a Colab cell first:

In [None]:
!pip install langchain chromadb sentence-transformers faiss-cpu annoy

Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading ope

✅ Step 2: Upload Your PDF
Upload the file using Colab’s upload feature:

In [None]:
from google.colab import files

uploaded = files.upload()
pdf_path = list(uploaded.keys())[0]

Saving nuclear_power_plant_safety.pdf to nuclear_power_plant_safety.pdf


✅ Step 3: Load and Split the PDF into Chunks
We’ll load the PDF, clean it up, and split into paragraphs or chunks.

In [None]:
import PyPDF2

# Read PDF content
with open(pdf_path, "rb") as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    full_text = ""
    for page in pdf_reader.pages:
        full_text += page.extract_text()

# Clean up text
full_text = full_text.replace("\n", " ").strip()

Now we split into **chunks**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = text_splitter.split_text(full_text)

print(f"Total chunks created: {len(chunks)}")

Total chunks created: 532


✅ Step 4: Vectorize Chunks and Store in ChromaDB
Let’s embed the chunks using a sentence transformer model .

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Create ChromaDB client
client = chromadb.Client()
collection_name = "npp_safety"

# Delete existing collection if needed
try:
    client.delete_collection(collection_name)
except:
    pass

# Create new collection
collection = client.create_collection(name=collection_name)

# Generate embeddings and store in ChromaDB
embeddings = embedding_model.embed_documents(chunks)

for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
    collection.add(
        documents=[chunk],
        embeddings=[embedding],
        ids=[f"id_{i}"]
    )

print("Chunks stored in ChromaDB.")

Chunks stored in ChromaDB.


✅ Step 5: Build Retrieval Using HNSW (via FAISS)
FAISS has a built-in HNSW index. Let’s build it.

In [None]:
import faiss
import numpy as np

# Convert embeddings to numpy array
embeddings_np = np.array(embeddings).astype('float32')

# Build FAISS HNSW index
dim = embeddings_np.shape[1]
index_hnsw = faiss.IndexHNSWFlat(dim, 32)  # 32 is HNSW ef_construction
index_hnsw.add(embeddings_np)

print("FAISS HNSW index built.")

FAISS HNSW index built.


✅ Step 6: Build Retrieval Using IVFPQ (Product Quantization)
Use FAISS’ PQ method for memory-efficient ANN.

In [None]:
# Build IVF-PQ index
nlist = 50  # Number of clusters
m = 8       # Number of subquantizers for PQ
bits = 8    # Bits per subvector

quantizer = faiss.IndexFlatL2(dim)
index_pq = faiss.IndexIVFPQ(quantizer, dim, nlist, m, bits)
index_pq.train(embeddings_np)  # Train on embeddings
index_pq.add(embeddings_np)

print("FAISS IVFPQ index built.")

FAISS IVFPQ index built.


Alternatively, use Annoy:

In [None]:
from annoy import AnnoyIndex

# Build Annoy index
index_annoy = AnnoyIndex(dim, metric='angular')
for i, emb in enumerate(embeddings_np):
    index_annoy.add_item(i, emb)

index_annoy.build(n_trees=10)  # Number of trees
print("Annoy index built.")

Annoy index built.


✅ Step 7: Ask a Question and Retrieve Results from Both Methods

In [None]:
def retrieve(query, index, k=3):
    query_emb = embedding_model.embed_query(query)
    D, I = index.search(np.array([query_emb]).astype("float32"), k)
    return [chunks[i] for i in I[0]]

# Try a sample question
question = "What are the main principles of nuclear power plant safety?"

# Retrieve with HNSW
hnsw_results = retrieve(question, index_hnsw)

# Retrieve with Annoy or PQ
pq_results = retrieve(question, index_pq)  # or index_annoy

**Display results:**

In [None]:
print("=== HNSW Results ===")
for r in hnsw_results:
    print(r[:300], "\n---\n")

print("=== PQ/Annoy Results ===")
for r in pq_results:
    print(r[:300], "\n---\n")

=== HNSW Results ===
types of nuclear power plants may achieve the intent of some of the principles presented in this report by special inherent features making theprinciples as presently formulated not entirely applicable. For such cases,it would benecessary to scrutinize closely the extent of the basis in proven techn 
---

is shown to pervade allactivities. From top to bottom on the left hand side of Fig. 2, all the principles arerelated to the levels of defence in order of increasing threat to safety, from normaloperation to off-site and emergency response, indicating the provisions in design andoperation that need t 
---

a self-standing report on safety principles for electricity generating nuclear power plants1. This report has been developed because: —the means for ensuring the safety of nuclear power plants have improved over the years,and it is believed that commonly shared principles for ensuring a veryhigh lev 
---

=== PQ/Annoy Results ===
types of nuclear power plants may

✅ Step 8: Compare Retrieval Time and Accuracy
⏱️ Measure Speed

In [None]:
import time

def benchmark(index, query, runs=10):
    times = []
    for _ in range(runs):
        start = time.time()
        retrieve(query, index)
        times.append(time.time() - start)
    return sum(times) / len(times)

avg_time_hnsw = benchmark(index_hnsw, question)
avg_time_pq = benchmark(index_pq, question)  # or index_annoy

print(f"HNSW Avg Time: {avg_time_hnsw:.4f}s")
print(f"PQ/Annoy Avg Time: {avg_time_pq:.4f}s")

HNSW Avg Time: 0.0192s
PQ/Annoy Avg Time: 0.0401s


In [None]:
import pandas as pd

results = pd.DataFrame({
    "Method": ["HNSW", "PQ/Annoy"],
    "Avg Retrieval Time (s)": [avg_time_hnsw, avg_time_pq],
    "Top Result Relevance": [
        "High" if any("safety culture" in r for r in hnsw_results) else "Low",
        "High" if any("safety culture" in r for r in pq_results) else "Low"
    ]
})

print(results.to_string(index=False))

  Method  Avg Retrieval Time (s) Top Result Relevance
    HNSW                0.019230                  Low
PQ/Annoy                0.040073                  Low


✅ Step-by-Step: Add Evaluation Code
Let’s walk through how to do it.

1. Define Test Questions and Ground Truths

In [None]:
# Define test questions and their expected (ground truth) answers
test_questions = [
    {
        "question": "What are the main principles of nuclear power plant safety?",
        "expected_answer": "The basic safety principles for nuclear power plants include reliability, defense in depth, quality assurance, proper siting, commissioning validation, training programs, and use of probabilistic safety assessment."
    },
    {
        "question": "How is human error addressed in nuclear power plants?",
        "expected_answer": "Human error is addressed through design improvements, automation, improved human performance, task analysis, identification of error-likely conditions, and operator training including simulator exercises."
    },
    {
        "question": "What is the role of self-assessment in nuclear safety?",
        "expected_answer": "Self-assessment helps identify root causes of poor performance, supports continuous improvement, involves personnel directly in reviews, and ensures corrective actions are effective and tracked."
    }
]

2. Function to Retrieve Top Chunk(s)

In [None]:
def retrieve_top_k(query, index, k=3):
    query_emb = embedding_model.embed_query(query)
    D, I = index.search(np.array([query_emb]).astype("float32"), k)
    retrieved_chunks = [chunks[i] for i in I[0]]
    return retrieved_chunks

3. Evaluate Using BLEU and ROUGE Scores
Install evaluate and nltk if not already done:

In [None]:
!pip install evaluate nltk

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.4


Import libraries:

In [None]:
import evaluate
import nltk
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
nltk.download('punkt')

rouge = evaluate.load('rouge')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


4. Helper Functions for Evaluation

In [None]:
def calculate_bleu(reference, hypothesis):
    reference_tokens = [word_tokenize(reference.lower())]
    hypothesis_tokens = word_tokenize(hypothesis.lower())
    return sentence_bleu(reference_tokens, hypothesis_tokens)

def calculate_rouge(reference, hypothesis):
    results = rouge.compute(predictions=[hypothesis], references=[reference])
    return results

def evaluate_retrieval(question_dict, index, method_name):
    question = question_dict["question"]
    expected = question_dict["expected_answer"]

    retrieved_chunks = retrieve_top_k(question, index, k=3)
    retrieved_text = " ".join(retrieved_chunks)

    bleu_score = calculate_bleu(expected, retrieved_text)
    rouge_scores = calculate_rouge(expected, retrieved_text)

    print(f"\n=== [{method_name}] Evaluation for: '{question}' ===")
    print(f"BLEU score: {bleu_score:.4f}")
    print(f"ROUGE-1: {rouge_scores['rouge1']:.4f}")
    print(f"ROUGE-2: {rouge_scores['rouge2']:.4f}")
    print(f"ROUGE-L: {rouge_scores['rougeL']:.4f}")
    print("-" * 60)

    return {
        "method": method_name,
        "question": question,
        "bleu": bleu_score,
        "rouge1": rouge_scores['rouge1'],
        "rouge2": rouge_scores['rouge2'],
        "rougeL": rouge_scores['rougeL']
    }

5. Run Evaluation on Both Methods

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
results = []

for q in test_questions:
    hnsw_result = evaluate_retrieval(q, index_hnsw, "HNSW")
    pq_result = evaluate_retrieval(q, index_pq, "IVFPQ")  # or index_annoy
    results.append(hnsw_result)
    results.append(pq_result)

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()



=== [HNSW] Evaluation for: 'What are the main principles of nuclear power plant safety?' ===
BLEU score: 0.0000
ROUGE-1: 0.0818
ROUGE-2: 0.0300
ROUGE-L: 0.0743
------------------------------------------------------------

=== [IVFPQ] Evaluation for: 'What are the main principles of nuclear power plant safety?' ===
BLEU score: 0.0000
ROUGE-1: 0.0833
ROUGE-2: 0.0305
ROUGE-L: 0.0758
------------------------------------------------------------

=== [HNSW] Evaluation for: 'How is human error addressed in nuclear power plants?' ===
BLEU score: 0.0168
ROUGE-1: 0.1245
ROUGE-2: 0.0549
ROUGE-L: 0.1089
------------------------------------------------------------


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()



=== [IVFPQ] Evaluation for: 'How is human error addressed in nuclear power plants?' ===
BLEU score: 0.0000
ROUGE-1: 0.0794
ROUGE-2: 0.0080
ROUGE-L: 0.0556
------------------------------------------------------------

=== [HNSW] Evaluation for: 'What is the role of self-assessment in nuclear safety?' ===
BLEU score: 0.0000
ROUGE-1: 0.0791
ROUGE-2: 0.0080
ROUGE-L: 0.0711
------------------------------------------------------------

=== [IVFPQ] Evaluation for: 'What is the role of self-assessment in nuclear safety?' ===
BLEU score: 0.0000
ROUGE-1: 0.0632
ROUGE-2: 0.0080
ROUGE-L: 0.0395
------------------------------------------------------------


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


6. Generate Final Summary Table

In [None]:
import pandas as pd

results_df = pd.DataFrame(results)
avg_results = results_df.groupby("method").mean(numeric_only=True).reset_index()

print("\n=== Average Evaluation Results ===")
print(avg_results.to_string(index=False))


=== Average Evaluation Results ===
method         bleu   rouge1   rouge2   rougeL
  HNSW 5.605735e-03 0.095116 0.030944 0.084815
 IVFPQ 2.393382e-79 0.075313 0.015501 0.056946
