In [None]:
!pip install langchain langchain-text-splitters langchain-huggingface langchain-community faiss-cpu PyMuPDF fastapi uvicorn python-multipart



# Indexing
This is the first part of RAG, where we take our raw documents. Parse them using parsers, split them using splitters, find their embeddings and store/index them in vector db

### 01. Parsing

In this step, we extract raw text from the source documents — for example, extracting text from a PDF. Several libraries and approaches can be used for PDF parsing:

* **Metadata-based**: Libraries like `PyMuPDF (fitz)` or `pdfplumber` extract text from the PDF's embedded metadata. These are accurate for digitally-generated PDFs (e.g., exports from Word or LaTeX), but do not work well for scanned/image-based PDFs where text is not selectable.

* **OCR-based**: Tools like `pytesseract` or `easyocr` use Optical Character Recognition (OCR) to extract text from image-based PDFs. These are useful when metadata-based parsing fails. They are slightly less accurate but more versatile.

* **Hybrid approaches**: Libraries like `docling` combine both metadata and OCR-based extraction. They can also preserve layout and structure — for example, outputting content in markdown or structured formats.

* **LLM-based parsing**: Tools like `llama-parse` use large language models to extract, clean, and structure content intelligently. These can handle messy layouts, mixed content, and even infer structure that isn’t explicitly present.

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    file_path="pdfs/crypto_note.pdf"
)

documents = loader.load()

In [5]:
print(f"Total number of pages: {len(documents)}")
documents[:5]

Total number of pages: 92


[Document(metadata={'producer': 'pikepdf 1.7.0', 'creator': 'Microsoft® Word 2010', 'creationdate': "20130319094955+05'30'", 'source': 'pdfs/crypto_note.pdf', 'file_path': 'pdfs/crypto_note.pdf', 'total_pages': 92, 'format': 'PDF 1.3', 'title': '', 'author': 'new', 'subject': '', 'keywords': '', 'moddate': "20130319095108+05'30'", 'trapped': '', 'modDate': "20130319095108+05'30'", 'creationDate': "20130319094955+05'30'", 'page': 0}, page_content='Source: www.csitnepal.com (Complied By Tej Shahi)  \nPage 1 \n \nChapter 1 \n`Introduction to Cryptography \n \nThe word cryptography comes from two Greek words meaning “secret writing” and is the art \nand science of information hiding. This field is very much associated with mathematics and \ncomputer science with application in many fields like computer security, electronic commerce, \ntelecommunication, etc. \n \nSo cryptography is a subject that should be of interest to many people, especially because we \nnow live in the Information Age,

### 02. Chunking

Once raw text is extracted, the next step is **chunking** — splitting the text into smaller, manageable segments (chunks) that can be indexed and retrieved efficiently by the system.

#### Tools & Libraries:

* `langchain_text_splitter` – Offers multiple splitters like `RecursiveCharacterTextSplitter`, `TokenTextSplitter`, etc.
* `llama-index` – Has smart document parsing utilities.
* Custom chunkers using NLTK, spaCy, or basic Python logic.

In [6]:
from langchain_text_splitters import  RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Max characters per chunk
    chunk_overlap=200,     # Overlap between chunks to preserve context
    separators=["\n---\n", "\n\n", "\n", ".", " "]  # Priority of where to split
)

# Split the text
chunks = splitter.split_documents(documents)

# Print chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}\n")
    print(chunk.page_content)


Chunk 0

Source: www.csitnepal.com (Complied By Tej Shahi)  
Page 1 
 
Chapter 1 
`Introduction to Cryptography 
 
The word cryptography comes from two Greek words meaning “secret writing” and is the art 
and science of information hiding. This field is very much associated with mathematics and 
computer science with application in many fields like computer security, electronic commerce, 
telecommunication, etc. 
 
So cryptography is a subject that should be of interest to many people, especially because we 
now live in the Information Age, and our secrets can be transmitted in so many ways – email, 
cell phone, etc. – and all these channels need to be protected [ simon singh]. 
 
Secrecy and Encryption 
 
In the ancient days, cryptography was mostly referred to as encryption – the mechanism to 
convert the readable plaintext into unreadable (incomprehensible) text i.e. ciphertext, and 
decryption – the opposite process of encryption i.e. conversion of ciphertext back to the
Chunk 1

c

In [7]:
print(f"Total number of chunks: {len(chunks)}")

Total number of chunks: 221


In [8]:
chunks[10].metadata

{'producer': 'pikepdf 1.7.0',
 'creator': 'Microsoft® Word 2010',
 'creationdate': "20130319094955+05'30'",
 'source': 'pdfs/crypto_note.pdf',
 'file_path': 'pdfs/crypto_note.pdf',
 'total_pages': 92,
 'format': 'PDF 1.3',
 'title': '',
 'author': 'new',
 'subject': '',
 'keywords': '',
 'moddate': "20130319095108+05'30'",
 'trapped': '',
 'modDate': "20130319095108+05'30'",
 'creationDate': "20130319094955+05'30'",
 'page': 3}

### 03. Embedding and storing

In this step, we convert each chunk into a dense vector (embedding) that captures its semantic meaning. We'll use a pretrained model from `sentence-transformers`.
Then store it in choice of our vector database. We'll be using `faiss` in this example.

In [9]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

text = "This is a test document."
query_result = embeddings.embed_query(text)

# show only the first 100 characters of the 
print(str(query_result)[:100] + "...")

[-0.0383385494351387, 0.1234646886587143, -0.02864295244216919, 0.053652748465538025, 0.008845349773...


In [None]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS



index = faiss.IndexFlatL2(384)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)


In [11]:
from uuid import uuid4


uuids = [str(uuid4()) for _ in range(len(chunks))]
vector_store.add_documents(documents=chunks, ids=uuids)

['884c40f7-e372-4f1c-9911-222cba288365',
 '4a0363d0-4d70-4379-8707-d65459993bab',
 'db777519-1428-43ab-987e-6c60a57779f4',
 '624e71a5-6a56-4828-88bb-b2a3ea2236dc',
 '62590fe8-c4ea-49d4-bbbc-6e83dc824dc0',
 'a0255a9a-c85e-49e2-8b9c-4bb5cab66351',
 '4b282c0d-9891-48bb-aba8-6fff9fde6080',
 '610fffdf-6f69-4ef2-926b-d7bfe7b70bc8',
 '70018446-6762-4ba6-99c4-332075654a8c',
 'a264df1c-e33c-4a5a-bb79-343c5472c869',
 '624e663f-231c-4bee-8d09-00fcd34d8a85',
 'ad05ca8c-dc0e-4a3e-8af3-14145cb68f7e',
 'f4895e77-883a-4955-99af-99a30809877c',
 'fd3a9900-9aea-4114-a0bb-ae96d9c594fa',
 'd995d5e4-186b-4507-ba93-f3a6a8bb26a5',
 '19bfe9ab-fe80-4b00-9100-c4e13b260651',
 '9055ec7c-10b4-46a1-8730-f68ef17f156b',
 '079dbd45-0647-44d3-b571-273473dc3abb',
 'bccaeb37-c798-4e42-a3a2-63299749fff3',
 '06acc9ec-0a5a-46c0-ba18-6b77a3cd750f',
 '56a3bc3f-59b8-40d9-bec1-a97d8644136e',
 'f4260daf-9825-43d5-bc25-bed1eb970497',
 'd28a79b3-7b23-494a-9c20-dc3f01cfe46d',
 '02b565cf-779d-4181-99bc-218f4838c290',
 '8dbef393-aa55-

In [12]:
vector_store.save_local("vector_store")

### 04. Retriving

In [13]:
results = vector_store.similarity_search(
    "Define Diffie-Hellman Key Exchange",
    k=3,
)
print("Relevent chunks metadata: ")
for res in results:
    print(res.metadata["page"])


Relevent chunks metadata: 
52
71
70


### 05. Setting up system prompts and LLMs

In [14]:
prompt_template = """
You are a helpful and intelligent study assistant. You are given a list of JSON objects as context, each representing extracted text from a PDF.

Each object contains:
- 'text': the actual content
- 'filename': the PDF file name
- 'page_no': the page number

Your task is to answer the student's question based **primarily** on the 'text' fields in the context. You may reason and infer the answer if the exact wording is not available, as long as your answer is clearly supported by the content.

# Context:
{context}

# Question:
{question}

# Instructions:
- Use only the 'text' field from the context entries for answering the question, but include the most relevant 'filename' and 'page_no' you used.
- Even if the original text is technical or unclear, rewrite the answer in a simple, **student-friendly** way that is easy to understand.
- You can use **Markdown formatting** (headings, bullet points, code blocks, tables, etc.) to make the answer more readable and structured.
- You **may infer** or **summarize** answers from the content to help students understand, even if the answer is not a perfect match.
- Avoid saying you don’t know unless the question is entirely unrelated to the context.
- Return a JSON object with:
  - "answer": your helpful, clear, Markdown-formatted answer
  - "filename": the filename of the most relevant entry you used
  - "page_no": the corresponding page number
  
Respond ONLY with a valid JSON object. Do not include any explanation, formatting, or extra text."""

In [15]:
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv

load_dotenv()

model = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    api_key=os.getenv("GEMINI_API_KEY")
)

In [16]:
from typing import Optional
from pydantic import BaseModel

class Response(BaseModel):
    answer: str
    filename: Optional[str] = None
    page_no: Optional[int] = None

In [17]:
structured_model = model.with_structured_output(schema=Response)

In [18]:
structured_model.invoke("hi")

Response(answer='hello', filename=None, page_no=None)

### 6. Getting response

In [19]:
import json

question = "What is diffie hellman key exchange ?"

similar_chunks = vector_store.similarity_search(
    query=question,
    k=5
)

context_list = []
for chunk in similar_chunks:
    context_list.append({
        "text": chunk.page_content,
        "page_no": chunk.metadata["page"] + 1,
        "filename": chunk.metadata["file_path"]
    })

context = json.dumps(context_list, indent=2)

prompt = prompt_template.format(context=context, question=question)

print(prompt)


You are a helpful and intelligent study assistant. You are given a list of JSON objects as context, each representing extracted text from a PDF.

Each object contains:
- 'text': the actual content
- 'filename': the PDF file name
- 'page_no': the page number

Your task is to answer the student's question based **primarily** on the 'text' fields in the context. You may reason and infer the answer if the exact wording is not available, as long as your answer is clearly supported by the content.

# Context:
[
  {
    "text": "Diffie-Hellman (D-H) key exchange is a cryptographic protocol that allows two parties that have \nno prior knowledge of each other to jointly establish a shared secret key over an insecure \ncommunications channel. This key can then be used to encrypt subsequent communications \nusing a symmetric key cipher. Other names for Diffie-Hellman Key Exhange are Diffie-Hellman \nKey Agreement, Diffie-Hellman Key Establishment, Diffie-Hellman Key Negotiation, \nExponential Ke

In [20]:
resp = structured_model.invoke(prompt)

print(
    f"Question: {question}\nAnswer: {resp.answer}\nPage number: {resp.page_no}"
)

Question: What is diffie hellman key exchange ?
Answer: Diffie-Hellman (D-H) key exchange is a special cryptographic method that allows two people or parties to secretly agree on a shared secret key, even if they've never met or communicated before, and are using an insecure channel. Once they have this shared secret key, they can use it to encrypt their future messages so that only they can read them. This method is also known by other names like Diffie-Hellman Key Agreement or Exponential Key Exchange.
Page number: 53
