In [1]:
!pip install langchain langchain-text-splitters langchain-community faiss-cpu PyMuPDF fastapi uvicorn python-multipart



# Indexing
This is the first part of RAG, where we take our raw documents. Parse them using parsers, split them using splitters, find their embeddings and store/index them in vector db

### 01. Parsing

In this step, we extract raw text from the source documents — for example, extracting text from a PDF. Several libraries and approaches can be used for PDF parsing:

* **Metadata-based**: Libraries like `PyMuPDF (fitz)` or `pdfplumber` extract text from the PDF's embedded metadata. These are accurate for digitally-generated PDFs (e.g., exports from Word or LaTeX), but do not work well for scanned/image-based PDFs where text is not selectable.

* **OCR-based**: Tools like `pytesseract` or `easyocr` use Optical Character Recognition (OCR) to extract text from image-based PDFs. These are useful when metadata-based parsing fails. They are slightly less accurate but more versatile.

* **Hybrid approaches**: Libraries like `docling` combine both metadata and OCR-based extraction. They can also preserve layout and structure — for example, outputting content in markdown or structured formats.

* **LLM-based parsing**: Tools like `llama-parse` use large language models to extract, clean, and structure content intelligently. These can handle messy layouts, mixed content, and even infer structure that isn’t explicitly present.

In [2]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    file_path="pdfs/crypto_note.pdf"
)

documents = loader.load()

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(f"Total number of pages: {len(documents)}")
documents[:5]

Total number of pages: 92


[Document(metadata={'producer': 'pikepdf 1.7.0', 'creator': 'Microsoft® Word 2010', 'creationdate': "20130319094955+05'30'", 'source': 'pdfs/crypto_note.pdf', 'file_path': 'pdfs/crypto_note.pdf', 'total_pages': 92, 'format': 'PDF 1.3', 'title': '', 'author': 'new', 'subject': '', 'keywords': '', 'moddate': "20130319095108+05'30'", 'trapped': '', 'modDate': "20130319095108+05'30'", 'creationDate': "20130319094955+05'30'", 'page': 0}, page_content='Source: www.csitnepal.com (Complied By Tej Shahi)  \nPage 1 \n \nChapter 1 \n`Introduction to Cryptography \n \nThe word cryptography comes from two Greek words meaning “secret writing” and is the art \nand science of information hiding. This field is very much associated with mathematics and \ncomputer science with application in many fields like computer security, electronic commerce, \ntelecommunication, etc. \n \nSo cryptography is a subject that should be of interest to many people, especially because we \nnow live in the Information Age,

In [4]:
print(documents[1].page_content)

Source: www.csitnepal.com (Complied By Tej Shahi)  
Page 2 
 
Decryption is the reverse process, transforming an encrypted message back into its normal, 
original form. In decryption process also the use of key is important. 
 
Alternatively, the terms encode and decode or encipher and decipher are used instead of encrypt 
and decrypt. That is, we say that we encode, encrypt, or encipher the original message to hide its 
meaning. Then, we decode, decrypt, or decipher it to reveal the original message. 
 
 
 
Plaintext 
Ciphertext 
Original Plaintext 
 
 
Fig: Encryption-Decryption 
 
The use of encryption techniques is being used since very long period as it can be noted from the 
technique called Caesar’s cipher used by Julius Caesar for information passing to his soldiers. 
Encryption techniques have also been extensively used in military purposes to conceal the 
information from the enemy. Nowadays to gain the confidentiality encryption is being used in 
many areas like communicatio

### 02. Chunking

Once raw text is extracted, the next step is **chunking** — splitting the text into smaller, manageable segments (chunks) that can be indexed and retrieved efficiently by the system.

#### Tools & Libraries:

* `langchain_text_splitter` – Offers multiple splitters like `RecursiveCharacterTextSplitter`, `TokenTextSplitter`, etc.
* `llama-index` – Has smart document parsing utilities.
* Custom chunkers using NLTK, spaCy, or basic Python logic.

In [5]:
from langchain_text_splitters import  RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Max characters per chunk
    chunk_overlap=200,     # Overlap between chunks to preserve context
    separators=["\n---\n", "\n\n", "\n", ".", " "]  # Priority of where to split
)

# Split the text
chunks = splitter.split_documents(documents)

# Print chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}\n")
    print(chunk.page_content)


Chunk 0

Source: www.csitnepal.com (Complied By Tej Shahi)  
Page 1 
 
Chapter 1 
`Introduction to Cryptography 
 
The word cryptography comes from two Greek words meaning “secret writing” and is the art 
and science of information hiding. This field is very much associated with mathematics and 
computer science with application in many fields like computer security, electronic commerce, 
telecommunication, etc. 
 
So cryptography is a subject that should be of interest to many people, especially because we 
now live in the Information Age, and our secrets can be transmitted in so many ways – email, 
cell phone, etc. – and all these channels need to be protected [ simon singh]. 
 
Secrecy and Encryption 
 
In the ancient days, cryptography was mostly referred to as encryption – the mechanism to 
convert the readable plaintext into unreadable (incomprehensible) text i.e. ciphertext, and 
decryption – the opposite process of encryption i.e. conversion of ciphertext back to the
Chunk 1

c

In [6]:
print(f"Total number of chunks: {len(chunks)}")

Total number of chunks: 221


In [7]:
chunks[10].metadata

{'producer': 'pikepdf 1.7.0',
 'creator': 'Microsoft® Word 2010',
 'creationdate': "20130319094955+05'30'",
 'source': 'pdfs/crypto_note.pdf',
 'file_path': 'pdfs/crypto_note.pdf',
 'total_pages': 92,
 'format': 'PDF 1.3',
 'title': '',
 'author': 'new',
 'subject': '',
 'keywords': '',
 'moddate': "20130319095108+05'30'",
 'trapped': '',
 'modDate': "20130319095108+05'30'",
 'creationDate': "20130319094955+05'30'",
 'page': 3}

### 03. Embedding and storing

In this step, we convert each chunk into a dense vector (embedding) that captures its semantic meaning. We'll use a pretrained model from `sentence-transformers`.
Then store it in choice of our vector database. We'll be using `faiss` in this example.

In [8]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

text = "This is a test document."
query_result = embeddings.embed_query(text)

# show only the first 100 characters of the 
print(str(query_result)[:100] + "...")

[-0.0383385494351387, 0.1234646886587143, -0.02864295244216919, 0.053652748465538025, 0.008845349773...


In [9]:
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

index = faiss.IndexFlatL2(384)

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)


In [10]:
from uuid import uuid4


uuids = [str(uuid4()) for _ in range(len(chunks))]
vector_store.add_documents(documents=chunks, ids=uuids)

['473f89cc-b0dd-4e63-8ffd-624ca4e848ce',
 '798c17d8-d499-4108-a8f8-3d84996fbe82',
 '223dd10a-c9a5-4d65-8615-0a073e9c4028',
 'fa08a5c5-0445-4a44-8efe-a1303265e889',
 'f83d7b00-757e-450e-8976-f3ac13121f07',
 '363383ea-af9a-4b7e-a922-a7267b71052d',
 '707dff2e-5160-4f2a-bdc9-dcf2fba762b3',
 '94728be2-52da-4c13-9be5-6679a63dea95',
 '4f28adf3-7eb8-458e-bff7-f8674092c15e',
 '087da8a2-2e0c-4636-9f10-12d1458900de',
 '14f4767c-3e3d-48c9-84b3-2a9ca3f7cdeb',
 '2e30d773-35c3-4346-ab96-bff24784b038',
 'b242473e-c182-4141-9d3f-2e16e4a8cebf',
 'f751b45e-df59-40f4-b841-2c6b09e08272',
 'f855fc77-e88e-4817-98f2-eec0e01f74d6',
 'f8fdbbce-0e3f-4034-8fdf-d5557a36418d',
 '849f129b-6b23-4ef2-ba2d-8a7df0e0825e',
 'f2dafd9d-9289-44e2-ae1c-c8275b8156e8',
 'acacf722-e73b-4fbf-8f89-0cccb6339af3',
 '5dad2c77-397d-4859-a738-c2e3a7d398c5',
 'f1f5e501-9d08-4f7c-9fbe-2de460c861e8',
 'a71fc9d9-48b2-495f-b982-627c59295b17',
 'd0a3e3f9-4994-47d0-9b35-7ae9045590a0',
 '053f51ac-b91c-4191-bbf7-fc55a70d8d73',
 'b067d343-bab4-

In [11]:
vector_store.save_local("vector_store")

### 04. Retriving

In [12]:
results = vector_store.similarity_search(
    "Define Diffi-Hellman Key Exchange",
    k=3,
)
print("Relevent chunks metadata: ")
for res in results:
    print(res.metadata["page"])


Relevent chunks metadata: 
52
71
70


### 05. Setting up system prompts and LLMs

In [13]:
prompt_template = """
You are a helpful and intelligent study assistant. You are given a list of JSON objects as context, each representing extracted text from a PDF.

Each object contains:
- 'text': the actual content
- 'filename': the PDF file name
- 'page_no': the page number

Your task is to answer the student's question based **primarily** on the 'text' fields in the context. You may reason and infer the answer if the exact wording is not available, as long as your answer is clearly supported by the content.

# Context:
{context}

# Question:
{question}

# Instructions:
- Use only the 'text' field from the context entries for answering the question, but include the most relevant 'filename' and 'page_no' you used.
- Even if the original text is technical or unclear, rewrite the answer in a simple, **student-friendly** way that is easy to understand.
- You can use **Markdown formatting** (headings, bullet points, code blocks, tables, etc.) to make the answer more readable and structured.
- You **may infer** or **summarize** answers from the content to help students understand, even if the answer is not a perfect match.
- Avoid saying you don’t know unless the question is entirely unrelated to the context.
- Return a JSON object with:
  - "answer": your helpful, clear, Markdown-formatted answer
  - "filename": the filename of the most relevant entry you used
  - "page_no": the corresponding page number
  
Respond ONLY with a valid JSON object. Do not include any explanation, formatting, or extra text."""

In [None]:
# from langchain_google_genai import ChatGoogleGenerativeAI
# import os
# from dotenv import load_dotenv

# load_dotenv()

# model = ChatGoogleGenerativeAI(
#     model="gemini-2.0-flash",
#     api_key=os.getenv("GEMINI_API_KEY")
# )

In [19]:
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv

load_dotenv()

model = ChatGroq(
    model="llama-3.3-70b-versatile",
    api_key=os.getenv("GROQ_API_KEY")
)

In [20]:
from typing import Optional
from pydantic import BaseModel

class Response(BaseModel):
    answer: str
    filename: Optional[str] = None
    page_no: Optional[int] = None

In [21]:
structured_model = model.with_structured_output(schema=Response)

In [22]:
structured_model.invoke("hi")

Response(answer='Hello! How can I help you today?', filename=None, page_no=None)

### 6. Getting response

In [26]:
import json

question = "What is IDEA ?"

similar_chunks = vector_store.similarity_search(
    query=question,
    k=5
)

context_list = []
for chunk in similar_chunks:
    context_list.append({
        "text": chunk.page_content,
        "page_no": chunk.metadata["page"] + 1,
        "filename": chunk.metadata["file_path"]
    })

context = json.dumps(context_list, indent=2)

prompt = prompt_template.format(context=context, question=question)

print(prompt)


You are a helpful and intelligent study assistant. You are given a list of JSON objects as context, each representing extracted text from a PDF.

Each object contains:
- 'text': the actual content
- 'filename': the PDF file name
- 'page_no': the page number

Your task is to answer the student's question based **primarily** on the 'text' fields in the context. You may reason and infer the answer if the exact wording is not available, as long as your answer is clearly supported by the content.

# Context:
[
  {
    "text": "Source: www.csitnepal.com (Compiled by Tej Shahi) \nPage 12 \n \nOperation: IDEA operates on 64-bit blocks using a 128-bit key, and consists of a series of eight \nidentical rounds (see figure below) and an output round (the half-round, see figure below). The \nprocesses for encryption and decryption are similar. IDEA derives much of its security by \ninterleaving operations from different groups - modular addition (addition mod 216, denoted by\n), modular multiplica

In [27]:
resp = structured_model.invoke(prompt)

print(
    f"Question: {question}\nAnswer: {resp.answer}\nPage number: {resp.page_no}"
)

Question: What is IDEA ?
Answer: IDEA (International Data Encryption Algorithm) is a type of encryption algorithm that operates on 64-bit blocks using a 128-bit key. It consists of eight identical rounds and an output round, and uses a combination of modular addition, modular multiplication, and bitwise XOR operations to provide security. IDEA derives its security from interleaving these operations, which are algebraically 'incompatible' in some sense.
Page number: 33
