# Indexing
This is the first part of RAG, where we take our raw documents. Parse them using parsers, split them using splitters, find their embeddings and store/index them in vector db

### 01. Parsing

In this step, we extract raw text from the source documents — for example, extracting text from a PDF. Several libraries and approaches can be used for PDF parsing:

* **Metadata-based**: Libraries like `PyMuPDF (fitz)` or `pdfplumber` extract text from the PDF's embedded metadata. These are accurate for digitally-generated PDFs (e.g., exports from Word or LaTeX), but do not work well for scanned/image-based PDFs where text is not selectable.

* **OCR-based**: Tools like `pytesseract` or `easyocr` use Optical Character Recognition (OCR) to extract text from image-based PDFs. These are useful when metadata-based parsing fails. They are slightly less accurate but more versatile.

* **Hybrid approaches**: Libraries like `docling` combine both metadata and OCR-based extraction. They can also preserve layout and structure — for example, outputting content in markdown or structured formats.

* **LLM-based parsing**: Tools like `llama-parse` use large language models to extract, clean, and structure content intelligently. These can handle messy layouts, mixed content, and even infer structure that isn’t explicitly present.


In [62]:
!pip install pymupdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [63]:
import fitz  # PyMuPDF

doc = fitz.open("../dataset/Syntax Analysis.pdf")  # Replace with your file path

all_text = ""
for page in doc:
    all_text += page.get_text() + "\n"

doc.close()

print(all_text)

 
Unit 3 
Syntax Analyzer 
……………………………………………………………………………………………………… 
Topics 
Syntax Analysis: Its role, Basic parsing techniques: Problem of Left Recursion, Left Factoring, 
Ambiguous Grammar, Top-down parsing, Bottom-up parsing, LR parsing. 
 
 
……………………………………………………………………………………………………… 
Syntax Analysis 
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the syntactical 
structure of the given input, i.e. whether the given input is in the correct syntax or not. It does 
so by building a data structure, called a Parse tree or Syntax tree. The parse tree is constructed 
by using the pre-defined Grammar of the language and the input string. If the given input 
string can be produced with the help of the syntax tree, the input string is found to be in the 
correct syntax. If not, error is reported by syntax analyzer. 
 
 
 
 
 
 
 
 
In this chapter, we shall learn the basic concepts used in the construction of a parser. We have 
seen that a lexical analyz

### 02. Chunking

Once raw text is extracted, the next step is **chunking** — splitting the text into smaller, manageable segments (chunks) that can be indexed and retrieved efficiently by the system.

#### Tools & Libraries:

* `langchain.text_splitter` – Offers multiple splitters like `RecursiveCharacterTextSplitter`, `TokenTextSplitter`, etc.
* `llama-index` – Has smart document parsing utilities.
* Custom chunkers using NLTK, spaCy, or basic Python logic.


In [64]:
!pip install langchain

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [65]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create the text splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,       # Max characters per chunk
    chunk_overlap=200,     # Overlap between chunks to preserve context
    separators=["\n\n", "\n", ".", " "]  # Priority of where to split
)

# Split the text
chunks = splitter.split_text(all_text)

# Print chunks
for i, chunk in enumerate(chunks, 1):
    print(f"--- Chunk {i} ---")
    print(chunk)


--- Chunk 1 ---
Unit 3 
Syntax Analyzer 
……………………………………………………………………………………………………… 
Topics 
Syntax Analysis: Its role, Basic parsing techniques: Problem of Left Recursion, Left Factoring, 
Ambiguous Grammar, Top-down parsing, Bottom-up parsing, LR parsing. 
 
 
……………………………………………………………………………………………………… 
Syntax Analysis 
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the syntactical 
structure of the given input, i.e. whether the given input is in the correct syntax or not. It does 
so by building a data structure, called a Parse tree or Syntax tree. The parse tree is constructed 
by using the pre-defined Grammar of the language and the input string. If the given input 
string can be produced with the help of the syntax tree, the input string is found to be in the 
correct syntax. If not, error is reported by syntax analyzer. 
 
 
 
 
 
 
 
 
In this chapter, we shall learn the basic concepts used in the construction of a parser. We have
--- Chunk 2 -

### Combined function for parsing and chunking

In [None]:
import fitz  # PyMuPDF

def load_and_chunk_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    chunks = []

    for page_num, page in enumerate(doc, start=1):
        text = page.get_text()
        if text.strip():
            chunks.append({
                "filename": pdf_path.split("/")[-1],
                "text": text,
                "page_no": page_num
            })

    doc.close()
    return chunks


pdf_chunks = load_and_chunk_pdf("../dataset/Syntax Analysis.pdf")
pdf_chunks[-5:]

[{'filename': 'Syntax Analysis.pdf',
  'text': 'In brief, kernel item includes the initial items, S‘→ .S and all items whose dot are not at the left \nend. Similarly non-kernel items are those items which have their dots at the left end except   \nS‘→ .S \nExample 1: Find the kernel and non-kernel items of following grammar, \nC → AB \n \nA → a \n \nB → a \nSolution: The augmented grammar of given grammar is, \n \n1. C‘→ C \n \n2. C → AB \n \n3. A → a \n \n4. B → a \nNext, we obtain the canonical collection of sets of LR(0) items, as follows, \nI0 = closure ({C‘ → •C}) = {C‘ → •C, C → •AB, A → •a} \nI1 = goto(I0, C) = closure(C‘ → C•) = {C‘ → C•} \nI2 = goto(I0, A) = closure(C → A•B) = {C → A•B, B → •a} \nI3 = goto(I0, a) = closure(A → a•) = {A → a•} \nI4 = goto(I2, B) = closure(C → AB•) = {C → AB•} \nI5 = goto(I2, a) = closure(B → a•) = {B → a•} \nList of kernel and non-kernel items are listed below; \nStates \nKernel Items \nNon-kernel items \nI0 \nC‘ → •C \nC → •AB \nA → •a \nI1 \nC

### 03. Embedding and storing

In this step, we convert each chunk into a dense vector (embedding) that captures its semantic meaning. We'll use a pretrained model from `sentence-transformers`.
Then store it in choice of our vector database. We'll be using `faiss` in this example.


In [None]:
from sentence_transformers import SentenceTransformer
import faiss
import json
import os

model = SentenceTransformer("all-MiniLM-L6-v2")
index_file = "my_index.faiss"
texts_file = "texts.json"

# Load or create index
if os.path.exists(index_file):
    index = faiss.read_index(index_file)
else:
    index = faiss.IndexFlatIP(384)  # 384 for MiniLM-L6-v2

# Load or init texts
if os.path.exists(texts_file):
    with open(texts_file, "r") as f:
        texts = json.load(f)
else:
    texts = []

def add_text(new_items: list[dict]):
    global texts, index

    # Extract texts to embed
    raw_texts = [item["text"] for item in new_items]
    embeddings = model.encode(raw_texts, normalize_embeddings=True).astype("float32")

    index.add(embeddings)
    texts.extend(new_items)
    
    faiss.write_index(index, index_file)
    with open(texts_file, "w") as f:
        json.dump(texts, f, ensure_ascii=False, indent=2)


add_text(pdf_chunks)


# Generation

This is the second half of RAG pipeline, where we get query from users, search relevent chunks from our vector db, extract it and append system prompt, context and user query and pass it to LLM.

### 04. Retriving

In [None]:
def retrieve(query, top_k=5):
    query_emb = model.encode([query], normalize_embeddings=True).astype('float32')
    D, I = index.search(query_emb, top_k)
    return [texts[i] for i in I[0] if i != -1]

retrieve("What are software engineerig ethices")

[{'filename': 'note.pdf',
  'text': 'Software Engineering Ethics\n•\nThe rationale behind above code is summarized in the first two paragraphs of the longer \nform which is described as below:\n•\nComputers have a central and growing role in commerce, industry, government, medicine, \neducation, entertainment and society at large.\n•\nSoftware engineers are those who contribute by direct participation or by teaching, to the \nanalysis, specification, design, development, certification, maintenance and testing of \nsoftware systems.\n•\nBecause of their roles in developing software systems, software engineers have significant \nopportunities to do good or cause harm, to enable others to do good or cause harm, or to \ninfluence others to do good or cause harm.\n•\n To ensure, as much as possible, that their efforts will be used for good, software engineers \nmust commit themselves to making software engineering a beneficial and respected \nprofession.\nChapter 1 Introduction\n30/10/2014\

### 05. Calling LLM with proper prompt

In [None]:
prompt_template = """
You are a helpful and intelligent study assistant. You are given a list of JSON objects as context, each representing extracted text from a PDF.

Each object contains:
- 'text': the actual content
- 'filename': the PDF file name
- 'page_no': the page number

Your task is to answer the student's question based **primarily** on the 'text' fields in the context. You may reason and infer the answer if the exact wording is not available, as long as your answer is clearly supported by the content.

# Context:
{context}

# Question:
{question}

# Instructions:
- Use only the 'text' field from the context entries for answering the question, but include the most relevant 'filename' and 'page_no' you used.
- Even if the original text is technical or unclear, rewrite the answer in a simple, **student-friendly** way that is easy to understand.
- You can use **Markdown formatting** (headings, bullet points, code blocks, tables, etc.) to make the answer more readable and structured.
- You **may infer** or **summarize** answers from the content to help students understand, even if the answer is not a perfect match.
- Avoid saying you don’t know unless the question is entirely unrelated to the context.
- Return a JSON object with:
  - "answer": your helpful, clear, Markdown-formatted answer
  - "filename": the filename of the most relevant entry you used
  - "page_no": the corresponding page number

- If the context contains **nothing relevant at all** to the question, return:
  ```json
  {{
    "answer": "I'm sorry, but I don't have enough information to answer that.",
    "filename": null,
    "page_no": null
  }}
  ```
Respond ONLY with a valid JSON object. Do not include any explanation, formatting, or extra text.
"""

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv

load_dotenv()

llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    api_key=os.getenv("GEMINI_API_KEY")
)

In [None]:
chunks

['Introduction to Software Engineering\nDefinition of software:\n•\nGenerally, people equate the term software with computer programs. However, a \nbroader definition is: Software is not just the programs but also all associated \ndocumentation and configuration data that is needed to make these programs \noperate correctly.\n•\nA software system usually consists of a number of separate programs, configuration \nfiles, which are used to set up these programs, system documentation, which \ndescribes the structure of the system, and user documentation, which explains \nhow to use the system and web sites for users to download recent product \ninformation.\n•\nSoftware engineers are concerned with developing software products, i.e., \nsoftware which can be sold to a customer.',
 'Software and its Types\nTypes of software product\nGeneric products\n• These are stand-alone systems that are produced by a development organization and sold on the \nopen market to any customer who is able to bu

In [None]:
import json
import json
import re

def json_to_obj(json_str: str) -> dict:
    cleaned = re.sub(r"^```(?:json)?\s*|```$", "", json_str.strip(), flags=re.IGNORECASE)
    
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError as e:
        print("Failed to parse JSON:", e)
        return {}


def rag_answer(question):
    chunks = retrieve(question)
    context = json.dumps(chunks)
    prompt = prompt_template.replace("{context}", context).replace("{question}", question)
    response = llm.invoke(prompt)
    obj = json_to_obj(response.content)
    return obj

In [None]:
answer = rag_answer("What are software development ethices")
answer

{'answer': 'Software engineers shall commit themselves to making the analysis, specification, design, development, testing and maintenance of software a beneficial and respected profession.\n In accordance with their commitment to the health, safety and welfare of the public, software engineers shall adhere to the following Eight Principles:\n1. PUBLIC – Software engineers shall act consistently with the public interest.\n2. CLIENT AND EMPLOYER – Software engineers shall act in a manner that is in the best interests of their client and employer consistent with the public interest.\n3. PRODUCT – Software engineers shall ensure that their products and related modifications meet the highest professional standards possible.\n4. JUDGMENT – Software engineers shall maintain integrity and independence in their professional judgment.\n5. MANAGEMENT – Software engineering managers and leaders shall subscribe to and promote an ethical approach to the management of software development and mainte

In [None]:
answer = rag_answer("What is cyber security")
answer

{'answer': "I'm sorry, but I don't have enough information to answer that.",
 'filename': None,
 'page_no': None}