### This code uses the BAAI/bge-m3 embedding model for generating embeddings
### and the DeepSeek LLM model (deepseek-r1:7b) for generating answers based on queries.
### The implementation works well for queries related to levels 2 to 4, providing relevant responses.
### However, for level 3-specific queries, it successfully retrieves and generates answers.

In [1]:
from llama_index.llms.ollama import Ollama

In [2]:
import pdfplumber
import pandas as pd
from llama_index.core import Document

def clean_text(text):
    """Cleans up extracted text by removing excessive spaces, blank lines, and redundant patterns."""
    if not text:
        return ""
    lines = text.splitlines()
    cleaned_lines = [
        line.strip() for line in lines if line.strip() and len(line.strip()) > 3
    ]  # Remove short/noisy lines
    return "\n".join(cleaned_lines)

def extract_text_and_tables_from_pdf(pdf_path):
    """Extracts and returns text and tables from all pages of a PDF."""
    extracted_text = []
    extracted_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_number, page in enumerate(pdf.pages, start=1):
            # Extract text
            text = page.extract_text()
            cleaned_text = clean_text(text) if text else "No meaningful text found."
            extracted_text.append(cleaned_text)

            # Extract tables
            tables = page.extract_tables()
            extracted_tables.extend(tables)

    return "\n".join(extracted_text), extracted_tables  # Return text and tables

# Call the function and store results
pdf_path = r"C:\RAG_project\rag_llamaindex\data\HR-Policy-Manual-2023.pdf"
pdf_text, extracted_tables = extract_text_and_tables_from_pdf(pdf_path)

# Create Documents
documents = []
documents.append(Document(text=pdf_text))
for table in extracted_tables:
    df = pd.DataFrame(table)
    documents.append(Document(text=df.to_markdown()))

In [3]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-m3"
)
Settings.llm = Ollama(model="deepseek-r1:7b", request_timeout=600)
index = VectorStoreIndex.from_documents(documents)

In [None]:
query_engine = index.as_query_engine()

response = query_engine.query(
    "I am interested in the 'Executive-IT (PGPX)' position. What are the required qualifications, experience, and age limit for this role? Additionally, does the '(PGPX)' designation have any impact on the eligibility criteria?"
)
print(response)

<think>
Okay, so I need to figure out the answer to the query about the 'Executive-IT (PGPX)' position based solely on the provided context. The user is asking for qualifications, experience, age limit, and whether the PGPX designation affects eligibility.

First, looking at the context table. There's a row labeled 93 with "Executive-IT" as the position. The Qualification column says BCA/MCA or equivalent, which makes sense since IT professionals often have computer-related degrees. 

For experience, it mentions three to thirty years, so min 03 Max 30. That's pretty broad, but I think they mean at least three years. The nature of duty includes specialized work in IT coordination and post-graduate fields.

Looking at the PGPX part: there's a note that the institute may change recruitment criteria as needed. So, while PGPX is mentioned in some other positions like AVPs, for this Executive-IT role, it doesn't specify whether PGPX affects the criteria. But since the instructions say to avo

In [9]:
query_engine = index.as_query_engine()

response = query_engine.query("What is the GSLI Sum Assured for a staff member in Pay Level 3?")
print(response)

<think>
Okay, so I need to figure out what the GSLI Sum Assured is for a staff member in Pay Level 3 based on the provided context. Let me go through this step by step.

First, I'll look at the context given. There's a section about the DEPOSIT-LINKED INSURANCE SCHEME under the General Provident Fund (GPF/CPF). This part is crucial because it details how the benefits are calculated.

Looking at the table provided in the context, there are categories A to D, each corresponding to different pay levels and their respective sums assured. Specifically:

- Category “A” covers Pay Level 9 and above with a sum assured of 2,80,000/-.
- Category “B” covers Pay Levels 5 to 8 with 2,25,000/-.
- Category “C” covers Pay Levels 2 to 4 with 1,40,000/-.
- Category “D” is for Pay Level 1 with 70,000/-.

Now, the question is about a staff member in Pay Level 3. From the table, Pay Level 3 falls under Category “C”. Therefore, the Sum Assured should be 1,40,000/- based on category C's details provided.

I 