In [2]:
from llama_index.llms.ollama import Ollama
from llama_index.core import Settings

In [None]:
# llm = Ollama(model="deepseek-r1:7b",request_timeout=600)

# response = llm.complete("What is Code")
# print(response)

In [None]:
import pdfplumber
import pandas as pd
from llama_index.core import Document

def clean_text(text):
    """Cleans up extracted text by removing excessive spaces, blank lines, and redundant patterns."""
    if not text:
        return ""
    lines = text.splitlines()
    cleaned_lines = [
        line.strip() for line in lines if line.strip() and len(line.strip()) > 3
    ]  # Remove short/noisy lines
    return "\n".join(cleaned_lines)

def extract_text_and_tables_from_pdf(pdf_path):
    """Extracts and returns text and tables from all pages of a PDF."""
    extracted_text = []
    extracted_tables = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_number, page in enumerate(pdf.pages, start=1):
            # Extract text
            text = page.extract_text()
            cleaned_text = clean_text(text) if text else "No meaningful text found."
            extracted_text.append(cleaned_text)

            # Extract tables
            tables = page.extract_tables()
            extracted_tables.extend(tables)

    return "\n".join(extracted_text), extracted_tables  # Return text and tables

# Call the function and store results
pdf_path = r"C:\RAG_project\rag_llamaindex\data\HR-Policy-Manual-2023.pdf"
pdf_text, extracted_tables = extract_text_and_tables_from_pdf(pdf_path)

# Create Documents
documents = []
documents.append(Document(text=pdf_text))
for table in extracted_tables:
    df = pd.DataFrame(table)
    documents.append(Document(text=df.to_markdown()))

In [5]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)
Settings.llm = Ollama(model="deepseek-r1:7b", request_timeout=600)
index = VectorStoreIndex.from_documents(documents)

In [12]:
query_engine = index.as_query_engine()

response = query_engine.query("What qualifications are required for the position of 'Executive- IT (PGPX)'?")

In [13]:
print(response)

<think>
Okay, so I need to figure out what qualifications are required for the position of 'Executive- IT (PGPX)' based on the provided context. Let me go through the information step by step.

First, looking at the table, there's a row labeled "1" with several columns. The position in question is under column 0: Position, which is "Executive- IT". In that row, the Qualification listed is BCA/MCA or related graduate/post graduate degree in computer science from any recognised university.

So, it seems like the primary qualifications required are a Bachelor of Computer Applications (BCA) or Master of Computer Applications (MCA), along with possibly a related graduate or postgraduate degree. Additionally, the individual should have experience: minimum three years in IT coordination with the Computer Centre and maximum 30 years of age.

Wait, but I should double-check to make sure there aren't any other qualifications mentioned elsewhere that might apply here. Looking through the rest of 

In [22]:
response = query_engine.query("What is the Pay Level of the Vice President?")
print(response)

<think>
Okay, so I need to figure out the Pay Level of the Vice President based on the provided context. Let me start by looking through the given data.

I see that there's a table with different levels labeled from 0 to 5, and each level has specific positions and their respective pay levels. For example, under level 1, it says Vice President / Associate Vice President – JSW School of Public Policy. The Pay Level for these positions is 5. So VP/Associate VP falls under Pay Level 5.

Wait, but I should check if there are any other Vice Presidents mentioned with different pay levels. Looking further down, level 2 has a Vice President heading the IGPC Office at a minimum of 15 years and a maximum age of 40? Hmm, that's more about experience and age limits rather than pay level.

Another VP is under level 3 as Vice President – HR, with a minimum of 20 years. Again, not directly related to pay level but position details.

Level 4 has another VP in Corporate Relations with a minimum of 10 y

In [23]:
response = query_engine.query("How many points after Successful completion of any Master’s Degree? ")
print(response)

<think>
Okay, so I'm trying to figure out how many points are awarded after successfully completing a Master's Degree based on the context provided. Let me start by carefully reading through the information given.

Looking at the first table, it seems to outline different categories or levels of qualification and their corresponding points. The columns are labeled as Sr., Qualification, and Points. Under "Qualification," there are several rows with different qualifications listed along with their respective points.

The first row under Qualification mentions a Master’s Degree, and next to it is the number 10 in parentheses. So that tells me directly that a successful completion of any Master's Degree gives 10 points. That seems straightforward.

Then, I check the second part of the context where there are additional details about Management degrees and experience levels. However, those sections talk about other qualifications like MBA, PG Degrees, and various years of experience but do

In [24]:
response = query_engine.query("What is the GSLI Sum Assured for a staff member in Pay Level 3?")
print(response)

<think>
Okay, I'm trying to figure out what the GSLI Sum Assured is for a staff member in Pay Level 3. Let me start by looking at the context provided.

First, I see that there's a section about medical facilities, which probably isn't directly relevant here. Then there are some details about the dispensary hours and contact information for doctors, but again, not sure how that ties into the Sum Assured.

The key part of the context seems to be around staff housing. It lists different types of houses available based on pay levels. The Pay Levels mentioned are 11, 10, 6-9, 2-5, and a specific group D for Type S houses.

Wait, the query is about Pay Level 3, which is between 2 and 5. Looking at the table provided:

- Type IIA is available for employees in Pay Level 6 to 9.
- Type I is for Pay Levels 2 to 5.

So, a staff member in Pay Level 3 would fall under Type I housing. The context doesn't specify any sum assured values based on pay levels; it only talks about the types of houses ava

In [27]:
response = query_engine.query("Provide the breakdown of GMIS coverage (existing and top-up) for all groups.")
print(response)

<think>
Alright, I need to provide a detailed breakdown of the GMIS coverage including both existing and top-up options for each group. Let me start by carefully analyzing the context provided.

First, looking at Group A:

- For individuals in Group A who are faculty or staff members:
  - They have an existing GMIS coverage of Rs. 500,000.
  - The additional top-up available is Rs. 300,000.

Next, moving on to the second table which covers other groups where either faculty or staff members can opt for their parents' coverage:

- For Group A & B:
  - The existing GMIS coverage drops to Rs. 200,000.
  - The additional top-up increases to Rs. 300,000.

Then, looking at Groups C and D under the second table:

- These groups have an even lower base coverage of Rs. 100,000 with a higher top-up option of Rs. 300,000 available.

I should structure this information clearly, making sure to separate existing and top-up coverages for each group without using any prior knowledge beyond what's provi

In [None]:
from llama_index.core import Document,GPTVectorStoreIndex
from llama_index.core.query_engine import GPTQueryEngine
from llama_index.core.retrievers import RAGMode
from pathlib import Path




In [None]:
llm_predictor = Ollama(model="deepseek-r1:7b",request_timeout=600)

# Create VectorStoreIndex
index = GPTSimpleVectorStoreIndex.from_documents(documents, llm_predictor=llm_predictor) 

# Create Query Engine
query_engine = GPTQueryEngine(index=index, llm_predictor=llm_predictor, rag_mode=RAGMode.GENERATE) 

# Example questions
questions = [
    "What are the key findings of the test?",
    "What are the performance metrics of the system?",
    "How does the system compare to previous versions?",
    "What is the value in the second row and third column of the first table?" 
]

for question in questions:
    response = query_engine.query(question)
    print(f"**Question:** {question}")
    print(f"**Answer:** {response}\n")