# Chat with Multiple Pdfs using Gemini and LangChain

In [None]:
# install requried libraries from requirements.txt
!pip install -r requirements.txt



##Section 1: Imports and Environment Setup

In [None]:
import os
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
import google.generativeai as genai


**Loading Environment Variables and Configuring Google GenAI**

It's crucial to keep sensitive information like API keys out of your main code. We'll use a .env file for this and load it using python-dotenv. Then, we'll configure the Google Generative AI SDK with our API key.



In [None]:
# Load environment variables from .env file (especially the Google API Key)
load_dotenv() # This function searches for a .env file in the current directory and loads any key-value pairs found into the environment variables.

# Configure Google GenAI with API key
google_api_key = os.getenv("GOOGLE_API_KEY") # Retrieves the value of the "GOOGLE_API_KEY" environment variable.
if not google_api_key: # Checks if the API key was successfully loaded.
    raise ValueError("Please set the GOOGLE_API_KEY in a .env file.") # If not found, raises an error instructing the user to set it.
else:
    genai.configure(api_key=google_api_key) # Configures the Google Generative AI library with the retrieved API key. This is essential for authentication.


**Setting Model and Chunking Parameters**

These parameters influence how our LLM behaves and how our documents are processed. temperature controls the creativity of the LLM, while chunk_size and chunk_overlap determine how our PDF text is divided for embedding and retrieval.

In [None]:
# Set temperature and chunking configuration

temperature = 0.3 # Controls the randomness/creativity of the LLM's responses. Lower values (e.g., 0.1-0.3) make the output more deterministic and factual.
chunk_size = 1000 # The maximum size (in characters) of each text chunk after splitting.
chunk_overlap = 300 # The number of characters that consecutive chunks will overlap. This helps maintain context across chunk boundaries.


## Section 2: Load and Chunk PDFs

**Defining the get_text_chunks_with_metadata Function**


This function will read specified PDF files, extract text page by page, and then use RecursiveCharacterTextSplitter to divide the text into chunks. Importantly, it also attaches metadata (source file, page number, chunk ID) to each chunk, which is useful for tracing answers back to their origin.

In [None]:
# Define a function to read and chunk text from PDFs
def get_text_chunks_with_metadata(filepaths, chunk_size, chunk_overlap):
    """
    Reads text from multiple PDF files, splits the text into chunks, and adds metadata to each chunk.

    Args:
        filepaths (list): A list of paths to the PDF files.
        chunk_size (int): The desired maximum size of each text chunk.
        chunk_overlap (int): The number of characters to overlap between consecutive chunks.

    Returns:
        list: A list of dictionaries, where each dictionary contains a 'text' key (the chunk content)
              and a 'metadata' key (a dictionary with 'source', 'page', and 'chunk_id').
    """
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap) # Initializes the text splitter with specified chunk size and overlap.
    all_chunks = [] # An empty list to store all processed chunks from all PDFs.

    for filepath in filepaths: # Iterates through each PDF file path provided.
        reader = PdfReader(filepath) # Creates a PdfReader object to read the PDF file.
        source_name = os.path.basename(filepath).replace(".pdf", "") # Extracts the base name of the file (e.g., "document.pdf" becomes "document") for source metadata.
        for i, page in enumerate(reader.pages): # Iterates through each page in the PDF document, along with its index.
            text = page.extract_text() # Extracts all text content from the current page.
            if text: # Checks if text was successfully extracted from the page.
                chunks = splitter.split_text(text) # Splits the extracted page text into smaller chunks using the defined splitter.
                for j, chunk in enumerate(chunks): # Iterates through each chunk generated from the current page.
                    all_chunks.append({ # Appends a dictionary to the all_chunks list.
                        "text": chunk, # The actual text content of the chunk.
                        "metadata": {"source": source_name, "page": i + 1, "chunk_id": j} # Metadata including the original file name, page number (1-indexed), and chunk's index on that page.
                    })
    return all_chunks # Returns the complete list of text chunks with their associated metadata.

##Section 3: Create Vector Store from Chunks

Once we have our text chunks, we need a way to efficiently search through them to find the most relevant ones for a given query. This is where embeddings and vector stores come in.

Embeddings: These are numerical representations of text. Models like Google's embedding-001 convert words, sentences, or chunks of text into a list of numbers (a vector) where semantically similar texts have vectors that are numerically "close" to each other in a multi-dimensional space.

Vector Store: This is a database designed to store these numerical embeddings and perform fast similarity searches. When you ask a question, your question is also converted into an embedding, and the vector store quickly finds the chunks whose embeddings are most similar to your question's embedding. We'll use FAISS for this.

**Defining the get_vector_store Function**

This function takes our list of text chunks, generates embeddings for each, and then stores them in a FAISS vector database. It also saves this database locally for later use, so we don't have to re-process PDFs every time.

In [None]:
# Define a function to create a vector store from text chunks
def get_vector_store(chunks):
    """
    Generates embeddings for given text chunks and stores them in a FAISS vector database.
    The database is also saved locally for persistence.

    Args:
        chunks (list): A list of dictionaries, where each dictionary contains a 'text' key
                       (the chunk content) and a 'metadata' key.

    Returns:
        FAISS: The FAISS vector database containing the chunk embeddings and metadata.
    """
    texts = [c["text"] for c in chunks] # Extracts just the text content from each chunk dictionary into a list.
    metadatas = [c["metadata"] for c in chunks] # Extracts just the metadata dictionary from each chunk dictionary into a list.
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") # Initializes the Google Generative AI embeddings model. This model converts text into numerical vectors.
    db = FAISS.from_texts(texts, embedding=embeddings, metadatas=metadatas) # Creates a FAISS vector database from the text chunks, their embeddings, and associated metadata.
                                                                            # FAISS will index these embeddings for fast similarity search.
    db.save_local("faiss_index") # Saves the created FAISS index to a local directory named "faiss_index". This allows us to load it later without re-embedding.
    return db # Returns the FAISS database object.


##Section 4: Summarize PDF Text

**Cell 4.1: Defining the summarize_pdf Function**

This function reads the entire text from a given PDF, creates a prompt asking for a summary, and then uses a ChatGoogleGenerativeAI model to generate a concise summary.

In [None]:
# Define a function to summarize the entire text of a PDF
def summarize_pdf(filepath):
    """
    Reads the entire text from a PDF and uses an LLM to generate a summary.

    Args:
        filepath (str): The path to the PDF file to summarize.

    Returns:
        str: A summary of the PDF content.
    """
    model = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=temperature) # Initializes a Google Chat Generative AI model for summarization.
                                                                                       # Uses "gemini-2.0-flash" for quick responses and the defined temperature.
    reader = PdfReader(filepath) # Creates a PdfReader object to read the PDF.
    full_text = "" # Initializes an empty string to accumulate all text from the PDF.
    for page in reader.pages: # Iterates through each page in the PDF.
        full_text += page.extract_text() if page.extract_text() else "" # Extracts text from the current page and appends it to full_text. Handles cases where a page might have no extractable text.

    prompt = PromptTemplate( # Creates a LangChain PromptTemplate.
        input_variables=["document"], # Defines the input variable expected by this prompt (the document content).
        template="Summarize this document in less than 100 words:\n{document}" # The template string that will be sent to the LLM, with a placeholder for the document.
    )
    # Invokes the LLM with the formatted prompt (document content inserted) and returns the generated summary content.
    return model.invoke(prompt.format(document=full_text)).content


##Section 5: Load Conversational QA Chain

This is the core of our RAG system! The conversational QA chain combines several components:

Retriever: An interface to our vector store that fetches the most relevant chunks based on a given query.

Memory: Stores the history of the conversation, allowing the LLM to understand context from previous turns.

LLM: The Large Language Model (Gemini 2.0 Flash) that generates the answer.

Prompt Template: A structured way to tell the LLM how to combine the retrieved context, chat history, and current question to formulate a comprehensive answer.

**Defining the load_qa_chain Function**

This function sets up the entire conversational retrieval chain. It configures the retriever, initializes conversation memory, selects the LLM, and defines a custom prompt to guide the LLM's response based on retrieved context and chat history.

In [None]:
# Define a function to load the Conversational Question Answering Chain
def load_qa_chain(db):
    """
    Sets up and returns a LangChain ConversationalRetrievalChain for question answering.
    This chain uses a retriever to fetch relevant documents, maintains conversation history,
    and uses an LLM to generate answers based on the retrieved context and chat history.

    Args:
        db (FAISS): The FAISS vector database to be used for document retrieval.

    Returns:
        ConversationalRetrievalChain: The configured LangChain conversational QA chain.
    """
    retriever = db.as_retriever(search_kwargs={"k": 32}) # Converts the FAISS database into a retriever.
                                                         # search_kwargs={"k": 32} specifies that the retriever should fetch the top 32 most relevant document chunks for a given query.
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer') # Initializes a ConversationBufferMemory.
                                                                                                        # memory_key="chat_history": Specifies the key in the chain's input dictionary where chat history will be stored.
                                                                                                        # return_messages=True: Ensures the chat history is returned as a list of message objects.
                                                                                                        # output_key='answer': Specifies the key for the chain's output, allowing memory to track the LLM's answer.
    model = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=temperature) # Initializes the Google Chat Generative AI model (Gemini 2.0 Flash) for generating answers.

    qa_prompt = PromptTemplate( # Defines a custom prompt template for the conversational retrieval chain.
        template="""
        You are an AI assistant for question-answering over documents. # Instructions for the AI's persona.
        Use the retrieved context to answer comprehensively. # Emphasizes using the provided context.
        If a question covers multiple entities, include all of them. # Specific instruction for comprehensive answers.
        If data is missing, say: 'I cannot find the answer to this question in the provided documents.' # Instruction for handling unanswerable questions.

        Chat History: # Placeholder for the conversation history.
        {chat_history}

        Context: # Placeholder for the retrieved document chunks.
        {context}

        Question: {question} # Placeholder for the current user question.
        Answer: # The AI's answer will follow this.
        """,
        input_variables=["question", "context", "chat_history"] # Defines the variables that will be populated in the template.
    )

    return ConversationalRetrievalChain.from_llm( # Creates and returns a ConversationalRetrievalChain.
        llm=model, # The Language Model to use for generating answers.
        retriever=retriever, # The document retriever that provides context.
        memory=memory, # The memory component that stores and manages chat history.
        return_source_documents=True, # Ensures that the chain returns the source documents from which the answer was derived.
        combine_docs_chain_kwargs={"prompt": qa_prompt} # Passes our custom prompt template to the internal chain that combines documents and generates the final answer.
    )

## Section 6: Highlight Relevant Sources

One of the great advantages of RAG is the ability to show where the answer came from. This function takes the LLM's answer and the source documents retrieved, and then uses another LLM call to intelligently highlight the exact sentences within those source documents that were most relevant to forming the answer. This provides transparency and allows users to verify information.

**6.1: Defining the highlight_relevant_sources_full_chunk Function**

This function formats the source documents and the LLM's answer into a specific prompt. It then sends this to an LLM (Gemini 2.0 Flash) with instructions to return the full text of relevant chunks, with the supporting sentences highlighted. This is a powerful way to make the RAG system more interpretable.

In [None]:
# Define a function to highlight relevant sentences within the full text of source documents
def highlight_relevant_sources_full_chunk(answer, source_documents):
    """
    Uses an LLM to identify and highlight relevant sentences within the full text of
    the source document chunks that were used to generate a given answer.
    It returns the full chunk text with highlighted sentences and source metadata.

    Args:
        answer (str): The answer generated by the QA chain.
        source_documents (list): A list of LangChain Document objects, each containing
                                 page_content (the chunk text) and metadata.

    Returns:
        str: A formatted string containing the relevant source chunks with highlighted
             sentences, separated by clear headers.
    """
    model = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.2) # Initializes a Google Chat Generative AI model (Gemini 2.0 Flash) for highlighting.
                                                                               # A slightly lower temperature (0.2) is used here to ensure more precise highlighting without too much creativity.

    # Formats each source document into a string with its metadata and content, separated by double newlines.
    formatted_chunks = "\n\n".join(
        f"[Source: {doc.metadata['source']} | Page: {doc.metadata['page']} | Chunk ID: {doc.metadata['chunk_id']}]\n{doc.page_content}"
        for doc in source_documents # Iterates through each source document provided.
    )

    prompt = PromptTemplate( # Defines a custom prompt template for the highlighting task.
        input_variables=["answer", "context"], # Defines the input variables: the answer and the formatted source chunks (context).
        template="""
You are given an answer and a set of document chunks from PDFs.

Task:
- For each chunk that supports the answer:
  - Return the full chunk.
  - Highlight the relevant sentences using **double asterisks**.
  - Prepend the source info like:
    ════════════════════════════════════════════════════════════════
    📄 Source: <source> | Page: <page>  | Chunk ID: <chunk_id>
    ════════════════════════════════════════════════════════════════

- Skip chunks that are not relevant.
- Add a line: ────────────────────────────────────────────── after each source block.

Answer:
{answer}

Document Chunks:
{context}

Output format:

════════════════════════════════════════════════════════════════
📄 Source: <source> | Page: <page> | Chunk ID: <chunk_id>
════════════════════════════════════════════════════════════════
<Full Chunk with **highlighted** text>

────────────────────────────────────────────────────────────────
""" # Detailed instructions for the LLM on how to format the output, including highlighting and separators.
    )

    # Invokes the LLM with the formatted prompt (answer and context inserted) and returns the generated content.
    response = model.invoke(prompt.format(answer=answer, context=formatted_chunks))
    return response.content # Returns the LLM's response, which contains the highlighted source chunks.


## Section 7: Example Run - Load PDFs, Create Index, Chat


**Load PDFs and Initial Summaries**

In [None]:
# load  pdfs
pdf_1 = "MH_Budget_Analysis_2025-26.pdf"
pdf_2 = "Gujarat_Budget_Analysis_2025-26.pdf"
pdf_3 = "Black hole mystry in the universe.pdf"
pdf_paths = [pdf_1, pdf_2, pdf_3]


In [None]:
# --- Summarize PDFs ---
print("--- PDF Summaries ---") # Prints a header for the summaries section.
for pdf_path in pdf_paths: # Iterates through each PDF path in the list.
    summary = summarize_pdf(pdf_path) # Calls the summarize_pdf function to get a summary for the current PDF.
    print(f"\nSummary of {os.path.basename(pdf_path).replace('.pdf', '')}:") # Prints the name of the PDF being summarized.
    print(summary) # Prints the generated summary.
print("\n" + "="*50 + "\n") # Prints a separator for better readability.


--- PDF Summaries ---

Summary of MH_Budget_Analysis_2025-26:
The Maharashtra budget for 2025-26, presented by Finance Minister Ajit Pawar, projects a GSDP growth of 9%. Expenditure is estimated at Rs 7,00,020 crore, with receipts of Rs 5,63,786 crore. The budget targets a fiscal deficit of 2.8% of GSDP and a revenue deficit of 0.9% of GSDP. Policy highlights include a new industrial policy aiming for Rs 40 lakh crore in investments and 50 lakh jobs, development of international business centers, and a long-term road development plan. Motor vehicle tax will increase, and affordable power initiatives are planned.

Summary of Gujarat_Budget_Analysis_2025-26:
The Gujarat budget for 2025-26 projects a GSDP growth of 12%. Expenditure is estimated to increase by 17%, funded by receipts and borrowings. A revenue surplus of 0.7% of GSDP is expected, with a fiscal deficit targeted at 2% of GSDP.

Policy highlights include expanding food security for laborers, identifying six growth hubs, reduci

**Step 1: Chunking the PDFs**

We call our get_text_chunks_with_metadata function to process the PDFs into chunks. This prepares our data for embedding and storage in the vector database.

In [None]:
# Step 1: Chunk PDFs
print("Step 1: Chunking PDFs...") # Informative print statement.
chunks = get_text_chunks_with_metadata(pdf_paths, chunk_size, chunk_overlap) # Calls the function to get all text chunks with metadata from the specified PDFs.
print(f"Generated {len(chunks)} chunks.") # Prints the total number of chunks generated.

Step 1: Chunking PDFs...
Generated 132 chunks.


**Step 2: Building and Saving the Vector Store**

Using the generated chunks, we create and populate our FAISS vector store. This step involves generating embeddings for each chunk and indexing them. The vector store is then saved locally.

In [None]:
# Step 2: Build and save vector store
print("Step 2: Building and saving vector store (FAISS index)...") # Informative print statement.
vector_store = get_vector_store(chunks) # Calls the function to create a FAISS vector store from the chunks and save it locally.
print("Vector store created and saved as 'faiss_index'.") # Confirmation message.

Step 2: Building and saving vector store (FAISS index)...
Vector store created and saved as 'faiss_index'.


**Step 3: Loading the Vector Store (if needed)**

In [None]:
# Step 3: Load vector store (needed if not passing 'vector_store' directly or if loading from a previous run)
print("Loading vector store from 'faiss_index'...") # Informative print statement.
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") # Re-initializes the embeddings model, which is required to load the FAISS index.
# Loads the FAISS index from the local directory.
# allow_dangerous_deserialization=True is used to allow deserialization of FAISS index, which might contain custom objects.
loaded_db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
print("Vector store loaded successfully.") # Confirmation message.


Loading vector store from 'faiss_index'...
Vector store loaded successfully.


**Step 4: Initializing the QA Chain**


In [None]:
# Step 4: Initialize QA chain
print("Step 4: Initializing Conversational QA Chain...") # Informative print statement.
qa_chain = load_qa_chain(loaded_db) # Calls the function to set up the conversational retrieval chain using the loaded FAISS database.
print("QA Chain initialized.") # Confirmation message.

Step 4: Initializing Conversational QA Chain...
QA Chain initialized.


  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer') # Initializes a ConversationBufferMemory.


**Step 5: Asking a Sample Question and Getting Response**

**Question 1**

questions on Gujrat bugdet pdf

In [None]:
question = "What is Gujarat's projected fiscal deficit as a percentage of GSDP for FY 2025–26?"
response = qa_chain.invoke({"question": question}) # Invokes the QA chain with the question. The chain processes the question, retrieves context, and generates an answer.
print("\nAnswer:")
print(response["answer"])


Answer:
The fiscal deficit for 2025-26 is targeted at 2% of GSDP (Rs 58,397 crore), which is higher than the revised estimates for 2024-25 (1.9% of GSDP).


In [None]:
# Display sources with highlighted snippets
print("\n--- Full Sources with Highlighted Snippets ---") # Prints a header for the sources section.
# Calls the highlighting function, passing the generated answer and the source documents returned by the QA chain.
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks) # Prints the formatted string containing highlighted source chunks.



--- Full Sources with Highlighted Snippets ---
```text
════════════════════════════════════════════════════════════════
📄 Source: Gujarat_Budget_Analysis_2025-26 | Page: 2 | Chunk ID: 1
════════════════════════════════════════════════════════════════
 **Fiscal deficit  for 202 5-26 is targeted at 2% of GSDP (Rs 58,397  crore) , higher  than the revised estimates for 202 4-25 (1.9% of GSDP)** . For 2025-26, central government has permitted fiscal deficit upto 3% of 
GSDP to states.  Additional borrowing space of 0.5% of GSDP will be available on undertaking certain 
power sector reforms.   
Table 1: Budget 2025-26 - Key figu res (in Rs crore ) 
Items  2023-24 
Actuals  2024-25 
Budgeted  2024-25 
Revised  % change from 
BE 2 4-25 to RE 
24-25 2025-26 
Budgeted  % change from 
RE 2 4-25 to BE 
25-26 
Total Expenditure  2,73,768  3,28,447  3,12,988  -5% 3,65,746  17% 
(-) Repayment of debt  26,136  29,085 29,086  0% 33,596  16% 
Net Expenditure (E)  2,47,632  2,99,362  2,83,902  -5% 3,3

**Question 2**

we are asking the question from Gujrat budget pdf without specifying the pdf name context to
check if llm correctly identify the context.

In [None]:
question = "How has capital outlay changed from 2024–25 to 2025–26?"
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
For Gujarat, the capital outlay for 2025-26 is proposed to be Rs 95,472 crore, an increase of 36% from the revised estimate of 2024-25. For Maharashtra, the capital outlay for 2025-26 is proposed to be Rs 84,475 crore, a decrease of 11% from the revised estimate of 2024-25.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: Gujarat_Budget_Analysis_2025-26 | Page: 2 | Chunk ID: 4
════════════════════════════════════════════════════════════════
RE 2 4-25 to BE 
25-26 
Revenue Expendi ture 1,89,296 2,19,832 2,10,181 -4% 2,31,858 10% 
**Capital Outlay 55,679 75,689 70,173 -7% 95,472 36%** 
Loans given by the state 2,667 3,842 3,548 -8% 4,821 36% 
Net Expenditure 2,47,632 2,99,362 2,83,902 -5% 3,32,150 17% 
Sources: Annual Financial Statemen t, Gujarat Budget Documents 2025-26; PRS. Social Sector Expenditure 
The 15th Finance Commission (2021) had recommended 
that Gujarat enhance social expenditure , and increase 
focus towards uplifting backward districts . 
Source: Development Expenditure: Select Indicators, 
RBI, PRS.
The RBI defines social sector expenditure to include 
spending on items such as health, education, welfare of 
SCs, STs, and OBCs, and rural development. According 
to th

**Question 3**

3rd check

In [None]:
question = "How much budget is allocated to Swarnim Mukhya Mantri Shaheri Vikas Yojana?"
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
Rs 12,846 crore has been allocated towards Swarnim Mukhya Mantri Shaheri Vikas Yojana.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: Gujarat_Budget_Analysis_2025-26 | Page: 3 | Chunk ID: 2
════════════════════════════════════════════════════════════════
Annexure 1. 
Table 4: Sector -wise expenditure under Gujarat Budget 2025-26 (in Rs crore) 
Sector s 2023 -24 
Actuals 2024 -25 
BE 2024 -25 
RE 2025 -26 
BE % change from 
RE 24-25 to 
BE 25-26 Budget Provisions (20 25-26) 
Education, Sports, 
Arts, and Culture 37,901 44,579 43,395 48,476 12%  Rs 2,914 crore has been allocated towards Schools of 
Excellence 
Urban Development 16,973 18,634 19,744 25,750 30% ** Rs 12,846 crore has been allocated towards Swarnim 
Mukhya Mantri Shaheri Vikas Yojana**
Transport 19,348 22,692 22,554 24,980 11%  Rs 5,002 crore has been allocated towards Mukhya 
Mantri Gram Sadak Yo jana 
Health and Family 
Welfare 16,211 19,348 20,589 22,840 11%  Rs 3, 491 crore has been allocated towards Aarogya 
Suraksha Yojana. 

**Question 4**

Now we ask questions on Maharashtra budget pdf

In [None]:
question = "What are the projected fiscal and revenue deficits for Maharashtra in 2025–26? "
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
The fiscal deficit for Maharashtra in 2025-26 is estimated to be 2.8% of GSDP (Rs 1,36,235 crore). The state estimates a revenue deficit of 0.9% of GSDP (Rs 45,891 crore) in 2025-26.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
Okay, here's the output based on the provided answer and document chunks:

════════════════════════════════════════════════════════════════
📄 Source: MH_Budget_Analysis_2025-26 | Page: 5 | Chunk ID: 1
════════════════════════════════════════════════════════════════
Fiscal deficit :  It is the excess of total expenditure over
total receipts. This gap is filled by borrowings by the
government and leads to an increase in total liabilities. **In
2025 -26, the fiscal deficit is estimated to be 2.8% of
GSDP**. For 2025-26, the central government has permitted
fiscal deficit of up to 3% of GSDP to states . Additional
borrowing space up to 0.5% of GSDP will also be
available for undertaking certain power sector reforms.
As per the revised estimates, in 202 4-25, the fi scal deficit
of the state is expected to be 2.9% of GSDP. This is
higher than the budget estimate of 2.6% of GSDP .
Outstanding debt :  Outstanding debt is the accumulation
of tota

**Question 5**

similary we check if llm provides answer without providing the pdf name context.

In [None]:
question = "What is the size and share of committed expenditure in 2025–26 budget of this state? "
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
In 2025-26, Maharashtra is estimated to spend Rs 3,12,556 crore on committed expenditure, which is 56% of its estimated revenue receipts.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: MH_Budget_Analysis_2025-26 | Page: 3 | Chunk ID: 0
════════════════════════════════════════════════════════════════
Maharashtra  Budget Analysis 2025 -26  PRS Legislative R esearch  
 
April 1, 2025   - 3 - 
 Committed expenditure:  Committed expenditure of a state typically includes expenditure on payment of 
salaries, pension, and interest.   A larger proportion of the budget allo cated for committed expenditure items 
limits the state’s flexibility to decide on other expenditure priorities , such as capital outlay.  **In 2025-26, 
Maharashtra is estimated to spend Rs 3,12,556 crore on co mmitted expenditure, which is 56% of its estimated  
revenue receipts.** This co mprises spending on salaries ( 31% of revenue receipts), pension ( 13 %), and interest 
payments ( 12%).  In 202 3-24, as per actual  figures , 55% of revenue receipts w ere spent on committed items

**Question 6**

questions on black hole pdf

In [None]:
question = "What is black hole and how it is created?"
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
A black hole is a region in space where gravity is so strong that the escape velocity is faster than the speed of light. They are naturally formed when stars collapse into a single mass. When a massive star exhausts its nuclear fuel, the gravitational collapse causes the star to implode, resulting in a highly dense core known as a black hole.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: Black hole mystry in the universe | Page: 1 | Chunk ID: 5
════════════════════════════════════════════════════════════════
interpretation as a region of space from which nothing can escape including light was first introduced by David Finkelstein i n 1958. 
In 1916, Karl Schwarzschild calculated that the black hole should have possessed a huge mass because of it s small radius 
(R=2×GM/C2, where G=Universal gravitational constant, M= Mass of the black hole, C= Speed of light in vacuum) and 
consequently to have an acceptable value of the radius, a very massive mass was necessary . 
WHAT ARE BLACK HOLES?  
Most people think that a black hole is a massive whirlpool in space, sucking in everything around it. But that is not the whole story. 
**A black hole is a region in space where gravity is so strong that the escape velocity is faster than the speed  of light.** Bu

**Question 7**

In [None]:
question = "What happen when We fall into a it? "
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
When someone falls into a black hole, they would experience an increasing gravitational pull. Time dilation would occur, meaning that time would appear to pass more slowly for the falling observer compared to an observer far away from the black hole. Once the observer crosses the event horizon, a point of no return, they would be unable to escape the black hole's gravitational pull. As the falling observer moves further inward, they would experience spaghettification, where the tidal forces near the black hole are incredibly strong, causing a significant difference in the gravitational pull between their head and feet, elongating them into a thin, elongated shape. As the falling observer continues to move closer to the black hole's singularity, the gravitational forces become infinitely strong, and the observer's ultimate fate would be to become part of the singularity, where their matter would be crushed to infinite density. From an external observer's perspective, it would t

In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: Black hole mystry in the universe | Page: 6 | Chunk ID: 2
════════════════════════════════════════════════════════════════
as being "pulled" or "sucked in." It's worth noting that the extreme conditions near black holes can have powerful tidal forces, w hich 
can stretch and deform objects that come close. These tidal forces can be highly destructive, tearing apart objects before they cr oss 
the event horizon.
1.5 What happen when We fall into a Black Hole?
**If someone were to fall into a black hole, the experience would be quite different depending on their position relative to the eve nt 
horizon. Let's explore the hypothetical scenario of an observer falling into a black hole:**
1. **Approaching the Event Horizon: As an observer falls towards a black hole, they would experience the increasing gravitational 
pull. Time dilation would occur, meaning that time wo

**Question 8**

In [None]:
question = "What is Event Horizon"
response = qa_chain.invoke({"question": question})
print("\nAnswer:")
print(response["answer"])


Answer:
The event horizon is the boundary of a black hole, marking the point of no return. It is the region beyond which nothing, including light, can escape the gravitational pull of the black hole. Once an object crosses the event horizon, it is inevitably drawn toward the singularity. From an external observer's perspective, the event horizon appears as a spherical surface surrounding the black hole.


In [None]:
print("\n--- Full Sources with Highlighted Snippets ---")
highlighted_full_chunks = highlight_relevant_sources_full_chunk(response["answer"], response["source_documents"])
print(highlighted_full_chunks)



--- Full Sources with Highlighted Snippets ---
════════════════════════════════════════════════════════════════
📄 Source: Black hole mystry in the universe | Page: 5 | Chunk ID: 1
════════════════════════════════════════════════════════════════
understanding, singularity is a region where the laws of physics, as we know them, break down. The concept of singularity ari ses 
from the mathematics of general relativity, but it is still a subject of active research and debate. The s ingularity is hidden be neath 
the event horizon and is inaccessible to direct observation.
**2. Event Horizon: The event horizon is the boundary of a black hole, marking the point of no return. It is the region beyond whic h 
nothing, including light, can escap e the gravitational pull of the black hole. Once an object crosses the event horizon, it is in evitably 
drawn toward the singularity. From an external observer's perspective, the event horizon appears as a spherical surface surro unding 
the black hole

**Displaying Conversation History**

Finally, we can inspect the conversation history stored in the qa_chain's memory. This shows how the LLM maintains context across turns.


In [None]:
# Step 8: Display Conversation History
print("\n--- Conversation History ---")
for i in range(0, len(qa_chain.memory.chat_memory.messages), 2):
    user_message = qa_chain.memory.chat_memory.messages[i].content
    assistant_message = qa_chain.memory.chat_memory.messages[i+1].content


    print(f"Question: {i//2 + 1}.")
    print(f"User: {user_message}")
    print("-"*100)
    print(f"Assistant: {assistant_message}")
    print("="*100)
    print("\n")




--- Conversation History ---
Question: 1.
User: What is Gujarat's projected fiscal deficit as a percentage of GSDP for FY 2025–26?
----------------------------------------------------------------------------------------------------
Assistant: The fiscal deficit for 2025-26 is targeted at 2% of GSDP (Rs 58,397 crore), which is higher than the revised estimates for 2024-25 (1.9% of GSDP).


Question: 2.
User: How has capital outlay changed from 2024–25 to 2025–26?
----------------------------------------------------------------------------------------------------
Assistant: For Gujarat, the capital outlay for 2025-26 is proposed to be Rs 95,472 crore, an increase of 36% from the revised estimate of 2024-25. For Maharashtra, the capital outlay for 2025-26 is proposed to be Rs 84,475 crore, a decrease of 11% from the revised estimate of 2024-25.


Question: 3.
User: How much budget is allocated to Swarnim Mukhya Mantri Shaheri Vikas Yojana?
------------------------------------------------