In [3]:
from PyPDF2 import PdfReader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_groq import ChatGroq
import os

In [None]:
os.environ["GROQ_API_KEY"] = ""

LOAD PDF

loads all the text from pdf into a string

In [6]:
pdfReader = PdfReader("budget_speech.pdf")
rawText = ""
for page in pdfReader.pages:
    content = page.extract_text()
    if content:
        rawText += content

In [18]:
rawText

'GOVERNMENT OF INDIA\nBUDGET 2026-2027\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2026 \nCONTENTS  \n \nPage No.  \nIntroduction  1 \n                                                          PART - A                                          \nYuva Shakti and 3 kartavya  2 \nReform Express  3 \nFirst kartavya : to accelerate and sustain economic growth  3 \nSecond kartavya: fulfil aspirations and build capacity  10 \nThird kartavya:  Sabka Sath, Sabka Vikas  14  \n16th Finance Commission  18 \nFiscal Consolidation  18 \n \nPART – B \nDirect taxes  20 \nIndirect Taxes   26 \n \nAnnexure to Part -A 32 \nAnnexure to Part -B \nAmendments relating to Direct Taxes  33 \nAmendments relating to Indirect Taxes  50 \n \n \n   \n \nBudget 202 6-2027 \n \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1 , 202 6 \n \nHon’ble Speaker,  \nOn the sacred occasion of Magha Purnima and the birth \nanniversary of Guru Ravidas, I present the Budget for the year 2

Splits Texts into Chunks

Need:
LLMs have token limits. So, chunking ensures

Better retrival , Faster Search , Prevents Overflow

we use chunk_overlap to ensure that there is some overlapping in the chunks so that the content may not get lost

In [10]:
text_splitters = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap = 200
)

texts = text_splitters.split_text(rawText)
print(len(texts))

171


Creating Embeddings + Vector DB

Text -> Vector -> Stored in FAISS

this enables semantic search i.e. "Find text similar to my question"

In [12]:
embeddings = HuggingFaceEmbeddings(
    model_name = "sentence-transformers/all-MiniLM-L6-v2"
)

document_search = FAISS.from_texts(texts,embeddings)

Loading weights: 100%|██████████| 103/103 [00:00<00:00, 271.77it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [13]:
llm = ChatGroq(
    model = "llama-3.1-8b-instant",
    temperature=0
)

Ask Questions(RAG)

Pipeline : 

User Question



Search relevant chunks in FAISS



Combine those chunks into context



Send to LLM



Generate grounded answer


In [15]:
query = "Vision of Amrit Kaal for India"
docs = document_search.similarity_search(query)
context = "\n".join([doc.page_content for doc in docs])

prompt = f"""
Answer the question using the following contexts below.

Context : {context}

Question : {query}

"""

Each retrieved item looks like


Document(

  page_content="Text chunk here",

  metadata={...}

)


so ,  the below line is used to extract only the content

In [16]:
response = llm.invoke(prompt)
print(response.content)

The vision of Amrit Kaal for India, as outlined in the Budget 2026-2027 speech by Nirmala Sitharaman, Minister of Finance, is to create a prosperous and inclusive India where every family, community, region, and sector has access to resources, amenities, and opportunities for meaningful participation.

This vision is encapsulated in the three kartavyas (duties) that the government has identified:

1. **Accelerate and sustain economic growth**: Enhance productivity and competitiveness, and build resilience to volatile global dynamics.
2. **Fulfil aspirations and build capacity**: Fulfil the aspirations of the people, including farmers, women in STEM, youth keen to upskill, and Divyangjan, and build their capacity to access newer opportunities.
3. **Sabka Sath, Sabka Vikas**: Ensure that every family, community, region, and sector has access to resources, amenities, and opportunities for meaningful participation.

The government's "Sankalp" (pledge) is to focus on the poor, underprivileg

In [20]:
query = "How much the budget has been increased in the current year?"

docs = document_search.similarity_search(query)
context = "\n".join([doc.page_content for doc in docs])

prompt = f"""
Answer using the provided context.

Context:
{context}

Question:
{query}
"""

response = llm.invoke(prompt)
print(response.content)


The information provided does not directly mention the increase in the budget for the current year. However, it does mention the Revised Estimates (RE) for 2025-26 and the Budget Estimates (BE) for 2026-27.

The Revised Estimates for 2025-26 for total expenditure is ₹49.6 lakh crore, and the Revised Estimates for non-debt receipts is ₹34 lakh crore. 

The Budget Estimates for 2026-27 for total expenditure is not directly mentioned, but the Revised Estimates for 2025-26 is ₹49.6 lakh crore.


Online PDF Loader

In [None]:
from langchain_community.document_loaders import PyPDFLoader

Load pdf from url

In [6]:
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
data = loader.load()

Create Vector Index

In [7]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


  embeddings = HuggingFaceEmbeddings(
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 136.42it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [8]:
from langchain_community.vectorstores import FAISS

index = FAISS.from_documents(data, embeddings)


In [9]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate


In [10]:
retriever = index.as_retriever(search_kwargs={"k": 3})


In [11]:
prompt = PromptTemplate.from_template("""
Answer the question using ONLY the context below.

Context:
{context}

Question:
{question}
""")


In [16]:
from langchain_groq import ChatGroq
llm = ChatGroq(
    model = "llama-3.1-8b-instant",
    temperature=0
)

In [17]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)


In [18]:
query = "Explain Attention is all you need"

response = chain.invoke(query)

print(response.content)


Unfortunately, the provided context does not contain any information about the paper "Attention is all you need" by Vaswani et al. However, it does mention the term "attention mechanism" in the context of a neural network, specifically in the encoder self-attention in layer 5 of 6.

The context describes how the attention mechanism is used to follow long-distance dependencies in the encoder self-attention, and how it attends to a distant dependency of the verb 'making', completing the phrase 'making...more difficult'. It also shows how the attentions are very sharp for certain words, such as 'its' in the context of anaphora resolution.

However, without more information about the paper "Attention is all you need", I can only provide a general explanation of the attention mechanism.

The attention mechanism is a technique used in neural networks to focus on specific parts of the input data when processing it. It allows the network to weigh the importance of different input elements and 