<a href="https://colab.research.google.com/github/Santhiyagithub/llm-pdf-qa/blob/main/LLM_withQ%26A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task 1: Integrate LLM with Q&A capability from PDF data**

Step 1: Install Dependencies

In [1]:
!pip install -q llama-index==0.10.14 sentence-transformers PyMuPDF transformers accelerate llama-index-embeddings-huggingface

Step 2: Upload Your PDF

In [2]:
from google.colab import files
uploaded = files.upload()  # Upload your sample.pdf here


Saving Astralweb Innovate Rulebook A'25.pdf to Astralweb Innovate Rulebook A'25.pdf


confirm upload

In [3]:
import os
print("Files in session:", os.listdir())


Files in session: ['.config', "Astralweb Innovate Rulebook A'25.pdf", 'sample.pdf', 'sample_data']


In [4]:
# Rename uploaded file to sample.pdf
for filename in uploaded.keys():
    os.rename(filename, "sample.pdf")
    print(f"Renamed {filename} to sample.pdf")

Renamed Astralweb Innovate Rulebook A'25.pdf to sample.pdf


Step 4: Load PDF and Query

In [6]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.embeddings import BaseEmbedding
from typing import List
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from llama_index.core import Settings

# Step 1: Load PDF
documents = SimpleDirectoryReader(input_files=["sample.pdf"]).load_data()

# Step 2: Embedding wrapper
class MyHFEmbedding(BaseEmbedding):
    model: SentenceTransformer

    def __init__(self, model: SentenceTransformer):
        super().__init__(model=model) # Pass model to the super class
        self.model = model # Also set the model attribute

    def _get_text_embedding(self, text: str) -> List[float]:
        return self.model.encode(text).tolist()

    def _get_query_embedding(self, query: str) -> List[float]:
        return self.model.encode(query).tolist()

    async def _aget_query_embedding(self, query: str) -> List[float]:
        # Implement the async method by calling the synchronous one
        return self._get_query_embedding(query)

# Step 3: Create embedding + service context
# Initialize the SentenceTransformer model and pass it to MyHFEmbedding
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embed_model = MyHFEmbedding(model=sentence_model)
Settings.embed_model = embed_model


# Step 4: Build index
index = VectorStoreIndex.from_documents(documents)

# Step 5: Load the LLM pipeline (FLAN-T5)
llm_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=512)

# Step 6: Custom response wrapper using the local LLM
def generate_llm_response(query_text, context_text):
    prompt = f"Context: {context_text}\n\nQuestion: {query_text}\n\nAnswer:"
    result = llm_pipeline(prompt)[0]['generated_text']
    return result

# Step 7: Semantic search → use LLM to generate final answer
def ask_question(query):
    nodes = index.as_retriever().retrieve(query)
    combined_context = "\n".join([n.text for n in nodes[:3]])  # Use top 3 chunks
    return generate_llm_response(query, combined_context)

# Step 8: Try a question
question = "What is the main focus of this document?"
answer = ask_question(question)
print("Answer:", answer)

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (607 > 512). Running this sequence through the model will result in indexing errors


Answer: Round 1: Implementation Submission Guidelines Round 1: Implementation Submission Guidelines Round 2: Website Submission Round 1: Idea and Implementation Submission Document Submission: A word document (500–1000 words) outlining the idea and proposed implementation plan.


In [19]:
import gradio as gr
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.embeddings import BaseEmbedding
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from typing import List

# Step 1: Globals
index = None

# Step 2: Embedding Wrapper
class MyHFEmbedding(BaseEmbedding):
    model: SentenceTransformer
    def __init__(self, model: SentenceTransformer):
        super().__init__(model=model)
        self.model = model

    def _get_text_embedding(self, text: str) -> List[float]:
        return self.model.encode(text).tolist()

    def _get_query_embedding(self, query: str) -> List[float]:
        return self.model.encode(query).tolist()

    async def _aget_query_embedding(self, query: str) -> List[float]:
        return self._get_query_embedding(query)

# Step 3: LLM Pipeline (Flan-T5)
llm_pipeline = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=256)

# Step 4: LLM Answer Generator
def generate_llm_response(query_text, context_text):
    prompt = f"""You are an AI assistant. Based on the following PDF content, answer the user's question clearly.

📄 Document Content:
{context_text}

❓ Question:
{query_text}

💬 Answer:"""
    result = llm_pipeline(prompt)[0]['generated_text']
    return result.strip() if len(result.strip()) > 10 else "⚠️ Sorry, I couldn't find a meaningful answer."

# Step 5: Upload & Index the PDF
def upload_pdf(file):
    global index
    file_path = file.name
    with open(file_path, "wb") as f:
        f.write(file.read())
    documents = SimpleDirectoryReader(input_files=[file_path]).load_data()
    embed_model = MyHFEmbedding(SentenceTransformer("all-MiniLM-L6-v2"))
    Settings.embed_model = embed_model
    index = VectorStoreIndex.from_documents(documents)

# Step 6: Handle user questions
def ask_question(user_question):
    if index is None:
        return "❌ No PDF uploaded yet."
    nodes = index.as_retriever().retrieve(user_question)
    context = "\n".join([n.text for n in nodes[:3]])[:1500]
    return generate_llm_response(user_question, context)

# Step 7: Gradio UI
with gr.Blocks() as iface:
    gr.Markdown("## 📄 PDF Question Answering Bot (LLM + Gradio)")

    file_input = gr.File(label="📁 Upload or Replace PDF", file_types=[".pdf"])
    file_input.change(fn=upload_pdf, inputs=file_input, outputs=[])

    question_box = gr.Textbox(lines=2, placeholder="Ask a question about the uploaded PDF...")
    submit_btn = gr.Button("💬 Submit Question")
    answer_box = gr.Textbox(label="💡 Answer")

    submit_btn.click(fn=ask_question, inputs=question_box, outputs=answer_box)

iface.launch(share=True)


Device set to use cpu


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://95eaaf39ceabe27f90.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Step 5: Ask More Questions Interactively

In [29]:
def ask_question(query):
    nodes = index.as_retriever().retrieve(query)
    combined_context = "\n".join([n.text for n in nodes[:3]])

    # 🔍 Debug output
    print("\n🔍 Retrieved Context:\n", combined_context)

    result = generate_llm_response(query, combined_context)

    if len(result.strip()) < 10 or '/' in result:
        return "⚠️ Sorry, I couldn't find a meaningful answer. Please try rephrasing your question."
    return result


In [21]:
def clean_document_chunks(documents):
    clean_docs = []
    for doc in documents:
        if len(doc.text.strip()) > 50 and not any(c in doc.text for c in ["/gid", "Page", "...."]):
            clean_docs.append(doc)
    return clean_docs


In [28]:
# Make sure to run cell 7EdG3Eqe1Bvg first to set up the query_engine and define ask_question

while True:
    q = input("Ask a question (or type 'exit'): ")
    if q.lower() == "exit":
        break
    # Call the ask_question function defined in cell 7EdG3Eqe1Bvg
    answer = ask_question(q)
    print("\n🤖 Answer:", answer)

Ask a question (or type 'exit'): explain pdf


Token indices sequence length is longer than the specified maximum sequence length for this model (694 > 512). Running this sequence through the model will result in indexing errors



🔍 Retrieved Context:
 Competition
 
Format
 
 
Round
 
1:
 
Implementation
 
Submission
 
 
Guidelines:
 
In
 
this
 
round,
 
participants
 
are
 
expected
 
to
 
submit
 
a
 
well-thought-out
 
idea.
 
The
 
idea
 
should
 
be
 
original,
 
innovative,
 
and
 
feasible
 
for
 
implementation
 
within
 
the
 
given
 
timeframe.
 
Participants
 
must
 
detail
 
how
 
they
 
plan
 
to
 
execute
 
their
 
idea,
 
specifying
 
the
 
tools,
 
technologies,
 
and
 
resources
 
they
 
will
 
use.
 
The
 
focus
 
should
 
be
 
on
 
how
 
this
 
idea
 
could
 
have
 
a
 
significant
 
impact,
 
whether
 
by
 
contributing
 
to
 
astronomical
 
and
 
scientific
 
knowledge,
 
improving
 
user
 
experience.
 
Clear
 
articulation
 
of
 
the
 
idea,
 
combined
 
with
 
a
 
realistic
 
implementation
 
plan,
 
will
 
form
 
the
 
basis
 
of
 
evaluation.
 
Creativity
 
and
 
practicality
 
should
 
go
 
hand-in-hand
 
while
 
describing
 
the
 
solution.
 
Participants
 
may
 
proceed
 
with
 
we

Let's switch to using the standard `HuggingFaceEmbedding` class, which is a simpler approach and should work now that the necessary library is installed.

We will use the code from the earlier attempt that used `HuggingFaceEmbedding`.