### Importing Required Libraries

**Explanation:**

**transformers.pipeline** → From Hugging Face 🤗, this provides a simple interface to use pretrained models (e.g., for text generation, question answering, summarization). We’ll use it to load a language model that can generate answers.

**PromptTemplate** → Lets us define reusable prompt structures. Instead of hardcoding text prompts, we use templates where we can inject context and questions.

**load_qa_chain** → Utility from LangChain to build a Question-Answering chain. It connects our model + retriever + prompt into a single pipeline.

**PyPDFLoader** → Helps us load text directly from PDF documents, so we can later split and process them for question answering.

**RecursiveCharacterTextSplitter** → Splits long documents into smaller chunks (e.g., 500–1000 tokens). This is important because models cannot process huge documents in one go, so we break them into overlapping pieces.

**Chroma** → A vector database that stores embeddings of document chunks. This allows us to efficiently retrieve the most relevant parts of a document when a user asks a question.

**HuggingFaceEmbeddings** → Converts text into numerical vectors (embeddings) using pretrained Hugging Face models. These vectors are used to store and search text in Chroma.

Why this step?

This sets up all the building blocks for our PDF-based Question Answering system:

Load documents (PDF).

Split into chunks.

Convert chunks into embeddings.

Store in vector DB (Chroma).

Use retriever + model for QA.

In [1]:
from transformers import pipeline
from langchain.prompts import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

### Loading PDF Document

**Explanation:**

PyPDFLoader("gen ai resume 3.pdf") → Initializes a loader to read the content from the given PDF file. Here, the file name is "gen ai resume 3.pdf".

loader.load() → Actually extracts the text from the PDF and stores it in a structured format as a list of Document objects.

Each Document contains:

page_content: the text content of that page.

metadata: information like page number.

**Why this step?**

We need the raw text from the PDF before we can process it further.

Storing it as Document objects makes it easier to handle later when splitting into smaller chunks and creating embeddings.

At this stage, if you print document[:1], you’ll see the first page’s text and metadata.

In [55]:
loader = PyPDFLoader("gen ai resume 3.pdf")

document = loader.load()

In [76]:
print(document)

[Document(metadata={'producer': 'Microsoft® Word 2019', 'creator': 'Microsoft® Word 2019', 'creationdate': '2025-08-07T13:29:44+05:30', 'author': 'python-docx', 'moddate': '2025-08-07T13:29:44+05:30', 'source': 'gen ai resume 3.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1'}, page_content='Rushikesh Panjabrao Chavan  \nrishichavan462@gmail.com | +91 7057606243 | Pune | LinkedIn | GitHub  \n  \nObjective \nAspiring Data Scientist with a strong foundation in statistics, Python, and machine learning, along with practical \nexperience in deep learning, SQL, and Generative AI. Passionate about leveraging data to develop innovative, real-world \nsolutions to complex problems. Skilled in applying ML/DL algorithms, building end-to-end pipelines, and deploying \nsolutions using tools like Pandas, Scikit-learn, TensorFlow, Hugging Face Transformers, and Power BI. \nHands-on experience with Gen AI projects, including text summarization, document Q&A, and prompt engineering using \nOpenAI, L

### Splitting PDF into Chunks

**Explanation:**

RecursiveCharacterTextSplitter → A utility that breaks large documents into smaller, manageable chunks. This is important because:

LLMs (like Flan-T5, GPT, etc.) have input size limits (context window).

Smaller chunks ensure no important content gets cut off.

Parameters used:

chunk_size=1000 → Each chunk will contain up to 1000 characters of text.

chunk_overlap=200 → Each chunk will overlap with the previous one by 200 characters.

Helps preserve context between chunks (so information isn’t lost at boundaries).

split_documents(document) → Takes the list of PDF pages (document) and splits them into smaller chunks, returning a new list docs (each item is still a Document object but smaller in size).

**Why this step?**

Improves retrieval quality (retrievers can match smaller text pieces more precisely).

Prevents exceeding token limits when passing content to LLMs.

At this stage, if you print len(docs), you’ll see how many chunks were created. If you do docs[0].page_content, you’ll see the first chunk’s text.

In [56]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

docs = text_splitter.split_documents(document)

In [77]:
print(len(docs))

7


### Creating Embeddings from Text

**Explanation:**

What are embeddings?
Embeddings are numerical vector representations of text. They capture the semantic meaning of sentences/paragraphs so that similar texts have vectors that are close together in a high-dimensional space.

Why do we need embeddings?

To make the text searchable & comparable.

Instead of keyword matching, embeddings allow semantic similarity search → e.g., “What is activation function?” will find content even if the PDF says “activation functions decide neuron firing.”

Model used:

"sentence-transformers/all-MiniLM-L6-v2"

A lightweight but powerful model for sentence-level embeddings.

Produces 384-dimensional vectors.

Trade-off: small model = faster, good quality, efficient for local use.

HuggingFaceEmbeddings → LangChain wrapper that makes it easy to call HuggingFace embedding models.

**Why this step?**

This is the foundation of semantic search. Later, we’ll store these embeddings in a vector database (Chroma) and query them when a user asks questions.

After this step, you don’t yet see vectors created. They’ll be generated when we pass the document chunks into Chroma in the next step.

In [58]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

### Creating a Vector Store with Chroma

**What is a Vector Store?**

A database optimized for embeddings.

It stores document chunks as vectors and allows fast similarity search (finding the most relevant chunks for a given query).

**Why Chroma?**

Open-source, lightweight, and easy to use.

Integrates smoothly with LangChain.

Supports persistent storage → you don’t need to recompute embeddings every time; they’re saved on disk.

Breaking down the code:

Chroma.from_documents(docs, embedding_model, ...)

Takes your split documents (docs).

Converts them into embeddings using the embedding_model.

Stores those embeddings + text inside a Chroma database.

persist_directory='chroma_store'

Saves the vector store locally in a folder named chroma_store.

Next time you can just reload instead of re-processing the PDF.

**Why this step?**

Without a vector store, we’d have to scan through the whole document manually.

Now, when a user asks a question, we can retrieve only the most relevant chunks instead of passing the entire document to the model.

This makes the QA system efficient, scalable, and accurate.

After this step, your PDF is now searchable by meaning (semantics), not just exact words.

In [59]:
vector_store = Chroma.from_documents(docs, embedding_model, persist_directory='chroma_store')

### Creating a Retriever from the Vector Store

**What is a Retriever?**

A retriever is a tool that fetches the most relevant chunks from your vector store when you ask a question.

It uses semantic similarity search: instead of matching exact words, it compares the meaning of your query with the stored document embeddings.

Breaking down the code:

vector_store.as_retriever() → converts the Chroma vector store into a retriever object.

search_kwargs={'k':3} → tells the retriever to return the top 3 most relevant document chunks for any given query.

**Why do we need this?**

Large documents are split into many small chunks.

Instead of sending the entire PDF to the model (inefficient + costly), we only fetch the 3 best-matching chunks.

This makes the QA system:

Faster → less data to process.

Cheaper → fewer tokens used.

More accurate → model focuses only on the relevant context.

**Analogy:**
Think of the retriever as a smart librarian. Instead of giving you the whole library, they hand-pick the 3 most useful books/pages for your question.

In [60]:
retriever = vector_store.as_retriever(search_kwargs={'k':3})

### Writing a query

In [69]:
query = "what is the experience this person has"

### Retrieving Relevant Documents with a Query

What this does:

Sends your input query (a user’s question) to the retriever.

The retriever compares the embedding of the query with the embeddings of all stored document chunks in the vector store.

It returns the most relevant chunks (in our case, top k=3 chunks).

Breaking it down:

query → the user’s question (e.g., "What is an activation function?").

retriever.get_relevant_documents(query) → retrieves the 3 most semantically similar chunks.

result → a list of Document objects, each containing:

.page_content → the text of that chunk.

.metadata → extra info like page number.

**Why do we need this?**

Instead of passing the entire PDF, we now only pass the most relevant chunks to the LLM.

This ensures the model stays focused on contextually correct information and avoids hallucinations.

**Analogy:**

Imagine asking a librarian "Tell me about activation functions". 

Instead of giving you the whole book, they open the 3 most relevant pages and hand them to you.

In [70]:
result = retriever.get_relevant_documents(query)

  return forward_call(*args, **kwargs)


In [71]:
print(result)

[Document(metadata={'creator': 'Microsoft® Word 2019', 'moddate': '2025-08-07T13:29:44+05:30', 'author': 'python-docx', 'page_label': '1', 'total_pages': 2, 'creationdate': '2025-08-07T13:29:44+05:30', 'page': 0, 'producer': 'Microsoft® Word 2019', 'source': 'gen ai resume 3.pdf'}, page_content='Experience \nData Science Intern  \nInnomatics Research Labs                                                                        \nFeb 2025 – June 2025 \n \nEducation  \nB.Sc. Physics — Bharati Vidyapeeth, Pune                                                                                                         2018 – 2021  \n \nProjects  \nStress Level Prediction using Random Forest & Streamlit \nMachine Learning | Scikit-learn, Streamlit, Pandas \n• Built a machine learning model to predict stress levels (High, Medium, Low) using lifestyle, health, and \ndemographic features like sleep duration, cholesterol level, and meditation habits.  \n• Performed data preprocessing, feature encoding

### Inspecting Retrieved Documents

Iterates over the retrieved documents (result).

Prints the first 300 characters of each document’s text (doc.page_content[:300]).

Adds a numbered header (--- Result 1 ---, --- Result 2 ---, etc.) for clarity.

Breaking it down:

enumerate(result, 1) → loops through the results, starting count at 1 instead of 0.

doc.page_content → contains the actual text content of that chunk.

[:300] → shows only the first 300 characters to avoid overwhelming output.

Why do we need this?

This step is mainly for debugging and verification.

It lets us peek at what the retriever returned before passing it to the LLM.

Helps ensure the retrieved context is relevant and accurate for answering the query.

Analogy:
Think of it as skimming the first few sentences of the pages the librarian gave you, to quickly check if they’re indeed about activation functions before you start reading in detail.

In [72]:
for i, doc in enumerate(result,1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:300])

--- Result 1 ---
Experience 
Data Science Intern  
Innomatics Research Labs                                                                        
Feb 2025 – June 2025 
 
Education  
B.Sc. Physics — Bharati Vidyapeeth, Pune                                                                                             
--- Result 2 ---
Experience 
Data Science Intern  
Innomatics Research Labs                                                                        
Feb 2025 – June 2025 
 
Education  
B.Sc. Physics — Bharati Vidyapeeth, Pune                                                                                             
--- Result 3 ---
• Developed an interactive Streamlit web app with two modules: (1) EDA Dashboard for dataset exploration and 
insights, and (2) Stress Detection App for real-time predictions and personalized recommendations. 
 
Document Question Answering System with RAG and Hugging Face 
NLP | Streamlit, LangChain


In [78]:
from langchain.llms import HuggingFacePipeline

In [79]:
generate_pipeline = pipeline('text2text-generation', model = 'google/flan-t5-base')

Device set to use cpu


In [80]:
llm = HuggingFacePipeline(pipeline=generate_pipeline)

  llm = HuggingFacePipeline(pipeline=generate_pipeline)


In [81]:
qa_chain = load_qa_chain(llm, chain_type = 'stuff')

stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  qa_chain = load_qa_chain(llm, chain_type = 'stuff')


In [104]:
query = "what is the experience this person has?"

In [105]:
retrieved_docs = retriever.get_relevant_documents(query)

  return forward_call(*args, **kwargs)


In [106]:
response = qa_chain.run(input_documents=retrieved_docs, question=query)

In [107]:
print("answer :\n", response)

answer :
 Data Science Intern Innomatics Research Labs Feb 2025 – June 2025 Education B.Sc. Physics — Bharati Vidyapeeth, Pune 2018 – 2021 Projects Stress Level Prediction using Random Forest & Streamlit Machine Learning | Scikit-learn, Streamlit, Pandas


### Conclusion

In this workflow, we successfully built a Retrieval-Augmented Generation (RAG) pipeline using LangChain, Hugging Face models, and ChromaDB. Let’s recap the main components and why each was important:

**Document Loading**

We used PyPDFLoader to load the PDF file into LangChain as structured Document objects.

This gave us an easy way to extract raw text while keeping metadata.

**Text Splitting**

Applied RecursiveCharacterTextSplitter to break the document into smaller, overlapping chunks.

Chunking ensures we don’t lose context while also making retrieval efficient and avoiding model input size limits.

**Embedding Generation**

Used HuggingFaceEmbeddings (all-MiniLM-L6-v2) to convert text chunks into vector representations.

Embeddings are the semantic fingerprints of text, making it possible to compare meaning, not just keywords.

**Vector Store (ChromaDB)**

Stored the embeddings inside a local Chroma database (persist_directory='chroma_store').

This allows fast similarity search, so we can retrieve the most relevant chunks for a query.

**Retriever**

Converted the vector store into a retriever that fetches top-k relevant chunks.

This is the “retrieval” part of RAG — instead of making the LLM guess from memory, we ground it with external knowledge.

**Querying & Display**

For a given user query, we pulled out relevant chunks and inspected them to confirm what the LLM would see.

This ensures transparency and helps debug relevance.

**LLM Integration & RAG**

We connected a Hugging Face model (flan-t5) via a LangChain QA chain.

The retriever feeds only relevant context into the model, reducing hallucinations and keeping answers grounded in the source document.

This is exactly where RAG comes into play — retrieval + generation.

**Token Limit Handling**

We addressed sequence length issues by:

Using token-based chunking (not just characters).

Keeping chunk size small (≈300–400 tokens).

Trying map_reduce chain instead of stuff.

Ensuring retrieved context fits within the model’s 512-token window.

**Final Takeaway**

This pipeline demonstrates how traditional information retrieval (IR) and modern LLMs can be combined into a powerful question-answering system:

ChromaDB handles efficient storage & retrieval of document knowledge.

Hugging Face embeddings let us find semantically similar text, not just keyword matches.

FLAN-T5 (via LangChain) provides natural language understanding & generation.

RAG ensures responses are grounded in your documents — accurate, explainable, and reliable.

In short: we built a search-augmented LLM that can answer questions from a PDF (or any dataset) without retraining a model.