# Document-Based Question Answering using RAG

This notebook demonstrates a basic Retrieval Augmented Generation (RAG) pipeline.
The system answers user questions by retrieving relevant information from documents
and passing that context to a Large Language Model (LLM).

**Tech Stack:** Python, LangChain, FAISS, Hugging Face Embeddings


---

## üîπ Step 1: Install Required Libraries

```python
!pip install langchain langchain-community langchain-huggingface faiss-cpu pypdf sentence-transformers
```

---

In [6]:
!pip install langchain langchain-community langchain-huggingface faiss-cpu pypdf sentence-transformers


Collecting langchain-huggingface
  Downloading langchain_huggingface-1.2.0-py3-none-any.whl.metadata (2.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.1-cp312-cp312-win_amd64.whl.metadata (7.6 kB)
Collecting pypdf
  Downloading pypdf-6.4.2-py3-none-any.whl.metadata (7.1 kB)
INFO: pip is looking at multiple versions of langchain-huggingface to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-huggingface
  Downloading langchain_huggingface-1.1.0-py3-none-any.whl.metadata (2.8 kB)
  Downloading langchain_huggingface-1.0.1-py3-none-any.whl.metadata (2.1 kB)
  Downloading langchain_huggingface-1.0.0-py3-none-any.whl.metadata (2.1 kB)
  Using cached langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Using cached langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Downloading faiss_cpu-1.13.1-cp312-cp312-win_amd64.whl (18.8 MB)
   ---------------------------------------- 0.0/18.8 MB ? eta -:--:--
   -- ------

---

## üîπ Step 2: Import Required Libraries (Code Cell)

In [10]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA


---

## üîπ Step 3: Load the Document

We load a PDF document which will act as our knowledge base.


In [15]:
# Path to your PDF file
pdf_path = r"C:\Users\Chethan Vakiti\Downloads\Machine Learning Notes.pdf"  
loader = PyPDFLoader(pdf_path)
documents = loader.load()

print(f"Number of pages loaded: {len(documents)}")


Number of pages loaded: 20



üëâ Interview line:

> ‚ÄúI load documents using LangChain loaders to convert them into text format.‚Äù

---

## üîπ Step 4: Split Document into Chunks 
Chunking helps embeddings capture semantic meaning effectively.


In [19]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap =50
)

chunks = text_splitter.split_documents(documents)

print(f"Total chunks created: {len(chunks)}")

Total chunks created: 46


üëâ Interview line:

> ‚ÄúChunking prevents loss of context and improves retrieval accuracy.‚Äù

---

## üîπ Step 5: Create Embeddings

Embeddings convert text into numerical vectors for semantic search.

In [24]:
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)





üëâ Interview line:

> ‚ÄúEmbeddings allow similarity-based retrieval instead of keyword matching.‚Äù

---

## üîπ Step 6: Store Embeddings in Vector Database 
FAISS helps us perform fast similarity search on embeddings.

In [32]:
vectorstore = FAISS.from_documents(chunks, embeddings)

---
## üîπ Step 7: Create Retriever

The retriever fetches the most relevant chunks for a user query.

In [34]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

## üîπ Step 8: Test Semantic Retrieval (Core RAG Component)

Before using an LLM, we test whether relevant content is retrieved.

In [39]:
query = "What is this document about?"

docs = retriever.get_relevant_documents(query)

for i, doc in enumerate(docs):
    print(f"\n--- Retrieved Chunk {i+1} ---")
    print(doc.page_content[:300])  


--- Retrieved Chunk 1 ---
SUPERVISED LEARNING:
What is it?
In Supervised Learning, the model is trained on a labeled dataset,
meaning every training example has an input (X) and a
corresponding correct output (Y). The goal is for the model to learn
the relationship between inputs and outputs so it can predict outputs
for new

--- Retrieved Chunk 2 ---
Common Classification Algorithms:
‚Ä¢
‚Ä¢
‚Ä¢
‚Ä¢
4/17/25, 11:28 AM Editing Machine Learning ‚Äì Medium
https://medium.com/p/2d3efa24e2a6/edit 7/20

--- Retrieved Chunk 3 ---
üîπHybrid Systems
‚Ä¢
‚Ä¢
‚Ä¢
‚Ä¢
‚Ä¢
‚Ä¢
4/17/25, 11:28 AM Editing Machine Learning ‚Äì Medium
https://medium.com/p/2d3efa24e2a6/edit 15/20


  docs = retriever.get_relevant_documents(query)


## Final Notes

This project focuses on implementing and validating the **retrieval component** of a
Retrieval Augmented Generation (RAG) system.

The retrieved document context can be passed to any Large Language Model (LLM)
such as Gemini or OpenAI for answer generation. This approach improves accuracy
and reduces hallucinations compared to direct LLM usage.



## üîπ Step 9: Final Explanation 

## Final Explanation

1. Load and split documents into chunks  
2. Convert chunks into embeddings  
3. Store embeddings in a vector database  
4. Retrieve relevant chunks for a user query  
5. Pass retrieved context to an LLM for answer generation  

This approach improves accuracy and reliability compared to direct LLM usage.

---

# ‚úÖ Overview of this project


> ‚ÄúI built a document-based question answering system using RAG. I load and split documents, generate embeddings, store them in a vector database, and retrieve relevant chunks based on user queries. These chunks are then provided to the LLM to generate answers. This reduces hallucinations and ensures answers are grounded in document context.‚Äù

---




‚ÄúI implemented a document-based question answering system using RAG. I load PDFs using LangChain, split them into chunks, generate embeddings using Hugging Face models, and store them in a FAISS vector database. When a user asks a question, I retrieve the most relevant chunks using semantic search and pass that context to an LLM for answer generation. This approach reduces hallucinations compared to direct LLM usage.‚Äù


‚ÄúI implemented and validated the retrieval component of a RAG system. The retrieved document context can be passed to any LLM such as Gemini or OpenAI for answer generation. I focused on retrieval accuracy since that is the core of RAG.‚Äù
