# **Document-Based RAG System Using LangChain + FAISS**

---

## **1. Project Title**
**Build a Document-Based RAG System Using LangChain and FAISS**

---

## **2. Objective**
The goal of this project is to build a **Retrieval-Augmented Generation (RAG)** system that can answer questions about a private or domain-specific document that the LLM alone cannot answer correctly.

**Key Objectives:**
- Handle private/internal documents
- Improve answer accuracy using document retrieval
- Demonstrate the difference between **LLM-only answers** and **RAG-based answers**

---

## **3. Document Used**
**File:** `Concept Note.pdf`  
**Type:** Internal research / project proposal document  
**Reason:** Contains domain-specific content that LLMs cannot answer without retrieval.

---

## **4. RAG Pipeline Overview**
The RAG system works in these steps:


[PDF Document] → [Chunking] → [Embeddings] → [FAISS Vector DB] → [Retriever] → [Prompt Template] → [LLM] → [Answer]


1. Load the document  
2. Split the document into chunks  
3. Generate embeddings for each chunk using `HuggingFaceEmbeddings`  
4. Store embeddings in **FAISS vector database**  
5. Retrieve top-k relevant chunks for a query  
6. Feed retrieved content into **LLM via prompt template**  
7. Generate final answer  

**Result:** Answers are **based only on retrieved context**, avoiding hallucinations.

---

## **5. Demo Queries and Answers**

### **Query 1:** What is the main purpose of the Concept Note?
- **RAG Answer:**  
  > The main purpose of the Concept Note is to develop a Physics-Informed Neural Network (PINN) model that integrates with physical constraints derived from hydrological and atmospheric equations governing rainfall processes for accurate rainfall projections in the Blue Nile Basin, Ethiopia.
- **LLM-only Answer:**  
  > A Concept Note is a document used in various contexts, but its main purpose remains the same in most cases. The main purpose of a Concept Note is to provide a brief overview and preliminary outline of a project, proposal, or idea. It serves as a precursor to a more detailed proposal or plan.  
  > *(Includes general description, methodology, budget, and objectives—mostly generic and not specific to the project.)*

---

### **Query 2:** List the 3 implementation steps mentioned in the Concept Note.
- **RAG Answer:**  
  1. Implement a PINN (Physics-Informed Neural Network) that integrates with physical constraints derived from hydrological and atmospheric equations governing rainfall process.  
  2. Compare performance with conventional ML models like LSTMs, Random Forests, and purely physics-based downscaling methods.  
  3. Implement the PINN using the supplementary datasets: CHIPRS (Climate Hazard Group InfraRed Precipitation with Station data) or GPM (Global Precipitation Measurement).  
- **LLM-only Answer:**  
  > Provides a general outline of what a Concept Note usually contains, but **does not specify the exact implementation steps**.  

---

### **Query 3:** What budget items are mentioned in the Concept Note?
- **RAG Answer:**  
  > There are no budget items mentioned in the context provided. The document focuses on objectives, methodology, and practical contributions.  
- **LLM-only Answer:**  
  > Mentions budget as part of a generic Concept Note, but this is **hallucinated** because the actual document does not include any budget items.  

---

## **6. Design Choices**

| Component | Choice | Reason |
|-----------|--------|--------|
| Chunking | 800 characters, 100 overlap | Preserve context while staying within LLM limits |
| Embeddings | `sentence-transformers/all-MiniLM-L6-v2` | Lightweight semantic embeddings |
| Vector Store | FAISS | Fast local similarity search |
| LLM | `ChatGroq llama-3.1-8b-instant` | Supports prompt-based retrieval |

---

## **7. Challenges & Solutions**

| Challenge | Solution |
|-----------|---------|
| Large documents | Split into smaller chunks for retrieval |
| Retrieval relevance | Used semantic embeddings with FAISS |
| LLM hallucination | Restricted LLM to **only retrieved context** |

---

## **8. Key Takeaways**
- RAG improves **accuracy** for private or domain-specific content  
- Retrieval ensures answers are **grounded in the document**, preventing hallucinations  
- LLM-only answers can be **generic or hallucinated**, highlighting the importance of RAG  
- LangChain + FAISS provide an **efficient and modular pipeline**  

---

