# üß† **PDF-Based Retrieval-Augmented Generation (RAG)** | **RAG100X**

*This notebook demonstrates a streamlined implementation of a Retrieval-Augmented Generation (RAG) system that processes a PDF document, breaks it into semantically meaningful chunks, and leverages OpenAI embeddings to enable intelligent retrieval and generation.*



#### üîç **Why Retrieval-Augmented Generation?**

*Large Language Models (LLMs) are prone to hallucination when not grounded in external data. RAG mitigates this by anchoring responses directly to source documents‚Äîsuch as PDFs, CSVs, or other structured content.*



#### ‚úÖ **Key Capabilities**

*This notebook provides a minimal yet functional RAG pipeline:*

- *Loads and parses PDF documents*  
- *Chunks text into manageable segments for efficient retrieval*  
- *Embeds each chunk using OpenAI‚Äôs embedding model*  
- *Stores vectors in a FAISS index for fast similarity search*  
- *Retrieves the most relevant segments in response to a user query*



> üõ†Ô∏è *Note:* No external helper libraries or frameworks have been used. The entire pipeline is self-contained to promote clarity and learning.





### üì¶ Installing Required Packages

This notebook uses several libraries to build and evaluate a simple RAG (Retrieval-Augmented Generation) system. Here's a quick breakdown of why each package is needed:

- **`pypdf`** ‚Äì To read and extract text from PDF documents.
- **`PyMuPDF`** ‚Äì Used internally by LangChain for more advanced PDF processing.
- **`python-dotenv`** ‚Äì To load API keys securely from a `.env` file (optional but good practice).
- **`langchain-community`** ‚Äì Core LangChain tools used for document loading, chunking, retrieval, etc.
- **`langchain_openai`** ‚Äì Lets LangChain work with OpenAI embeddings and models.
- **`rank_bm25`** ‚Äì Adds support for traditional keyword-based retrieval as a baseline method.
- **`faiss-cpu`** ‚Äì Enables fast similarity search over document embeddings (vector database).
- **`deepeval`** ‚Äì To evaluate how well the RAG system is retrieving and answering.
- **`openai`** ‚Äì Official OpenAI Python client to access embeddings and GPT models.

> You only need to run this installation cell once per Colab session.


In [None]:
# Install required packages
!pip install pypdf==5.6.0
!pip install PyMuPDF==1.26.1
!pip install python-dotenv==1.1.0
!pip install langchain-community==0.3.25
!pip install langchain_openai==0.3.23
!pip install rank_bm25==0.2.2
!pip install faiss-cpu==1.11.0
!pip install deepeval==3.1.0
!pip install openai




### üîê Setting Up Your OpenAI API Key

To use OpenAI‚Äôs embedding models (like `text-embedding-ada-002`), you‚Äôll need an API key.

Run the cell below and **enter your API key securely** when prompted. It won‚Äôt be stored or exposed in this notebook.

```python
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("üîê Enter your OpenAI API key: ")
```

---

‚ö†Ô∏è **BUT WAIT... There's a Plot Twist!**

> üß± *You may hit a wall if you're using the free tier of OpenAI ‚Äî*  
> > ‚ùå *No free embeddings quota.*  
> > üí∏ *You need to add billing info, or they ghost you faster than your last match.*

---

### üß™ Alternatives (aka "Plan B and C")

If OpenAI‚Äôs quota says ‚ÄúNope‚Äù üëã, you can still keep going:

- ‚úÖ **Use [HuggingFace Embeddings](https://www.sbert.net/docs/pretrained_models.html)** (totally free and local):
  ```python
  from langchain.embeddings import HuggingFaceEmbeddings
  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  ```

- ü§ù **Try [Cohere](https://cohere.com)** ‚Äì they offer **100k free tokens/month**, and it integrates easily with LangChain too.

> So don‚Äôt worry ‚Äî the RAG journey continues, even if OpenAI tries to act exclusive. üòé


In [None]:
import os
from getpass import getpass

# Force prompt every time
os.environ["OPENAI_API_KEY"] = getpass("üîê Enter your OpenAI API key: ")

### üì¶ Importing Core RAG Libraries

These imports bring in all the necessary tools to implement a basic RAG pipeline:

- **`PyMuPDFLoader`** (from `langchain_community`)  
  Loads and parses PDF documents into structured text objects.

- **`RecursiveCharacterTextSplitter`**  
  Splits long text into overlapping chunks, helping maintain context across segments for better retrieval.

- **`OpenAIEmbeddings`**  
  Converts text chunks into high-dimensional vectors using OpenAI‚Äôs embedding models (e.g., `text-embedding-ada-002`).

- **`FAISS`**  
  An efficient vector store used for similarity search ‚Äî allows us to quickly retrieve relevant chunks based on a query.

> Together, these tools let us go from raw PDF ‚Üí semantic chunks ‚Üí vector index ‚Üí intelligent retrieval.



In [None]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS


### üß† Core RAG Functions (Self-Contained)

This section defines the main building blocks for a simple Retrieval-Augmented Generation (RAG) pipeline:

- **`replace_t_with_space`**  
  Cleans each text chunk by replacing tab characters (`\t`) with spaces ‚Äî useful for messy PDFs.

- **`encode_pdf`**  
  Loads a PDF, chunks the text into overlapping segments, cleans them, generates OpenAI embeddings, and stores everything in a FAISS vector database for fast similarity search.

- **`retrieve_context_per_question`**  
  Takes a user query and retrieves the top matching chunks from the vector store using semantic similarity.

- **`show_context`**  
  Displays each retrieved chunk ‚Äî super helpful for debugging or exploring what the model "sees" before answering.

> These are simple and modular by design ‚Äî easy to extend later (like adding metadata, re-ranking, filtering, etc.) as we go deeper into advanced RAG techniques. üöÄ


In [None]:
def replace_t_with_space(list_of_documents):
    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')
    return list_of_documents

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    loader = PyMuPDFLoader(path)
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    chunks = splitter.split_documents(documents)
    cleaned_chunks = replace_t_with_space(chunks)

    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_chunks, embeddings)
    return vectorstore

def retrieve_context_per_question(question, retriever):
    docs = retriever.get_relevant_documents(question)
    return [doc.page_content for doc in docs]

def show_context(context_list):
    for i, chunk in enumerate(context_list):
        print(f"\n--- Chunk {i+1} ---\n{chunk[:500]}...")


### üìÑ Load, Encode & Build the Retriever

In this step, we:

- **Specify the PDF path**  
  We assume the file has been uploaded to the Colab session (via the left sidebar or manually).

- **Call `encode_pdf()`**  
  This processes the PDF into clean, overlapping chunks, creates embeddings, and stores them in a FAISS vector index.

- **Create a `retriever`**  
  This turns the FAISS store into a retriever that can fetch the top-2 most relevant chunks for any user query using semantic search.

> This is the heart of our RAG system ‚Äî converting static documents into a dynamic knowledge base ready to be queried. üîç



In [None]:
pdf_path = "Understanding_Climate_Change.pdf"


vectorstore = encode_pdf(pdf_path, chunk_size=1000, chunk_overlap=200)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

#this is where the quota  for openAI error comes, you cn use alternate options as mentioned earlier.

### üí¨ Ask a Question & Retrieve Relevant Context

Now let‚Äôs test our RAG pipeline:

- **Define a query**  
  We ask a sample question: *‚ÄúWhat is the main cause of climate change?‚Äù*

- **Retrieve context**  
  The retriever searches the vector store and returns the most relevant chunks from the document.

- **Display context**  
  We use `show_context()` to print out the retrieved chunks ‚Äî giving us a peek into what the system will use to answer the question.

> At this stage, we're not generating answers ‚Äî just verifying that the right context is being retrieved. Think of it as the ‚Äúthinking before answering‚Äù part of RAG. üß†


In [None]:
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, retriever)
show_context(context)


### ‚úÖ Evaluate the RAG System (Optional but Powerful)

We now evaluate how well our retriever performed using:

```python
evaluate_rag(retriever, query=test_query, expected_answer="greenhouse gases")
```

This function checks whether the retrieved context contains the correct answer ("greenhouse gases" in this case).

> üìÅ **Note:** The `evaluate_rag` function is located in a separate file under the `evaluation/` folder.  
It's a reusable evaluation script designed to work across **all RAG techniques** you'll implement in this repo.

To deeply understand how it works (e.g., relevancy checks, scoring, LLM-based judgment), feel free to explore the full script inside the `evaluation` directory. It's your go-to tool to test RAG accuracy across different setups. üß™


---

## üìò Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way ‚Äî without external helper functions (except for evaluation) ‚Äî as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I‚Äôve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I‚Äôve added clear, concise markdowns throughout the notebook ‚Äî explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It‚Äôs designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.


## üß© Why Start Simple?

Before diving into hybrid retrievers, multi-vector search, re-rankers, or agents ‚Äî it's essential to understand the **spine** of any RAG system:  
üìÑ ‚Üí üß± Chunk ‚Üí üß† Embed ‚Üí üîé Retrieve ‚Üí ‚úçÔ∏è Generate.

This notebook focuses solely on the **Retrieve** stage. Generation is intentionally excluded ‚Äî because the current priority is **retrieval sanity, reproducibility, and structural clarity**.



## üí° Final Word

This notebook is part of my larger personal project: **RAG100x** ‚Äî a challenge to build and log my journney in RAG from 0 100 in the coming months.

It‚Äôs not built to impress ‚Äî it‚Äôs built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course ‚Äî check out the original repository for broader implementations and ideas.
