## üß† Contextual Compression for Focused Retrieval | RAG100X

This notebook implements **Contextual Compression** ‚Äî a technique that filters and compresses the retrieved chunks based on a query, ensuring only the **most relevant information** is passed to the LLM for answer generation.

Unlike traditional RAG pipelines that return full chunks (which might include irrelevant fluff), contextual compression **uses a secondary LLM to extract only the essential segments**. This leads to more focused, efficient, and faithful answers ‚Äî especially in large or noisy documents.

---

### ‚úÖ What You‚Äôll Learn

- Why raw chunk retrieval often includes unnecessary or noisy context  
- How to build a **ContextualCompressionRetriever** with LangChain  
- How the **LLMChainExtractor** works to distill relevant info  
- How to combine vector search + compression for more effective RAG  
- How compression improves latency, accuracy, and token efficiency  

---

### üîç Real-world Analogy

Imagine you ask a friend to summarize a 100-page report. Instead of reading the whole thing to you, they:

> üîç Skim only the sections related to your question  
> ‚úÇÔ∏è Extract just the key paragraphs  
> üß† Answer using only the most relevant insights  

‚úÖ That‚Äôs contextual compression ‚Äî **retrieval with built-in summarization**.

---

### üß† How Contextual Compression Works Under the Hood

Here‚Äôs the step-by-step pipeline:

| Step                          | What Happens                                                                 |
|-------------------------------|------------------------------------------------------------------------------|
| 1. PDF Ingestion              | A PDF document is loaded and chunked                                        |
| 2. Vector Embedding           | Chunks are embedded using OpenAI and stored in a FAISS vector store         |
| 3. Base Retrieval             | Initial top-K chunks are retrieved based on query similarity                |
| 4. LLM Compression            | A `LLMChainExtractor` distills only relevant parts from each retrieved chunk|
| 5. QA Chain                   | The final compressed context is used by an LLM to generate the answer       |

üß† You get **precision-focused retrieval**, reducing both token usage and noise.

---

### üöÄ Why Contextual Compression Works So Well

- üß† **Relevance-driven**: Chunks are trimmed to what actually matters for the query  
- üßπ **Noise reduction**: Removes irrelevant surrounding context  
- üìâ **Lower token cost**: Only important bits are passed to the LLM  
- ‚ö° **Better answers**: Higher grounding and answer quality  

---

### üèóÔ∏è Why This Matters in Production

Traditional vector retrieval might return:

> ‚Äú...Climate models suggest a variety of trends over the coming decades. One factor is CO‚ÇÇ...‚Äù

Whereas compression gives:

> ‚ÄúClimate change is primarily driven by CO‚ÇÇ emissions, confirmed by decades of atmospheric research.‚Äù

‚úÖ **Cleaner. Direct. Grounded.**

This is crucial for applications with:

- Tight latency/token budgets  
- Complex or large document corpora  
- High demand for factual accuracy  

---

### üîÑ Where This Fits in RAG100X

In earlier projects, you explored:

1. Vanilla PDF/CSV/Web retrieval  
2. Chunking strategies and enhancements (Propositional, Semantic, Contextual Headers)  
3. Query Transformation and HyDE/HyPE  
4. Segment-based and windowed retrieval techniques  

Now, in **Day 13**, you compress at the **retrieval level**, post-vector search:  
> üí° **Find relevant chunks ‚Äî then shrink them down to what actually matters.**


## üì¶ Installation & Setup

### üß© Key LangChain Components Explained

- **`LLMChainExtractor`**  
  A document compressor that uses an LLM to extract only the most relevant parts from each retrieved chunk.  
  ‚úÖ Filters out noise and keeps only what's useful for answering the query.  
  üîç Under the hood: it takes each chunk and the query, runs a prompt over them using an LLM, and returns a compressed version focused on the query.

- **`ContextualCompressionRetriever`**  
  A special retriever that adds a compression layer on top of your base retriever (like FAISS).  
  ‚úÖ First, it retrieves the usual top-k results. Then, it applies `LLMChainExtractor` to each chunk.  
  üîç This gives you shorter, query-focused snippets instead of long raw chunks.

- **`RetrievalQA`**  
  A standard LangChain QA chain that handles the full process: retrieve ‚Üí generate answer.  
  ‚úÖ In this setup, it pulls compressed, context-aware chunks from the `ContextualCompressionRetriever`.  
  üîç This reduces token usage and improves the relevance and faithfulness of the final answer.


These components together create a **retrieval pipeline with built-in semantic filtering**, helping the LLM focus only on the information that truly matters for the user's query.


In [None]:
# Install required packages
!pip install langchain python-dotenv

import os
import sys
from dotenv import load_dotenv
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA


# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Document Path

In [None]:

path = "data/Understanding_Climate_Change.pdf"

### Creating the Retriver

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

# Cleaning the document

def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

#Encoding the pdf into vector store using OpenAI Embeddings

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len
    )
    texts = text_splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(texts)

    # Create embeddings and vector store
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

vector_store = encode_pdf(path)

### üîß Building a Semantically Compressed Retrieval Pipeline

This block sets up a complete retrieval + compression + QA system using LangChain.


#### üìå Step-by-step Breakdown:

- **`vector_store.as_retriever()`**  
  Converts the FAISS vector store into a retriever.  
  üîç When given a query, it returns the top-k most similar chunks based on embedding similarity.

- **`LLMChainExtractor.from_llm(llm)`**  
  Creates a compressor that uses an LLM to extract only the relevant parts from each chunk.  
  üîç It takes each chunk + query, and returns a shorter, query-focused version using `gpt-4o-mini`.

- **`ContextualCompressionRetriever(...)`**  
  Combines the base retriever with the compressor.  
  üîç It first retrieves top-k chunks, then compresses each using the LLM to retain only the most relevant info.

- **`RetrievalQA.from_chain_type(...)`**  
  Builds a RetrievalQA chain using the compressed retriever.  
  üîç The chain retrieves compressed chunks, sends them to the LLM, and returns an answer + source docs.


üéØ **Why This Setup?**

- ‚úÖ Reduces token usage by trimming unnecessary text.
- ‚úÖ Focuses on chunks actually useful for answering the query.
- ‚úÖ Produces answers that are more relevant, grounded, and concise.

This is especially helpful in long-document scenarios where raw chunks might contain lots of irrelevant filler.


In [None]:
from langchain_openai import ChatOpenAI

# Convert the FAISS vector store into a retriever
retriever = vector_store.as_retriever()

# Load the LLM that will be used for compressing the chunks
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=4000)

# Create a compressor that uses the LLM to extract only relevant info from each chunk
compressor = LLMChainExtractor.from_llm(llm)

# Combine the retriever with the compressor
#    This retrieves top-k documents and compresses them using the LLM
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

# Build a QA chain that uses the compressed retriever
#    The chain sends the compressed chunks to the LLM and returns an answer
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=compression_retriever,
    return_source_documents=True
)


### Example usage

In [None]:
query = "What is the main topic of the document?"
result = qa_chain.invoke({"query": query})
print(result["result"])
print("Source documents:", result["source_documents"])

---

## üìò Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models ‚Äî as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I‚Äôve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I‚Äôve added clear, concise markdowns throughout the notebook ‚Äî explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It‚Äôs designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## üîç Why Use Contextual Compression in RAG?

Retrieving full chunks can overload the LLM with irrelevant or verbose context ‚Äî reducing precision and increasing latency.

**Contextual Compression** solves this by:
- ‚úÇÔ∏è Filtering out **irrelevant content** from retrieved documents before generation  
- üß† Passing only the **most salient segments** to the LLM  
- ‚ö° Making RAG more efficient, focused, and cost-effective  

---

## üß† What‚Äôs New in This Version?

This implementation includes:

- üß± A retriever + compressor pipeline using LangChain‚Äôs `ContextualCompressionRetriever`  
- üß† **LLMChainExtractor** to extract meaningful context spans using GPT-4o-mini  
- üîÑ Seamless plug-in with existing FAISS vectorstore retrievers  
- ‚úÖ A full pipeline: vector retriever ‚Üí compressor ‚Üí QA chain  

---

## üìà Inferences & Key Takeaways

- ‚úÖ Compressed context reduces **noise and token load** for the LLM  
- üß† Enables more **focused and accurate answers** by removing irrelevant filler  
- üîÑ Flexible architecture ‚Äî can plug in any base retriever or compressor  

---

## üöÄ What Could Be Added Next?

- üìä Compare performance with and without compression using eval metrics  
- ü§ñ Try advanced compressors like rerankers or summarization chains  
- üß† Experiment with chunk merging before compression for longer context  
- üõ†Ô∏è Add UI toggles to switch between raw vs. compressed retrieval  

---
## üí° Final Word

This notebook is part of my larger personal project: **RAG100x** ‚Äî a challenge to build and log my journney in RAG from 0 100 in the coming months.

It‚Äôs not built to impress ‚Äî it‚Äôs built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course ‚Äî check out the original repository for broader implementations and ideas.