## 🧠 Hypothetical Document Embeddings (HyDE) for Better Retrieval | RAG100X

This notebook implements **Hypothetical Document Embedding (HyDE)** — a powerful technique to improve how Retrieval-Augmented Generation (RAG) systems understand and retrieve information from vector databases.

Instead of embedding the *original* user query (which might be short, vague, or ambiguous), HyDE first uses an LLM to **generate a detailed "hypothetical document"** — a short passage that *imagines* what a good answer to the query might look like. We then embed this **hypothetical document** instead of the query itself to perform similarity search.

---

### ✅ What You’ll Learn

- How short queries often mismatch with longer document chunks in vector search  
- How HyDE creates better alignment by simulating what the user wants  
- How to generate, embed, and use hypothetical documents for retrieval  
- When HyDE improves performance — and when it doesn’t

---

### 🔍 Real-world Analogy

Imagine you walk into a library and ask:  
> *"What’s the impact of climate change?"*

The librarian might bring you a general book or article that matches the words “climate change” — but it might not be what you *really* wanted.

Now imagine you say:  
> *"I’m writing a paper on how climate change causes rising sea levels, melting glaciers, and biodiversity loss — I need evidence from recent studies."*

Now the librarian brings you **exactly** the right papers — because you gave them more **semantic context**.

That’s what HyDE does behind the scenes — it turns your short query into a more descriptive version, so the vector search works more like that helpful librarian.

---

### 🔬 How HyDE Works Under the Hood

Let’s say the original query is:

> *"Climate change effects?"*

Here's what happens in standard vs. HyDE RAG:

| Step                      | Standard RAG                                      | HyDE RAG                                                       |
|---------------------------|--------------------------------------------------|----------------------------------------------------------------|
| 1. User Query             | Short query like “Climate change effects”        | Same short query                                               |
| 2. Embedding              | Embed the query directly                         | Use LLM to generate a **pseudo-answer paragraph**              |
| 3. Vector Search          | Retrieve documents most similar to short query   | Embed the pseudo-answer, retrieve similar chunks to that       |
| 4. Answer Generation      | Generate based on retrieved docs                 | Same, but retrieval is more semantically aligned               |

✅ The generated pseudo-answer doesn’t need to be factually correct — it just needs to be **semantically similar** to the right documents.

---

### 🏗️ Why This Matters in Production

- **Short queries are common**: Think of users on chatbots, voice assistants, or search bars.  
- **Embedding mismatch is real**: Query embeddings often don’t “look like” document embeddings.  
- **HyDE is a drop-in improvement**: You don’t need to change your vector DB or data — just rewrite the query, embed, and search.

> In high-stakes use cases (e.g., medical, legal, scientific), improving retrieval relevance can make the difference between a helpful answer and a harmful one.

---

### 🔄 Where This Fits in RAG100X

So far, RAG100x has explored:

1. PDF-based QA  
2. CSV-based retrieval  
3. Blog-based hallucination checks  
4. Chunk-size tuning  
5. Proposition-level chunking  
6. Query rewriting & decomposition

Now in **Day 7**, we bring it full circle — not by transforming the question *format*, but by **imagining the answer first**. This subtle shift helps your system "think like a user" and fetch better context before answering.

> 💡 HyDE is like a smart assistant that says: *"Let me first imagine what you're really looking for... and then search accordingly."*


## 📦 Installation & Setup

In [1]:
# Install required packages
!pip install python-dotenv


Defaulting to user installation because normal site-packages is not writeable
Collecting python-dotenv
  Downloading python_dotenv-1.1.1-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.1



[notice] A new release of pip available: 22.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')


## 📄 Load Dataset: Understanding Climate Change

This notebook uses a sample document titled **"Understanding Climate Change"**, stored in the `data/` directory.

In [3]:
path = "data/Understanding_Climate_Change.pdf"

## 📄 Encode PDF into FAISS Vectorstore

This section defines a function to load a PDF, preprocess its content, split it into meaningful chunks, convert it into vector embeddings, and store it in a FAISS vectorstore for efficient semantic retrieval.


In [4]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate

def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document.

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """
    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Clean tab characters
    return list_of_documents

def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Loads a PDF, splits it into overlapping text chunks, embeds it using OpenAI embeddings,
    and stores the result in a FAISS vectorstore.

    Args:
        path: Path to the input PDF file.
        chunk_size: Size of each text chunk (in characters).
        chunk_overlap: Overlap between chunks to preserve context.

    Returns:
        A FAISS vectorstore containing the embedded PDF chunks.
    """
    # Load PDF and extract raw text pages
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split the text into overlapping chunks for better context retention
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    texts = text_splitter.split_documents(documents)

    # Clean each chunk (e.g., remove tab characters)
    cleaned_texts = replace_t_with_space(texts)

    # Create vector embeddings and store in FAISS for fast retrieval
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore


ModuleNotFoundError: No module named 'langchain_community'

## 🧠 Define the HyDE Retriever Class

This section defines a custom retriever class using the **Hypothetical Document Embeddings (HyDE)** technique — an advanced retrieval strategy where, instead of embedding the user query directly, we first use an LLM to **generate a synthetic (hypothetical) document** that answers the query. We then embed this generated document and use it for similarity search.

This approach often leads to better recall, especially for short, vague, or under-specified queries.

---

### 🔧 `__init__()` — Initialization

- `files_path`: Path to the PDF file(s) that will be ingested and indexed.
- `chunk_size`: Length (in characters) of each document chunk. Larger chunks capture more context.
- `chunk_overlap`: Number of characters overlapping between consecutive chunks. Prevents cutting off mid-sentence.

**Inside the constructor:**
- A ChatOpenAI LLM (`gpt-4o-mini`) is initialized with `temperature=0` for deterministic output, and `max_tokens=4000` to allow long generation.
- OpenAI’s embedding model is initialized for converting text into vector representations.
- The input PDF is chunked and embedded into a FAISS vectorstore using `encode_pdf()`, which is a helper function (you will need to define it explicitly).
- A prompt template is created to instruct the LLM to generate a detailed document of a specified length that answers a given query.
- The prompt is chained with the LLM to form `hyde_chain`, which will later be invoked for generation.

---

### 📄 `generate_hypothetical_document(query)`

Takes a user query and uses the `hyde_chain` to generate a synthetic document that directly answers it.  
- The document size is fixed to `chunk_size` characters to ensure it’s consistent with the size of chunks in the vectorstore.

---

### 🔍 `retrieve(query, k=3)`

- First, generates a hypothetical answer using the method above.
- Then, performs a vector similarity search against the embedded chunks using that synthetic document.
- Returns the top `k` most relevant documents and the generated hypothetical document.

---




In [None]:
from langchain.chat_models import ChatOpenAI

class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=4000)

        self.embeddings = OpenAIEmbeddings()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
    
        
        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc

### 📦 `text_wrap` Utility Function

This small utility function helps improve the readability of long text outputs—like generated answers or retrieved documents—by wrapping them to a fixed width (default: 120 characters). It uses Python's built-in `textwrap` module and is especially helpful for formatting printed results in Colab or terminal environments.


In [None]:
import textwrap
def text_wrap(text, width=120):
    """
    Wraps the input text to the specified width.

    Args:
        text (str): The input text to wrap.
        width (int): The width at which to wrap the text.

    Returns:
        str: The wrapped text.
    """
    return textwrap.fill(text, width=width)

### 🔍 Creating and Using a HyDE Retriever Instance

In this section, we instantiate the `HyDERetriever` and demonstrate how it enhances retrieval quality by generating a *hypothetical document* based on the user query. This synthetic document acts as a proxy for the user’s intent and is used to search the vector store.





- `retriever = HyDERetriever(path)`  
  This line initializes an instance of our previously defined `HyDERetriever` class. It loads and encodes the PDF at the given path into a FAISS vectorstore using OpenAI embeddings, while also preparing the LLM-based HyDE prompt chain for generating hypothetical documents.

- `test_query = "What is the main cause of climate change?"`  
  We define a natural language query that a user might ask. This is the input to test our HyDE retrieval pipeline.

- `results, hypothetical_doc = retriever.retrieve(test_query)`  
  This is the key step of the HyDE method:
  1. The model first generates a detailed hypothetical answer using the `gpt-4o-mini` LLM.
  2. This generated document is then used as a *semantic anchor* to search the FAISS vector store.
  3. The top-k similar chunks (default `k=3`) are retrieved based on cosine similarity between their embeddings and the embedding of the hypothetical document.

- `docs_content = [doc.page_content for doc in results]`  
  We extract just the text content of the retrieved documents, discarding metadata. This makes it easier to visualize and compare them with the hypothetical generation.

- `print(text_wrap(hypothetical_doc))`  
  Prints the hypothetical document with word wrapping for better readability. This helps developers and evaluators understand how well the synthetic document captures the intent of the query.

- `show_context(docs_content)`  
  A utility function (defined earlier in your notebook) that formats and displays the retrieved contexts. It helps visualize whether the retrieved chunks are truly relevant to the hypothetical document — an important part of evaluating HyDE-based retrieval systems.



Together, this flow demonstrates how HyDE retrieves semantically aligned information using generated knowledge as a bridge — especially useful when the user query is short, vague, or underspecified.


In [None]:
retriever = HyDERetriever(path)


# Demonstrate on a use case
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)


#Plot the hypothetical document and the retrieved documnets
docs_content = [doc.page_content for doc in results]

print("hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")

---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.


## 🔍 Why Use HyDE in RAG?

Traditional RAG systems rely on embedding the **user query directly** and retrieving chunks based on cosine similarity. But when queries are short, abstract, or under-specified, this can limit retrieval quality.

**HyDE (Hypothetical Document Embeddings)** solves this by:
- ✨ **Expanding the query into a detailed, in-domain pseudo-document**
- 📚 **Embedding that document** to improve semantic alignment with relevant chunks

This leads to more **targeted, context-rich retrieval**, especially useful when user queries are vague or exploratory.

---

## 🧠 What’s New in This Version?

This implementation of HyDE offers:

- 🤖 **GPT-4o-based hypothetical generation** — Fast, smart pseudo-docs generated in real-time  
- 🧱 **Self-contained architecture** — All logic defined within the notebook, no external wrappers  
- 📐 **Customizable prompt & chunk size** — Fine control over generation and embedding quality  
- 🔍 **Plug-and-play retriever** — HyDERetriever class is reusable across any PDF corpus  

You can drop it into most LangChain or OpenAI pipelines with minimal modifications.

---

## 📈 Inferences & Key Takeaways

Running HyDE on real queries like *"What is the main cause of climate change?"* shows:

- 📄 The generated hypothetical doc captures nuanced concepts and keywords  
- 📥 FAISS retrievals are noticeably more relevant, even without query rewriting  
- 💡 Combining HyDE with chunk optimization leads to **significant gains in grounding**  

It’s a lightweight yet powerful way to **boost RAG accuracy without retraining or new data**.

---

## 🚀 What Could Be Added Next?

For a production-grade HyDE-powered RAG:

- 🔁 **Use it in parallel with traditional queries** — Ensembling often yields the best results  
- 🧪 **Evaluate hallucination & relevancy deltas** — Track how HyDE improves grounding  
- 🧠 **Swap GPT-4o with smaller models** — Try Claude Haiku or Phi-3 for faster inference  
- 🧵 **Chain it with query rewriting** — Rewrite → HyDE → Retrieve for maximal alignment  
- 🔌 **Deploy with pgvector, Chroma, or Elastic** — Adapt vectorstore backends for scalability  

---
## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.