## 🧠 Hypothetical Prompt Embeddings (HyPE) for Better Retrieval | RAG100X

This notebook implements **Hypothetical Prompt Embeddings (HyPE)** — a powerful technique for improving document retrieval in RAG systems by precomputing synthetic questions for every chunk of your documents.

Instead of embedding the raw document chunks (which might be long, descriptive, or poorly aligned with a user's query), HyPE uses an LLM to generate multiple relevant questions for each chunk ahead of time — and stores those questions in a vector database. At retrieval time, the user’s query is matched directly against these synthetic questions, leading to much more semantically aligned retrieval.

---

### ✅ What You’ll Learn

- Why raw document embeddings often fail to match real user queries  
- How HyPE turns every document chunk into a set of “likely questions” using GPT-4  
- How we embed these questions and use them in FAISS for retrieval  
- When HyPE gives better results — and how it compares to standard document-based RAG  

---

### 🔍 Real-world Analogy

Imagine you’re organizing a library.  
Each book has:
- A summary (raw document chunk)  
- And a set of FAQ-style questions written on the back:  
  - “What is the cause of rising sea levels?”  
  - “How does agriculture impact climate change?”

Now a user walks in and asks:  
> *"What are the main drivers of global warming?"*

With a standard system, you’d match their query to the book summary. That might or might not work — the wording could be too different.

But with HyPE, you match the user’s question directly to one of those prewritten questions — and boom, you find the right book instantly. 🔍

✅ **HyPE bridges the gap between how users ask and how content is written.**

---

### 🔬 How HyPE Works Under the Hood

Let’s say we have this chunk of a document:

> "Greenhouse gases like CO₂, methane, and nitrous oxide trap heat in the Earth’s atmosphere. These gases are largely released through fossil fuel burning, agriculture, and industrial processes."

With **traditional RAG**:

| Step    | What Happens                                                  |
|---------|---------------------------------------------------------------|
| Embed   | The full paragraph is embedded as-is                          |
| Retrieve| User query is compared to this raw embedding                  |
| Issue   | Mismatch if query wording doesn’t align                       |

With **HyPE**, we do something smarter:

| Step              | What Happens                                                                 |
|-------------------|------------------------------------------------------------------------------|
| 1. Chunk          | We take the original document chunk                                          |
| 2. Prompting      | GPT-4 is asked: *“Write 3–5 questions that this chunk could answer”*         |
| 3. Example Output | → “What are the major greenhouse gases?”  
                    → “How does agriculture affect emissions?”  
                    → “Why is CO₂ a concern?”                         |
| 4. Embedding      | Each of these questions is embedded using OpenAI                             |
| 5. Storage        | All these vectors are stored in FAISS, pointing to the same original chunk   |
| 6. Retrieval      | At query time, user’s question is compared against these synthetic ones      |

✅ Now, if someone asks:  
> *"What causes climate change?"*

…it will match closely with the synthetic question *"What are the major greenhouse gases?"* — leading us to the right context.

---

### 🧪 Why This Works So Well

- 📌 **Style alignment**: User queries are questions. Now the stored vectors are questions too.  
- 🔁 **Multi-angle coverage**: A single chunk can answer many different questions — and HyPE captures them all.  
- ⚡ **Fast at runtime**: All the hard work (prompting + embedding) is done offline. Retrieval is as fast as FAISS.

---

### 🏗️ Why This Matters in Production

Most real-world users don’t speak in textbook paragraphs. They ask questions like:
- “How does plastic harm the ocean?”
- “What’s the cause of inflation?”

But if your document says:
> *“Microplastics accumulate in marine ecosystems over time…”*

The vector match might fail — because the query and text just don’t sound alike.

**HyPE fixes this with precomputed question embeddings that match the user’s language, not the author’s.**

🔑 It’s like your system “thinks ahead” and asks:  
> *"If someone needed this chunk, what kind of questions would they ask?"*

---

### 🔄 Where This Fits in RAG100X

So far, RAG100x has explored:

1. PDF-based QA  
2. CSV-based retrieval  
3. Blog-based hallucination checks  
4. Chunk-size tuning  
5. Proposition-level chunking  
6. Query rewriting & decomposition  
7. HyDE: Imagine the answer before searching  

Now in **Day 8**, we go further — not by rewriting the user’s question, but by **rewriting the documents as questions in advance**.

> 💡 **HyPE is like giving your documents a voice — so they can “raise their hand” when a user asks something they know the answer to.**


## 📦 Installation & Setup

In [None]:

# Install required packages
!pip install faiss-cpu futures langchain-community python-dotenv tqdm

import os
import sys
import faiss
from tqdm import tqdm
from dotenv import load_dotenv
from concurrent.futures import ThreadPoolExecutor, as_completed
from langchain_community.docstore.in_memory import InMemoryDocstore


# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable (comment out if not using OpenAI)
if not os.getenv('OPENAI_API_KEY'):
    os.environ["OPENAI_API_KEY"] = input("Please enter your OpenAI API key: ")
else:
    os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### 📌 Define Constants and Download Dataset

We start by setting up some important constants and downloading the data:

- **PATH**: This points to the PDF file that we will use as our document source.
- **LANGUAGE_MODEL_NAME**: Specifies the LLM we will use (e.g., `gpt-4o-mini`).
- **EMBEDDING_MODEL_NAME**: Sets the embedding model (e.g., `text-embedding-3-small`) to convert chunks into vector representations.
- **CHUNK_SIZE** and **CHUNK_OVERLAP**: These control how the document is split:
  - `CHUNK_SIZE` defines the target length of each chunk.
  - `CHUNK_OVERLAP` adds redundancy between chunks to preserve context at chunk boundaries.

We also download a sample PDF document on climate change, which will be embedded and used throughout the RAG pipeline.


In [None]:
# Path to the input document used in the RAG pipeline
PATH = "data/Understanding_Climate_Change.pdf"

# Language model for generating responses (LLM)
LANGUAGE_MODEL_NAME = "gpt-4o-mini"

# Embedding model for converting text into vector format
EMBEDDING_MODEL_NAME = "text-embedding-3-small"

# Chunking configuration for RecursiveCharacterTextSplitter
CHUNK_SIZE = 1000        # Target size of each document chunk
CHUNK_OVERLAP = 200      # Number of overlapping characters between consecutive chunks

### 🧠 Generate Hypothetical Prompt Embeddings (HyPE)

In this step, we simulate how users might query a specific chunk of text by generating **hypothetical questions** using an LLM. These questions are treated as semantic proxies for the chunk and embedded for later retrieval.

#### 🔍 Why this matters:
Instead of embedding raw chunks directly, we extract likely *questions* someone might ask about that chunk. This results in:
- Better semantic alignment between queries and documents.
- Improved retrieval performance for natural user queries.

#### 🛠️ Under the hood:
1. A prompt is used to instruct the LLM to generate multiple natural questions about the given chunk.
2. These questions are parsed and cleaned (e.g., removing extra newlines).
3. Each question is embedded using OpenAI's embedding model.
4. We return both the original chunk and its corresponding question embeddings.

#### ⚙️ Why return the `chunk_text` along with the embeddings?

When using **multithreading** or **parallel processing** to speed up embedding generation, each thread typically works on a different chunk independently.

However, Python's multithreaded map functions (like `ThreadPoolExecutor.map`) expect the function to return a result that still links back to the original input — in this case, the `chunk_text`.

Returning both the original chunk and its associated embeddings allows us to:
- Keep track of which embeddings belong to which text chunk.
- Avoid issues with unordered or mismatched outputs in parallel processing.
- Efficiently construct the final vectorstore after embedding is complete.

So even though we only need the embeddings for retrieval, we return `chunk_text` too for traceability and mapping during parallel execution.


This technique, called **Hypothetical Prompt Embeddings (HyPE)**, helps bridge the gap between document phrasing and user queries by mimicking how people naturally ask about content.


In [None]:
def generate_hypothetical_prompt_embeddings(chunk_text: str):
    """
    Uses the LLM to generate multiple hypothetical questions for a single chunk.
    These questions act as proxies for the chunk during retrieval.

    Parameters:
    chunk_text (str): Text contents of the chunk.

    Returns:
    chunk_text (str): Returned as-is to simplify multi-threaded processing.
    hypothetical_embeddings (List[float]): Embedding vectors for the generated questions.
    """
    
    # Load LLM and embedding model with specified names
    llm = ChatOpenAI(temperature=0, model_name=LANGUAGE_MODEL_NAME)
    embedding_model = OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME)

    # PromptTemplate defines the instructions given to the LLM
    question_gen_prompt = PromptTemplate.from_template(
        "Analyze the input text and generate essential questions that, when answered, \
        capture the main points of the text. Each question should be one line, \
        without numbering or prefixes.\n\n \
        Text:\n{chunk_text}\n\nQuestions:\n"
    )

    # Chain the prompt to LLM and parse output as string
    question_chain = question_gen_prompt | llm | StrOutputParser()

    # Invoke the chain to generate a list of questions
    raw_questions = question_chain.invoke({"chunk_text": chunk_text})

    # Remove extra newlines (gpt-4o often uses \n\n between questions)
    questions = raw_questions.replace("\n\n", "\n").split("\n")

    # Embed the list of questions into vector format
    return chunk_text, embedding_model.embed_documents(questions)


### 🧠 What Does This Block Do?

This code builds a **FAISS vector store** that helps find relevant chunks later when a user asks a question.  
But instead of just embedding the raw text chunks, we use **Hypothetical Prompt Embeddings (HyPE)** — meaning we:

- Generate **questions** that the chunk could answer.
- Embed those questions instead of the chunk itself.
- Store those embeddings in FAISS for fast, accurate retrieval.

---

### ⚡ Why Use Hypothetical Prompt Embeddings?

Because when someone asks a question, it's easier to match it against other **questions** than against raw text.

> 🔍 For example:  
> Suppose we have a chunk:  
> _"LangChain is a framework for developing LLM-powered apps using chains of components like retrievers, memory, and tools."_  
>
> We ask the LLM:  
> 👉 “What are the key questions this chunk could answer?”  
> It might return:
> - What is LangChain?
> - What is LangChain used for?
> - What are the components in a LangChain pipeline?

These questions are embedded and stored.

Later, when a user asks:  
💬 “How does LangChain work?”  
...it will match well with these hypothetical questions and retrieve the right chunk.

---

### ⚙️ How It Works Under the Hood

1. **Multithreading with `ThreadPoolExecutor`**  
   We use multithreading to generate embeddings **in parallel** — so if there are 100 chunks, we don’t wait for one to finish before starting the next.

2. **Generate Hypothetical Questions**  
   For each chunk, we call an LLM (like GPT-4o) to create 3–5 questions the chunk might answer.

3. **Embed the Questions**  
   Each question is embedded using OpenAI's embedding model, turning it into a vector of numbers.

4. **Store in FAISS**  
   FAISS is initialized to store these vectors efficiently:
   - It uses **L2 distance** to find which vectors are "closest" (most relevant).
   - It also keeps the original text so we can show it later.

5. **Add Multiple Embeddings per Chunk**  
   Since one chunk can answer many questions, we embed and store each question separately — even if they all point to the same chunk.

---

### ✅ Result

You now have a FAISS store where each text chunk is **indexed through its most likely questions**, improving both **retrieval relevance** and **semantic alignment**.


In [None]:
def prepare_vector_store(chunks: List[str]):
    # Wait with initialization to see vector lengths
    vector_store = None  
    # Run prompt embedding generation in parallel using multithreading
    # This speeds up processing across multiple chunks
    with ThreadPoolExecutor() as pool:  
        futures = [pool.submit(generate_hypothetical_prompt_embeddings, c) for c in chunks]

        # Process completed results in order of finish (not submission)
        for f in tqdm(as_completed(futures), total=len(chunks)):  
            chunk, vectors = f.result()  # Get chunk and list of its prompt embeddings

            # Initialize FAISS on first result to get vector dimension
            if vector_store == None:  
                vector_store = FAISS(
                    embedding_function=OpenAIEmbeddings(model=EMBEDDING_MODEL_NAME),
                    index=faiss.IndexFlatL2(len(vectors[0])),  # L2 distance-based similarity
                    docstore=InMemoryDocstore(),
                    index_to_docstore_id={}
                )

            # Each chunk is stored multiple times — once per generated question embedding
            chunks_with_embedding_vectors = [(chunk.page_content, vec) for vec in vectors]

            # Add all embeddings (and corresponding content) to FAISS
            vector_store.add_embeddings(chunks_with_embedding_vectors)
    return vector_store  # Return the populated vector store

### Encode PDF into FAISS Vector Store

- Loads PDF and extracts raw text.
- Splits text into overlapping chunks (preserve context).
- Cleans the text (removes noisy characters like \t).
- Embeds each chunk using HyPE (hypothetical questions).
- Stores embeddings in FAISS for fast and accurate retrieval.


In [None]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF into a FAISS vector store using HyPE-based embeddings.
    """
    # Load the PDF and extract all text
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split text into overlapping chunks to preserve context
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    texts = text_splitter.split_documents(documents)

    # Clean the text to remove \t and other noisy characters
    cleaned_texts = replace_t_with_space(texts)

    # Generate hypothetical embeddings and store in FAISS
    vectorstore = prepare_vector_store(cleaned_texts)

    return vectorstore


In [None]:
def retrieve_context_per_question(question, retriever):
    docs = retriever.get_relevant_documents(question)
    return [doc.page_content for doc in docs]

In [None]:
def show_context(context_list):
    for i, chunk in enumerate(context_list):
        print(f"\n--- Chunk {i+1} ---\n{chunk[:500]}...")

### Create & Test HyPE Retriever

- 📚 Process the PDF and encode it into a FAISS vector store.
- 🔍 Initialize a retriever to fetch top-k relevant chunks.
- 🧪 Run a sample query to test retrieval quality.
- ✅ Deduplicates and displays matched context for inspection.


In [None]:
# Encode the PDF into a FAISS vector store using HyPE embeddings
# Chunk size can be large with HyPE — more info improves question generation
chunks_vector_store = encode_pdf(PATH, chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

# Create a retriever to fetch top-k relevant chunks based on query similarity
chunks_query_retriever = chunks_vector_store.as_retriever(search_kwargs={"k": 3})

# Test the retriever with a sample query
test_query = "What is the main cause of climate change?"

# Retrieve and deduplicate context chunks
context = retrieve_context_per_question(test_query, chunks_query_retriever)
context = list(set(context))

# Display the context returned by the retriever
show_context(context)


---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.


## 🔍 Why Use HyPE in RAG?

Traditional RAG systems often struggle when there's a mismatch between the **style of the user query** and the way documents are written. This is especially problematic when user queries are short, vague, or phrased as questions.

**HyPE (Hypothetical Prompt Embeddings)** solves this by:
- ❓ **Generating multiple hypothetical questions per chunk during indexing**
- 📌 **Embedding those questions** so queries can match with them instead of raw chunk text

This turns retrieval into a **question-to-question matching problem**, which better aligns user queries with relevant information.

---

## 🧠 What’s New in This Version?

This implementation of HyPE offers:

- 🤖 **GPT-4o-based prompt generation** — Generates multiple smart questions per chunk  
- ⚡ **Offline embedding at indexing time** — No runtime query expansion needed  
- 📄 **Multi-vector indexing** — Each chunk gets several semantic representations  
- 🧱 **Colab-ready & self-contained** — All logic included, no copy-paste from modules  

It’s designed for flexibility and can be used with any PDF input.

---

## 📈 Inferences & Key Takeaways

Running HyPE on real-world queries like *"What is the main cause of climate change?"* reveals:

- ❔ The hypothetical questions align closely with what users typically ask  
- 📥 FAISS retrieves more accurate and diverse chunks by matching on these questions  
- 💡 This approach improves grounding **without increasing latency or complexity**  

It’s an efficient way to enhance retrieval **without extra cost at query time**.

---

## 🚀 What Could Be Added Next?

For a production-grade HyPE-enhanced RAG pipeline:

- 🔁 **Combine with traditional retrieval** — Use both chunk and prompt-based indexes  
- 🧪 **Run evaluation benchmarks** — Measure hallucination and relevance gains  
- ⚙️ **Swap OpenAI with local models** — Try embedding with BGE or Cohere for cost savings  
- 🧵 **Pair with rerankers** — Filter top chunks using LLM scoring after retrieval  
- 🔌 **Support other vectorstores** — Easily integrate with pgvector, Chroma, or Weaviate  

---


## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.