# HyDE RAG Technique - Hypothetical Document Embeddings

## üìã Overview

This implementation demonstrates the **HyDE (Hypothetical Document Embeddings)** retrieval technique, an advanced RAG (Retrieval-Augmented Generation) method that improves document retrieval by generating hypothetical answers before searching.

## üéØ What is HyDE?

HyDE is a retrieval technique that addresses a common problem in traditional RAG systems: **the semantic gap between user queries and document content**.

### Traditional RAG Problem
- User queries are often short and question-like
- Documents contain detailed, declarative content
- Direct similarity search may miss relevant documents

### HyDE Solution
1. **Generate** a hypothetical answer to the user's query using an LLM
2. **Embed** this hypothetical document
3. **Search** for similar documents using the hypothetical document's embedding
4. **Retrieve** the most relevant actual documents

## üèóÔ∏è Architecture

```
User Query ‚Üí LLM (Generate Hypothetical Doc) ‚Üí Embed ‚Üí Vector Search ‚Üí Retrieve Documents
```

**Flow**:
1. User asks: "What is LLM?"
2. LLM generates a hypothetical answer as if from the textbook
3. Hypothetical answer is embedded
4. Vector search finds actual textbook chunks similar to the hypothetical answer
5. Returns top `k` most relevant chunks

## üí° Why HyDE Works Better

### Traditional RAG
```
Query: "What is LLM?"
‚Üì
Embed: [0.1, 0.3, 0.2, ...]
‚Üì
Search: Finds chunks with similar embeddings
```

### HyDE RAG
```
Query: "What is LLM?"
‚Üì
Generate: "A Large Language Model (LLM) is a type of neural network..."
‚Üì
Embed: [0.2, 0.4, 0.3, ...]  (richer semantic representation)
‚Üì
Search: Finds chunks similar to a full answer
```

In [39]:
from dotenv import load_dotenv

load_dotenv()

True

In [40]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI
from openai import OpenAI
from langchain_ollama import OllamaEmbeddings
from langchain_core.prompts import PromptTemplate

In [41]:
def replace_t_with_space(list_of_documents):
    """
    Replaces all tab characters ('\t') with spaces in the page content of each document

    Args:
        list_of_documents: A list of document objects, each with a 'page_content' attribute.

    Returns:
        The modified list of documents with tab characters replaced by spaces.
    """

    for doc in list_of_documents:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces
    return list_of_documents

def encode_pdf(file_path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """
    
    loader = PyPDFLoader(file_path)
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )

    chunks = splitter.split_documents(documents)
    cleaned_texts = replace_t_with_space(chunks)

    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    vectorstore = FAISS.from_documents(cleaned_texts, embeddings)

    return vectorstore

In [42]:
class HyDERetriever:
    def __init__(self, file_path, chunk_size = 1000, chunk_overlap = 200):
        self.llm = ChatOpenAI(temperature=0, model="meta-llama/llama-3.3-70b-instruct", max_tokens=4000)
        
        self.embeddings = OllamaEmbeddings(model="nomic-embed-text")
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(file_path, chunk_size, chunk_overlap)
        
        self.hyde_prompt = PromptTemplate(
            input_variables=["query"],
            template="""
                You are summarizing how a 400-page AI engineering textbook explains a concept.

                Based on the question below, write a synthetic passage that reflects
                how the book would discuss this topic across multiple chapters.

                Focus on:
                - practical framing
                - systems and agents
                - engineering perspective
                - how the concept is used, not just defined

                Question:
                {query}
                """
            )
        
        self.hyde_chain = self.hyde_prompt | self.llm
        
    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content
        
    def retrieve(self, query, k=8):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc
                   

In [43]:
path = "AI_Engineer_Book.pdf"

retriever = HyDERetriever(path, chunk_size=500, chunk_overlap=100)

In [44]:
test_query = "What is LLM?"
results, hypothetical_doc = retriever.retrieve(test_query, k=10)

In [45]:
import textwrap

def text_wrap(text, width=120):
    """
    Wraps the input text to the specified width.

    Args:
        text (str): The input text to wrap.
        width (int): The width at which to wrap the text.

    Returns:
        str: The wrapped text.
    """
    return textwrap.fill(text, width=width)

def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i + 1}:")
        print(c)
        print("\n")

In [46]:
docs_content = [doc.page_content for doc in results]

# Display hypothetical document
print("hypothetical_doc:\n")
print(text_wrap(hypothetical_doc))
print()

# Display retrieved contexts
show_context(docs_content)

hypothetical_doc:

Large Language Models (LLMs) are a class of artificial intelligence (AI) systems that have revolutionized the field of
natural language processing (NLP). From an engineering perspective, LLMs are complex software systems that leverage deep
learning techniques to process and generate human-like language. These models are designed to learn patterns and
relationships within vast amounts of text data, enabling them to perform a wide range of tasks, such as language
translation, text summarization, and conversation generation.  In the context of systems and agents, LLMs can be viewed
as autonomous agents that interact with their environment through text-based interfaces. These agents are capable of
perceiving their environment, processing the input, and generating responses that are contextually relevant. For
instance, a chatbot powered by an LLM can engage in conversation with a user, understanding their queries, and
responding with accurate and informative answers.  Fro