### 📖 Where We Are

**In the previous notebooks**, we explored query enhancement techniques like **Query Expansion** (making a query more detailed) and **Query Decomposition** (breaking a complex query into simpler parts).

**In this notebook**, we'll learn about a third, fundamentally different technique: **HyDE (Hypothetical Document Embeddings)**. Instead of modifying the user's query text, HyDE generates a completely new, hypothetical document that answers the query. We then use the embedding of this *hypothetical answer* to find real, factual documents in our knowledge base. This is a powerful method for bridging the semantic gap between a question and its answer.

### 1. HyDE (Hypothetical Document Embeddings)

🧠 **What is HyDE?**

HyDE is a retrieval technique where, instead of embedding the user’s query directly, you first generate a hypothetical answer (document) to the query using an LLM. You then embed that hypothetical document to search your vector store. The core idea is that a hypothetical answer is likely to be much closer in the vector space to the real answer documents than the original question is.

➡️ **HyDE is most effective when:**

1.  Queries are short or ambiguous.
2.  There is a language mismatch between the query and the documents (e.g., question vs. statement).
3.  You want to retrieve based on the *content of a likely answer*, not the words in the question.

In [1]:
# --- Standard LangChain and Environment Imports ---
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain.chat_models import init_chat_model

In [2]:
# --- 1. Load and Chunk Documents ---
# We'll use a Wikipedia loader to get a sample dataset about Steve Jobs.
loader = WikipediaLoader(query="Steve Jobs", load_max_docs=1)
documents = loader.load()

# Split the documents into manageable chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
docs = text_splitter.split_documents(documents=documents)

In [3]:
# --- 2. Setup LLM and Vector Store ---
load_dotenv()
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")
llm = init_chat_model("groq:gemma2-9b-it")

# Create the vector store for our documents.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embeddings)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

  from .autonotebook import tqdm as notebook_tqdm


### 3. Manual HyDE Implementation
First, let's implement the HyDE logic manually to understand each step of the process clearly.

In [4]:
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# This function takes a query and uses an LLM to generate a hypothetical answer.
def generate_hypothetical_document(query: str):
    # Create a prompt that instructs the LLM to generate a detailed, hypothetical answer.
    template = """Imagine you are an expert on the topic. Please write a concise, factual-sounding paragraph answering the following question:
    Question: '{query}'
    Hypothetical Answer:"""
    prompt = ChatPromptTemplate.from_template(template=template)
    
    # Create a simple chain to pass the formatted prompt to the LLM.
    chain = prompt | llm | StrOutputParser()
    
    # Invoke the chain to get the hypothetical document.
    return chain.invoke({"query": query})

In [5]:
# Let's test our function.
query = 'When was Steve Jobs fired from Apple?'
hypothetical_doc = generate_hypothetical_document(query=query)
print("--- HYPOTHETICAL DOCUMENT ---")
print(hypothetical_doc)

--- HYPOTHETICAL DOCUMENT ---
Steve Jobs was forced out of Apple in a boardroom coup on September 17, 1985.  The decision came after a series of internal conflicts and declining market share for Apple's products, particularly in the face of rising competition from IBM-compatible PCs. Jobs' visionary leadership and aggressive management style had proven increasingly incompatible with the company's direction.  



In [6]:
# Now, use the hypothetical document to retrieve real documents.
# We are NOT using the original query for retrieval.
retrieved_docs = base_retriever.invoke(hypothetical_doc)

print(f"--- REAL DOCUMENTS RETRIEVED FOR: '{query}' ---")
for doc in retrieved_docs:
    print(f"- {doc.page_content}\n")

--- REAL DOCUMENTS RETRIEVED FOR: 'When was Steve Jobs fired from Apple?' ---
- In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets,

- In 1997, Jobs returned to Apple as CEO after the company's acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications,

- Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer

- services that had larger

### 4. Using the LangChain `HypotheticalDocumentEmbedder`
LangChain provides a convenient wrapper, `HypotheticalDocumentEmbedder`, that automates the "generate-then-embed" process. It acts like a special embedding function that you can plug directly into your vector store setup. When you ask it to embed a query, it internally performs the HyDE process.

In [9]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.hyde.base import HypotheticalDocumentEmbedder

In [10]:
# --- CORRECTED: Setup the HyDE Embedder ---

# 1. Define the LLMChain that will generate the hypothetical documents.
# This uses a pre-defined prompt for web search-style questions.
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm=llm, 
    base_embeddings=embeddings, # The base model to embed the generated doc.
    prompt_key="web_search"    # A pre-built prompt for this task.
)

print("HyDE Embedder created successfully.")

HyDE Embedder created successfully.


In [11]:
# Now, let's test the HyDE embedder directly.
# This will first generate a hypothetical document for our query and then embed it.
result = hyde_embeddings.embed_query(query)

print(f"Original Query: {query}")
print(f"Embedding dimension: {len(result)}")
print(f"Sample of embedding vector: {result[:5]}")

Original Query: When was Steve Jobs fired from Apple?
Embedding dimension: 384
Sample of embedding vector: [np.float64(-0.02763230912387371), np.float64(0.05564332380890846), np.float64(0.08019711077213287), np.float64(-0.0018339100060984492), np.float64(0.05279001221060753)]


#### Using HyDE for Retrieval
Now we can create a new vector store that uses our `hyde_embeddings` object as its embedding function. This means that at search time, the query will automatically be passed through the HyDE process.

In [12]:
# Create a new FAISS vector store, this time passing our HyDE embedder.
hyde_vectorstore = FAISS.from_documents(docs, hyde_embeddings)
hyde_retriever = hyde_vectorstore.as_retriever(search_kwargs={"k": 4})

# When we invoke this retriever, it will automatically use the HyDE process on the query.
hyde_retrieved_docs = hyde_retriever.invoke(query)

print(f"--- REAL DOCUMENTS RETRIEVED (LANGCHAIN HyDE) FOR: '{query}' ---")
for doc in hyde_retrieved_docs:
    print(f"- {doc.page_content}\n")

--- REAL DOCUMENTS RETRIEVED (LANGCHAIN HyDE) FOR: 'When was Steve Jobs fired from Apple?' ---
- In 1985, Jobs departed Apple after a long power struggle with the company's board and its then-CEO, John Sculley. That same year, Jobs took some Apple employees with him to found NeXT, a computer platform development company that specialized in computers for higher-education and business markets,

- In 1997, Jobs returned to Apple as CEO after the company's acquisition of NeXT. He was largely responsible for reviving Apple, which was on the verge of bankruptcy. He worked closely with British designer Jony Ive to develop a line of products and services that had larger cultural ramifications,

- Steven Paul Jobs (February 24, 1955 – October 5, 2011) was an American businessman, inventor, and investor best known for co-founding the technology company Apple Inc. Jobs was also the founder of NeXT and chairman and majority shareholder of Pixar. He was a pioneer of the personal computer

- service

### 🔑 Key Takeaways

* **HyDE Bridges the Semantic Gap**: HyDE addresses the problem that a user's question and a factual answer can be semantically distant. It works by searching for a *hypothetical answer* in the vector space, which is more likely to be close to the *real answers*.
* **Workflow**: The process is: **Query -> Generate Hypothetical Document -> Embed Hypothetical Document -> Retrieve Real Documents.**
* **LangChain Integration**: LangChain's `HypotheticalDocumentEmbedder` automates this process. It acts as a special embedding function that wraps an LLM and a base embedding model.
* **Improves Retrieval for Vague Queries**: HyDE is particularly effective for short, ambiguous, or poorly formulated questions because it enriches them with a full, context-rich hypothetical answer before the search.
* **Trade-offs**: HyDE introduces an extra LLM call at the beginning of every query, which increases latency and cost. It also carries a small risk that a wildly inaccurate hypothetical document could lead to irrelevant results.