- HyDE is an approach that transforms query questions into hypothetical documents containing the answer, amining to bridge the gap b/w query and document distributions in vector space.

- Hyde (combines the power of LLMs with contrastive learning)

- Contrastive learning is a ML technique where the model learns to differentiate b/e similar and dissimilar data points. (explore more later)

- Contrastive learning model??
- Unsupervised contrastively learned encoder e.g. Conteiever (explore more later)

- Traditional retrieval methods often struggle with the semantic gap b/w short queries and longer, more detailed documents. HyDe addresses this by expanding the query into a full hypothetical document, potentially imporving retrieval relevance by making the query representation more similar to the document representations in the vector space.

- This technique could be valueable in domains where understanding query intent and context is crucial.

Benefits:

- **Improved Relevance**: By expanding queries into full documents, HyDe can potentially capture more nuanced and relevant mathes.

- **Handling Complex Queries**: Useful for complex queries or multi-faceted queries that might be difficult to match directly.

- **Adaptability**: Hypothetical document generation can adapt to different types of queries and document domains.

- **Potential for better context understanding**: The expanded query might better capture the context and intent behind the original question.

- **Bridges domain terminology gap**
- **Narrows query-doc semantic gap**
- **Improves retrieval for complex queries**

Challenges:

- **Computational Resources**: Generating hypothetical docs for each query is resource-intensive.

- **LLM dependency**: The quality of results depends on the capabilities of the underlying LLM.

In [None]:
! pip3 install python-dotenv pypdf langchain langchain-core langchain-community langchain-openai chromadb langchain-chroma langchain-text-splitters

### Create Vectorstore

In [1]:
from langchain_community.document_loaders import PyPDFLoader

In [None]:
path = "./data/Understanding_Climate_Change.pdf"

In [3]:
loader = PyPDFLoader(path)

In [4]:
documents = loader.load()

In [5]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100, length_function=len)

In [7]:
chunks = text_splitter.split_documents(documents)

In [8]:
from dotenv import load_dotenv, find_dotenv

In [9]:
load_dotenv(find_dotenv())

True

In [10]:
from langchain_chroma import Chroma

In [11]:
from langchain_openai import OpenAIEmbeddings

In [13]:
vectorstore = Chroma.from_documents(chunks,OpenAIEmbeddings(model='text-embedding-3-small')) #By default, the length of the embedding vector is 1536 for text-embedding-3-small

###  Generating hypothetical document

In [18]:
from langchain_core.prompts import PromptTemplate

In [19]:
template = """Given a question, generate a hypothetical document that directly answers the question. 
The document should be detailed and in-depth. The document size has be exactly 500 characters.

---

<question>
{question}
</question>
"""

In [20]:
hyde_prompt = PromptTemplate(
    template = template,
    input_variables = ["question"]
)

In [None]:
! pip3 install pydantic

In [22]:
from pydantic import BaseModel, Field

In [23]:
# data model
class HypotheticalDocument(BaseModel):
    """Hypothetical document that directly answers a given question"""

    hypothetical_document: str = Field(description="hypothetical document that directly answers the question. The document should be detailed and in-depth.")

In [24]:
from langchain_openai import ChatOpenAI

In [25]:
structured_llm = ChatOpenAI(model="o4-mini-2025-04-16").with_structured_output(HypotheticalDocument)
# specifying max_completion_tokens is sometimes leading to an error
# An upper bound for the number of tokens that can be generated for a completion, including visible output tokens and reasoning

In [26]:
hyde_chain = hyde_prompt | structured_llm

In [27]:
query = "How does rising sea level contribute to coastal erosion?"

In [34]:
response = hyde_chain.invoke({"question":query})

In [35]:
hypothetical_doc = response.hypothetical_document

In [36]:
len(hypothetical_doc)

497

In [37]:
import textwrap

In [38]:
print(textwrap.TextWrapper(width=100).fill(hypothetical_doc))

Rising sea levels exacerbate coastal erosion by increasing the reach, frequency, and power of wave
action and storm surges. Elevated waters shift wave breaking zones landward, removing sediments from
beaches and cliffs. Saturated soils lose cohesion, accelerating undercutting of dunes and bedrock.
Repeated flooding erodes vegetation and destabilizes manmade barriers. Over time, these processes
lead to accelerated shoreline retreat, loss of habitats, and vulnerability of coastal communities.!!


### Regular RAG vs HyDE

In [52]:
retrieved_docs_with_hyDoc = vectorstore.similarity_search(query=hypothetical_doc, k=3)

In [53]:
for idx, doc in enumerate(retrieved_docs_with_hyDoc,start=1):
    print(f"DOC {idx}:")
    print(textwrap.TextWrapper(width=100).fill(doc.page_content))
    print("-"*50)

DOC 1:
Regions dependent on glacial meltwater, such as the Himalayas and the Andes, face  particular risks.
Glacial melt also impacts hydropower generation and agriculture.  Coastal Erosion  Rising sea levels
and increased storm surges are accelerating coastal erosion, threatening  homes, infrastructure, and
ecosystems. Low-lying islands and coastal regions are especially  vulnerable. Coastal communities
must invest in adaptation measures like sea walls and  managed retreats.  Extreme Weather Events
--------------------------------------------------
DOC 2:
destructive storms. Coastal regions are at heightened risk of storm surge and flooding. Early
Increased temperatures and changing precipitation patterns are contributing to more frequent  and
severe droughts. This affects agriculture, water supply, and ecosystems, particularly in  arid and
semi-arid regions. Droughts can lead to food and water shortages and exacerbate  conflicts.
Flooding
---------------------------------------------

In [54]:
retrieved_docs_with_query = vectorstore.similarity_search(query=query, k=3)

In [55]:
for idx, doc in enumerate(retrieved_docs_with_query,start=1):
    print(f"DOC {idx}:")
    print(textwrap.TextWrapper(width=100).fill(doc.page_content))
    print("-"*50)

DOC 1:
practices.  Melting Ice and Rising Sea Levels  Warmer temperatures are causing polar ice caps and
glaciers to melt, contributing to rising  sea levels. Sea levels have risen by about 20 centimeters
(8 inches) in the past century,  threatening coastal communities and ecosystems.  Polar Ice Melt
--------------------------------------------------
DOC 2:
Regions dependent on glacial meltwater, such as the Himalayas and the Andes, face  particular risks.
Glacial melt also impacts hydropower generation and agriculture.  Coastal Erosion  Rising sea levels
and increased storm surges are accelerating coastal erosion, threatening  homes, infrastructure, and
ecosystems. Low-lying islands and coastal regions are especially  vulnerable. Coastal communities
must invest in adaptation measures like sea walls and  managed retreats.  Extreme Weather Events
--------------------------------------------------
DOC 3:
Marine ecosystems are highly vulnerable to climate change. Rising sea temperatures, 