# Hypothetical Document Embedding (HyDE) in Document Retrieval

## Overview

This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.

## Motivation

Traditional retrieval methods often struggle with the semantic gap between short queries and longer, more detailed documents. HyDE addresses this by expanding the query into a full hypothetical document, potentially improving retrieval relevance by making the query representation more similar to the document representations in the vector space.

## Key Components

1. PDF processing and text chunking
2. Vector store creation using FAISS and OpenAI embeddings
3. Language model for generating hypothetical documents
4. Custom HyDERetriever class implementing the HyDE technique

## Method Details

### Document Preprocessing and Vector Store Creation

1. The PDF is processed and split into chunks.
2. A FAISS vector store is created using OpenAI embeddings for efficient similarity search.

### Hypothetical Document Generation

1. A language model (GPT-4) is used to generate a hypothetical document that answers the given query.
2. The generation is guided by a prompt template that ensures the hypothetical document is detailed and matches the chunk size used in the vector store.

### Retrieval Process

The `HyDERetriever` class implements the following steps:

1. Generate a hypothetical document from the query using the language model.
2. Use the hypothetical document as the search query in the vector store.
3. Retrieve the most similar documents to this hypothetical document.

## Key Features

1. Query Expansion: Transforms short queries into detailed hypothetical documents.
2. Flexible Configuration: Allows adjustment of chunk size, overlap, and number of retrieved documents.
3. Integration with OpenAI Models: Uses GPT-4 for hypothetical document generation and OpenAI embeddings for vector representation.

## Benefits of this Approach

1. Improved Relevance: By expanding queries into full documents, HyDE can potentially capture more nuanced and relevant matches.
2. Handling Complex Queries: Particularly useful for complex or multi-faceted queries that might be difficult to match directly.
3. Adaptability: The hypothetical document generation can adapt to different types of queries and document domains.
4. Potential for Better Context Understanding: The expanded query might better capture the context and intent behind the original question.

## Implementation Details

1. Uses OpenAI's ChatGPT model for hypothetical document generation.
2. Employs FAISS for efficient similarity search in the vector space.
3. Allows for easy visualization of both the hypothetical document and retrieved results.

## Conclusion

Hypothetical Document Embedding (HyDE) represents an innovative approach to document retrieval, addressing the semantic gap between queries and documents. By leveraging advanced language models to expand queries into hypothetical documents, HyDE has the potential to significantly improve retrieval relevance, especially for complex or nuanced queries. This technique could be particularly valuable in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems.

In [1]:
import os
import sys
from dotenv import load_dotenv

from utils.helper_functions import *
from utils.evaluate_rag import *

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

In [2]:
path = "data/Climate_Change.pdf"

### Define the HyDe retriever class - creating vector store, generating hypothetical document, and retrieving

In [3]:
class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=4000)

        self.embeddings = OpenAIEmbeddings()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
    
        
        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc


In [6]:
retriever = HyDERetriever(path)

In [7]:
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)

In [8]:
results

[Document(id='6ad0da8f-63e0-42c5-90bd-0bacb3566076', metadata={'producer': 'GPL Ghostscript 8.70', 'creator': 'PDFCreator Version 1.0.2', 'creationdate': '2015-05-11T11:41:54+05:30', 'moddate': '2015-05-11T11:41:54+05:30', 'title': 'Climate Change', 'author': 'VISION', 'keywords': '', 'subject': '', 'source': 'data/Climate_Change.pdf', 'total_pages': 15, 'page': 1, 'page_label': '2'}, page_content='‘natural’ influences of the past. Global warming has occurred faster than any other climate change recorded by \nhumans and so is of great interest and importance to the human population. \nCause of anthropogenic (human caused) climate change includes greenhouse gases, aerosols and pattern of land \nuse changes.'),
 Document(id='daf73bfe-3cd0-40dd-95dc-995e15c2b5b2', metadata={'producer': 'GPL Ghostscript 8.70', 'creator': 'PDFCreator Version 1.0.2', 'creationdate': '2015-05-11T11:41:54+05:30', 'moddate': '2015-05-11T11:41:54+05:30', 'title': 'Climate Change', 'author': 'VISION', 'keywords':

In [12]:
doc_content = [doc.page_content for doc in results]

print("Hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")
show_context(doc_content)

Hypothetical_doc:

**The Main Cause of Climate Change**  Climate change primarily results from human activities, particularly the burning
of fossil fuels such as coal, oil, and natural gas. This process releases significant amounts of carbon dioxide (CO2)
and other greenhouse gases into the atmosphere. Deforestation exacerbates the issue by reducing the number of trees that
can absorb CO2. Industrial processes, agriculture, and waste management also contribute to greenhouse gas emissions. The
accumulation of these gases traps heat, leading to global warming and subsequent climate changes, impacting ecosystems
and weather patterns worldwide.

Context 1:
‘natural’ influences of the past. Global warming has occurred faster than any other climate change recorded by 
humans and so is of great interest and importance to the human population. 
Cause of anthropogenic (human caused) climate change includes greenhouse gases, aerosols and pattern of land 
use changes.


Context 2:
revolution, hum

### Langchain HyDe Implementation

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_classic.chains import HypotheticalDocumentEmbedder

# Load PDF
loader = PyPDFLoader("data/Climate_Change.pdf")
documents = loader.load()

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Create HYDE embeddings
base_embeddings = OpenAIEmbeddings()
llm = OpenAI(temperature=0)
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm, base_embeddings, prompt_key="web_search"
)

In [21]:
#create a ectore store
vectore_store = Chroma.from_documents(chunks, hyde_embeddings)

#query
retriever = vectore_store.as_retriever(search_kwargs={"k":5})
docs = retriever.invoke("What are the roles of the technologies in climate change Mitigation?")

print(docs[0].page_content)

of greenhouse gases into the atmosphere, resulting in higher global temperatures, affecting hydrological 
regimes and increasing climatic variability.  
/square4 Climate change is projected to have significant impacts on agricultural conditions, food supply, and food 
security. Some of these effects are biophysical, some are ecological, and some are economic, including: 
o A shift in climate and agricultural zones towards the poles 
o Changes in production patterns due to higher temperatures 
o A boost in agricultural productivity due to increased carbon dioxide in the atmosphere 
o Changing precipitation patterns 
o Increased vulnerability of the landless and the poor 
 
International Efforts to Counter International Efforts to Counter International Efforts to Counter International Efforts to Counter Climate Change Climate Change Climate Change Climate Change    
The Intergovernmental Panel on Climate Change (IPCC)


### Minimal Usage

In [28]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

# Load PDF
docs = PyPDFLoader("data/Climate_Change.pdf").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)

# Create vector store and retriever
retriever = Chroma.from_documents(chunks, OpenAIEmbeddings()).as_retriever()
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# HYDE prompt
hyde_prompt = ChatPromptTemplate.from_template("Write a passage answering: {question}")

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

# Create HYDE RAG chain - CORRECTED ✅
hyde_chain = (
    {"question": RunnablePassthrough(), "hypothetical": hyde_prompt | llm | StrOutputParser()}
    | RunnableLambda(lambda x: {  # ✅ Wrap lambda in RunnableLambda
        "context": format_docs(retriever.invoke(x["hypothetical"])), 
        "question": x["question"]
    })
    | ChatPromptTemplate.from_template("Context: {context}\n\nQuestion: {question}\n\nAnswer:")
    | llm
    | StrOutputParser()
)

# Query
answer = hyde_chain.invoke("What is this about?")
print(answer)

This text is about various initiatives and missions in India aimed at addressing environmental challenges such as urban waste management, water scarcity, and the conservation of the Himalayan ecosystem in the face of climate change. These initiatives include measures such as improving water use efficiency, enforcing fuel economy standards, promoting the use of public transportation, and conserving biodiversity in the Himalayan region.
