### Hypothetical Document Embeddings (HyDE):

* It is a retrieval technique where, instead of embedding the user's query directly you first generatea hypothetical answer(document) to the query using an LLM and then embed that hypothetical document to search your vector store.

HyDE bridges the gap between user intent and relevant content, especially when:
1. queries are short.
2. language mismatch between query and documents.
3. you want to retrieve based on answer content, not question words.

In [4]:
#importing libraries

from langchain_community.document_loaders import WikipediaLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, FAISS

In [3]:
#ingesting documents and splitting into chunks

# need to install wikipedia (!pip install wikipedia)

loader = WikipediaLoader(query="APJ Abdul Kalam", load_max_docs=5)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=100)
chunks = splitter.split_documents(documents)
print(f"Number of documents after splitting into chunks: {len(chunks)}")

Number of documents after splitting into chunks: 85


In [7]:
#embeddings and vectorstore retriever

embeddings = HuggingFaceEmbeddings(model="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents=chunks,
                                   embedding=embeddings)

retriever = vectorstore.as_retriever(searh_kwargs={"k":5})

#saving db locally

vectorstore.save_local("local_db/apj_abdul_kalam.db")

In [6]:
#setting up llm to generate hypothetical answers
import os
from dotenv import load_dotenv
load_dotenv()

from langchain.chat_models import init_chat_model

groq_api_key = os.getenv("GROQ_API_KEY")
llm = init_chat_model(model="groq:llama-3.1-8b-instant")

#testing llm

llm.invoke("hi, how are you?")

AIMessage(content="I'm just a language model, so I don't have emotions or feelings in the way that humans do. However, I'm functioning properly and ready to assist you with any questions or tasks you may have. How can I help you today?", additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 50, 'prompt_tokens': 41, 'total_tokens': 91, 'completion_time': 0.05417695, 'completion_tokens_details': None, 'prompt_time': 0.001929335, 'prompt_tokens_details': None, 'queue_time': 0.050262955, 'total_time': 0.056106285}, 'model_name': 'llama-3.1-8b-instant', 'system_fingerprint': 'fp_f757f4b0bf', 'service_tier': 'on_demand', 'finish_reason': 'stop', 'logprobs': None, 'model_provider': 'groq'}, id='lc_run--019b2c86-91b3-7d41-8240-03e178c23164-0', usage_metadata={'input_tokens': 41, 'output_tokens': 50, 'total_tokens': 91})

In [8]:
#creating a prompt to get hypothetical documents for given query

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = PromptTemplate.from_template(
    '''
    Imagine you are an expert in writing a detailed explanation on the topic : {query}
    create hypothetical answer for the given topic.
    '''
)

prompt

PromptTemplate(input_variables=['query'], input_types={}, partial_variables={}, template='\n    Imagine you are an expert in writing a detailed explanation on the topic : {query}\n    create hypothetical answer for the given topic.\n    ')

In [12]:
hypothetical_doc_chain = (prompt | llm | StrOutputParser())
hypothetical_docs = hypothetical_doc_chain.invoke({"query":"achievements of apj abdul kalam"})
print(hypothetical_docs)

**The Achievements of APJ Abdul Kalam: A Visionary Leader and a Trailblazer**

Dr. Avul Pakir Jainulabdeen Abdul Kalam, fondly known as APJ Abdul Kalam, was a renowned Indian scientist, administrator, and the 11th President of India. His remarkable life and achievements have left an indelible mark on the nation and the world. This detailed explanation will delve into the numerous achievements of APJ Abdul Kalam, highlighting his contributions to technology, education, and society.

**Early Life and Education**

Born on October 15, 1931, in Rameswaram, Tamil Nadu, India, APJ Abdul Kalam grew up in a humble family of modest means. His early life was marked by hardship, but his curiosity and passion for learning drove him to excel in his studies. Kalam pursued his higher education at the Madras Institute of Technology, where he earned a Bachelor's degree in Aeronautical Engineering in 1957. He later completed his Master's degree in Aerospace Engineering from the same institution in 1960.


In [13]:
# giving this hypothetical documents to retriever to get relevant documents

relevant_docs = retriever.invoke(hypothetical_docs)
relevant_docs

[Document(id='22821036-82fc-4d75-9662-109b24ee6d95', metadata={'title': 'List of awards and honours received by A. P. J. Abdul Kalam', 'summary': 'A. P. J. Abdul Kalam was an Indian aerospace scientist and statesman who served as president of India from 2002 to 2007. Born in Rameswaram in Southern India, Kalam spent four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation and Indian Space Research Organisation and was intimately involved in India\'s civilian space programme and military missile development efforts. He was known as the "Missile Man of India" for his work on the development of ballistic missile and launch vehicle technology. He also played a pivotal organisational, technical, and political role in India\'s Pokhran-II nuclear tests in 1998.\nHe was elected as the president of India in 2002 and was widely referred to as the "People\'s President". He engaged in teaching, writing and public service after his presidenc

#### Using LangChain- HypotheticalDocumentEmbedder

In [15]:
from langchain_classic.chains.hyde.base import HypotheticalDocumentEmbedder

hyde_embedder = HypotheticalDocumentEmbedder.from_llm(llm=llm, 
                                                      base_embeddings=embeddings,
                                                      prompt_key="web_search")

#if you want to give your prompt, you can select custom_prompt="your prompt name"

hyde_embedder

HypotheticalDocumentEmbedder(verbose=False, base_embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False), llm_chain=PromptTemplate(input_variables=['QUESTION'], input_types={}, partial_variables={}, template='Please write a passage to answer the question\nQuestion: {QUESTION}\nPassage:')
| ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True}, client=<groq.resources.chat.completions.Completions object at 0x0000027F297B1990>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x0000027F2B4777D0>, model_name='llama-3.1-8b-instant', model_kwargs={}, groq_api_key=SecretStr('**********'))
| StrOutputParser())

In [16]:
#now using hyde embedder in vectorstore

hyde_based_vectorstore = FAISS.from_documents(
    documents = chunks,
    embedding= hyde_embedder
)
hyde_based_vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x27f2d3aa650>

In [17]:
# creating rag prompt and combine doc chain 
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
rag_prompt= PromptTemplate.from_template(
    '''
Answer the given question based on only the context information provided.
context: {context}
question:{input}
'''
)

documents_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=rag_prompt
)

documents_chain


RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| PromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, template='\nAnswer the given question based on only the context information provided.\ncontext: {context}\nquestion:{input}\n')
| ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True}, client=<groq.resources.chat.completions.Completions object at 0x0000027F297B1990>, async_client=<groq.resources.chat.completions.AsyncCompletions object at 0x0000027F2B4777D0>, model_name='llama-3.1-8b-instant', model_kwargs={}, groq_api_key=SecretStr('**********'))
| StrOutputParser(), kwargs={}, config={'run_name': 'stuff_docume

In [21]:
#function for rag pipeline

def hyde_rag_pipeline(query:str):

    relevant_docs= hyde_based_vectorstore.similarity_search(query=query, k=3)
    print(f"Relevant documents for given query: {relevant_docs}")
    response = documents_chain.invoke({"context": relevant_docs,
                                       "input": query})
    return response


results=hyde_rag_pipeline(query="Missile Man of India")
print(results)

Relevant documents for given query: [Document(id='4f51aecb-99a3-4fae-9a1c-95210980b503', metadata={'title': 'List of awards and honours received by A. P. J. Abdul Kalam', 'summary': 'A. P. J. Abdul Kalam was an Indian aerospace scientist and statesman who served as president of India from 2002 to 2007. Born in Rameswaram in Southern India, Kalam spent four decades as a scientist and science administrator, mainly at the Defence Research and Development Organisation and Indian Space Research Organisation and was intimately involved in India\'s civilian space programme and military missile development efforts. He was known as the "Missile Man of India" for his work on the development of ballistic missile and launch vehicle technology. He also played a pivotal organisational, technical, and political role in India\'s Pokhran-II nuclear tests in 1998.\nHe was elected as the president of India in 2002 and was widely referred to as the "People\'s President". He engaged in teaching, writing an