## **Hypothetical Document Embeddings (HyDE)**

### **How it Works?**
Given a query, the Hypothetical Document Embeddings (HyDE) first zero-shot prompts an instruction-following language model to generate a “fake” hypothetical document that captures relevant textual patterns from the initial query - in practice, this is done five times. Then, it encodes each hypothetical document into an embedding vector and averages them. The resulting, single embedding can be used to identify a neighbourhood in the document embedding space from which similar actual documents are retrieved based on vector similarity. As with any other retriever, these retrieved documents can then be used downstream in a pipeline (for example, in a Generator for RAG).

### **When Is It Helpful?**

The HyDE method is highly useful when:

- The performance of the retrieval step in your pipeline is not good enough (for example, low Recall metric).
- Your retrieval step has a query as input and returns documents from a larger document base.
- Particularly worth a try if your data (documents or queries) come from a special domain that is very different from the typical datasets that Retrievers are trained on.

### **Table of Contents**
1. Load Data
    - Split Data into Chunks
2. Create Vector Store & Load Text Chunks
    - Define Vector Store & Collection
    - Add Text Chunks to Collection
    - Visualize Dataset
3. Query Vector Store
    - Visualize Original Query & Retrieved Docs
4. HyDE
    - Visualize Original Query & Retrieved Docs
5. HyDE (Avg. Embeddings of Answers)
    - Visualize Original Query & Retrieved Docs
6. RAG Performance
### **Installation**
- pip install groq
- pip install langchain
- pip install FAISS

**Login to https://console.groq.com and create API Key.**

In [None]:
# !pip install groq

In [12]:
# !pip install chromadb

> ### 1. Load Data
This is a Research Paper of DeepSeek_R1 created by China in 2025.

In [14]:
from langchain_community.document_loaders import PyPDFLoader
reader  = PyPDFLoader('./Data/DeepSeek_R1.pdf')
docs = reader.load()

### 1.1 Split Data into Text Chunks

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

character_split_texts = text_split.split_documents(docs)

print(f"Total chunks: {len(character_split_texts)}")

Total chunks: 71


> ## 2. Create Vector Store & Load Text Chunks

In [17]:
from langchain_community.vectorstores import FAISS
import tqdm as notebook_tqdm
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
embadding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(character_split_texts, embedding=embadding)

In [26]:
from langchain_groq import ChatGroq
model = ChatGroq(model='llama-3.1-8b-instant', api_key="")

In [19]:
retriver = vector_store.as_retriever(search_type='similarity', search_kwards={'k': 6})

In [22]:
from langchain.prompts import ChatPromptTemplate

template = "For the given question try to generate the hypothetical answer and don't generate anything else. {question}"

prompt = ChatPromptTemplate.from_template(template)

In [24]:
query = prompt.format(question="What is Deepseek?")

In [27]:
hypothetical_answer = model.invoke(query).content

In [29]:
print(hypothetical_answer)

Deepseek is a hypothetical, open-source, decentralized, peer-to-peer, and blockchain-based content creation and sharing platform, often referred to as a 'Decentralized Video Sharing Platform'.


In [30]:
retriver.get_relevant_documents(hypothetical_answer)

  retriver.get_relevant_documents(hypothetical_answer)


[Document(id='5ef58292-19c1-476d-987e-142158b8e4e2', metadata={'source': './Data/DeepSeek_R1.pdf', 'page': 3}, page_content='R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores\nof 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its\nperformance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1\nsurpasses other closed-source models, demonstrating its competitive edge in educational\ntasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,\ndemonstrating its capability in handling fact-based queries. A similar trend is observed\nwhere OpenAI-o1 surpasses 4o on this benchmark.\n4'),
 Document(id='ea16d6da-9051-4191-86ac-12deb6608711', metadata={'source': './Data/DeepSeek_R1.pdf', 'page': 12}, page_content='RL, which not only boosts reasoning capabilities but also improves performance across diverse\ndomains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an\naverage

In [31]:
similar_docs = retriver.get_relevant_documents(hypothetical_answer)

In [32]:
template = """Answer the question based on the context {context}\n\nQuestion: {question}"""
prompt2 = ChatPromptTemplate.from_template(template)

In [33]:
def format_documents(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [34]:
formated_docs = format_documents(similar_docs)

In [35]:
formated_docs

'R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores\nof 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its\nperformance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1\nsurpasses other closed-source models, demonstrating its competitive edge in educational\ntasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,\ndemonstrating its capability in handling fact-based queries. A similar trend is observed\nwhere OpenAI-o1 surpasses 4o on this benchmark.\n4\n\nRL, which not only boosts reasoning capabilities but also improves performance across diverse\ndomains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an\naverage of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that\n13\n\nand 57.2% on LiveCodeBench. These results significantly outperform previous open-\nsource models and are comparable to o1-mini. We open-source distilled 1.

In [37]:
query = prompt2.format(context=formated_docs,
question = 'What is Deepseek?')

In [38]:
query

'Human: Answer the question based on the context R1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores\nof 90.8% on MMLU, 84.0% on MMLU-Pro, and 71.5% on GPQA Diamond. While its\nperformance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1\nsurpasses other closed-source models, demonstrating its competitive edge in educational\ntasks. On the factual benchmark SimpleQA, DeepSeek-R1 outperforms DeepSeek-V3,\ndemonstrating its capability in handling fact-based queries. A similar trend is observed\nwhere OpenAI-o1 surpasses 4o on this benchmark.\n4\n\nRL, which not only boosts reasoning capabilities but also improves performance across diverse\ndomains. Moreover, the summary lengths generated by DeepSeek-R1 are concise, with an\naverage of 689 tokens on ArenaHard and 2,218 characters on AlpacaEval 2.0. This indicates that\n13\n\nand 57.2% on LiveCodeBench. These results significantly outperform previous open-\nsource models and are co

In [39]:
responce = model.invoke(query)

In [40]:
responce

AIMessage(content='Based on the given context, DeepSeek appears to be a series of artificial intelligence models, specifically designed for educational and reasoning tasks. It has undergone training with Reinforcement Learning (RL) to improve its performance across various domains, including reasoning, coding, and engineering tasks.\n\nThere are at least two versions of the DeepSeek model mentioned: \n\n1. DeepSeek-V3: This is an older version of the model, which has been surpassed by newer versions in various tasks.\n2. DeepSeek-R1: This is the latest version of the model, which has achieved outstanding results in reasoning tasks, educational tasks, and knowledge-related benchmarks, outperforming other closed-source models in many cases.\n\nThe DeepSeek models are open-sourced, with various checkpoints made available to the community, including 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 187, 'prompt_toke