# Improve Semantic Similarity with Reverse HYDE

It is common that the documents that we want to retrieve are longer than the users' queries and have different formats. To increase the accuracy of the **r*etrieval of the documents based on the users' queries, we will generate hypothetical potential queries from each document and use them as vector embeddings to the documents - AKA Reverse Hyde.

Please note that the original [Hyde technique](https://arxiv.org/abs/2212.10496) processed the incoming queries of the users, and generated the hypothetical documents from these queries, and then used these hypothetical documents to retrive the real documents. In the reverse HYDE, the processing is done when indexing the documents and not in retrival time. Therefore, the latency of the query is not affectd.

* [Reverse Hyde Implementation](#reverse-hyde-implementation)
* [Enriching Vector Database with Reverse Hyde Output](#enriching-vector-database-with-reverse-hyde-output)
* [Query the Enriched Index](#query-the-enriched-index)

In [1]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## Reverse HYDE Implementation

We will create a class that will generate and hypotherical questions and also retrieve the document by calculating the semantic similarity matching. In a real application, we can use a vector database for the embedding vector storage, indexing and retrieval. 

In [2]:
import openai
from typing import List, Dict
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
class ReverseHyde:
    def __init__(self, api_key: str):
        openai.api_key = api_key
        self.model = "text-embedding-ada-002"
        self.chat_model = "gpt-4.1-mini-2025-04-14"

    def get_embedding(self, text: str) -> List[float]:
        client = openai.OpenAI()
        response = client.embeddings.create(input=text, model=self.model)
        return response.data[0].embedding

    def generate_reverse_hyde(self, chunk: str, n: int = 3) -> List[str]:
        prompt = f"""
        
Given the following text chunk, generate {n} different questions that this chunk would be a good answer to:

Chunk: {chunk}

Questions (enumarate the questions with 1. 2., etc.):
"""

        client = openai.OpenAI()
        response = client.chat.completions.create(
            model=self.chat_model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=100,
            n=1,
            stop=None,
            temperature=0.7,
        )

        
        questions = response.choices[0].message.content.strip().split('\n')
        return [q.split('. ', 1)[1] for q in questions if '. ' in q]

    def process_chunks(self, chunks: List[str], n: int = 3) -> Dict[str, List[str]]:
        processed_chunks = {}
        for chunk in chunks:
            processed_chunks[chunk] = self.generate_reverse_hyde(chunk, n)
        return processed_chunks


Loading API keys from environment variable

In [5]:
from dotenv import load_dotenv

load_dotenv()

True

## Enriching Vector Database with Reverse Hyde Output

We will apply the Reverse Hyde method on a set of documents, and enrich the vector database index with LLM generated Hypothetical questions.

In [6]:
import os
# Usage example
api_key = os.getenv("OPENAI_API_KEY")
reverse_hyde = ReverseHyde(api_key)

chunks = [
    "A mitochondrion (pl. mitochondria) is an organelle found in the cells of most eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used throughout the cell as a source of chemical energy. They were discovered by Albert von Kölliker in 1857 in the voluntary muscles of insects. Meaning a thread-like granule, the term mitochondrion was coined by Carl Benda in 1898. The mitochondrion is popularly nicknamed the \"powerhouse of the cell\", a phrase popularized by Philip Siekevitz in a 1957 Scientific American article of the same name.",
    "Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a \"batteries included\" language due to its comprehensive standard library.",
    "The American Civil War (from April 12, 1861 to May 26, 1865) was a civil war in the United States between the Union (\"the North\") and the Confederacy (\"the South\"), which was formed in 1861 by states that had seceded from the Union. The central conflict leading to war was a dispute over whether slavery should be permitted to expand into the western territories, leading to more slave states, or be prohibited from doing so, which many believed would place slavery on a course of ultimate extinction."
]

processed_chunks = reverse_hyde.process_chunks(chunks, n=5)

In [8]:
import json

print(json.dumps(processed_chunks, indent=4))

{
    "A mitochondrion (pl. mitochondria) is an organelle found in the cells of most eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used throughout the cell as a source of chemical energy. They were discovered by Albert von K\u00f6lliker in 1857 in the voluntary muscles of insects. Meaning a thread-like granule, the term mitochondrion was coined by Carl Benda in 1898. The mitochondrion is popularly nicknamed the \"powerhouse of the cell\", a phrase popularized by Philip Siekevitz in a 1957 Scientific American article of the same name.": [
        "What is a mitochondrion and in which types of cells is it found?  ",
        "How do mitochondria generate energy for the cell?  ",
        "Who discovered mitochondria and when were they discovered?  ",
        "What is the origin and meaning of the term \"mitochondrion\"?  ",
        "Why is the mitochondrion often ca

## Query the enriched index

Once we have an index with multiple hypothetical questions to the documents, we can use it to retrive the document based on a real user's query.

In [9]:
query = "What generates energy in a cell?"

In [10]:
from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer

# create the vector database client
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

# Create the embedding encoder
encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

In [11]:
# Create collection to store the wine rating data
hyde_collection_name="reverse_hyde"

qdrant.recreate_collection(
    collection_name=hyde_collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

True

In [12]:
import uuid
# vectorize!
qdrant.upload_points(
    collection_name=hyde_collection_name,
    points=[
        models.PointStruct(
            id=uuid.uuid5(uuid.NAMESPACE_URL, f"{d_idx}-{q_idx}").hex,
            vector=encoder.encode(question).tolist(),
            payload={ 
                "document": document , 
                "doc_id": d_idx
            }
        ) for d_idx, (document, questions) 
            in enumerate(processed_chunks.items()) 
                for q_idx, question in enumerate(questions)
    ]
)

In [13]:
print(
    qdrant
    .get_collection(
        collection_name=hyde_collection_name
    )
)

status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=None indexed_vectors_count=0 points_count=15 segments_count=1 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None), shard_number=None, sharding_method=None, replication_factor=None, write_consistency_factor=None, read_fan_out_factor=None, on_disk_payload=None, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=

### Search Collection for Best Match



In [14]:
def search_collection(collection_name: str, query: str, limit: int = 1):
    """
    This function searches the specified collection for the best match to the given query.
    It then creates a table and a panel to display the query and the best match.
    
    :param collection_name: The name of the collection to search.
    :param query: The query to search for.
    :param limit: The maximum number of results to return. Default is 1.
    """
    hits = qdrant.search(
        collection_name=collection_name,
        query_vector=encoder.encode(query).tolist(),
        limit=limit
    )

    
    print("Query: ", query)
    print("Best Matching Chunk: ", hits[0].payload['document'])
    print("Score: ", hits[0].score)


In [15]:
search_collection(hyde_collection_name, query)

Query:  What generates energy in a cell?
Best Matching Chunk:  A mitochondrion (pl. mitochondria) is an organelle found in the cells of most eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used throughout the cell as a source of chemical energy. They were discovered by Albert von Kölliker in 1857 in the voluntary muscles of insects. Meaning a thread-like granule, the term mitochondrion was coined by Carl Benda in 1898. The mitochondrion is popularly nicknamed the "powerhouse of the cell", a phrase popularized by Philip Siekevitz in a 1957 Scientific American article of the same name.
Score:  0.8280217009077394


### Compare to document only index (without Hyde)

We will index the same documents without adding the reverse Hyde questions and compare the similarity scores.

In [16]:
# Create collection to store the wine rating data
docs_collection_name="documents_only"

qdrant.recreate_collection(
    collection_name=docs_collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

True

In [17]:
# vectorize!
qdrant.upload_points(
    collection_name=docs_collection_name,
    points=[
        models.PointStruct(
            id=idx,
            vector=encoder.encode(document).tolist(),
            payload={ "document": document}
        ) for idx, (document, questions) in enumerate(processed_chunks.items())
    ]
)

In [18]:
search_collection(docs_collection_name, query)

Query:  What generates energy in a cell?
Best Matching Chunk:  A mitochondrion (pl. mitochondria) is an organelle found in the cells of most eukaryotes, such as animals, plants and fungi. Mitochondria have a double membrane structure and use aerobic respiration to generate adenosine triphosphate (ATP), which is used throughout the cell as a source of chemical energy. They were discovered by Albert von Kölliker in 1857 in the voluntary muscles of insects. Meaning a thread-like granule, the term mitochondrion was coined by Carl Benda in 1898. The mitochondrion is popularly nicknamed the "powerhouse of the cell", a phrase popularized by Philip Siekevitz in a 1957 Scientific American article of the same name.
Score:  0.5002659397930831
