# HyDE
For a given query, HyDE retrieval pipeline contains 4 components:
1. Promptor: bulid prompt for generator based on specific task.
2. Generator: generates hypothesis documents using Large Language Model.
3. Encoder: encode hypothesis documents to HyDE vector.
4. Searcher: search nearest neighbour for the HyDE vector (dense retrieval).

### Initialize HyDE components
We use [pyserini](https://github.com/castorini/pyserini) as the search interface.

In [1]:
%env JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/bin/java
%env JVM_PATH=/usr/lib/jvm/java-21-openjdk-amd64/lib/server/libjvm.so

import json
from pyserini.search import FaissSearcher, LuceneSearcher
from pyserini.search.faiss import AutoQueryEncoder

from hyde import Promptor, HyDE, LlamaGenerator

env: JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/bin/java
env: JVM_PATH=/usr/lib/jvm/java-21-openjdk-amd64/lib/server/libjvm.so


  from .autonotebook import tqdm as notebook_tqdm


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/guest/r12922050/.config/sagemaker/config.yaml


In [2]:
promptor = Promptor('web search')
generator = LlamaGenerator(model_name="llama3.1-8b-instruct")
encoder = AutoQueryEncoder(encoder_dir='facebook/contriever', pooling='mean')
searcher = FaissSearcher('contriever_msmarco_index/', encoder)
corpus = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')

Oct 28, 2024 4:04:23 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false


### Build a HyDE pipeline

In [9]:
hyde = HyDE(promptor, generator, encoder, searcher)

### Load example Query

In [10]:
query = 'how long does it take to remove wisdom tooth'

### Build Zeroshot Prompt

In [11]:
prompt = hyde.prompt(query)
print(prompt)

Please write a passage to answer the question.
Question: how long does it take to remove wisdom tooth
Passage:


### Generate Hypothesis Documents

In [15]:
from hyde import LlamaGenerator

In [18]:
hypothesis_documents = hyde.generate(query)
for i, doc in enumerate(hypothesis_documents):
    print(f'HyDE Generated Document: {i}')
    print(doc.strip())

TypeError: string indices must be integers

### Encode HyDE vector

In [None]:
hyde_vector = hyde.encode(query, hypothesis_documents)
print(hyde_vector.shape)

### Search Relevant Documents

In [None]:
hits = hyde.search(hyde_vector, k=10)
for i, hit in enumerate(hits):
    print(f'HyDE Retrieved Document: {i}')
    print(hit.docid)
    print(json.loads(corpus.doc(hit.docid).raw())['contents'])

### End to End Search

e2e search will directly go through all the steps descripted above.

In [None]:
hits = hyde.e2e_search(query, k=10)
for i, hit in enumerate(hits):
    print(f'HyDE Retrieved Document: {i}')
    print(hit.docid)
    print(json.loads(corpus.doc(hit.docid).raw())['contents'])