In [1]:
%env JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/bin/java
%env JVM_PATH=/usr/lib/jvm/java-21-openjdk-amd64/lib/server/libjvm.so

env: JAVA_HOME=/usr/lib/jvm/java-21-openjdk-amd64/bin/java
env: JVM_PATH=/usr/lib/jvm/java-21-openjdk-amd64/lib/server/libjvm.so


### Initialize Contriever Index and Query Encoder
We use [Pyserini](https://github.com/castorini/pyserini) as the search interface for the experiment. Please following the guidance in Pyserini to create Contriever index using the checkpoint from original Contriever work.

In [2]:
from pyserini.search import FaissSearcher, LuceneSearcher
from pyserini.search.faiss import AutoQueryEncoder
from pyserini.search import get_topics, get_qrels
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
query_encoder = AutoQueryEncoder(encoder_dir='facebook/contriever', pooling='mean')
searcher = FaissSearcher('contriever_msmarco_index/', query_encoder)
corpus = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')

Oct 31, 2024 3:44:39 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false


In [5]:
topics = get_topics('dl19-passage')
qrels = get_qrels('dl19-passage')

### Run Contriever

In [6]:
with open('dl19-contriever-top1000-trec', 'w')  as f:
    for qid in tqdm(topics):
        if qid in qrels:
            query = topics[qid]['title']
            hits = searcher.search(query, k=1000)
            rank = 0
            for hit in hits:
                rank += 1
                f.write(f'{qid} Q0 {hit.docid} {rank} {hit.score} rank\n')

100%|██████████| 43/43 [01:41<00:00,  2.37s/it]


In [7]:
!python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage dl19-contriever-top1000-trec
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage dl19-contriever-top1000-trec
!python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage dl19-contriever-top1000-trec

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


map                   	all	0.2399


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ndcg_cut_10           	all	0.4454


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


recall_1000           	all	0.7459


### Run BM25

In [17]:
corpus = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')
corpus.set_bm25() # default parameters
with open('dl19-bm25-top1000-trec', 'w')  as f:
    for qid in tqdm(topics):
        query = topics[qid]['title']
        hits = corpus.search(query, k=1000)
        rank = 0
        for hit in hits:
            rank += 1
            f.write(f'{qid} Q0 {hit.docid} {rank} {hit.score} rank\n')

100%|██████████| 43/43 [01:14<00:00,  1.73s/it]


In [18]:
!python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage dl19-bm25-top1000-trec
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage dl19-bm25-top1000-trec
!python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage dl19-bm25-top1000-trec

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


map                   	all	0.3013


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ndcg_cut_10           	all	0.5058


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


recall_1000           	all	0.7501


### Run RM3

In [19]:
corpus = LuceneSearcher.from_prebuilt_index('msmarco-v1-passage')
corpus.set_rm3() # default parameters
with open('dl19-rm3-top1000-trec', 'w')  as f:
    for qid in tqdm(topics):
        query = topics[qid]['title']
        hits = corpus.search(query, k=1000)
        rank = 0
        for hit in hits:
            rank += 1
            f.write(f'{qid} Q0 {hit.docid} {rank} {hit.score} rank\n')

100%|██████████| 43/43 [00:08<00:00,  4.93it/s]


In [20]:
!python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage dl19-rm3-top1000-trec
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage dl19-rm3-top1000-trec
!python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage dl19-rm3-top1000-trec

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


map                   	all	0.3416


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ndcg_cut_10           	all	0.5216


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


recall_1000           	all	0.8136


## Run Hyde-LLaMA3.1-8B-Instruct

In [16]:
import json
with open("dl19-passage_pseudo_docs_8rep.json", "r") as f:
    pseudo_docs = json.load(f)

print(f"Generation config: {pseudo_docs['gen_config']}")
print(f"Total number of topics and corresponding query, answers: {len(pseudo_docs['topics'])}")
print(f"An Example of a topic entry: {pseudo_docs['topics'][0]}")

Generation config: {'max_new_tokens': 512, 'temperature': 0.7, 'top_p': 1, 'do_sample': True, 'num_return_sequences': 8}
Total number of topics and corresponding query, answers: 43
An Example of a topic entry: {'qid': 264014, 'query': 'how long is life cycle of flea', 'generated_passages': ["The life cycle of a flea is a complex process that consists of four distinct stages: egg, larva, pupa, and adult. This cycle typically lasts around 21-30 days, depending on environmental factors and the availability of food.\n\nIt begins with the adult flea, which lays its eggs in the host's fur or in the surrounding environment. The female flea can lay up to 50 eggs per day, with the total number ranging from 20 to 400 eggs. The eggs are usually white, oval-shaped, and about 0.5 millimeters in length.\n\nAfter 2-3 weeks, the eggs hatch into larvae, which are legless, worm-like creatures that feed on flea feces, skin cells, and other organic matter. During this stage, the larvae molt (shed their sk

In [12]:
from time import sleep
import numpy as np
import json
from tqdm import tqdm

with open('hyde-dl19-contriever-llama3_1_8b_instruct-top1000-8rep-trec', 'w') as f:
    for topic in tqdm(pseudo_docs['topics'], desc="Performing encoding and searching for pseudo docs"):
        qid = topic["qid"]
        query = topic["query"]
        passages = topic["generated_passages"]
        # encode the candidate passages and take the average of the embeddings
        all_emb_c = []
        for passage in passages:
            c_emb = query_encoder.encode(passage)
            all_emb_c.append(np.array(c_emb))
        all_emb_c = np.array(all_emb_c)
        avg_emb_c = np.mean(all_emb_c, axis=0)
        avg_emb_c = avg_emb_c.reshape((1, len(avg_emb_c)))

        # search the pseudo docs
        hits = searcher.search(avg_emb_c, k=1000)
        for hit in hits:
            rank += 1
            f.write(f'{qid} Q0 {hit.docid} {rank} {hit.score} rank\n')

Performing encoding and searching for pseudo docs: 100%|██████████| 43/43 [01:52<00:00,  2.62s/it]


In [14]:
!python -m pyserini.eval.trec_eval -c -l 2 -m map dl19-passage hyde-dl19-contriever-llama3_1_8b_instruct-top1000-8rep-trec
!python -m pyserini.eval.trec_eval -c -m ndcg_cut.10 dl19-passage hyde-dl19-contriever-llama3_1_8b_instruct-top1000-8rep-trec
!python -m pyserini.eval.trec_eval -c -l 2 -m recall.1000 dl19-passage hyde-dl19-contriever-llama3_1_8b_instruct-top1000-8rep-trec

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


map                   	all	0.3695


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ndcg_cut_10           	all	0.5494


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


recall_1000           	all	0.8234
