This repository complements the following papers:
- Query Performance Prediction using Relevance Judgments Generated by Large Language Models
- In this paper, we propose a new query performance prediction (QPP) framework,
QPP-GenRE
, which first automatically generates relevance judgments for a ranked list for a given query, and then regard the generated relevance judgments as pseudo labels to compute different IR evaluation measures.QPP-GenRE
can be integrated with various methods for judging relevance. We show the success ofQPP-GenRE
equipped with LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct. We fine-tune LLaMA-7B, Llama-3-8B, and Llama-3-8B-Instruct to generate relevance judgments automatically.
- In this paper, we propose a new query performance prediction (QPP) framework,
- Can We Use Large Language Models to Fill Relevance Judgment Holes?
- In this paper, we fine-tune Llama-3-8B, and Llama-3-8B-Instruct for generating relevance judgments in the context of conversational search.
This repository is structured into the following parts:
- Installation
- Query Performance Prediction using Relevance Judgments Generated by Large Language Models
- 2.1 Prerequisite
- 2.2 Inference using fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
- 2.3 Fine-tuning LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
- 2.4 In-context learning using LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct
- 2.5 Evaluation
- 2.6 The results of scaled Mean Absolute Ranking Error (sMARE)
- Can We Use Large Language Models to Fill Relevance Judgment Holes?
- 3.1 Prerequisite
- 3.2 Inference using fine-tuned LLaMA-7B
- 3.3 Zero-shot prompting using Llama-3-8B and Llama-3-8B-Instruct
- 3.4 Inference using fine-tuned Llama-3-8B and Llama-3-8B-Instruct
- 3.5 Fine-tuning Llama-3-8B and Llama-3-8B-Instruct
- 3.6 Evaluation
pip install -r requirements.txt
Please first download dataset.zip
(containing queries, run files, qrels files and files containing the actual retrieval quality of queries) from here, and then unzip it in the current directory.
Then, please download MS MARCO V1 and V2 passage ranking collections from Pyserini:
wget -P ./datasets/msmarco-v1-passage/ https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.msmarco-v1-passage-full.20221004.252b5e.tar.gz --no-check-certificate
tar -zxvf ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e.tar.gz -C ./datasets/msmarco-v1-passage/
wget -P ./datasets/msmarco-v2-passage/ https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a.tar.gz --no-check-certificate
tar -zxvf ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a.tar.gz -C ./datasets/msmarco-v2-passage/
For LLaMA-7B, please refer to the LLaMA repository to fetch the original weights of LLaMA-7B. And then, please follow the instructions from here to convert the original weights for the LLaMA-7B model to the Hugging Face Transformers format. Next, set your local path to the weights of LLaMA-7B (Hugging Face Transformers format) as an environment variable, which will be used in the following process.
export LLAMA_7B_PATH={your path to the weights of LLaMA-7B (Hugging Face Transformers format)}
For Llama-3-8B and Llama-3-8B-Instruct, we can directly fetch weights from Hugging Face. Please set your own token and your cache directory:
export TOKEN={your token to use as HTTP bearer authorization for remote files}
export CACHE_DIR={your cache path that stores the weights of Llama 3}
For the reproducibility of the results reported in the paper, please download the checkpoints of our fine-tuned
After downloading, please unzip them in a new directory ./checkpoint/
.
Note
We leverage 4-bit quantized LLaMA-7B for either inference or fine-tuning in this paper; we use an NVIDIA A100 Tensor Core GPU (40GB) to conduct all experiments in our paper.
The part shows how to directly use our released checkpoints of fine-tuned LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct to predict the performance of BM25 and ANCE on TREC-DL 19, 20, 21 and 22 datasets.
Please run judge_relevance.py
and predict_measures.py
sequentially to finish one prediction for one ranker on one dataset.
Specifically, judge_relevance.py
aims to automatically generate relevance judgments for a ranked list returned by BM25 or ANCE; the generated relevance judgments are saved to ./output/
.
predict_measures.py
is used to compute different IR evaluation measures, such as RR@10 and nDCG@10, based on the generated relevance judgments (pseudo labels); the computed values of an IR evaluation metric are regarded as predicted QPP scores that are expected to approximate the actual values of the IR evaluation metric; predicted QPP scores for a dataset will be saved to a folder that corresponds to the dataset, e.g., QPP scores for BM25 or ANCE on TREC-DL 19 will be saved to ./output/dl-19-passage
.
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-2790.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-2790.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000/checkpoint-5350 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-neg2-top1000-checkpoint-5350.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000/checkpoint-2675 \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-neg2-top1000-checkpoint-2675.k1000 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--qrels_path ./output/dl-19-passage.original-ance-msmarco-v1-passage-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-checkpoint-2790.k1000 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-2790 \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--output_dir ./output/ \
--batch_size 32 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-ance-msmarco-v1-passage-1000.txt \
--qrels_path ./output/dl-20-passage.original-ance-msmarco-v1-passage-1000.original-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-checkpoint-2790.k1000 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
Run the following command to fine-tune quantized 4-bit LaMA-7B using QLoRA on the task of judging the relevance of a passage to a given query, on the development set of MS MARCO V1.
For each query in the development set of MS MARCO V1, we use the relevant passages shown in the qrels file, while we randomly sample a negative passage from the ranked list (1000 items) returned by BM25.
The checkpoints will be saved to ./checkpoint/
for each epoch.
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 1 \
--neg_top 1000 \
--prompt binary
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 2 \
--neg_top 1000 \
--prompt binary
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 5 \
--num_negs 2 \
--neg_top 1000 \
--prompt binary
Note
Fine-tuning LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct using QLoRA for 5 epochs on the development set of MS MARCO V1 takes about an hour and a half on an NVIDIA A100 GPU.
In the setting of in-context learning, we freeze the parameters of LLaMA. We randomly sample several human-labeled demonstration examples (each demonstration example is in the format of "<query, passage, relevant/irrelevant>") from the development set of MS MARCO V1 (the same set used for fine-tuning LLaMA in the previous part), and insert these sampled demonstration examples into the input of LLaMA-7B, Llama-3-8B and Llama-3-8B-Instruct with original weights. We randomly sample two demonstration examples, where one example has a passage that is labeled as relevant (<query, passage, relevant>) while the other example has an irrelevant passage (<query, passage, irrelevant>); our preliminary experiments show that two demonstration examples work best and so we stick with this setting.
Note that we sampled and use the following demonstration examples for all few-shot prompting experiments:
Question: avatar the last airbender game
Passage: Avatar: The Last Airbender: The Video Game (known as Avatar: The Legend of Aang in Europe) is a video game based on the animated television series of the same name for Game Boy Advance, Microsoft Windows, Nintendo GameCube, Nintendo DS, PlayStation 2, PlayStation Portable, Wii, and Xbox.
Output: Relevant
Question: avatar the last airbender game
Passage: Fans of Avatar: The Last Airbender have been feverishly looking forward to this weekend: Michael Dante DiMartino and Bryan Konietzko continue the mythology of their Airbender series with NickelodeonΓ’οΏ½οΏ½s The Legend of Korra, which premieres Saturday at 11 a.m.
Output: Irrelevant
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-19-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-19-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-19-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-19-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-19-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v1-passage/queries/dl-20-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_path ./datasets/msmarco-v1-passage/qrels/dl-20-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v1-passage/runs/dl-20-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-20-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-20-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-21-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-21-passage.qrels.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-21-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-21-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-21-passage \
--n 10 100 200 500 1000
# LLaMA-7B
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-llama-1-7b-hf-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
# Llama-3-8B
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
# Llama-3-8B-Instruct
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/msmarco-v2-passage/queries/dl-22-passage.queries-original.tsv \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--index_path ./datasets/msmarco-v2-passage/lucene-index.msmarco-v2-passage-full.20220808.4d6d2a \
--qrels_path ./datasets/msmarco-v2-passage/qrels/dl-22-passage.qrels-withDupes.txt \
--query_demon_path ./datasets/msmarco-v1-passage/queries/msmarco-v1-passage-dev-small.queries-original.tsv \
--run_demon_path ./datasets/msmarco-v1-passage/runs/msmarco-v1-passage-dev-small.run-original-bm25-1000.txt \
--index_demon_path ./datasets/msmarco-v1-passage/lucene-index.msmarco-v1-passage-full.20221004.252b5e \
--qrels_demon_path ./datasets/msmarco-v1-passage/qrels/msmarco-v1-passage-dev-small.qrels.tsv \
--num_demon_per_class 1 \
--output_dir ./output/ \
--batch_size 32 \
--k 1000 \
--infer --prompt binary
python -u predict_measures.py \
--run_path ./datasets/msmarco-v2-passage/runs/dl-22-passage.run-original-bm25-1000.txt \
--qrels_path ./output/dl-22-passage.original-bm25-1000.original-Meta-Llama-3-8B-Instruct-icl-msmarco-v1-passage-dev-small.original-bm25-1000-demon1 \
--output_path ./output/dl-22-passage \
--n 10 100 200 500 1000
We provide detailed commands to evaluate QPP effectiveness of QPP-GenRE for predicting the performance of BM25 or ANCE in terms of RR@10 or nDCG@10. Specifically, QPP effectiveness is measured by Pearson and Kendall correlation coefficients between the actual performance of a ranker for a set of queries and the predicted performance of the ranker for the set of queries.
Note
TREC-DL 19, 20, 21 and 22 provide relevance judgments in multi-graded relevance scales per query, while an LLM in QPP-GenRE can only generate binary relevance judgments for each query, because the training set of QPP-GenRE only contains binary relevance judgments. For RR@10, we use relevance scale β₯ 2 as positive to compute the actual values of RR@10. For nDCG@10, the actual values of nDCG@10 are calculated by human-labeled relevance judgments in multi-graded relevance scales, while the values of nDCG@10 predicted by QPP-GenRE are calculated by binary relevance judgments automatically generated by an LLM. Although QPP-GenRE uses the nDCG@10 values computed by binary relevance judgments to "approximate" the nDCG@10 values computed by relevance judgments in multi-graded relevance scales, QPP-GenRE still achieves promising QPP effectiveness in terms of Pearson and Kendall correlation coefficients.
Evaluate QPP effectiveness of QPP-GenRE for predicting the performance of BM25 in terms of RR@10 and nDCG@10
The following commands will produce files recording QPP results in the directories of ./output/dl-19-passage/
, ./output/dl-20-passage/
, ./output/dl-21-passage/
, and ./output/dl-22-passage/
, respectively:
python -u evaluate_qpp.py \
--pattern './output/dl-19-passage/dl-19-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-19-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10
python -u evaluate_qpp.py \
--pattern './output/dl-20-passage/dl-20-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-20-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10
python -u evaluate_qpp.py \
--pattern './output/dl-21-passage/dl-21-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v2-passage/ap/dl-21-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10
python -u evaluate_qpp.py \
--pattern './output/dl-22-passage/dl-22-passage.original-bm25-1000*' \
--ap_path ./datasets/msmarco-v2-passage/ap/dl-22-passage.ap-original-bm25-1000.json \
--target_metrics mrr@10 ndcg@10
Evaluate QPP effectiveness of QPP-GenRE for predicting the performance of ANCE in terms of RR@10 and nDCG@10
The following commands will produce files recording QPP results in the directories of ./output/dl-19-passage/
and ./output/dl-20-passage/
, respectively:
python -u evaluate_qpp.py \
--pattern './output/dl-19-passage/dl-19-passage.original-ance-msmarco-v1-passage-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-19-passage.ap-original-ance-msmarco-v1-passage-1000.json \
--target_metrics mrr@10 ndcg@10
python -u evaluate_qpp.py \
--pattern './output/dl-20-passage/dl-20-passage.original-ance-msmarco-v1-passage-1000*' \
--ap_path ./datasets/msmarco-v1-passage/ap/dl-20-passage.ap-original-ance-msmarco-v1-passage-1000.json \
--target_metrics mrr@10 ndcg@10
We calculate sMARE values for our method and all baselines; we use the code released by the authors of sMARE.
The following tables show that our method obtains the lowest sMARE values (the lower the value is, the better the QPP effectiveness is) on each dataset for predicting the performance of either BM25 or ANCE in terms of RR@10 and nDCG@10.
Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 19.
Method | sMARE |
---|---|
Clarity | 0.352 |
WIG | 0.291 |
NQC | 0.313 |
ππππ₯ | 0.296 |
n(ππ₯%) | 0.286 |
SMV | 0.313 |
UEF(NQC) | 0.290 |
RLS(NQC) | 0.318 |
QPP-PRP | 0.297 |
NQAQPP | 0.315 |
BERTQPP | 0.318 |
qppBERT-PL | 0.275 |
M-QPPF | 0.283 |
QPP-GenRE (ours) | 0.196 |
Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 20.
Method | sMARE |
---|---|
Clarity | 0.320 |
WIG | 0.245 |
NQC | 0.249 |
ππππ₯ | 0.255 |
n(ππ₯%) | 0.279 |
SMV | 0.251 |
UEF(NQC) | 0.261 |
RLS(NQC) | 0.294 |
QPP-PRP | 0.287 |
NQAQPP | 0.315 |
BERTQPP | 0.287 |
qppBERT-PL | 0.302 |
M-QPPF | 0.250 |
QPP-GenRE (ours) | 0.157 |
Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 21.
Method | sMARE |
---|---|
Clarity | 0.285 |
WIG | 0.276 |
NQC | 0.276 |
ππππ₯ | 0.286 |
n(ππ₯%) | 0.288 |
SMV | 0.273 |
UEF(NQC) | 0.315 |
RLS(NQC) | 0.272 |
QPP-PRP | 0.311 |
NQAQPP | 0.285 |
BERTQPP | 0.305 |
qppBERT-PL | 0.269 |
M-QPPF | 0.267 |
QPP-GenRE (ours) | 0.237 |
Table: Predicting the performance of BM25 in terms of RR@10 on TREC-DL 22.
Method | sMARE |
---|---|
Clarity | 0.317 |
WIG | 0.315 |
NQC | 0.330 |
ππππ₯ | 0.322 |
n(ππ₯%) | 0.309 |
SMV | 0.322 |
UEF(NQC) | 0.325 |
RLS(NQC) | 0.316 |
QPP-PRP | 0.316 |
NQAQPP | 0.280 |
BERTQPP | 0.306 |
qppBERT-PL | 0.295 |
M-QPPF | 0.289 |
QPP-GenRE (ours) | 0.249 |
Table: Predicting the performance of ANCE in terms of RR@10 on TREC-DL 19.
Method | sMARE |
---|---|
Clarity | 0.335 |
WIG | 0.307 |
NQC | 0.307 |
ππππ₯ | 0.281 |
n(ππ₯%) | 0.287 |
SMV | 0.278 |
UEF(NQC) | 0.266 |
RLS(NQC) | 0.269 |
QPP-PRP | 0.296 |
Dense-QPP | 0.317 |
NQAQPP | 0.316 |
BERTQPP | 0.286 |
qppBERT-PL | 0.274 |
M-QPPF | 0.291 |
QPP-GenRE (ours) | 0.119 |
Table: Predicting the performance of ANCE in terms of RR@10 on TREC-DL 20.
Method | sMARE |
---|---|
Clarity | 0.325 |
WIG | 0.333 |
NQC | 0.302 |
ππππ₯ | 0.306 |
n(ππ₯%) | 0.339 |
SMV | 0.294 |
UEF(NQC) | 0.335 |
RLS(NQC) | 0.302 |
QPP-PRP | 0.307 |
Dense-QPP | 0.292 |
NQAQPP | 0.368 |
BERTQPP | 0.365 |
qppBERT-PL | 0.359 |
M-QPPF | 0.321 |
QPP-GenRE (ours) | 0.228 |
Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 19.
Method | sMARE |
---|---|
Clarity | 0.309 |
WIG | 0.239 |
NQC | 0.239 |
ππππ₯ | 0.236 |
n(ππ₯%) | 0.238 |
SMV | 0.241 |
UEF(NQC) | 0.236 |
RLS(NQC) | 0.233 |
QPP-PRP | 0.287 |
NQAQPP | 0.295 |
BERTQPP | 0.273 |
qppBERT-PL | 0.296 |
M-QPPF | 0.264 |
QPP-GenRE (ours) | 0.198 |
Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 20.
Method | sMARE |
---|---|
Clarity | 0.251 |
WIG | 0.213 |
NQC | 0.215 |
ππππ₯ | 0.211 |
n(ππ₯%) | 0.206 |
SMV | 0.218 |
UEF(NQC) | 0.227 |
RLS(NQC) | 0.223 |
QPP-PRP | 0.305 |
NQAQPP | 0.272 |
BERTQPP | 0.248 |
qppBERT-PL | 0.274 |
M-QPPF | 0.243 |
QPP-GenRE (ours) | 0.177 |
Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 21.
Method | sMARE |
---|---|
Clarity | 0.307 |
WIG | 0.252 |
NQC | 0.266 |
ππππ₯ | 0.258 |
n(ππ₯%) | 0.264 |
SMV | 0.271 |
UEF(NQC) | 0.262 |
RLS(NQC) | 0.286 |
QPP-PRP | 0.341 |
NQAQPP | 0.266 |
BERTQPP | 0.261 |
qppBERT-PL | 0.279 |
M-QPPF | 0.259 |
QPP-GenRE (ours) | 0.201 |
Table: Predicting the performance of BM25 in terms of nDCG@10 on TREC-DL 22.
Method | sMARE |
---|---|
Clarity | 0.307 |
WIG | 0.265 |
NQC | 0.282 |
ππππ₯ | 0.283 |
n(ππ₯%) | 0.264 |
SMV | 0.276 |
UEF(NQC) | 0.282 |
RLS(NQC) | 0.284 |
QPP-PRP | 0.339 |
NQAQPP | 0.283 |
BERTQPP | 0.273 |
qppBERT-PL | 0.289 |
M-QPPF | 0.283 |
QPP-GenRE (ours) | 0.249 |
Table: Predicting the performance of ANCE in terms of nDCG@10 on TREC-DL 19.
Method | sMARE |
---|---|
Clarity | 0.366 |
WIG | 0.213 |
NQC | 0.221 |
ππππ₯ | 0.223 |
n(ππ₯%) | 0.239 |
SMV | 0.228 |
UEF(NQC) | 0.221 |
RLS(NQC) | 0.224 |
QPP-PRP | 0.309 |
Dense-QPP | 0.212 |
NQAQPP | 0.329 |
BERTQPP | 0.309 |
qppBERT-PL | 0.343 |
M-QPPF | 0.292 |
QPP-GenRE (ours) | 0.186 |
Table: Predicting the performance of ANCE in terms of nDCG@10 on TREC-DL 20.
Method | sMARE |
---|---|
Clarity | 0.345 |
WIG | 0.297 |
NQC | 0.254 |
ππππ₯ | 0.250 |
n(ππ₯%) | 0.305 |
SMV | 0.250 |
UEF(NQC) | 0.250 |
RLS(NQC) | 0.254 |
QPP-PRP | 0.294 |
Dense-QPP | 0.242 |
NQAQPP | 0.304 |
BERTQPP | 0.304 |
qppBERT-PL | 0.324 |
M-QPPF | 0.274 |
QPP-GenRE (ours) | 0.228 |
Please first download dataset.zip
(containing queries, qrels files and corpus) from here, and then unzip it in the current directory.
Then, please run the following commands to preprocess the dataset:
python -u prepcrocessing.py \
--raw_data_path ./datasets/ikat/raw/splitted_data.txt
One can directly fetch the original weights of Llama-3-8B and Llama-3-8B-Instruct. Please set the following variables:
export TOKEN={your token to use as HTTP bearer authorization for remote files}
export CACHE_DIR={your cache path that stores the weights of Llama 3}
For the reproducibility of the results reported in the paper, please download the checkpoints of our fine-tuned
After downloading, please unzip them in a new directory ./checkpoint/
.
# inference on the test split
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output \
--batch_size 16 \
--infer --rj --prompt binary
# inference on the whole set
python -u judge_relevance.py \
--model_name_or_path ${LLAMA_7B_PATH} \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000/checkpoint-1860 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output \
--batch_size 16 \
--infer --rj --prompt binary
# inference on the test split (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj
# inference on the test split (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj
# inference on the whole set (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj
# inference on the whole set (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat.qrels \
--output_dir ./output/ \
--prompt ikat \
--infer --rj
# inference on the test split (Llama-3-8B)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name ikat-train.Meta-Llama-3-8B/checkpoint-3374 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--batch_size 32 \
--prompt ikat \
--infer --rj
# inference on the test split (Llama-3-8B-Instruct)
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--checkpoint_name ikat-train.Meta-Llama-3-8B-Instruct/checkpoint-3374 \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-test.qrels \
--output_dir ./output/ \
--batch_size 32 \
--prompt ikat \
--infer --rj
# fine-tune Llama-3-8B on the training split
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-train.qrels \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 10 \
--prompt ikat \
--rj
# fine-tune Llama-3-8B-Instruct on the training split
python -u judge_relevance.py \
--model_name_or_path "meta-llama/Meta-Llama-3-8B-Instruct" \
--token ${TOKEN} \
--cache_dir ${CACHE_DIR} \
--checkpoint_path ./checkpoint/ \
--query_path ./datasets/ikat/queries/ikat.queries-manual \
--ptkb_path ./datasets/ikat/queries/ikat.ptkb \
--index_path ./datasets/ikat/corpus/ikat.corpus \
--qrels_path ./datasets/ikat/qrels/ikat-train.qrels \
--logging_steps 10 \
--per_device_train_batch_size 64 \
--num_epochs 10 \
--prompt ikat \
--rj
# evaluate fine-tuned LLaMA-7B on the test split
python -u evaluate_rj.py \
--qrels_true_dir /datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860 \
--binary --pre_is_binary
# evaluate fine-tuned LLaMA-7B on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.manual-llama-1-7b-hf-ckpt-msmarco-v1-passage-dev-small.original-bm25-1000.original-llama-1-7b-hf-neg1-top1000-checkpoint-1860 \
--binary --pre_is_binary
# evaluate Llama-3-8B on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.Meta-Llama-3-8B
# evaluate Llama-3-8B-Instruct on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.Meta-Llama-3-8B-Instruct
# evaluate Llama-3-8B on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.Meta-Llama-3-8B
# evaluate Llama-3-8B-Instruct on the whole set
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat.qrels \
--qrels_pred_dir ./output/ikat.Meta-Llama-3-8B-Instruct
# evaluate fine-tuned Llama-3-8B on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-Meta-Llama-3-8B-ckpt-ikat-train.Meta-Llama-3-8B-checkpoint-3374
# evaluate fine-tuned Llama-3-8B-Instruct on the test split
python -u evaluate_rj.py \
--qrels_true_dir ./datasets/ikat/qrels/ikat-test.qrels \
--qrels_pred_dir ./output/ikat-test.manual-Meta-Llama-3-8B-Instruct-ckpt-ikat-train.Meta-Llama-3-8B-Instruct-checkpoint-3374