BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction

In this paper, we adopt contextual embeddings to perform performance prediction specifically for the task of query performance prediction.The fine-tuned contextual representations can estimate the performance of a query based on the association between the representation of the query and the retrieved documents. We compare the performance of our approach with the state-of-the-art based on the MS MARCO passage retrieval corpus and its three associated query sets: (1) MS MARCO development set, (2) TREC DL 2019, and (3) TREC DL 2020. We show that our approach not only shows significant improved prediction performance compared to all the state-of-the-art methods, but also, unlike past neural predictors, it shows significantly lower latency, making it possible to use in practice.

We adopt two architechtures namely cross-encoder network and bi-encoder network to address QPP task.

To replicate our results with BERT-QPP_cross and BERT-QPP_bi on MSMARCO passage collection,

Clone this repository.
Install the required packages are listed in requirement.txt on python 3.7+.
Download MSMARCO collection collection.tsv and store it in collection repository.
If you are willing to predict the performance of BM25 retrieval method on MSMARCO, skip this step. Otherwise, when evaluating any other retrieval method, you need to prepare the similar run file to bm25_first_docs_train.tsv and bm25_first_docs_dev.tsv which include the run file for first retrieved documents for queries in MSMARCO train and dev set.
- The runfile of your desired retrieval approach should havethe folloinwg format for each query per line: QID<\t>DOCID<\t>1.
- Then, modify the run_file variable in create_train_pkl_file.py and create_test_pkl_file.py so that they point to your desired run_files on train and sev set of MSMARCO.
To train BERT-QPP_cross, we require the query, the first retrieved document, and the queries' performance. To do so, in create_train_pkl_file.py we create a dictionary including the following attributes:

    train_dic[qid] ["qtext"]=query_text
    train_dic[qid] ["performance"]=query_performance_value
    train_dic[qid]["doc_text"]=document_text

you can train the model on your desired metric by creating the assosiated train pkl file. Here, we use map@20. Run create_train_pkl_file.py to save a dictionary including query and document text as well as their associated performance. As a result train_map.pkl will be saved in pklfiles directory.

Run create_test_pkl_file.py to save a dictionary including query and document text on the MSMARCO developement set. As a result test_dev_map.pkl will be saved in pklfiles directory.

BERT-QPP_cross

run train_CE.py to learn the map@20 of BM25 retrieval on MSMARCO train set. alternatively, you can train with your desired metric by creating the assosiated train pkl file. me On a single 24GB RTX3090 GPU, it took less than 2 hours. You may also change the epoch_num,batch_size, and initial pre-trained model in this file. We used bert-base-uncased in this experiment. The trained model will be saved in models directory.
If you are not willing to train the model, you can download our BERT-QPP_cross trained model on bert-based-uncased from here.
add the trained_model you are willing to test in test_CE.py and run test_CE.py.
The results will be saved in results directory in the following format: QID\tPredicted_QPP_value The results will be saved in results directory in the following format: QID<\t>Predicted_QPP_value
To evaluate the results, you can calculate the correlation between the actual performance of each query and predicted QPP value.

BERT-QPP_bi

run train_bi.py to learn the map@20 of BM25 retrieval on MSMARCO train set. . me On a single 24GB RTX3090 GPU, it took ~1hour. You may also change the epoch_num,batch_size, and initial pre-trained model in this file. We used bert-base-uncased in this experiment. The trained model will be saved in models directory.
If you are not willing to train the model, you can download our BERT-QPP_bi trained model on bert-based-uncased from here.
add the trained_model you are willing to test in test_bi.py and run test_bi.py.
The results will be saved in results directory in the following format: QID\tPredicted_QPP_value The results will be saved in results directory in the following format: QID\tPredicted_QPP_value
To evaluate the results, you can calculate the correlation between the actual performance of each query and predicted QPP value.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction

BERT-QPP_cross

BERT-QPP_bi

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
collection		collection
models		models
pklfiles		pklfiles
results		results
run		run
.gitattributes		.gitattributes
README.md		README.md
create_test_pkl_files.py		create_test_pkl_files.py
create_train_pkl_file.py		create_train_pkl_file.py
queries.dev.small.tsv		queries.dev.small.tsv
requirements.txt		requirements.txt
test_CE.py		test_CE.py
test_bi.py		test_bi.py
train_CE.py		train_CE.py
train_bi.py		train_bi.py
train_query_map_20.tsv		train_query_map_20.tsv

Narabzad/BERTQPP

Folders and files

Latest commit

History

Repository files navigation

BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction

BERT-QPPcross

BERT-QPPbi

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

BERT-QPP_cross

BERT-QPP_bi

Packages