Skip to content

SIGIR'2022, Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Notifications You must be signed in to change notification settings

Albert-Ma/COSTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COSTA

This is the official repo of our SIGIR'2022 paper,"Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction".

Introduction

The foundation of effective search is high-quality text representation learning. Modern dense retrieval models usually employ pre-trained models like BERT as the text encoder. However, there is a gap between the pre-training objectives of BERT-like models and the requirements of dense retrieval as shown in Figure 1.

drawing

Existing works mainly utilize two types of methods to learn high-quality text sequence representations for dense retrieval, i.e., contrastive learning and the autoencoder-based language models. We list the pros and cons of these two methods in Figure 2. In this paper, therefore, we propose a novel COntrastive Span predicTion tAsk (COSTA), which leverages the merits of contrastive learning and autoencoder. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a groupwise contrastive loss. Our method only use the encoder and learn document-level text sequence representations by ”reconstructing” its own multiple spans. We do not actually generate the original texts, and only force the text sequence representations to be close with its own multiple span representations of different granularities. In this way, we can

  • Learn discriminative text sequence representations effectively while avoiding designing complex data augmentation techniques for contrastive learning.
  • Learn expressive text sequence representations efficiently while avoiding the bypass effect of autoencoder-based models thoroughly.
  • Resemble the relevance relationship between the query and the document since spans with different granularities can be treated as pseudo queries.

drawing

Pre-trained models in the Huggingface 🤗

We have uploaded COSTA pre-trained models to the Huggingface Hub, so you can easily use the COSTA models with Huggingface/Transformers library.

Model identifier in Huggingface Hub:

  • xyma/COSTA-wiki: The official COSTA model pre-trained on Wikipedia

For example,

tokenizer = AutoTokenizer.from_pretrained("xyma/COSTA-wiki")
model = AutoModel.from_pretrained("xyma/COSTA-wiki")

Preparing Data

Download the Wikipedia from the website and extract the text with WikiExtractor.py, and then apply any necessary cleanup and filter the short texts.

Download the two MS MARCO dense retrieval datasets from this website and the two TREC 2019 Deep Learning Track datasets from this website. Since these two TREC datasets use the same training set and dev set as the two MS MARCO datasets, so just download the test files. Put these datasets on

./data/marco-pas, ./data/marco-doc

Pre-training

Stay tuned! Come back soon!

Fine-tuning

Our fine-tunning code is based on the texttron toolkit.

See README.md for fine-tuning COSTA on passage retrieval datasets.

See README.md for fine-tuning COSTA on document retrieval datasets.

Fine-tuning Results

MS MARCO Passage Retrieval MRR@10 Recall@1000 Files
COSTA (BM25 negs) 0.342 0.959 Model, Dev(MARCO format), Dev (TREC format)
COSTA (hard negs) 0.366 0.971 Model, Dev (MARCO format), Dev (TREC format)
TREC 2019 Passage Retrieval NDCG@10 Recall@1000 Files
COSTA (BM25 negs) 0.635 0.773 Model, Test (TREC format)
COSTA (hard negs) 0.704 0.816 Model, Test (TREC format)

Run the following code to evaluate COSTA on MS MARCO Passage dataset.

./eval/eval_msmarco_passage.sh   ./marco_pas/qrels.dev.tsv ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/dev.rank.tsv.marco

You will get

#####################
MRR @ 10: 0.36564396006731276
QueriesRanked: 6980
#####################

Run the following code to evaluate COSTA on TREC2019 Passage dataset.

./eval/trec_eval -m ndcg_cut.10 -m recall.1000  -c -l 2 ./marco_pas/qrels.dl19-passage.txt ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/trec.rank.tsv.trec

You will get

recall_1000             all     0.8160
ndcg_cut_10             all     0.7043
MS MARCO Document Retrieval MRR@100 Recall@100 Files
COSTA (1st iteration hard negs) 0.395 0.894 Model, Dev(MARCO format), Dev (TREC format)
COSTA (2nd iteration hard negs) 0.422 0.917 Model, Dev (MARCO format), Dev (TREC format)
TREC 2019 Document Retrieval NDCG@10 Recall@100 Files
COSTA (1st iteration hard negs) 0.582 0.278 Model, Test (TREC format)
COSTA (2nd iteration hard negs) 0.626 0.320 Model, Test (TREC format)

Run the following code to evaluate COSTA on MS MARCO Document dataset.

./eval/eval_msmarco_doc.sh   ./marco_doc/qrels.dev.tsv ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/dev.rank.tsv.marco

You will get

#####################
MRR @ 100: 0.4215861855110516
QueriesRanked: 5193
#####################

Run the following code to evaluate COSTA on TREC2019 Document dataset.

./eval/trec_eval -m ndcg_cut.10 -m recall.100  ./marco_doc/msmarco-trec19-qrels.txt ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/trec.rank.tsv.trec

You will get

recall_100             all     0.3202
ndcg_cut_10            all     0.6260

Citation

If you find our work useful, please consider citing our paper:

@inproceedings{ma2022costa,
  author = {Ma, Xinyu and Guo, Jiafeng and Zhang, Ruqing and Fan, Yixing and Cheng, Xueqi},
  title = {Pre-Train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction},
  year = {2022},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3477495.3531772},
  doi = {10.1145/3477495.3531772},
  pages = {848–858},
  numpages = {11},
  location = {Madrid, Spain},
  series = {SIGIR '22}
}

About

SIGIR'2022, Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published