ANCE-Tele

This is the implementation of ANCE-Tele introduced in the EMNLP 2022 Main Conference paper "Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives". If you find this work useful, please cite our paper 😃 and give our repo a star ⭐️ Thanks ♪(･ω･)ﾉ

@inproceedings{sun2022ancetele,
  title={Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives},
  author={Si, Sun and Chenyan, Xiong and Yue, Yu and Arnold, Overwijk and Zhiyuan, Liu and Jie, Bao},
  booktitle={Proceedings of EMNLP 2022},
  year={2022}
}

What's New ٩(๑>◡<๑)۶

[2023/9/26] We release a new ANCE-Tele checkpoint and its Marco & BEIR evaluation results. Specific results for 18 BEIR datasets are shown on HuggingFace: OpenMatch/ance-tele_coco-base_msmarco_qry-psg-encoder.

Model	Pretrain Model	Train w/ Marco Title	Marco Dev MRR@10	BEIR Avg NDCG@10
ANCE-Tele	cocodr-base	w/o	37.3	44.2

[2023/4/13] We update our ongoing work "Rethinking Few-shot Ability in Dense Retrieval" in this repository. Please switch to branch 'FewDR' and check the folder ANCE-Tele/plugins/FewDR.

Outline

ANCE-Tele

Overview

ANCE-Tele is a simple and efficient DR training method that introduces teleportation (momentum and lookahead) negatives to smooth the learning process, leading to improved training stability, convergence speed, and reduced catastrophic forgetting.

On web search and OpenQA, ANCE-Tele is competitive among systems using significantly more (50x) parameters, and eliminates the dependency on additional negatives (e.g., BM25, other DR systems), filtering strategies, and distillation modules. You can easily reproduce ANCE-Tele in about one day with only an A100 😉. (Of course, 2080Ti is ok, but with more time).

Let's begin!

Requirements

ANCE-Tele is tested on Python 3.8+, PyTorch 1.8+, and CUDA Version 10.2/11.1.

(1) Create a new Anaconda environment:

conda create --name ancetele python=3.8
conda activate ancetele

(2) Install the following packages using Pip or Conda under this environment:

transformers==4.9.2
datasets==1.11.0
pytorch==1.8.0

faiss-gpu==1.7.2
## faiss-gpu is depend on the CUDA version
## conda install faiss-gpu cudatoolkit=11.1
## conda install faiss-gpu cudatoolkit=10.2

openjdk==11
pyserini ## pyserini is depend on openjdk

Reproduce MS MARCO Results

MS MARCO Quick Link

MARCO Download

(1) Download Dataset

Download Link	Size
msmarco.tar.gz	~1.0G

(2) Uncompress Dataset

Run the command: tar -zxvf msmarco.tar.gz. The uncompressed folder contains the following files:

msmarco
  ├── corpus.tsv  # <TSV> psg_id /t psg_title /t psg
  ├── train.query.txt  # <TSV> qry_id /t qry
  ├── qrels.train.tsv  # <TSV> qry_id /t 0 /t pos_psg_id /t 1
  ├── dev.query.txt  # <TSV> qry_id /t qry
  └── qrels.dev.small.tsv  # <TSV> qry_id /t 0 /t pos_psg_id /t 1

MARCO Preprocess

(1) Tokenize Dataset

Enter the folder ANCE-Tele/shells and run the shell script:

bash tokenize_msmarco.sh

MARCO Reproduce using Our CheckPs

(1) Download our CheckP from HuggingFace:

Download Link	Size	Dev MRR@10
ance-tele_msmarco_qry-psg-encoder	~438M	39.1

(2) Encoder & Search MS MARCO using our CheckP:

bash infer_msmarco.sh

P.S. We support multi-GPUs to encode the MARCO corpus, which is split into ten files (split00-split09). But Multi-GPU encoding only supports the use of 1/2/5 GPUs at the same time, e.g., setting ENCODE_CUDA="0,1"

Faiss Search Notice

🙌 ANCE-Tele supports Faiss-GPU search but requires sufficient CUDA memory. In our experience: MS MARCO >= 1*A100 or 2*3090/V100 or 4*2080ti; NQ/TriviaQA >= 2*A100 or 4*3090/V100 or 8*2080ti e.g., setting SEARCH_CUDA="0,1,2,3".

🙌 If your CUDA memory is not enough, you can use split search: set --sub_split_num 5 and the sub_split_num can be 1/2/5/10. You can also use CPU search: (1) cancel --use_gpu and set --batch_size -1.

MARCO Reproduce using Our Episode-3 Training Negatives

(1) Download vanilla pre-trained model & our Epi-3 training negatives:

Download Link	Size
co-condenser-marco	~473M
ance-tele_msmarco_tokenized-train-data.tar.gz	~8.8G

(2) Uncompress our Epi-3 training negatives:

Run the command: tar -zxvf ance-tele_msmarco_tokenized-train-data.tar.gz. The uncompressed folder contains 12 sub-files {split00-11.hn.json}. The format of each file is as follows:

{
  "query": [train-query tokenized ids],
  "positives": [[positive-passage-1 tokenized ids], [positive-passage-2 tokenized ids], ...],
  "negatives": [[negative-passage-1 tokenized ids], [negative-passage-2 tokenized ids], ...],
}

(3) Train ANCE-Tele using our Epi-3 training negtatives

bash train_ance-tele_msmarco.sh

P.S. Multi-GPU training is supported. Please keep the following hyperparameters unchanged and set --negatives_x_device when using multi-GPU setup.

Hyperparameters	Augments	Single GPU	E.g., Two GPUs
Qry Batch Size	--per_device_train_batch_size	8	4
(Positive + Negative) Passages per Qry	--train_n_passages	32	32
Learning rate	--learning_rate	5e-6	5e-6
Total training Epoch	--num_train_epochs	3	3

Grad Cache Notice

🙌 If your CUDA memory is limited, please use the Gradient Caching technique. Set the following augments during training:

--grad_cache \
--gc_q_chunk_size 4 \
--gc_p_chunk_size 8 \
## Split a batch of queries to several gc_q_chunk_size
## Split a batch of passages to several gc_p_chunk_size

(4) Evaluate your ANCE-Tele

After training for 3 epochs, you can follow the instructions in MARCO: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckP with your trained model file 😉.

MARCO Reproduce from Scratch

To reproduce ANCE-Tele from scratch (Epi->2->3), you only need to prepare the vanilla pretrained model co-condenser-marco.

Iterative Training Notice

🙌 Quick refreshing strategy. ANCE-Tele takes a quick refreshing strategy for hard negative mining. Hence, Epi-1,2 only train 1/10 of the total training steps (early stop), and only the last Epi-3 goes through the entire training epoch, significantly reducing the training cost.

🙌 Train from scratch. Every training episode adopts train-from-scratch mode; each episode uses the vanilla pretrained model as the initial model, and the only difference is the input training negatives. In this way, it is convenient to reproduce the results without relying on intermediate episode CheckPs.

(1) Epi-1 Training

First, mine the Tele-negatives using vanilla co-condenser-marco. In Epi-1, Tele-negatives contain ANN-negatives and Lookahead-negatives (LA-Neg) without Momentum.

bash epi-1-mine-msmarco.sh

Then train the vanilla co-condenser-marco with the Epi-1 Tele-negatives and early stop at 20k step prepared for negative refreshing:

bash epi-1-train-msmarco.sh

(2) Epi-2 Training

For Epi-2, mine Tele-negatives using the Epi-1 trained model. Epi-2 Tele-negatives contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-1 training negatives).

bash epi-2-mine-msmarco.sh

Then train the vanilla co-condenser-marco with the Epi-2 Tele-negatives and early stop at 20k step prepared for negative refreshing:

bash epi-2-train-msmarco.sh

(3) Epi-3 Training

For the last Epi-3, mine Tele-negatives using the Epi-2 trained model. Epi-3 Tele-negatives contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-2 training negatives).

bash epi-3-mine-msmarco.sh

Then train the vanilla co-condenser-marco with the Epi-3 Tele-negatives. This step is the same as introduced in MARCO: Reproduce w/ Our Episode-3 Training Negatives - (3) :

bash epi-3-train-msmarco.sh

(4) Evaluate your ANCE-Tele

After three episodes, you can follow the instructions in MARCO: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckP with your trained model file 😉.

Reproduce NQ and TriviaQA Results

NQ/TriviaQA Quick Link

NQ and TriviaQA Download

(1) Download Datasets

NQ and TriviaQA use the same Wikipedia-Corpus-Index.

Download Link	Size
nq.tar.gz	~17.6M
triviaqa.tar.gz	~174.6M
wikipedia-corpus-index.tar.gz	~12.9G

(2) Uncompress Datasets

Run the command: tar -zxvf xxx.tar.gz. The uncompressed folder contains the following files:

nq
  ├── nq-train-qrels.jsonl
  └── nq-test.jsonl # <DICT> {"qid":xxx, "question":xxx, "answers":[xxx, ...]}

triviaqa
  ├── triviaqa-train-qrels.jsonl
  └── triviaqa-test.jsonl # <DICT> {"qid":xxx, "question":xxx, "answers":[xxx, ...]}

wikipedia-corpus-index
  ├── psgs_w100.tsv # <TSV> psg_id /t psg /t psg_title
  └── index-wikipedia-dpr-20210120-d1b9e6 # Wikipedia Index (for pyserini evaluation)

The format of nq/triviaqa-train-qrels.jsonl file is as follows:

{
  "qid": xxx,
  "question": xxx,
  "answers": [xxx, ...]
  "positive_ctxs": [xxx, ...],
}

P.S. "positive_ctxs" is a positive passage list. The empty list means that DPR did not provide the oracle-relevant passage for the training query. In such cases, we use the passages containing the answer mined by ANCE-Tele as the "positive" passages during training.

NQ and TriviaQA Preprocess

(1) Tokenize Datasets

Enter the folder ANCE-Tele/shells and run the shell script:

bash tokenize_nq.sh
bash tokenize_triviaqa.sh
bash tokenize_wikipedia_corpus.sh

NQ and TriviaQA Reproduce Using Our CheckPs

(1) Download our CheckPs from HuggingFace:

For NQ and TriviaQA, ANCE-Tele adopts Bi-encoder architecture, which is the same as DPR, coCondenser, etc. Different GPUs cause a little difference (a few thousandths) in search results.

Datasets	Qry-Encoder Download Link	Psg-Encoder Download Link	Size	R@5	R@20	R@100
NQ	ance-tele_nq_qry-encoder	ance-tele_nq_psg-encoder	~418M x 2	77.0	84.9	89.7
TriviaQA	ance-tele_triviaqa_qry-encoder	ance-tele_triviaqa_psg-encoder	~418M x 2	76.9	83.4	87.3

(2) Encoder & Search NQ/TriviaQA using our CheckPs:

bash infer_nq.sh  # NQ
bash infer_triviaqa.sh  # TriviaQA

P.S. We support multi-GPUs to encode the Wikipedia corpus, split into 20 files (split00-split19). But Multi-GPU encoding only supports using 1/2/5 GPUs simultaneously, e.g., setting ENCODE_CUDA="0,1". If your CUDA memory is limited, please see [Faiss Search Notice] for more GPU Search details.

NQ and TriviaQA Reproduce using Our Episode-3 Training Negatives

(1) Download vanilla pre-trained model & our Epi-3 training negatives:

Datassets	Download Link	Size
Vanilla pre-trained model	co-condenser-wiki	~419M
NQ	ance-tele_nq_tokenized-train-data.tar.gz	~8.8G
TriviaQA	ance-tele_triviaqa_tokenized-train-data.tar.gz	~6.8G

(2) Uncompress our Epi-3 training negatives:

Run the command: tar -zxvf xxx. Each uncompressed dataset contains two sub-files {split00-01.hn.json}. The format of each file is as follows:

{
  "query": [train-query tokenized ids],
  "positives": [[positive-passage-1 tokenized ids], [positive-passage-2 tokenized ids], ...],
  "negatives": [[negative-passage-1 tokenized ids], [negative-passage-2 tokenized ids], ...],
}

(3) Train ANCE-Tele using our Epi-3 training negtatives

bash train_ance-tele_nq.sh  # NQ
bash train_ance-tele_triviaqa.sh  # TriviaQA

P.S. Multi-GPU training is supported. Please keep the following hyperparameters unchanged and set --negatives_x_device when using a multi-GPU setup. If your CUDA memory is limited, please use [Gradient Caching].

Hyperparameters	Augments	Single GPU	E.g., Four GPUs
Qry Batch Size	--per_device_train_batch_size	128	32
(Positive + Negative) Passages per Qry	--train_n_passages	12	12
Learning rate	--learning_rate	5e-6	5e-6
Total training Epoch	--num_train_epochs	40	40

Prepare your ANCE-Tele for NQ and TriviaQA

After training, the models are saved under the ${train_job_name} folder like:

${train_job_name}
  ├── query_model  # Qry-Encoder
  └── passage_model  # Psg-Encoder

Before using your model, please copy the three files special_tokens_map.json, tokenizer_config.json, and vocab.txt into Qry/Psg-Encoder folders.

cp special_tokens_map.json tokenizer_config.json vocab.txt ./query_model
cp special_tokens_map.json tokenizer_config.json vocab.txt ./passage_model

(4) Evaluate your ANCE-Tele

Then you can follow the instructions in NQ/TriviaQA: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckPs with your trained model file 😉:

export qry_encoder_name=${train_job_name}/query_model
export psg_encoder_name=${train_job_name}/passage_model

NQ and TriviaQA Reproduce from Scratch

To reproduce ANCE-Tele from scratch (Epi->2->3), you only need to prepare the vanilla pretrained model co-condenser-wiki. Before starting to reproduce, please know the quick-refreshing-strategy and train-from-scratch mode of ANCE-Tele [Iterative Training Notice] 🙌.

(1) Epi-1 Training

First, mine the Tele-negatives using the vanilla co-condenser-wiki. In Epi-1, Tele-negatives contain ANN-negatives and Lookahead-negatives (LA-Neg) without Momentum.

bash epi-1-mine-nq.sh  # NQ
bash epi-1-mine-triviaqa.sh  # TriviaQA

Then train the vanilla co-condenser-wiki with the Epi-1 Tele-negatives and early stop at 2k step prepared for negative refreshing:

bash epi-1-train-nq.sh  # NQ
bash epi-1-train-triviaqa.sh  # TriviaQA

(2) Epi-2 Training

First, prepare the Epi-1 trained model introduced in [Prepare your ANCE-Tele for NQ and TriviaQA], and then use the prepared CheckPs to mine Epi-2 Tele-negatives, which contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-1 training negatives).

bash epi-2-mine-nq.sh  # NQ
bash epi-2-mine-triviaqa.sh  # TriviaQA

Then train the vanilla co-condenser-wiki with the Epi-2 Tele-negatives and early stop at 2k step prepared for negative refreshing:

bash epi-2-train-nq.sh  # NQ
bash epi-2-train-triviaqa.sh  # TriviaQA

(3) Epi-3 Training

For the last Epi-3, prepare the Epi-2 trained model introduced in [Prepare your ANCE-Tele for NQ and TriviaQA], and then use the prepared CheckPs to mine Epi-3 Tele-negatives, which contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-2 training negatives).

bash epi-3-mine-nq.sh  # NQ
bash epi-3-mine-triviaqa.sh  # TriviaQA

Then train the vanilla co-condenser-wiki with the Epi-3 Tele-negatives. This step is the same as introduced in NQ/TriviaQA: Reproduce w/ Our Episode-3 Training Negatives - (3):

bash epi-3-train-nq.sh  # NQ
bash epi-3-train-triviaqa.sh  # TriviaQA

(4) Evaluate your ANCE-Tele

After three training episodes, first Prepare your ANCE-Tele for NQ and TriviaQA. Then you can follow the instructions in NQ/TriviaQA: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckPs with your trained model file 😉:

export qry_encoder_name=${train_job_name}/query_model
export psg_encoder_name=${train_job_name}/passage_model

Easy-to-Use Tips

🙌 Faiss Search Notice: How to use multi-GPU/CPU search
🙌 Grad Cache Notice: How to save CUDA memory during training ANCE-Tele.
🙌 Iterative Training Notice: ANCE-Tele takes a quick-refreshing-strategy and train-from-scratch mode.

Contact Us

For any question, feel free to create an issue, and we will try our best to solve. If the problem is more urgent, you can send an email to me at the same time 🤗.

NAME: Si Sun
EMAIL: sunsi.shining@gmail.com

Acknowledgement

ANCE-Tele is implemented and modified based on Tevatron. We thank the authors for their open-sourcing and excellent work. We have integrated ANCE-Tele into OpenMatch.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.ipynb_checkpoints		.ipynb_checkpoints
ancetele		ancetele
preprocess		preprocess
scripts		scripts
shells		shells
LICENSE		LICENSE
README.md		README.md
framework.jpeg		framework.jpeg
github.sh		github.sh

License

OpenMatch/ANCE-Tele

Folders and files

Latest commit

History

Repository files navigation

ANCE-Tele

What's New ٩(๑>◡<๑)۶

Outline

Overview

Requirements

Reproduce MS MARCO Results

MARCO Download

MARCO Preprocess

MARCO Reproduce using Our CheckPs

Faiss Search Notice

MARCO Reproduce using Our Episode-3 Training Negatives

Grad Cache Notice

MARCO Reproduce from Scratch

Iterative Training Notice

Reproduce NQ and TriviaQA Results

NQ and TriviaQA Download

NQ and TriviaQA Preprocess

NQ and TriviaQA Reproduce Using Our CheckPs

NQ and TriviaQA Reproduce using Our Episode-3 Training Negatives

Prepare your ANCE-Tele for NQ and TriviaQA

NQ and TriviaQA Reproduce from Scratch

Easy-to-Use Tips

Contact Us

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages