Skip to content

BOUALILILila/ExactMatchMarking

Repository files navigation

Enhancing BERT with Exact Match Signals

Deep neural models pretrained on auxiliary text tasks exemplified by BERT reported impressive gains in the ad hoc retrieval task. However, important cues of this task, such as exact matching, were rarely addressed in previous work, where relevance is formalized as a matching problem between two segments of text similarly to Natural Language Processing (NLP) tasks. In this work, we propose to explicitly mark the terms that exactly match between the query and the document in the input of BERT, assuming that it is capable of learning how to integrate the exact matching signal when estimating the relevance. Our simple yet effective approach reports improvements in the ranking accuracy for three ad hoc benchmark collections.

Code for paper: https://assets.researchsquare.com/files/rs-550456/v1_covered.pdf?c=1645466722


Resources

Our fine-tuned models on MS MARCO can be directly used from the HuggingFace Model hub (PyTorch or TF2):

TF2 framework:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained('LilaBoualili/bert-sim-pair')

PyTorch framework:

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('LilaBoualili/bert-sim-pair')

Here is the complete list of models under LilaBoualili/:

  • bert-vanilla | electra-vanilla
  • bert-sim-pair | electra-sim-pair
  • bert-sim-doc | electra-sim-doc
  • bert-pre-pair | electra-pre-pair
  • bert-pre-doc | electra-pre-doc

These models and corresponding tokenizers can be directly used for inference. We still provide the full code for the fine-tuning.


Set up the environment !

We use a retrieve-and-rerank architecture for our experiments where Anserini is used for the retriever stage. Our experiments were done under the 0.9.4 version of the library. Please follow the installation instructions on their Github repo.

# Create virtual environment
pip install virtualenv
virtualenv -p python3.7 exctM_env
source exctM_env/bin/activate

# Install requirements
pip install -r requirements.txt

Models fine-tuning

Get the data ready !

  • Get MsMarco passage train dataset.
DATA_DIR=./train_data
mkdir ${DATA_DIR}

wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P ${DATA_DIR}

tar -xvf ${DATA_DIR}/triples.train.small.tar.gz -C ${DATA_DIR}
  • Generate the training pairs file. Very usefull when fine-tuning multiple models since this processing is time-consuming.
output_dir=<path/to/out>
dataset_path=${DATA_DIR}/triples.train.small.tsv
python ./prep_data_plus.py --collection msmarco \
                      --set train \
                      --dataset_path ${dataset_path} \
                      --output_dir ${output_dir} \
                      --set_name msmarco_train_small
  • Apply a marking strategy to highlight the exact match signals and save the dataset to a TFRecord file.

Note: For the strategies pre_doc and pre_pair--that is our implementation of MarkedBERT[1]--, use this script to add the precise markers to the vocabulary and initialize their embeddings:

python ./add_marker_tokens.py --save_dir <path/out/save/vocabulary/and/model> \
                              --tokenizer_name_path bert-base-uncased \# default or google/electra-base-discriminator
                              --name bert_base_uncased # the name of the extended vocabualry file (pre_tokenizer_${name}, pre_model_${name})
strategy=<base, sim_doc, sim_pair, pre_doc, pre_pair>
data_path = <path/to/pairs/file>
tokenizer_path_or_name = <path/to/tokenizer, tokenizer name in transformers> # default to 'bert-base-uncased' this need to be set to the path of the augmented tokenizer with precise marker tokens for the precise marking strategies.
python ./convert_dataset_to_tfrecord.py --collection msmarco \
                                        --set train \
                                        --strategy $strategy \
                                        --tokenizer_name_path ${tokenizer_path_or_name} \
                                        --data_path ${data_path} \
                                        --output_dir ${output_dir} \
                                        --set_name ${collection}_${strategy} # dataset name

Fine-tune BERT

We use Google Free TPUs for our experiments. See our google colab notebooks under Notebooks/.

python ./run_model.py --output_dir ${output_dir} \ # save the model checkpoints on GCS by the TF2 checpoint manager
                        --do_train True \
                        --do_eval False \
                        --do_predict False \
                        --evaluate_during_training False \
                        --per_device_train_batch_size 16 \
                        --per_device_eval_batch_size 4 \
                        --learning_rate 3e-6 \
                        --weight_decay 0.01 \
                        --adam_epsilon 1e-6 \
                        --num_train_epochs -1 \
                        --max_steps 100000 \
                        --warmup_steps 10000 \
                        --logging_steps 1000 \
                        --save_steps 5000 \
                        --ckpt_name ${CHECKPOINT_NAME} \
                        --ckpt_dir ${CKPT_DIR} \ # path to save the checkpoint with HuggingFace trasnformers format 
                        --eval_all_checkpoints False \
                        --logging_dir ${LOG_DIR} \  # must be in GCS
                        --seed 42 \
                        --collection msmarco \
                        --marking_strategy ${STRATEGY} \
                        --train_data_dir ${DATA_DIR} \  # directory that contains the training dataset file(s)
                        --eval_data_dir  ${EVAL_DATA_DIR} \ # directory that contains the eval dataset file(s)
                        --train_set_name ${TRAIN_SET_NAME} \ # convention of naming: dataset_train_${TRAIN_SET_NAME}.tf
                        --eval_set_name ${EVAL_SET_NAME} \ # convention of naming: dataset_dev_${EVAL_SET_NAME}.tf
                        --test_set_name ${TEST_SET_NAME} \ # convention of naming: dataset_test_${TEST_SET_NAME}.tf
                        --out_suffix ${OUT_SUFFIX} \ # for the output file names: 
                        --eval_qrels_file ${EVAL_QRELS_FILE_NAME} \ # optional, for calculating evaluation measures
                        --test_qrels_file ${TEST_QRELS_FILE_NAME} \
                        --max_seq_length 512 \ 
                        --max_query_length 64 \
                        --model_name_or_path ${MODEL_NAME_OR_PATH} \ # bert-base-uncased or augmented version with precise tokens

Inference

Here we give examples for the zero-shot setting. Please refer to the scripts for full examples for both the zero-shot trasnfer setting and the multi-phase fine-tuning data preparation. For the actual training/testing refer to the Notebooks. You can run the models on GPU as well, just omit the tpu_name argument and the code will automatically use the GPUs if available. Check TFTrainingArguments for more details about the devise use.

Index the collections !

We use Anserini for indexing our collections. Follow the quidlines:

Make sure to save the contents when indexing by setting the -storeContents flag if you want to use them for the document contents.

Retrieve the top-K initial candidates list !

The data_path has the following structure:

data_path
|-- topics
|   |--title
|   |--description
|--folds
|--qrels
|--datasets
   |--body
   |--title_body
  • Where topics contain the original topic files of the three collections topics.{collection}.txt that can be found in Anserini ressources. After the first execution using title|description queries a new file is created for each collection and topic field under the directory data_path/topics/{topic_field}/`topics.{collection}.txt with this format:
{Qid}\t{title|description}

The functions for creating the topics files can be found under Retriever/utils(get_title|get_description).

  • Qrels contain the qrels files qrels.{collection}.txt that can be found in Anserini ressources.
collection=<robust04, gov2>
topic_field=<title, description>
anserini_path=<path/to/anserini/root>
index_path=<path/to/lucene/index>
data_path=<path/to/data/root>

python ./retrieve.py --collection ${collection} --topic_field ${topic_field} \
                               --data_path ${data_path} \
                               --anserini_path ${anserini_path} \
                               --index ${index_path} \  
                               --K 1000

You can choose various parameters for the retrieval: -rm3 for RM3 expansion, -K for the depth, BM25 parameters can be set using -bm25_k1 and -bm25_b.

The document contents are obtained after parsing the raw documents (-storeRaw needed when indexing) or directly from the contents by using the -use_contents flag of retrieve.py, if they were saved during indexing using the -storeContents.

You can check the Data/data_utils.py for different preprocssing possibilities that you can apply to a specific collection in Data/collections.py#parse_doc().

For titles we use the default parameters, for descriptions we use the following setting:

  • Robust04: b=0.6, k1=1.9
  • GOV2: b=0.6 , k1=2.0

Prepare the passages !

The retriever generates 3 files:

  • The run file: {qid}\t{did}\t{score}\t{rank}\t{judgement}
  • The corpus file: {did}\t{title}\t{body} , if -use_title_doc was not set, the title field would be empty('').

The corpus file can be obtained with other ways, check PARADE for another option. If your corpus is in the {did}\t{body} format make sure to set the -from_raw_docs flag when preparing passages in the next step.

Before marking we construct a unique data file containing the pairs of query and split passages for each collection. This file is only created once and used for all strategies.

collection=<robust04, gov2>
output_dir=<path/to/out>
run_path=<path/to/run/file>
queries_path=<path/to/queries/file>
collection_path=<path/to/corpus/file>
python ./prep_data_plus.py --collection $collection \
                      --output_dir ${output_dir} \
                      --queries_path ${queries_path} \
                      --run_path ${run_path} \
                      --collection_path ${collection_path} \
                      --set_name ${collection}_${query_field} \
                      --num_eval_docs 1000 \ 
                      --plen 150 \
                      --overlap 75 \
                      --max_pass_per_doc 30 # --from_raw_docs

Use a marking strategy to highlight the exact match signals of the document w.r.t the query in the pairs file generated above. The query and document passages are marked then tokenized and finally saved in a TFRecord file.

strategy=<base, sim_doc, sim_pair, pre_doc, pre_pair>
data_path = <path/to/pairs/file>
tokenizer_path_or_name = <path/to/tokenizer, tokenizer name in transformers> # default to 'bert-base-uncased' this need to be set to the path of the augmented tokenizer with precise marker tokens for the precise marking strategies.
python ./convert_dataset_to_tfrecord.py --collection $collection \
                                        --strategy $strategy \
                                        --tokenizer_name_path ${tokenizer_path_or_name} \
                                        --max_seq_len 256 \
                                        --data_path ${data_path} \
                                        --output_dir ${output_dir} \
                                        --set_name ${collection}_${query_field}_${strategy} # dataset name

Run the Model !

Check the notebook for the zero-shot trasnfer setting.

Evaluation

After running the model, all you need is to fetch the prediction files. The code saves two versions: one with the MaxP (_maxP.tsv) strategy ready to evaluate and the other saves all (_all.tsv) passage probabilities if you want further manipulations.

We use the pytrec_eval library:

python ./trec_eval.py \
                     --output_dir /path/for/saving/perquery/metrics \ # Optional, only used when -per_query flag is set
                     --qrels_path ${DATA_DIR}/qrels/qrels.${collection}.txt \
                     --preds_path /path/to/predictions/file \ # MaxP file: {qid}\t{did}\t{score}
                     --per_query # If you need the per query metrics 

BM25 scores interpolation

python ./score_comb.py \
                    --output_dir ./output/dir/for/new/predictions \
                    --doc_scores_path /path/to/${collection}_run_${field}_${k}.tsv \ # the run file of BM25 produced by the retrieve.py script
                    --preds_path /path/to/predictions/file \ # MaxP file: {qid}\t{did}\t{score}
                    --qrels_path ${DATA_DIR}/qrels/qrels.${collection}.txt \
                    --folds_path ${DATA_DIR}/folds/${collection}-folds.json # folds.json config

See scripts/zero-shot/comb_bm25.sh for a full script with evaluations before and after interpolating BM25 scores.

Acknowledgement

Some code parts were copied or modified from: dl4marco-bert, SIGIR19-BERT-IR, Birch, PARADE, Capreolus


References

[1] Lila Boualili, Jose G. Moreno, and Mohand Boughanem. 2020. MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 1977–1980. DOI:https://doi.org/10.1145/3397271.3401194