Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties to reproduce results on Robust 04 #22

Open
krasserm opened this issue Apr 10, 2020 · 11 comments
Open

Difficulties to reproduce results on Robust 04 #22

krasserm opened this issue Apr 10, 2020 · 11 comments

Comments

@krasserm
Copy link

This is a follow-up on #21. I tried to reproduce the results on Robust 04 but failed to do so using the code in this repository. In the following I report my results on test fold f1 obtained in 3 experiments:

Experiment 1: Use provided CEDR-KNRM weights and .run files.

When evaluating the provided cedrknrm-robust-f1.run file in #18 with

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.run

I'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a .run file generated with the provided weights cedrknrm-robust-f1.p

python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
  --run data/robust/f1.test.run --model_weights cedrknrm-robust-f1.p --out_path cedrknrm-robust-f1.extra.run

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.extra.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.extra.run

I'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided cedrknrm-robust-f1.run file. What is the reason for this difference?

Experiment 2: Train my own BERT and CEDR-KNRM models.

This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker:

python train.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run --model_out_dir trained_bert
python rerank.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_bert/weights.p --out_path trained_bert/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_bert/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_bert/test.run

I'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided vbert-robust-f1.run file:

bin/trec_eval -m P.20 data/robust/qrels vbert-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels vbert-robust-f1.run

This gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization

python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights trained_bert/weights.p --model_out_dir trained_cedr
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr/weights.p --out_path trained_cedr/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr/test.run

I'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with f1.test.run, f1.valid.run and f1.train.pairs files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.

Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?

Experiment 3: Use provided vbert-robust-f1.p weights as initialization to CEDR-KNRM training

I made this experiment in an attempt to debug the performance gap found in the previous experiment. I'm fully aware that training and evaluating a CEDR-KNRM model on fold 1 (i.e. f1) with the provided vbert-robust-f1.p is invalid because of the inconsistencies reported in #21. This is because the folds used for training/validating/testing vbert-robust-f1.p differ from those in data/robust/f[1-5]*.

In other words, validation and evaluation of the trained CEDR-KNRM model is done with queries that have been used for training the provided vbert-robust-f1.p. So this setup is using partially training data for evaluation which of course gives better evaluation results. I was surprised to see that with this invalid setup, I'm able to reproduce the numbers obtained in Experiment 1, or at least come very close. Here's what I did:

python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights vbert-robust-f1.p --model_out_dir trained_cedr_invalid
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr_invalid/weights.p --out_path trained_cedr_invalid/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr_invalid/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr_invalid/test.run

With this setup I'm getting a CEDR-KNRM performance of P@20 = 0.4400 and nDCG@20 = 0.5050. Given these results and the inconsistencies reported in #21, I wonder if the performance of the cedrknrm-robust-f[1-5].run checkpoints is the result of an invalid CEDR-KNRM training and evaluation setup or, more likely, if I did something wrong? Any hints appreciated!

@seanmacavaney
Copy link
Contributor

Hi Martin! Thanks for reporting. I'm looking into these issues (as well as related #21).

@krasserm
Copy link
Author

Thanks @seanmacavaney for your fast reply and for looking into it. Much appreciated!

@andrewyates
Copy link

Hi Martin,

Thank you for pointing this out.

While Sean is looking into the third question (#21), I'll try to provide some information about the others.

Experiment 1: Use provided CEDR-KNRM weights and .run files.

When evaluating the provided cedrknrm-robust-f1.run file in #18 with

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.run

I'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a .run file generated with the provided weights cedrknrm-robust-f1.p

python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
  --run data/robust/f1.test.run --model_weights cedrknrm-robust-f1.p --out_path cedrknrm-robust-f1.extra.run

bin/trec_eval -m P.20 data/robust/qrels cedrknrm-robust-f1.extra.run
bin/gdeval.pl -k 20 data/robust/qrels cedrknrm-robust-f1.extra.run

I'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided cedrknrm-robust-f1.run file. What is the reason for this difference?

This appears to be caused by a preprocessing difference. When using Anserini to prepare documents.tsv, I get the same metrics as you (P@20 = 0.4290 and nDCG@20 = 0.5038).

If I use an Indri index built without stopword removal and without stemming, I get P@20 = 0.4470 (from trec_eval) and nDCG@20 = 0.51797 (gdeval).

Indri and Anserini behave differently here: Anserini returns the raw document whereas Indri (via pyndri) is returning the document tokens. The project README should probably be updated to point this out. I don't seen an obvious reason for the remaining nDCG differences (0.51774 vs. 0.51797), but it may be that my Indri version or config differs from what Sean used.

Experiment 2: Train my own BERT and CEDR-KNRM models.

This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker:

python train.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run --model_out_dir trained_bert
python rerank.py --model vanilla_bert --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_bert/weights.p --out_path trained_bert/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_bert/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_bert/test.run

I'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided vbert-robust-f1.run file:

bin/trec_eval -m P.20 data/robust/qrels vbert-robust-f1.run
bin/gdeval.pl -k 20 data/robust/qrels vbert-robust-f1.run

This gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization

python train.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --qrels data/robust/qrels --train_pairs data/robust/f1.train.pairs --valid_run data/robust/f1.valid.run \
    --initial_bert_weights trained_bert/weights.p --model_out_dir trained_cedr
python rerank.py --model cedr_knrm --datafiles data/robust/queries.tsv data/robust/documents.tsv \
    --run data/robust/f1.test.run --model_weights trained_cedr/weights.p --out_path trained_cedr/test.run

bin/trec_eval -m P.20 data/robust/qrels trained_cedr/test.run
bin/gdeval.pl -k 20 data/robust/qrels trained_cedr/test.run

I'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with f1.test.run, f1.valid.run and f1.train.pairs files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.

Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?

Some background: this repository is a simplified version of a (in-house) toolkit called srank, which is what was originally used to conduct experiments. The OpenNIR toolkit is based on srank with some experimental/unpublished items removed and other cleanup. (Exporting data from srank to the format used in this repo is the step Sean referred to in #21.)

I have successfully trained CEDR using this repository starting from the pre-trained VBERT weights, but this obviously would be affected by #21 if this issue does turn out to indicate a test data leak. I also spotchecked one robust04 fold (using this repo) where VBERT was also trained. However, after looking through per-fold results from running OpenNIR over the past few days (details below), I don't see this as convincing evidence. The per-fold metrics vary a lot with some looking fine even though the aggregation is lower than expected.

Regardless of #21, the metrics you report (i.e., P@20 = 0.3790 and nDCG@20 = 0.4347) look low to me. As mentioned, I have been training VBERT and CEDR-KNRM using OpenNIR to look for any differences in the two repositories that were missed. In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained.

I've obtained similar metrics to these both with this repository (using Indri preprocessing) and using a different TensorFlow v1 codebase (using Anserini). The difference in our results here may be related to the document preprocessing as with the first experiment, but I'm not confident of this given that Anserini was also used with the TFv1 code. It's worth noting that this TFv1 setting also resulted in several other changes to the training setup (e.g., larger batch size and different sampling approach).

Regardless of how preprocessing may be affecting these results, it's clear that something else is also going on; nDCG@20 = 0.4826 is still lower than expected. My initial theory was that VBERT fine-tuning is sensitive to the random seed (which is effectively different on different hardware even when fixed) as others have observed, but experiments I've run do not support this.

@krasserm
Copy link
Author

Hi Andrew, thanks for your quick and detailed reply!

Some background: this repository is a simplified version of a (in-house) toolkit called srank, which is what was originally used to conduct experiments. The OpenNIR toolkit is based on srank with some experimental/unpublished items removed and other cleanup.

I meanwhile started a Vanilla BERT training run using OpenNIR but in order to make sure we're using the same configuration, can you please share the exact Vanilla BERT and CEDR-KNRM training commands you've used?

... I also spotchecked one robust04 fold (using this repo) where VBERT was also trained. However, after looking through per-fold results from running OpenNIR over the past few days (details below), I don't see this as convincing evidence.

Not sure I fully understand. You don't see this as a convincing evidence for what?

The per-fold metrics vary a lot with some looking fine even though the aggregation is lower than expected.

I understand that metrics significantly vary for different folds but my main goal was to reproduce the metrics obtained with the provided CEDR-KNRM checkpoint for fold 1 only. Or do you refer to another variance here? Also, can you please elaborate what you mean by aggregation? Is it aggregation of test results on different folds? Sorry for my ignorance, just want to make sure I fully understand.

Regardless of #21, the metrics you report (i.e., P@20 = 0.3790 and nDCG@20 = 0.4347) look low to me. As mentioned, I have been training VBERT and CEDR-KNRM using OpenNIR to look for any differences in the two repositories that were missed. In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained.

I initially trained VBERT and CEDR-KNRM on an 8 GB GTX 1080 and had to set GRAD_ACC_SIZE = 1 in train.py. To make sure that this wasn't the reason for the "low" metrics I repeated another training run with GRAD_ACC_SIZE = 2 (default) on another machine with a 12 GB RTX 2080ti and got even worse results: P@20 = 0.3660 and nDCG@20 = 0.4155. Database was still Anserini.

But before testing using Indri together with this repo, I'd like to give OpenNIR on fold 1 a try. On an 8 GB GTX 1080 I'm currently running Vanilla BERT training with

bash scripts/pipeline.sh config/robust/fold1 config/vanilla_bert trainer.grad_acc_batch=1 ranker.dlen=1000 pipeline.max_epoch=100

but will rerun it later on an RTX 2080ti with the command line you provide. And thanks for sharing OpenNIR with the community, great initiative!

@krasserm
Copy link
Author

One more question regarding

In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained.

I get similar numbers on the fold 1 validation set. My "low" numbers are obtained on the fold 1 test set i.e. from running the checkpoint with the highest P@20 validation set metric on the test set.

Are you reporting here (a statistic of) the validation set metric? Is the "variance" you mentioned previously the validation metric variance across epochs? Is "aggregation" a (running) average over epochs?

@andrewyates
Copy link

andrewyates commented Apr 14, 2020

Hi Andrew, thanks for your quick and detailed reply!

Some background: this repository is a simplified version of a (in-house) toolkit called srank, which is what was originally used to conduct experiments. The OpenNIR toolkit is based on srank with some experimental/unpublished items removed and other cleanup.

I meanwhile started a Vanilla BERT training run using OpenNIR but in order to make sure we're using the same configuration, can you please share the exact Vanilla BERT and CEDR-KNRM training commands you've used?

Sure, here is the script I used.

I made some changes to the config/cedr/_dir file when I was experimenting with getting the results to run. I don't think these are important, but here they are just for the sake of completeness:

(miniconda3-4.5.4) ayates@wks-15-81 OpenNIR-cedr $ git diff config/cedr/_dir 
diff --git a/config/cedr/_dir b/config/cedr/_dir
index 88dab39..dda874e 100644
--- a/config/cedr/_dir
+++ b/config/cedr/_dir
@@ -1,6 +1,12 @@
 vocab=bert
 vocab.train=True
-valid_pred.batch_size=16
+
+#valid_pred.batch_size=16
+valid_pred.batch_size=2
+vocab.bert_base=bert-base-uncased
+
 test_pred.batch_size=16
 trainer.encoder_lr=2e-5
-trainer.grad_acc_batch=2
+
+#trainer.grad_acc_batch=2
+trainer.grad_acc_batch=1

... I also spotchecked one robust04 fold (using this repo) where VBERT was also trained. However, after looking through per-fold results from running OpenNIR over the past few days (details below), I don't see this as convincing evidence.

Not sure I fully understand. You don't see this as a convincing evidence for what?

The per-fold metrics vary a lot with some looking fine even though the aggregation is lower than expected.

I understand that metrics significantly vary for different folds but my main goal was to reproduce the metrics obtained with the provided CEDR-KNRM checkpoint for fold 1 only. Or do you refer to another variance here? Also, can you please elaborate what you mean by aggregation? Is it aggregation of test results on different folds? Sorry for my ignorance, just want to make sure I fully understand.

Right, I'm referring to the variance in per-fold metrics (e.g., nDCG@20 for fold 1). I meant that I am not convinced that VBERT was being correctly trained at that time. I missed that you were concentrating on fold 1 only, which changes a bit of what I said about the OpenNIR results (see below). By aggregation I meant averaging the per-query metrics from across all five folds, so I think we're on the same page.

Regardless of #21, the metrics you report (i.e., P@20 = 0.3790 and nDCG@20 = 0.4347) look low to me. As mentioned, I have been training VBERT and CEDR-KNRM using OpenNIR to look for any differences in the two repositories that were missed. In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained.

I initially trained VBERT and CEDR-KNRM on an 8 GB GTX 1080 and had to set GRAD_ACC_SIZE = 1 in train.py. To make sure that this wasn't the reason for the "low" metrics I repeated another training run with GRAD_ACC_SIZE = 2 (default) on another machine with a 12 GB RTX 2080ti and got even worse results: P@20 = 0.3660 and nDCG@20 = 0.4155. Database was still Anserini.

I missed that these were fold 1 metrics before. Your metrics with GRAD_ACC_SIZE = 1 for fold 1 are very close to the fold 1 metrics I got with OpenNIR. The higher metrics I mentioned are across all folds.

But before testing using Indri together with this repo, I'd like to give OpenNIR on fold 1 a try. On an 8 GB GTX 1080 I'm currently running Vanilla BERT training with

bash scripts/pipeline.sh config/robust/fold1 config/vanilla_bert trainer.grad_acc_batch=1 ranker.dlen=1000 pipeline.max_epoch=100

I think your ranker.dlen is shorter than the one I used. The command looks the same other than that, but I don't know the default configs very well. In the script I provided, I used scripts/wsdm2020_demo.sh as a starting point and mainly removed ranker.add_runscore=True.

@andrewyates
Copy link

One more question regarding

In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained.

I get similar numbers on the fold 1 validation set. My "low" numbers are obtained on the fold 1 test set i.e. from running the checkpoint with the highest P@20 validation set metric on the test set.

Are you reporting here (a statistic of) the validation set metric? Is the "variance" you mentioned previously the validation metric variance across epochs? Is "aggregation" a (running) average over epochs?

My bad, the confusion comes from the fact that I thought the metrics you reported were across all folds. P@20 = 0.4167 and nDCG@20 = 0.4826 are the test set metrics across folds 1-5. The details are shown here.

@krasserm
Copy link
Author

Thanks for clarifying Andrew. This means that our fold 1 results are now close (and I don't seem to have any gross errors in my training setup). I meanwhile also trained CEDR-KNRM with OpenNIR and again got results (P@20 = 0.3660 and nDCG@20 = 0.4259) that are close to your fold 1 results and the fold 1 results I got with the CEDR repo. I'll later run your script for training on all folds but I am now quite confident that I can reproduce the numbers you obtained:

P@20 = 0.4167 and nDCG@20 = 0.4826 are the test set metrics across folds 1-5.

Thanks for all your help on this so far! Remains the question what caused the gap between these numbers and those reported in the paper i.e. those obtained from the aggregated results in #18 (P@20 = 0.4667 and nDCG@20 = 0.5381 by evaluating cedrknrm-robust-combined.run). The gap on fold 1 is similarly large. Curious what @seanmacavaney will find out.

@krasserm
Copy link
Author

@seanmacavaney do you have any updates on this? I'm in the process of selecting potential candidates for a ranking pipeline and it is therefore important for me to be able to reproduce the numbers in the paper.

@seanmacavaney
Copy link
Contributor

Hi @krasserm -- sorry for the delays. I'm trying to balance a variety of priorities right now, and I have not had much time to dig into this.

@andrewyates
Copy link

I've spent some time looking into the reproduction issues from a different angle by implementing CEDR-KNRM in a toolkit [1] known to work well with Transformer-based models like PARADE [2].

tl;dr If you're interested in using CEDR, variants replacing BERT with ELECTRA perform well and sometimes exceed the results reported in the paper. If you're curious about why CEDR has been hard to reproduce, I've made some progress, but there are still missing pieces.

Results with BERT base are slightly better than the previous ones. I also tried ELECTRA base, which has performed well elsewhere [2,3], and saw a bigger improvement.

CEDR-KNRM (nDCG@20) BERT-KNRM (nDCG@20) VBERT (nDCG@20) Weights Passages
0.4872 0.4890 0.4845 BERT-Base 4
0.5004 0.4781 0.4925 BERT +MSMARCO 4
0.5189 0.5339 0.5191 ELECTRA-Base 4
0.5321 0.5268 0.5292 ELECTRA +MSMARCO 4
0.5158 0.5045 0.4992 BERT-Base 8
0.5151 0.5064 0.5135 BERT +MSMARCO 8
0.5410 0.5450 0.5290 ELECTRA-Base 8
0.5513 0.5389 0.5391 ELECTRA +MSMARCO 8

CEDR-KNRM, BERT-KNRM, and VBERT correspond to the models from the paper. +MS MARCO indicates fine-tuning on MS MARCO prior to robust04 (as in [2,3]).

The main model differences compared to this repo are (1) using the weight initialization from PyTorch 0.4 rather than the initialization in 1.0+ and (2) adding a hidden layer to the network predicting the relevance score. The 4 passage setting considers 900 total tokens, which is close to this repo's. The 8 passage setting uses 1800 tokens.

ELECTRA's nDCG@20 sometimes surpasses the paper's best results, but this model was not used originally (and didn't yet exist), of course. ELECTRA is essentially BERT with improved pretraining, so the difference between the two is surprising. Overall, something is still missing though, because the BERT configurations are substantially lower than CEDR-KNRM's ~0.538 nDCG@20 from the paper. Given that the PyTorch version in this repo seems to have been wrong, it's possible an earlier version of pytorch-pretrained-bert was used as well, but I couldn't find any library changes that looked relevant.

You can find additional results and reproduction instructions here.

[1] Capreolus. https://capreolus.ai

[2] PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. arXiv 2020.

[3] Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers. Xinyu Zhang, Andrew Yates, and Jimmy Lin. ECIR 2021.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants