Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMT15 BaryScore #3

Closed
jbgruenwald opened this issue Feb 4, 2022 · 10 comments
Closed

WMT15 BaryScore #3

jbgruenwald opened this issue Feb 4, 2022 · 10 comments

Comments

@jbgruenwald
Copy link

I also have a problem with reproducing your results. If I run your command_line.sh on WMT15 de-en I get a pearson correlation of -0.3559000480869218, but in your paper you reported 75.9. how did you run these experiments?

@PierreColombo
Copy link
Owner

PierreColombo commented Feb 4, 2022

Hi,
Thanks for opening the issue ! This is indeed a problem.
We report absolute pearson correlations.
Which model/layers do you use ? What \espilon do you use ? And which distance/divergence do you use ?
Kindest regards,

@jbgruenwald
Copy link
Author

I didn't change model/layer/epsilon. just the default settings from the command_line.sh in your latest commit. Only thing I changed was adding the lines for the pearson correlation and putting the wmt15 newstest de-en into the samples folder

@PierreColombo
Copy link
Owner

Can you run by considering the last 3/5 layers and check the different div/metric.
As mentioned in the mail. These code were developed while I was at IBM that is why I cannot publish the full project.

You should be able to reproduce the exact same results (a co-author did) but I cannot tell you the exact parameters now. I will try to have access to the file.
Did you try to reproduce the results for Bleu / bert score / meteor ?

@jbgruenwald
Copy link
Author

jbgruenwald commented Feb 4, 2022

ok, i just tried various layers: last 5 layers is -0.3559000480869218, last 3 layers is -0.44103185124368643, some other values were similar.

If i try to run depthscore on the same data, I get similar correlation. btw: depthscore logs some warnings:

/home/repos/nlg_eval_via_simi_measures/depth_score.py:200: RuntimeWarning: divide by zero encountered in true_divide | 0/3 [00:00<?, ?it/s]
square_inv_matrix = u / np.sqrt(s)
/home/repos/nlg_eval_via_simi_measures/depth_score.py:202: RuntimeWarning: invalid value encountered in matmul
return X_transf @ square_inv_matrix

and InfoLM even stops with an error:

Traceback (most recent call last):
File "score_cli.py", line 110, in
main()
File "score_cli.py", line 98, in main
preds = metric.evaluate_batch(candidate_batch, golden_batch, idf_hyps=idf_hyps, idf_ref=idf_ref)
TypeError: evaluate_batch() got an unexpected keyword argument 'idf_hyps'

Yes, for e.g. BERTScore I get the results, that the authors reported in their paper

@jbgruenwald
Copy link
Author

or could you please tell me the results of the examples in the samples folder? maybe that helps finding the solution

@PierreColombo
Copy link
Owner

There is no results on wmt 15 in the bertscore paper. Can you tell me the correlations you get with Bleu and bertscore ?

@jbgruenwald
Copy link
Author

BERTScore published results on WMT18 and there is a list of how BERTScore performs using various models on WMT16 linked on their github repo: https://docs.google.com/spreadsheets/d/1RKOVpselB98Nnh_EOC4A2BYn8_201tmPODpNWu4w7xI/edit#gid=0

BERTScore gives me a correlation of 0.7485 on WMT15, BLEU I didn't try yet, I'll send it later

@PierreColombo
Copy link
Owner

PierreColombo commented Feb 5, 2022

since you mention different scores and different datasets. I released some of the raw scores ... See raw score folders.

Thanks for the typos in the cli it has been corrected
272966420_347235013734616_1276479366577274993_n
.

@jbgruenwald
Copy link
Author

thanks a lot for these scores. That looks really helpful. Do you have the script, that created these files in your repo? I couldn't configure score_cli.py to report these models.

for example roberta-base_lm_wsw_nbarycentersTrue_range(8, 13) means you use roberta-base in baryscore? then I guess range(8, 13) means last_layers=5? nbarycentersTrue means, you take the 'baryscore_W' from the score dictionary? and what does lm and wsw mean?

@PierreColombo
Copy link
Owner

I update the raw score folder for full reproducibility. You can now reproduce the results.
The issue is solved. I close it.
Screenshot 2022-02-05 at 16 10 07

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants