Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow tokenizer is used by default #105

Closed
nikitajz opened this issue Jul 15, 2021 · 6 comments
Closed

Slow tokenizer is used by default #105

nikitajz opened this issue Jul 15, 2021 · 6 comments

Comments

@nikitajz
Copy link
Contributor

nikitajz commented Jul 15, 2021

First of all, thanks for a useful metric along with the code!

I've used to evaluate a big set of internal documents and it takes a while. Upon a review of your code, I've noticed that by default slow tokenizers (python version) is used (use_fast=False).
It would be great either to switch the default to fast tokenizers (written in RUST) or at least parametrize it to avoid hacking lib code to make it work for a massive set of documents (currently it's a bottleneck, especially on GPU-powered machines).

Thanks!

@felixgwu
Copy link
Collaborator

Hi @nikitajz,
We observed an inconsistence with the newer fast tokenizer (see Issue #86) which leads to different scores. If it is fixed, we can definitely switch to it.

@nikitajz
Copy link
Contributor Author

nikitajz commented Jul 16, 2021

Hi @felixgwu,

Thanks for a prompt reply. So the issue is with backward compatibility with previous results or rescaling, right? But if it's the case when, for example, few different summaries are compared (e.g. produced by different models) without rescaling it should be fine? If so, it'd be useful to have bert score parametrized with tokenizer type without the need to "hack" the library code.

Update:
I've created a draft PR that facilitates the above. Feel free to decline it if you have a better solution or believe it's not appropriate.

p.s.: there is an unrelated failing test test_score_en_sci in the master branch due to commented model scibert-scivocab-uncased, so I left it untouched.

@felixgwu
Copy link
Collaborator

Hi @nikitajz,
Thanks for submitting this pull request. I believe we need to change bert_score/utils.py (the get_hash function), bert_score/scorer.py, and bert_score_cli/score.py accordingly as well. Also, we need to raise an error if the fast tokenizer is specified when using an older version of transformers.
Do you have time to work on them? Otherwise, I can work on them sometime this week.

@nikitajz
Copy link
Contributor Author

hi @felixgwu,

I've updated the mentioned files, hopefully didn't miss anything.
Please take a look if this looks good for you.

P.s.:
As a side note, I've updated test as well and noticed a usage of self.assertTrue for comparing two arrays (tensors). Consider alternative version using numpy testing module. The signature is np.testing.assert_almost_equal(actual, desired, decimal=5, err_msg='', verbose=True)

@felixgwu
Copy link
Collaborator

Great, thanks! I'll take a look soon.
Yes, it'll be great if you replace those with np.testing.assert_almost_equal.

@felixgwu
Copy link
Collaborator

Closing it since it is resolved in #106.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants