Slow tokenizer is used by default #105

nikitajz · 2021-07-15T12:55:59Z

First of all, thanks for a useful metric along with the code!

I've used to evaluate a big set of internal documents and it takes a while. Upon a review of your code, I've noticed that by default slow tokenizers (python version) is used (use_fast=False).
It would be great either to switch the default to fast tokenizers (written in RUST) or at least parametrize it to avoid hacking lib code to make it work for a massive set of documents (currently it's a bottleneck, especially on GPU-powered machines).

Thanks!

The text was updated successfully, but these errors were encountered:

felixgwu · 2021-07-15T14:30:43Z

Hi @nikitajz,
We observed an inconsistence with the newer fast tokenizer (see Issue #86) which leads to different scores. If it is fixed, we can definitely switch to it.

nikitajz · 2021-07-16T12:59:47Z

Hi @felixgwu,

Thanks for a prompt reply. So the issue is with backward compatibility with previous results or rescaling, right? But if it's the case when, for example, few different summaries are compared (e.g. produced by different models) without rescaling it should be fine? If so, it'd be useful to have bert score parametrized with tokenizer type without the need to "hack" the library code.

Update:
I've created a draft PR that facilitates the above. Feel free to decline it if you have a better solution or believe it's not appropriate.

p.s.: there is an unrelated failing test test_score_en_sci in the master branch due to commented model scibert-scivocab-uncased, so I left it untouched.

felixgwu · 2021-07-19T04:03:19Z

Hi @nikitajz,
Thanks for submitting this pull request. I believe we need to change bert_score/utils.py (the get_hash function), bert_score/scorer.py, and bert_score_cli/score.py accordingly as well. Also, we need to raise an error if the fast tokenizer is specified when using an older version of transformers.
Do you have time to work on them? Otherwise, I can work on them sometime this week.

nikitajz · 2021-07-19T17:54:56Z

hi @felixgwu,

I've updated the mentioned files, hopefully didn't miss anything.
Please take a look if this looks good for you.

P.s.:
As a side note, I've updated test as well and noticed a usage of self.assertTrue for comparing two arrays (tensors). Consider alternative version using numpy testing module. The signature is np.testing.assert_almost_equal(actual, desired, decimal=5, err_msg='', verbose=True)

felixgwu · 2021-07-19T18:26:55Z

Great, thanks! I'll take a look soon.
Yes, it'll be great if you replace those with np.testing.assert_almost_equal.

felixgwu · 2021-07-21T04:37:49Z

Closing it since it is resolved in #106.

nikitajz mentioned this issue Jul 16, 2021

feat: pass use_fast parameter to get_tokenizer #106

Merged

felixgwu closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow tokenizer is used by default #105

Slow tokenizer is used by default #105

nikitajz commented Jul 15, 2021 •

edited

felixgwu commented Jul 15, 2021

nikitajz commented Jul 16, 2021 •

edited

felixgwu commented Jul 19, 2021

nikitajz commented Jul 19, 2021

felixgwu commented Jul 19, 2021

felixgwu commented Jul 21, 2021

Slow tokenizer is used by default #105

Slow tokenizer is used by default #105

Comments

nikitajz commented Jul 15, 2021 • edited

felixgwu commented Jul 15, 2021

nikitajz commented Jul 16, 2021 • edited

felixgwu commented Jul 19, 2021

nikitajz commented Jul 19, 2021

felixgwu commented Jul 19, 2021

felixgwu commented Jul 21, 2021

nikitajz commented Jul 15, 2021 •

edited

nikitajz commented Jul 16, 2021 •

edited