Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BERTScore can match contextualized embeddings of [SEP]/[CLS] tokens #180

Open
asumagic opened this issue Mar 19, 2024 · 0 comments
Open

Comments

@asumagic
Copy link

During the IDF dict calculation, the weight associated with special tokens is zeroed:

idf_dict = defaultdict(lambda: 1.0)
# set idf for [SEP] and [CLS] to 0
idf_dict[tokenizer.sep_token_id] = 0
idf_dict[tokenizer.cls_token_id] = 0

But, to my understanding of the code, this weight never actually prevents a non-special token embedding from getting matched with a [SEP] or [CLS] token embedding.

I've noticed this because I was obtaining different recall/precision values on certain pairs on a custom implementation. This difference disappears if I stop masking pairs involving a special token in the cosine similarity matrix.

That code looks something like:

ref_mask = self._select_by_tokens(token_masks, ref_tokens)
hyp_mask = self._select_by_tokens(token_masks, hyp_tokens)

# mask rows according to ref_mask and columns according to hyp_mask
# reminder: this is the mask used to mask off special tokens
similarity_matrix[~ref_mask, :] = 0.0
similarity_matrix.transpose(1, 2)[~hyp_mask, :] = 0.0

Testing with no IDF, using google-bert/bert-base-uncased at layer 12 (not a really thought-out choice, it's just for the repro), the following pair of sentences reproduces the issue:

  • ref: "WE'LL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HAVE A REGULAR HOUSE CLEANING"
  • hyp: "WILL COME IN HERE THIS AFTERNOON WITH OLD CLOTHES ON AND HALF A REGULAR HOUSE CLEANING"

With my implementation, greedy selection through the matrix shows a difference over the 2nd (non-special) token:

  • disabled masking 0.70251393, 0.95448172, 0.45837021, ..., resulting in a recall of 0.82332665 (matches bert-score)
  • enabled masking 0.70251393, 0.18742326, 0.45837021, ..., resulting in a recall of 0.78071225

Inspecting the cosine similarity matrix indicates that 0.95448172 is the similarity between the 2nd token and the last token ([SEP]).

I don't know if this is intended, but since those special tokens are weighted down to 0 in the IDF dict, I'm assuming the intent is to never actually consider them. I have not tried to check whether that degrades the quality of the metric, so maybe it doesn't matter. In any case, I felt like this was worth documenting as an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant