Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Maybe) A bug of huggingface version of OFA Model? #309

Closed
Sunliangtai opened this issue Dec 5, 2022 · 6 comments
Closed

(Maybe) A bug of huggingface version of OFA Model? #309

Sunliangtai opened this issue Dec 5, 2022 · 6 comments

Comments

@Sunliangtai
Copy link

It seems that line 1484 of transformers/src/transformers/models/ofa/modeling_ofa.py has a bug?

I think it should be expanded_attn_mask = _expand_mask(~attention_mask, dtype, tgt_len=input_shape[-1])

Maybe you miss a "~"?

@faychu
Copy link
Contributor

faychu commented Dec 23, 2022

Thanks for your attention. It is not a bug, since the invert operation is in the _expand_mask() function.

@Sunliangtai
Copy link
Author

I've noticed this, the _expand_mask function is defined as follows:

def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len
    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
    inverted_mask = 1.0 - expanded_mask
    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)

Let's consider a 2-dimentional example, if attention_mask is [0,0,0,1,1,1], where 1 means mask. If not applied invert operation, then input mask is [0,0,0,1,1,1], and let's just ignore the expand operation. Then the inverted_mask should be [1,1,1,0,0,0], and the returned value will be [-inf, -inf, -inf, 0, 0, 0].

Is this right? Or I make some mistakes? And during experiments, after adding ~, I get a better performance.

Furthermore, the same problem happens in line 1940 of transformers/src/transformers/models/ofa/modeling_ofa.py:
attention_mask = decoder_input_ids.new_ones(decoder_input_ids.shape)
I think it should be attention_mask = decoder_input_ids.new_zeros(decoder_input_ids.shape).

@faychu
Copy link
Contributor

faychu commented Dec 23, 2022

Yes, your deduction is right. The returned value is [-inf, -inf, -inf, 0, 0, 0] in the example.
Actually, our logic is :
masks for tokens that should be attended are 0, and those for tokens that should not be attended are -inf.
Our provided checkpoints are trained under this logic in the fairseq framework, so our huggingface version also follows this logic to ensure that the ckpt can be loaded correctly without performance degradation.

@Sunliangtai
Copy link
Author

ummm... you mean the mask tokens should be attended while other tokens should not be attended?

@faychu
Copy link
Contributor

faychu commented Dec 23, 2022

The original attention mask should be 0 for masked tokens (not attended) and 1 for not masked tokens (attended) following HF logic. After _expand_mask function, the mask is translated by putting -inf for 0 values and 0 for 1 values.

@Sunliangtai
Copy link
Author

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants