(Maybe) A bug of huggingface version of OFA Model? #309

Sunliangtai · 2022-12-05T11:32:31Z

It seems that line 1484 of transformers/src/transformers/models/ofa/modeling_ofa.py has a bug?

I think it should be expanded_attn_mask = _expand_mask(~attention_mask, dtype, tgt_len=input_shape[-1])

Maybe you miss a "~"?

The text was updated successfully, but these errors were encountered:

faychu · 2022-12-23T02:13:25Z

Thanks for your attention. It is not a bug, since the invert operation is in the _expand_mask() function.

Sunliangtai · 2022-12-23T03:41:21Z

I've noticed this, the _expand_mask function is defined as follows:

def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len
    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
    inverted_mask = 1.0 - expanded_mask
    return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)

Let's consider a 2-dimentional example, if attention_mask is [0,0,0,1,1,1], where 1 means mask. If not applied invert operation, then input mask is [0,0,0,1,1,1], and let's just ignore the expand operation. Then the inverted_mask should be [1,1,1,0,0,0], and the returned value will be [-inf, -inf, -inf, 0, 0, 0].

Is this right? Or I make some mistakes? And during experiments, after adding ~, I get a better performance.

Furthermore, the same problem happens in line 1940 of transformers/src/transformers/models/ofa/modeling_ofa.py:
attention_mask = decoder_input_ids.new_ones(decoder_input_ids.shape)
I think it should be attention_mask = decoder_input_ids.new_zeros(decoder_input_ids.shape).

faychu · 2022-12-23T04:16:28Z

Yes, your deduction is right. The returned value is [-inf, -inf, -inf, 0, 0, 0] in the example.
Actually, our logic is :
masks for tokens that should be attended are 0, and those for tokens that should not be attended are -inf.
Our provided checkpoints are trained under this logic in the fairseq framework, so our huggingface version also follows this logic to ensure that the ckpt can be loaded correctly without performance degradation.

Sunliangtai · 2022-12-23T04:35:04Z

ummm... you mean the mask tokens should be attended while other tokens should not be attended?

faychu · 2022-12-23T08:30:26Z

The original attention mask should be 0 for masked tokens (not attended) and 1 for not masked tokens (attended) following HF logic. After _expand_mask function, the mask is translated by putting -inf for 0 values and 0 for 1 values.

Sunliangtai · 2022-12-23T08:52:29Z

Thanks a lot.

Sunliangtai closed this as completed Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Maybe) A bug of huggingface version of OFA Model? #309

(Maybe) A bug of huggingface version of OFA Model? #309

Sunliangtai commented Dec 5, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022

(Maybe) A bug of huggingface version of OFA Model? #309

(Maybe) A bug of huggingface version of OFA Model? #309

Comments

Sunliangtai commented Dec 5, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022

faychu commented Dec 23, 2022

Sunliangtai commented Dec 23, 2022