Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fix - alibi causal: for ith query, bias is m · [−(i − 1), ..., −2, −1, 0] for the first i keys #7105

Closed
wants to merge 1 commit into from

Conversation

LydiaXiaohongLi
Copy link

What does this PR do ?

Fix alibi position embedding for causal attention.

Collection: NLP

Changelog

Existing: returns (1, num_heads, 1, key_length)
Fixed: returns (1, num_heads, query_length, key_length), where for ith query, the bias is m · [−(i − 1), ..., −2, −1, 0] for the first i keys, as per the alibi paper

PR Type:

  • Bugfix

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@hsiehjackson
Copy link
Collaborator

If it is causal attention, we can use a singleton trick to the alibi attention bias. We don't need (1, num_heads, query_length, key_length) for our bias, but we only need (1, num_heads, 1, key_length) because after softmax they are all the same.
The original bias is the following if we have length 5:

[ 0,  0,  0,  0, 0]
[-1,  0,  0,  0, 0]
[-2, -1,  0,  0, 0]
[-3, -2, -1,  0, 0]
[-4, -3, -2, -1, 0]

A singleton trick bias can be the following:

[-4, -3, -2, -1, 0]
[-4, -3, -2, -1, 0]
[-4, -3, -2, -1, 0]
[-4, -3, -2, -1, 0]
[-4, -3, -2, -1, 0]

With a causal mask will be the following. You can find after softmax, it is the same if we use singleton trick.

[ 0,  0,  0,  0, 0]
[-4,  0,  0,  0, 0]
[-4, -3,  0,  0, 0]
[-4, -3, -2,  0, 0]
[-4, -3, -2, -1, 0]

For more details, you can find the authors's code here.
Code: https://github.com/ofirpress/attention_with_linear_biases/blob/master/fairseq/models/transformer.py#L760-L762
Discussion: ofirpress/attention_with_linear_biases#5

We can find a little difference from the author's singleton trick is :

[0, 1, 2, 3, 4, 5, ..., n]

And our implementation is:

[-n, ..., -5, -4, -3, -2, -1, 0]

@LydiaXiaohongLi
Copy link
Author

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants