Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Axial-Attention #11

Closed
Liqq1 opened this issue Feb 25, 2022 · 2 comments
Closed

About Axial-Attention #11

Liqq1 opened this issue Feb 25, 2022 · 2 comments

Comments

@Liqq1
Copy link

Liqq1 commented Feb 25, 2022

Hi
I notice that the derivation of axial-attention in the code is as follows:
image

But in the original version of axial attention it would look like this:
image
I don't fully understand your code, can you explain it for me?I am looking forward to hearing from u.

@AngeLouCN
Copy link
Owner

Hi, I do not think they are different. There are some other works do the axial-attention in the similar way.
In this way, we also follow the width to height axis and the output is same as original version. The difference here I think is the size of attention_ map. Due to the limitation of hardware, I need to reduce the parameters.

@AngeLouCN
Copy link
Owner

You can try to add the presume(0,1,3,2).

Here is what I think. For the axial-attention, we only need to keep those operations are applied at the axis which we need. Like mode = 'h', we only need to care about the height. For the output of attention, we only need to make sure the total elements are equal to (batch_size * channel * width * height) and then we can reshape it to (batch_size, channel, width, height).

Also, you can search self attention in Github, you will find some repositories use similar ways to define the projected_query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants