You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I notice that the derivation of axial-attention in the code is as follows:
But in the original version of axial attention it would look like this:
I don't fully understand your code, can you explain it for me?I am looking forward to hearing from u.
The text was updated successfully, but these errors were encountered:
Hi, I do not think they are different. There are some other works do the axial-attention in the similar way.
In this way, we also follow the width to height axis and the output is same as original version. The difference here I think is the size of attention_ map. Due to the limitation of hardware, I need to reduce the parameters.
Here is what I think. For the axial-attention, we only need to keep those operations are applied at the axis which we need. Like mode = 'h', we only need to care about the height. For the output of attention, we only need to make sure the total elements are equal to (batch_size * channel * width * height) and then we can reshape it to (batch_size, channel, width, height).
Also, you can search self attention in Github, you will find some repositories use similar ways to define the projected_query
Hi
I notice that the derivation of axial-attention in the code is as follows:
But in the original version of axial attention it would look like this:
I don't fully understand your code, can you explain it for me?I am looking forward to hearing from u.
The text was updated successfully, but these errors were encountered: