About Axial-Attention #11

Liqq1 · 2022-02-25T14:45:10Z

Hi
I notice that the derivation of axial-attention in the code is as follows：

But in the original version of axial attention it would look like this：

I don't fully understand your code, can you explain it for me?I am looking forward to hearing from u.

AngeLouCN · 2022-02-26T00:50:10Z

Hi, I do not think they are different. There are some other works do the axial-attention in the similar way.
In this way, we also follow the width to height axis and the output is same as original version. The difference here I think is the size of attention_ map. Due to the limitation of hardware, I need to reduce the parameters.

AngeLouCN · 2022-02-26T04:26:12Z

You can try to add the presume(0,1,3,2).

Here is what I think. For the axial-attention, we only need to keep those operations are applied at the axis which we need. Like mode = 'h', we only need to care about the height. For the output of attention, we only need to make sure the total elements are equal to (batch_size * channel * width * height) and then we can reshape it to (batch_size, channel, width, height).

Also, you can search self attention in Github, you will find some repositories use similar ways to define the projected_query

AngeLouCN closed this as completed Oct 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Axial-Attention #11

About Axial-Attention #11

Liqq1 commented Feb 25, 2022

AngeLouCN commented Feb 26, 2022

AngeLouCN commented Feb 26, 2022

About Axial-Attention #11

About Axial-Attention #11

Comments

Liqq1 commented Feb 25, 2022

AngeLouCN commented Feb 26, 2022

AngeLouCN commented Feb 26, 2022