New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Local Self-Attention of your code #6
Comments
Hi, Thank you for your interest. |
Thanks for your reply。 |
In Local Self-Attention, the |
Ok, Fine, I will try it by myself. |
Hi,I‘m very interested in your work about the Local Self-Attention and feature fusion in Transformer。But I have a doubt that Because the input image size for the image classification task in the source code is fixed, 224 or 384, in other words, the size is an integer multiple of 32. If the input size is not fixed, for example the detection task, the input is 800x1333, although the feature map can be divided into window size windows by using padding, but for the key_ padding_ mask, how should the mask be handled?
The shape of attention weights map is [bs x H/7 x W/7, 49, 49], default there window size is 7, but the key padding mask shape is
[1, HW], so how can I convert this mask to match the attention weights map。
I sincerely hope you can give me some advice about this question. Thanks !
The text was updated successfully, but these errors were encountered: