Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to visualize the attention map? #77

Closed
Amo5 opened this issue Feb 22, 2023 · 3 comments
Closed

How to visualize the attention map? #77

Amo5 opened this issue Feb 22, 2023 · 3 comments
Labels
question Further information is requested

Comments

@Amo5
Copy link

Amo5 commented Feb 22, 2023

Hi,
I use command 'pip3 install natten -f https://shi-labs.com/natten/wheels/cu116/torch1.12.1/index.html' to install the wheel.
But, I don't know how to visualize the attention map of NeighborhoodAttention2D.
Could you help me?

@alihassanijr
Copy link
Member

Hello and thank you for your interest in our work.
First off I'm sorry for getting to this question so late.

Unfortunately methods that restrict self attention to small windows cannot produce attention maps in the same way that self attention itself does. This applies to sliding window approaches (NA/DiNA, SASA, Sliding Window Attention) and partitioning-based methods (block attention, WSA, and the like).

There's two reasons for that, the most important of which is the fact that the full self attention graph is not learned during training. This means that every pixel attends to a subset of the input as opposed to the entire set -- therefore every pixel only produces a fixed number of attention weights. In other words, given a 64x64 input feature map, you would end up with attention maps of shape 7x7 for every pixel, whereas if you were computing self attention, you'd have attention maps of shape 64x64 for every pixel (still hard to visualize because there's 4096 pixels, so in total 4096x64x64 attention weights, but it's easy to either map those to a single attention map, or cross attend something with every pixel to produce one attention map.)

Methods that do not restrict attention (ViT / DeiT) typically also learn a "class token", and use it to produce attention maps at different layers -- class token attends to every pixel in your feature map (and itself depending on the model), therefore given any input image, the token can cross-attend the pixels in the same way (this is the "something" I mentioned earlier).

@alihassanijr alihassanijr added the question Further information is requested label May 12, 2023
@stevenwalton
Copy link
Collaborator

I'm sorry, I meant to get to this sooner. I have code to visualize the attention maps for both Swin and NAT located here.

If you use these attention maps in your work, please cite StyleNAT as that's where this is introduced.

Here is a sample of what the maps may look like. Note that in StyleNAT we are using a Hydra-NA, which allows for different dilation and/or kernels on each attention head, so we look at them independently. You can either mean or sum the heads if you want. Also note that this example is from a generative model, so your maps would look different in a discriminating type of network.

There are a lot more samples in the appendix of StyleNAT, including ones from Swin. There will be some visual differences between these because the attentions have different types of biases. We extensively discuss this in StyleNAT too.

image

@alihassanijr
Copy link
Member

Closing due to inactivity. Feel free to reopen if you still have questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants