This seems to be an idea that has been demonstrated by the existing method. #1

lartpang · 2022-04-15T08:21:01Z

First of all, it is really a very interesting work.

This article has a very similar strategy to one I've read: Stand-Alone Self-Attention in Vision Models.

However, I did not find a relevant comparison in the paper, and the author should probably add more content to explain the difference.

alihassanijr · 2022-04-15T18:33:07Z

Hello and thank you for your interest.

Thank you for pointing this out. We actually cited a more recent follow-up work to SASA by the same group of authors, HaloNet.
As stated in the paper, the idea of localizing attention is not a new idea, just as attention is not a new idea. Swin is also localizing attention (as others have too), but the difference is in the choice of receptive fields.
A key difference between NA and SASA is in the definition of neighborhoods.
NA is based on the concept of each pixel attending to its nearest neighbors, while SASA, following "same" Convolutions, is based on the concept of each pixel attending to its surrounding pixels only. Those two are very different at the edges, and the edges grow with window size (see our fig 6, or the animation in the README, for an illustration of how NA handles edges and corners).
In addition, neither SASA nor HaloNet were open-sourced and thus are difficult to directly compare to; and HaloNet’s Table 1 seems to suggest that SASA even has a different computational complexity and memory usage compared to NA. So there may be other differences that we are not aware of.

Another big difference between the papers is the application of NA vs SASA.
SASA aimed to replace spatial convolutions in existing models, typically all with small kernel sizes for models like ResNets, while our idea is to use large-neighborhood NAs to build efficient hierarchical transformers that work well for both image classification and downstream vision applications, similar to what Swin Transformer is doing, but simpler and more efficient. This is the reason why our Related section did not focus on works such as SASA and HaloNet, because while there are similarities in their attention mechanisms, the focus and application of the papers are very different. Our NAT directly competes with existing state-of-the-art hierarchical models such as Swin.

I hope that answers your question, and feel free to reopen the issues if you have any more questions.

lartpang · 2022-04-16T02:22:24Z

Okay, thanks for your reply!

xuxy09 · 2022-04-21T03:43:31Z

Hello and thank you for your interest.

Thank you for pointing this out. We actually cited a more recent follow-up work to SASA by the same group of authors, HaloNet. As stated in the paper, the idea of localizing attention is not a new idea, just as attention is not a new idea. Swin is also localizing attention (as others have too), but the difference is in the choice of receptive fields. A key difference between NA and SASA is in the definition of neighborhoods. NA is based on the concept of each pixel attending to its nearest neighbors, while SASA, following "same" Convolutions, is based on the concept of each pixel attending to its surrounding pixels only. Those two are very different at the edges, and the edges grow with window size (see our fig 6, or the animation in the README, for an illustration of how NA handles edges and corners). In addition, neither SASA nor HaloNet were open-sourced and thus are difficult to directly compare to; and HaloNet’s Table 1 seems to suggest that SASA even has a different computational complexity and memory usage compared to NA. So there may be other differences that we are not aware of.

Another big difference between the papers is the application of NA vs SASA. SASA aimed to replace spatial convolutions in existing models, typically all with small kernel sizes for models like ResNets, while our idea is to use large-neighborhood NAs to build efficient hierarchical transformers that work well for both image classification and downstream vision applications, similar to what Swin Transformer is doing, but simpler and more efficient. This is the reason why our Related section did not focus on works such as SASA and HaloNet, because while there are similarities in their attention mechanisms, the focus and application of the papers are very different. Our NAT directly competes with existing state-of-the-art hierarchical models such as Swin.

I hope that answers your question, and feel free to reopen the issues if you have any more questions.

Thanks for the explanations. Is it possible to also provide a run-time comparison between NAT and Swin? It seems the current paper only compares the FLOPS which is not always coherent with the run-time.

alihassanijr · 2022-04-25T20:10:55Z

@xuxy09 That is true, FLOPs are not a direct measure of time. They are though a measure of computational cost and we are particularly interested in that as the kernel is still not as fast as it can potentially be. As far as runtime goes, both training and inference on classification run with the same throughput as Swin at the Tiny scale, but they grow apart with NAT being slower than Swin. But again, that is only a limitation of the existing implementation, which we expect will change in the near future. You can also refer to this issue #13 for details.

IDKiro · 2022-04-26T01:39:42Z

First of all, thank you very much for your contribution to the community.
I also have some questions about how it differs from previous works.

A similar approach seems to be mentioned in Swin Transformer, and Swin's repo contains an implementation of this variant (sliding). The experimental results of NAT for this part of the ablation seem to be consistent with Swin, and the quantitative experimental results seem likely to be caused in part by a narrower but deeper network?

I'm very much looking forward to your new CUDA implementation, as I tried a similar idea but gave up because of the speed and memory overhead.

alihassanijr · 2022-04-26T02:18:46Z

@IDKiro Based on my reading, the sliding window approach seems to be more similar to SASA, than NA. We also observed that NAT-T is just as fast as Swin-T in inference on ImageNet, while their sliding window approach seems to be much slower. The Swin-T based result on ImageNet happens to be the same, 81.4%, but this is likely coincidental: as you can see in our ablation table, this gap grows as we shift to our NAT configuration (~0.5%). I'd also point out that our segmentation result on that model (not in the paper) was 46.3 mIoU, while the number in Swin's table is 45.8. We also did a detection run on that model, but using Mask RCNN (the table from Swin is Cascade Mask RCNN), and observed it performed on par with Swin (46.1 mAP vs Swin-T's 46.0), while this sliding window approach seems to be doing worse than Swin-T.

Add Mask2Former models

lartpang changed the title ~~This seems to be an idea that has been demonstrated by existing methods.~~ This seems to be an idea that has been demonstrated by the existing method. Apr 15, 2022

alihassanijr closed this as completed Apr 15, 2022

honghuis pushed a commit that referenced this issue Nov 16, 2022

Merge pull request #1 from alihassanijr/m2f

5d027d3

Add Mask2Former models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This seems to be an idea that has been demonstrated by the existing method. #1

This seems to be an idea that has been demonstrated by the existing method. #1

lartpang commented Apr 15, 2022

alihassanijr commented Apr 15, 2022 •

edited

Loading

lartpang commented Apr 16, 2022

xuxy09 commented Apr 21, 2022

alihassanijr commented Apr 25, 2022

IDKiro commented Apr 26, 2022 •

edited

Loading

alihassanijr commented Apr 26, 2022

This seems to be an idea that has been demonstrated by the existing method. #1

This seems to be an idea that has been demonstrated by the existing method. #1

Comments

lartpang commented Apr 15, 2022

alihassanijr commented Apr 15, 2022 • edited Loading

lartpang commented Apr 16, 2022

xuxy09 commented Apr 21, 2022

alihassanijr commented Apr 25, 2022

IDKiro commented Apr 26, 2022 • edited Loading

alihassanijr commented Apr 26, 2022

alihassanijr commented Apr 15, 2022 •

edited

Loading

IDKiro commented Apr 26, 2022 •

edited

Loading